You are here: Foswiki>ELabs Web>Checklist_and_troubleshooting (2019-05-22, AdminUser)Edit Attach

Website Checklist and Troubleshooting

Checklist Summary:

Check website regularly:
- Login as admin - cosmic:
  - View all analyses
  - View Session tracking
  - Run an analysis
- Login as guest to any elab:
  - Retrieve data, posters, plots
www.i2u2.org is not responsive:
- Check the browser is not a problem:
  - Try another browser and/or quit and restart the browser you are using.
- Check if you have internet access.
- Try www.i2u2.org:8080 in your browser:
  - If it works, then we need to restart Apache (this is more common with i2u2-dev than with i2u2prod) (see number 3 below).
- Also check i2u2-dev.crc.nd.edu:
  - If both sites are not responsive, it might be a problem with Notre Dame's network. Contact systems while you perform more checking.
  - If the problem is only www.i2u2.org: then we might have a problem with the Tomcat server (see number 3 below).
ELabs is slow:
- Check Tomcat log to see if it is working with many analysis or masterclasses (read number 1 below).
Can’t Login:
- Try another user, try guest, etc.
  - Spot difference between slow and user can never login. The latter maybe a problem of the database (see number 5 below).
  - If it is slow, need to check the Tomcat log and see if the site is busy.
Found a bug?:
- If you know how to fix the code, then you might need a rollout (see 6 and 7).
LIGO is not displaying data:
- Check number 8 below.
CMS is not displaying charts (Exploration):
- Check number 11 below.
Can't upload or save:
- If you see a message on the screen saying that there was a key violation or that there was an Exception when trying to save/upload, see number 12 below.

Checklist Details:

1 – Check Tomcat log. Steps:

a - log in to production using terminal

ssh <your_login>@i2u2-prod.crc.nd.edu

b – log in to development using terminal

ssh <your_login>@i2u2-dev.crc.nd.edu

Then

sudo

cd /home/quarkcat/sw/tomcat/logs

tail –f catalina.out

This displays the log as it is growing. You can see whatever was sent to the output by the app as well as exceptions or problems happening. I do not check development unless it is needed (www13.i2u2.org).

Edit does this once per day or so. Try to log in (to production or 13 or 17. . . ) and watch catalina.out. If we can’t login and catalina.out, restart Tomcat (step 3).

2 – Check website:

go to www.i2u2.org

view the list of all analysis

check LIGO for the python server to have data available (see Ligo below).

3 – If a server restart is needed:

*TOMCAT: only if you see problems (exceptions or some ugly message) with the catalina.out log and the website is not responsive.

*APACHE: if you see that www18.i2u2.org is not responsive but www18.i2u2.org:8080 is actually working.

a – production:

ssh www18.i2u2.org

sudo

cd /etc/init.d

if you need to restart tomcat:

run: ./tomcat6 restart or ( ./tomcat6 stop and ./tomcat6 start)

if you need to restart apache:

run: ./apache2 restart or ( ./apache2 stop and ./apache2 start)

b- development:

ssh www13.i2u2.org

sudo

cd /etc/init.d

if you need to restart tomcat:

run: ./i2u2tomcat restart or (./i2u2tomcat stop and ./i2u2tomcat start)

if you need to restart apache:

run: ./httpd restart or ( ./httpd stop and ./httpd start)

4 – Also:

- check if it is Argonne’s problem: if www.i2u2.org is not responding, try www13.i2u2.org. If all of them do not respond, then call systems. Try ssh to any of the machines. Sometimes you can ssh to the machines and see that the servers are fine and yet the site is not responsive.

- check if it is your browser: Safari sometimes chokes. Test other browsers. If you only have Safari, then clear history and quit. Then restart it and it should work.

- check your internet connection (Comcast had an outage in my area but the first thing I thought was that the website was down).

5 – Could be the database… (check with somebody before doing this)

This is more rare but it has happened that some index gets corrupted and then the site might be not working correctly. Months ago, I had a problem with the guest user. It turned out to be a problem with the usage table index. This is what I did to fix this:

ssh www18.i2u2.org

sudo

cd /etc/init.d

./tomcat6 stop

-to reindex:

ssh data1.i2u2.org

sudo

Backup the database before reindexing - see backup below.

psql postgres

\c userdb2006_1022

reindex DATABASE userdb2006_1022; (Enter)

You will see the tables being reindexed. Wait until they are all done.

\c vds2006_1022

reindex DATABASE vds2006_1022; (Enter)

You will see the tables being reindexed. This will take longer when it hits anno_lfn because this is a huge table that holds most of the metadata. Be patient… it will eventually finish but it can take 10-15 minutes.

This is only needed when there seem to be problems with user logins (i.e., guest) or with data searches.

-to backup the database:

ssh data1.i2u2.org

sudo

pg_dump userdb2006_1022 > production_userdb_backup_plus_date

pg_dump vds2006_1022 > production_vds_backup_plus_date

-to restore from backup:

ssh data1.i2u2.org

sudo

psql userdb2006_1022 < production_userdb_backup_plus_date

psql vds2006_1022 < production_vds_backup_plus_date

-to restart the database:

ssh data1.i2u2.org

sudo

psql -version

sudo service postgresql-<version> restart

You must restart tomcat6 on the server after database restart. Our database layer can't handle misconnects (and predates most modern layers).

6 – If you need to rollout:

Only if there has been a change to the code. Sometimes it happened that after a rollout I tested the app by login in to the 3 elabs and by clicking on the pencil or going to posters or plots and searching, I got some exceptions for not finding classes. This indicated that the rollout that just happened did not complete successfully (although it apparently did). So, I had to rollout again and test.

a – production:

ssh www18.i2u2.org

b – development:

ssh www13.i2u2.org

sudo

cd /home/quarkcat

run: ./deploy-from-svn branches/3.1(whatever branch, see below for explanation)

if the rollout is successful, you will see that by the end of the process tomcat is restarted automatically.

If the rollout is not successful, it will tell you to look at a file it created called deployment.log.

7 – Repository

We keep the repository at http://cdcvs.fnal.gov/subversion/quarknet/

There are several branches in use and they mean different things. These 3 are the branches that I actively use for development and rollout:

ROLL OUT TO TEST NEW STUFF:

http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-test

This is the branch that gets all the stuff is that is ready for others than me to look and test. I always deploy this branch on 13 or 17 for testing purposes. You can certainly deploy from this branch at any time in those machines. You can also use it for testing your own code before it goes to production.

ROLL OUT TO PRODUCTION:

http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-rollout-test or

http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-rollout

This is where I move all changes that have been approved and will go to production. I usually test it on 17 before I rollout to 18 making sure that all the changes are in place and it rolls out without any problems.

AN ATTEMPT TO HAVE AN ‘ALWAYS WORKING’ BRANCH:

http://cdcvs.fnal.gov/subversion/quarknet/branches/3.1

This is a branch that should be reliable because it is almost always a step behind (and I can use to go back to if anything happens at a rollout). I update this branch with the latest changes in production a few days after my latest rollout.

http://cdcvs.fnal.gov/subversion/quarknet/trunk

trunk/master has been updated with the latest reliable code (in sync with branches/3.1). The unreliable code that was in trunk was copied over to branches/old_trunk.

8 – LIGO

[See also the LIGO Data Guide - JG]

The LIGO streams and served by a python REST server (this is code running in the background).

I usually go to the LIGO data tab (current or old) and click the button Plot (if the python code is not running, then you will get the message ‘no data to plot’).

The python server needs to run at data4.i2u2.org. To check if it is running:

ssh data4.i2u2.org

ps –ef | grep “DataServer.py”

(if you do not see any process running, then you need to restart it)

sudo

cd /disks/i2u2/ligo/data/streams

nohup python DataServer.py

(close your terminal window, the nohup will keep it running)

A kill on the PID of the DataServer.py (followed by a restart) may help. If still no data in the UI, let Edit know.

Sometimes you will need to restart tomcat as well (on 18 and 13).

9 – CRONJOBS

As far as I know, there are two machines that run cronjobs that we might need to check. I do not check them daily but sometimes I make sure things are running…

ssh data2.i2u2.org

sudo

crontab –l –u quarkcat (this will display the cronjobs)

Cronjobs active in data2.i2u2.org

0 0 * * * rsync -a --verbose --password-file=/home/quarkcat/.rsyncpw i2u2data@terra.ligo.caltech.edu::ligo/trend_after23April2013/second-trend/ /disks/i2u2/ligo/data/frames/trend_after23April2013/second-trend > /tmp/second.log 2>&1

0 0 * * * rsync -a --verbose --password-file=/home/quarkcat/.rsyncpw i2u2data@terra.ligo.caltech.edu::ligo/trend_after23April2013/minute-trend/ /disks/i2u2/ligo/data/frames/trend_after23April2013/minute-trend > /tmp/minute.log 2>&1

50 0 * * * /usr/local/ligotools/i2u2tools/bin/ImportData /disks/i2u2/ligo/data/frames/trend_after23April2013 /usr/local/ligotools/ligotools /disks/i2u2/ligo/data/streams > /tmp/convert.log 2>&1

30 1 * * * rsync -a --verbose /disks/i2u2/ligo/data/streams/ /disks/data4/i2u2/ligo/data/streams > /tmp/data4.log 2>&1

30 2 * * * rsync -zarv --verbose --include="*.raw.*" --exclude="*/" /disks/data4/i2u2-dev/cosmic/data/ /disks/i2u2-dev/cosmic/raw_data/ > /tmp/raw_data_development.log 2>&1

30 3 * * * rsync -zarv --verbose --include="*.raw.*" --exclude="*/" /disks/data4/i2u2/cosmic/data/ /disks/i2u2/cosmic/raw_data/ > /tmp/raw_data_production.log 2>&1

ssh data4.i2u2.org

sudo

crontab –l –u quarkcat (this will display the cronjobs)

cronjobs active in data4.i2u2.org

# Run python code to get size of /disks daily so we can find out growth rate

0 0 * * * python /home/quarkcat/tools/storage-check/calculate_data_growth.py > /tmp/data-growth.log 2>&1

# Purge Cosmic scratch data (not plots)

/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".wd" -exec rm {} \;

/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".thresh" -exec rm {} \;

/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".out" -exec rm {} \;

*/15 3 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name "singlechannelOut" -exec rm {} \;

# Purge all Cosmic scratch files greater than 7 days old

0 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 9 -mindepth 9 -type d -mtime +7 -wholename "*scratch/*" -exec rm -r {} \;

In some of the daily jobs (LIGO rsync, data conversion and some of my scripts) I added for them to send a log to the /tmp folder so it is easier to know if something wrong happened while the cronjob tried to run. Let me know if you want to go over these in detail.

10 – Python tools

I have 2 current tools that I use in data4.i2u2.org

ssh data4.i2u2.org

sudo –u quarkcat

cd /home/quarkcat/tools

cd storage-check

view data_growth.txt

A cron job appends the size of the /disks folder to this file every night. This is to monitor our file growth.

ssh data4.i2u2.org

sudo –u quarkcat

cd /home/quarkcat/tools

cd threshold-times-check

python thresh_check_18.py

view output_thresh_check18.txt

To see which split files do not have a .thresh file. Then I keep looking if these were failed uploads or if they should have actually been created and were not.

Most of the splits that I have found here (with the exception of two from Kevin Martz) are those that I mentioned in the telecon. They seem to have split OK and yet there is no record of the ProcessUpload or any metadata.

At the moment the most important thing is that if there are files without .thresh and these files have metadata and we can access them through the application, then we need to create the .thresh manually. I created a tool for this that admin can use while I find out what the problem with app is.

11 – CMS MySql problems

According to MySql: the problem resides in that mysqld has received many connection requests from the given host that were interrupted in the middle so it gets blocked. To unblock, run:

ssh data1.i2u2.org

sudo

mysqladmin -u root <password> flush-hosts

12 – Database id, primary key violation problems

This likely means that the primary key sequence in the table you are working with has become out of sync. This could be the cause of a mass data addition process. Believe it or not, it has a name "bug by design" and you have to manually reset the primary key index. I ran these commands and it got it fixed:

ssh data1.i2u2.org

sudo

psql postgres (connect to postgres)

\c vds2006_1022 (connect to the database)

select max(id) from anno_lfn;

max

5007770

(1 row)

then run:

select nextval('anno_id_seq');

nextval

5000096

The nextval should always be the max id of the table + 1 (in this case: 5007771)

Therefore, run:

select setval('anno_id_seq', (select max(id) from anno_lfn) + 1);

setval

5007771

(1 row)

Problem fixed

-- Main.EditPeronja - 2014-04-03

Topic revision: r12 - 2019-05-22, AdminUser

ELabs

Webs
- ELabs
- Main
- Sandbox
- System

Create personal sidebar

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback