Website Checklist and Troubleshooting
Checklist Summary:
- Check website regularly:
- Login as admin - cosmic:
- View all analyses
- View Session tracking
- Run an analysis
- Login as guest to any elab:
- Retrieve data, posters, plots
- www.i2u2.org is not responsive:
- Check the browser is not a problem:
- Try another browser and/or quit and restart the browser you are using.
- Check if you have internet access.
- Try www.i2u2.org:8080 in your browser:
- If it works, then we need to restart Apache (this is more common with i2u2-dev than with i2u2prod) (see number 3 below).
- Also check i2u2-dev.crc.nd.edu:
- If both sites are not responsive, it might be a problem with Notre Dame's network. Contact systems while you perform more checking.
- If the problem is only www.i2u2.org: then we might have a problem with the Tomcat server (see number 3 below).
- ELabs is slow:
- Check Tomcat log to see if it is working with many analysis or masterclasses (read number 1 below).
- Can’t Login:
- Try another user, try guest, etc.
- Spot difference between slow and user can never login. The latter maybe a problem of the database (see number 5 below).
- If it is slow, need to check the Tomcat log and see if the site is busy.
- Found a bug?:
- If you know how to fix the code, then you might need a rollout (see 6 and 7).
- LIGO is not displaying data:
- CMS is not displaying charts (Exploration):
- Can't upload or save:
- If you see a message on the screen saying that there was a key violation or that there was an Exception when trying to save/upload, see number 12 below.
Checklist Details:
1 – Check Tomcat log. Steps:
a - log in to production using terminal
ssh <your_login>@i2u2-prod.crc.nd.edu
b – log in to development using terminal
ssh <your_login>@i2u2-dev.crc.nd.edu
Then
sudo
cd /home/quarkcat/sw/tomcat/logs
tail –f catalina.out
This displays the log as it is growing. You can see whatever was sent to the output by the app as well as exceptions or problems happening. I do not check development unless it is needed (www13.i2u2.org).
Edit does this once per day or so. Try to log in (to production or 13 or 17. . . ) and watch catalina.out. If we can’t login and catalina.out, restart Tomcat (step 3).
2 – Check website:
go to
www.i2u2.org
login as admin and look at session tracking
view the list of all analysis
login as TestTeacher in the 3 elabs and click the pencil for the logbook (mainly after a rollout)
login as guest and click on some links
check LIGO for the python server to have data available (see Ligo below).
3 – If a server restart is needed:
*TOMCAT: only if you see problems (exceptions or some ugly message) with the catalina.out log and the website is not responsive.
*APACHE: if you see that www18.i2u2.org is not responsive but www18.i2u2.org:8080 is actually working.
a – production:
ssh www18.i2u2.org
sudo
cd /etc/init.d
if you need to restart tomcat:
run:
./tomcat6 restart
or (
./tomcat6 stop
and
./tomcat6 start
)
if you need to restart apache:
run:
./apache2 restart
or (
./apache2 stop
and
./apache2 start
)
b- development:
ssh www13.i2u2.org
sudo
cd /etc/init.d
if you need to restart tomcat:
run: ./i2u2tomcat restart or (./i2u2tomcat stop and
./i2u2tomcat start
)
if you need to restart apache:
run:
./httpd restart
or (
./httpd stop
and
./httpd start
)
4 – Also:
- check if it is Argonne’s problem: if
www.i2u2.org is not responding, try www13.i2u2.org. If all of them do not respond, then call systems. Try ssh to any of the machines. Sometimes you can ssh to the machines and see that the servers are fine and yet the site is not responsive.
- check if it is your browser: Safari sometimes chokes. Test other browsers. If you only have Safari, then clear history and quit. Then restart it and it should work.
- check your internet connection (Comcast had an outage in my area but the first thing I thought was that the website was down).
5 – Could be the database… (check with somebody before doing this)
This is more rare but it has happened that some index gets corrupted and then the site might be not working correctly. Months ago, I had a problem with the guest user. It turned out to be a problem with the usage table index. This is what I did to fix this:
ssh www18.i2u2.org
sudo
cd /etc/init.d
./tomcat6 stop
-to reindex:
ssh data1.i2u2.org
sudo
Backup the database before reindexing - see backup below.
psql postgres
\c userdb2006_1022
reindex DATABASE userdb2006_1022; (Enter)
You will see the tables being reindexed. Wait until they are all done.
\c vds2006_1022
reindex DATABASE vds2006_1022; (Enter)
You will see the tables being reindexed. This will take longer when it hits anno_lfn because this is a huge table that holds most of the metadata. Be patient… it will eventually finish but it can take 10-15 minutes.
This is only needed when there seem to be problems with user logins (i.e., guest) or with data searches.
-to backup the database:
ssh data1.i2u2.org
sudo
pg_dump userdb2006_1022 > production_userdb_backup_plus_date
pg_dump vds2006_1022 > production_vds_backup_plus_date
-to restore from backup:
ssh data1.i2u2.org
sudo
psql userdb2006_1022 < production_userdb_backup_plus_date
psql vds2006_1022 < production_vds_backup_plus_date
-to restart the database:
ssh data1.i2u2.org
sudo
psql -version
sudo service postgresql-<version> restart
You
must restart tomcat6 on the server after database restart. Our database layer can't handle misconnects (and predates most modern layers).
6 – If you need to rollout:
Only if there has been a change to the code. Sometimes it happened that after a rollout I tested the app by login in to the 3 elabs and by clicking on the pencil or going to posters or plots and searching, I got some exceptions for not finding classes. This indicated that the rollout that just happened did not complete successfully (although it apparently did). So, I had to rollout again and test.
a – production:
ssh www18.i2u2.org
b – development:
ssh www13.i2u2.org
sudo
cd /home/quarkcat
run: ./deploy-from-svn branches/3.1(whatever branch, see below for explanation)
if the rollout is successful, you will see that by the end of the process tomcat is restarted automatically.
If the rollout is not successful, it will tell you to look at a file it created called deployment.log.
7 – Repository
We keep the repository at
http://cdcvs.fnal.gov/subversion/quarknet/
There are several branches in use and they mean different things. These 3 are the branches that I actively use for development and rollout:
ROLL OUT TO TEST NEW STUFF:
http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-test
This is the branch that gets all the stuff is that is ready for others than me to look and test. I always deploy this branch on
13 or 17 for testing purposes. You can certainly deploy from this branch at any time in those machines. You can also use it for testing your own code before it goes to production.
ROLL OUT TO PRODUCTION:
http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-rollout-test or
http://cdcvs.fnal.gov/subversion/quarknet/branches/3.0-rollout
This is where I move all changes that have been approved and will go to production. I usually test it on 17 before I
rollout to 18 making sure that all the changes are in place and it rolls out without any problems.
AN ATTEMPT TO HAVE AN ‘ALWAYS WORKING’ BRANCH:
http://cdcvs.fnal.gov/subversion/quarknet/branches/3.1
This is a branch that should be reliable because it is
almost always a step behind (and I can use to go back to if anything happens at a rollout). I update this branch with the latest changes in production a few days after my latest rollout.
http://cdcvs.fnal.gov/subversion/quarknet/trunk
trunk/master has been updated with the latest reliable code (in sync with branches/3.1). The unreliable code that was in trunk was copied over to branches/old_trunk.
8 – LIGO
[See also the LIGO Data Guide - JG]
The LIGO streams and served by a python REST server (this is code running in the background).
I usually go to the LIGO data tab (current or old) and click the button Plot (if the python code is not running, then you will get the message ‘no data to plot’).
The python server needs to run at data4.i2u2.org. To check if it is running:
ssh data4.i2u2.org
ps –ef | grep “DataServer.py”
(if you do not see any process running, then you need to restart it)
sudo
cd /disks/i2u2/ligo/data/streams
nohup python DataServer.py
(close your terminal window, the nohup will keep it running)
A kill on the PID of the DataServer.py (followed by a restart) may help. If still no data in the UI, let Edit know.
Sometimes you will need to restart tomcat as well (on 18 and 13).
9 – CRONJOBS
As far as I know, there are two machines that run cronjobs that we might need to check. I do not check them daily but sometimes I make sure things are running…
ssh data2.i2u2.org
sudo
crontab –l –u quarkcat
(this will display the cronjobs)
Cronjobs active in data2.i2u2.org
0 0 * * * rsync -a --verbose --password-file=/home/quarkcat/.rsyncpw
i2u2data@terra.ligo.caltech.edu::ligo/trend_after23April2013/second-trend/ /disks/i2u2/ligo/data/frames/trend_after23April2013/second-trend > /tmp/second.log 2>&1
0 0 * * * rsync -a --verbose --password-file=/home/quarkcat/.rsyncpw
i2u2data@terra.ligo.caltech.edu::ligo/trend_after23April2013/minute-trend/ /disks/i2u2/ligo/data/frames/trend_after23April2013/minute-trend > /tmp/minute.log 2>&1
50 0 * * * /usr/local/ligotools/i2u2tools/bin/ImportData /disks/i2u2/ligo/data/frames/trend_after23April2013 /usr/local/ligotools/ligotools /disks/i2u2/ligo/data/streams > /tmp/convert.log 2>&1
30 1 * * * rsync -a --verbose /disks/i2u2/ligo/data/streams/ /disks/data4/i2u2/ligo/data/streams > /tmp/data4.log 2>&1
30 2 * * * rsync -zarv --verbose --include="*.raw.*" --exclude="*/" /disks/data4/i2u2-dev/cosmic/data/ /disks/i2u2-dev/cosmic/raw_data/ > /tmp/raw_data_development.log 2>&1
30 3 * * * rsync -zarv --verbose --include="*.raw.*" --exclude="*/" /disks/data4/i2u2/cosmic/data/ /disks/i2u2/cosmic/raw_data/ > /tmp/raw_data_production.log 2>&1
ssh data4.i2u2.org
sudo
crontab –l –u quarkcat
(this will display the cronjobs)
cronjobs active in data4.i2u2.org
# Run python code to get size of /disks daily so we can find out growth rate
0 0 * * * python /home/quarkcat/tools/storage-check/calculate_data_growth.py > /tmp/data-growth.log 2>&1
# Purge Cosmic scratch data (not plots)
/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".wd" -exec rm {} \;
/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".thresh" -exec rm {} \;
/15 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name ".out" -exec rm {} \;
*/15 3 * * * find /disks/i2u2/cosmic/users/ -maxdepth 10 -mindepth 10 -type f -name "singlechannelOut" -exec rm {} \;
# Purge all Cosmic scratch files greater than 7 days old
0 2 * * * find /disks/i2u2/cosmic/users/ -maxdepth 9 -mindepth 9 -type d -mtime +7 -wholename "*scratch/*" -exec rm -r {} \;
In some of the daily jobs (LIGO rsync, data conversion and some of my scripts) I added for them to send a log to the /tmp folder so it is easier to know if something wrong happened while the cronjob tried to run. Let me know if you want to go over these in detail.
10 – Python tools
I have 2 current tools that I use in data4.i2u2.org
ssh data4.i2u2.org
sudo –u quarkcat
cd /home/quarkcat/tools
cd storage-check
view data_growth.txt
A cron job appends the size of the /disks folder to this file every night. This is to monitor our file growth.
ssh data4.i2u2.org
sudo –u quarkcat
cd /home/quarkcat/tools
cd threshold-times-check
python thresh_check_18.py
view output_thresh_check18.txt
To see which split files do not have a .thresh file. Then I keep looking if these were failed uploads or if they should have actually been created and were not.
Most of the splits that I have found here (with the exception of two from Kevin Martz) are those that I mentioned in the telecon. They seem to have split OK and yet there is no record of the ProcessUpload or any metadata.
At the moment the most important thing is that if there are files without .thresh and these files have metadata and we can access them through the application, then we need to create the .thresh manually. I created a tool for this that admin can use while I find out what the problem with app is.
11 – CMS MySql problems
According to MySql: the problem resides in that
mysqld has received many connection requests from the given host that were interrupted in the middle so it gets blocked. To unblock, run:
ssh data1.i2u2.org
sudo
mysqladmin -u root <password> flush-hosts
12 – Database id, primary key violation problems
This likely means that the primary key sequence in the table you are working with has become out of sync. This could be the cause of a mass data addition process. Believe it or not, it has a name "bug by design" and you have to manually reset the primary key index. I ran these commands and it got it fixed:
ssh data1.i2u2.org
sudo
psql postgres
(connect to postgres)
\c vds2006_1022
(connect to the database)
select max(id) from anno_lfn;
max
5007770
(1 row)
then run:
select nextval('anno_id_seq');
nextval
5000096
The nextval should always be the max id of the table + 1 (in this case: 5007771)
Therefore, run:
select setval('anno_id_seq', (select max(id) from anno_lfn) + 1);
setval
5007771
(1 row)
Problem fixed
-- Main.EditPeronja - 2014-04-03