Troubleshooting

Telescope
Cryogenics
Electronics
Rotator
Computer Crashes
Data Streams
Data Processing
General Computer Troubleshooting
Revision History

Back to BolocamWebPage
Back to ExpertManual

Telescope

For non-Bolocam related problems (dome, dish, antenna computer, etc.), go to the main CSO Hawaii webpage (http://www.cso.caltech.edu), scroll down and click on "Local Information". You will find generic troubleshooting information there.

Cryogenics

About the only cryogenics problem that the typical observer can deal with is running out of LHe. If you know how to do LHe fills, you can go ahead and refill; see the cryogen fill instructions. If you caught the problem quickly enough and the fridge did not die, you may be able to continue observing. If the UC Fridge GRT reading on the fridge monitoring page returns to its previous value, you're fine and you can keep observing.

If the fridge did run out (IC and/or UC Fridge GRT readings high and not recovering), then you can at least speed the recovery along by following the recovery instructions.

If you are not experienced with doing cryogen fills, your night is done. Leave a note for the day crew and shut down for the day. They will refill, recover, and set the fridge to cycle and be cold for the next night.

If you have more serious problems -- cryogen hold time sharply decreased, fridge cycle failing, etc. -- let the day crew and the Bolocam support person know.

Electronics

There are two kinds of electronics problems one typically runs into:

functionality problems: Some signal is just not present, or is reading completely incorrectly, etc. It is likely that, for some reason or another, some switch has been put into the wrong state, some cable has become disconnected, etc. The best thing to do is to carefully walk through the system and make sure everything is set up correctly. Go to the Electronics page and the Setting up for Observing page and make sure all necessary connections have been made and all the power switches are on.
noise problems: If noise problems appear in most or all channels simultaneously, or in all channels of a given hextant, it is likely that the problem is the bias board. If you are not an expert, your best bet is to simply replace the bias board with the spare. We use either bias board 2.1 or 3.1, so you can grab the unused one (usually in the 3rd floor storage room, in the cabinet) and replace the problem board. You must turn off the power to the board before removing it; see the Electronics page for instructions on how to do this and for pictures to identify the bias boards and instructions for setting them up for observing.

If noise problems appear in a single channel or only a few channels, note them and send the list to the Bolocam support person, and then continue with your observing. He will get in touch with the day crew to troubleshoot the problem channels.

Rotator

Typical problems that might occur are:

The rotator will not home when the home program is used.
The home program cannot set the origin of the rotator encoder when it is homed.
The interactive program reads back junk from the encoder.
The interactive program cannot seem to get the rotator to go to a programmed angle.
The rotator program either rotates when you have asked it not to rotate, or does not rotate when you have asked it to rotate.
The rotator hits its limit switch, resulting in killing the motor power and the rotator swinging to some arbitrary angle.

The sequence of troubleshooting is as follows:

You should stop your current observing macro but otherwise leave all programs running.
First, assume that it is a simple problem that simply requires resetting the rotator. Use the home program; instructions are given elsewhere. This essentially resets the entire system and will likely get rid of any mild problems. If rehoming is successful, test the rotator by rotating to some small angles (between -30 and +30 deg) using the interactive program, and try reading back the angles and see if they make sense. If interactive works, then you can then restart observing. The observation during which the problem occured should probably be discarded, but in principle later observations should be fine. If either the rotator or write_log program died, you will have to restart them as explained below prior to restarting observing.
If you can't rehome, or rehoming does not help, maybe the DIP switches on the fiber-optic isolators have gotten screwed up. Check that they are set correctly by comparing to the instructions on the Setting up for Observing page.
If rehoming fails, you should determine whether it is an obvious broken connection problem. Go outside and check all the rotator cabling, which is described on the Setting up for Observing page. The most likely failure mode is the fiber-optic cables; spares can be found in a box near allegro. Make sure you hook the replacement up properly, paying attention to the connector colors and where they connect to. If the other parts of the cabling fail, you may be able to find replacements by digging around the AOS lab. After replacing the damaged cabling, try rehoming again. If that works, try testing using interactive as above. If that works, you can probably start observing again, though again you may have to restart rotator or write_log as above.
If rehoming continues to fail, there may be communication or control problems. If you suddenly get errors in your rotator or write_log window such as modprobe: can't locate module, then somehow one of the run-time kernel modules has been unloaded. Log in to allegro.submm.caltech.edu as observer (password in white Bolocam Manual binder) and type

> insmod rocket
> insmod seaio

You should receive messages like

Using /lib/modules/2.4.13-0.6/kernel/drivers/char/rocket.o
Using /lib/modules/misc/seaio.o

possibly with warnings or the messages

insmod: a module named rocket already exists
insmod: a module named seaio already exists

There may be other warnings. As long as none of them say a module could not be loaded, then things should work. Try rehoming and running interactive as above; if successful, you can restart observing, restarting rotator and/or interactive as above if necessary.
If you are still having problems, then the best thing to do is just lock the rotator to its home position and disable the rotator for the night. By turning the motor power off (see the rotator instructions elsewhere), you can rotate the dewar to its home position (where the homing sensor tab occludes the right half of the homing sensor). Turn the power back on to have it hold there.

You will have to restart the rotator program with rotation disabled (R = 0).

If you had problems communicating with the encoder, then you will have to restart write_log with the encoder readout disabled (do not include the -e flag). Clearly note when this occured in your observing logs, as it will be necessary to recalibrate the rotator angle from that point onward. The data will be entirely analyzable, it will just have to be treated differently than preceding data.

Regardless of the problem, inform the Bolocam support person, providing details, so the problem can be rectified.

Computer Crashes

Ah, the bane of every system, the reason we should just go back to using chart recorders and slide rules!

Remarkably, our critical computers, andante and allegro are quite stable. This is because we do not run much on them. andante only runs the DAS and the fridge control, allegro only runs the data copying programs and gbolostrip. kilauea tends to be less stable due to strange goings-on with its video card. We provide instructions here for recovering from crashes of each of these machines.

For an explanation of the data streams, see the Data Acquisition, Rotator Control, and Data Handling page.

andante crashes

This is not too tragic. Do the following:

Reboot andante twice. Go to the folder containing the raw data (D:\DAS_DATA\YYYYMMDD) and delete any files that are the wrong size (compare to the other files, all should be the same size to within 1-2 bytes). Delete any .lck files also.
Log into allegro as observer (password in the white Bolocam Manual binder). Go to /data00/rawdir/YYYYMMDD and delete any undersized files you find there also.
If gbolostrip. is still running, kill it. The window's kill button may not work, you can kill it by doing the following:
- Log into allegro as root (password in white Bolocam Manual binder).
- At the shell prompt, type
  
  > ps -A | grep gbolostrip
- The above command will display the process ID. Kill the process by typing
  
  > kill -9 pid
  
  where pid is the process ID.
Remount andante's data disk on allegro. Log into allegro as root . Go to /smb. Follow the instructions in the AAAREADME file that is located there. Check that the disk is mounted by typing ls /smb/andante. You should see the YYYYMMDD data directories.

If you have problems, make sure that andante is properly set up to share D:\DAS_DATA and subdirectories thereof. If you think everything is properly set up, then the problem may be that dirsync.py is trying to access /smb/andante. You need to kill dirsync.py. You can do this in the same way as you killed gbolostrip, just replace gbolostrip with dirsync.py in the grep command. Once you have killed dirsync.py, you should be able to mount /smb/andante as instructed above. Restart dirsync.py as explained below. Your log monitoring window should still be running, you don't need to restart it.

Remember to exit your root session now.
Restart the DAS in the same way as you did at the start of the night (see the daily startup instructions).
Restart gbolostrip in the same way as you did at the start of the night (see the daily startup instructions).
Restart merge on kilauea as explained below, working around the hole in the data due to the missing DAS files. If you receive "short file" errors, then you did not properly clean up the short raw data files on either andante or allegro as instructed above. Check this and try again.
The remainder of the analysis software should work around the hole without any problems. If you do have problems see the Data Processing section of this page.

allegro crashes

This is a pain because the encoder log files are completely lost for this period. Do the following:

Reboot allegro.
Mount \\andante\d\das_data and hau:/var/plog on allegro and allegro:/data00 on kilauea as explained elsewhere.
Restart the data stream programs as indicated below, remember to provide the nlast argument to start_tel_util.
Restart merge on kilauea as explained below, working around the hole in the data due to the missing encoder log files.
The remainder of the analysis software should work around the hole without any problems. If you do have problems see the Data Processing section of this page.

allegro and andante crash

You should only be so lucky. The main thing here is to bring up both computers and get everything cross-mounted as explained separately for each machine above before starting any programs. Then you can restart the data-copying programs, then the DAS, then merge.

kilauea crashes

This is not so bad because no data are lost. Don't be fooled by the fact that your data copying programs were running in windows on kilauea; they weren't really, only the log files were being displayed in these windows. Do the following:

Reboot kilauea.
Restart UIP using the instructions found here: UIP guide.
Mount allegro:/data00 and kilauea:/data_bolocam on kilauea as explained elsewhere.
Restart the log monitoring for the data-copying programs as explained below.
Restart merge on kilauea as explained below from the last point where you think things were merged properly. Err on the side of remerging rather than missing unmerged data.
Restart the analysis software on kilauea as explained below. The software will automatically figure out what data has and has not been processed, though you may need to clean up .lck files as indicated.

The telescope computer (hau) crashes

This happens very infrequently. Your observation is terminated. write_log will continue running without too much problem, but, obviously, it gets no information from the telescope and so will write invalid values.

To recover, do the following:

In the write_log log screen, you will see timeout errors while the telescope computer is unavailable. This is fine. They should stop and you should see normal write_log messages when the telescope computer becomes available again.
Reboot the telescope computer (see the CSO web page as instructed above).
Mount hau:/var/plog as explained elsewhere.
Wait a minute and see if dirsync starts copying new telescope computer files by watching dirsync's log screen. Check that normal write_log messages begin to appear. If one of these programs fails to start executing properly again, follow the instructions for restarting them below.
merge will presumably have died because it could not find any pointing log files. You will have to merge around the hole as explained below.

Everybody crashes!

Again, get all the computers up and the disks cross-mounted first, then start up the various programs.

Data Streams

For an explanation of the data streams, see the Data Acquisition, Rotator Control, and Data Handling page.

Data Stream Programs or Windows Die

The more likely occurrence is that the X connection to the machine displaying the monitoring windows for the data copying programs goes down (for example, if kilauea crashes). This is not a major problem! The data copying programs are running autonomously on allegro; all that has happened is that the windows that display the log files written by these programs have died. You have not lost any data, all you need to do is restart the monitor windows. Once you have your X server back up and running, log into allegro (set X forwarding as necessary) and type

> start_tel_util YYYYMMDD R E 0

YYYYMMDD is UT date, R indicates whether you want to use the dewar rotator or not (R = 1 means "use the rotator"), and E indicates whether you want to read the rotator encoder (E = 1 indicates that the encoder should be read; if you don't read the encoder, the rotator angle will be taken to be 0 and you will have to deal with this later in the analysis). The last argument, 0, tells the program that all the processes are already running, you just want to create the monitoring windows. start_tel_util will check to see whether all 4 programs are indeed running; it will advise you if there is a problem.

If, on the other hand, allegro has itself died, then the file copying programs have died. Once you have allegro back up and are ready to start taking data again, you can start them back up using the command

> start_tel_util YYYYMMDD R E 1 nlast

where YYYYMMDD, R, and E are as above. The 4th argument is set to 1 to advise start_tel_util that it needs to start up the programs again, not just start up the log monitoring windows. nlast is very important; it is the number of the last rotator log that was written (in /data00/encdir/YYYYMMDD). Remember is just the number, not the entire filename. You will have lost the encoder log files between nlast and the minute you restart the programs; you will have to force merging to work around them as indicated below. However, as long as you use the nlast argument, the observation number should pick up where it left off. If you forget the nlast argument, the observation number will start again from 0 and you will have a mess on your hands (it can be cleaned up, but you will have to consult an expert).

If allegro has not died but you suspect that one or more of the file copying programs has died, you can check by logging into allegro and typing

> check_tel_util

You will get messages indicating which processes are still running. Proceed as follows:

If all of the processes have died, you can restart in the same way as you would if allegro had died.
If write_log has died, the easiest thing to do is to kill the other programs and restart everything as if allegro had died. Just issue the command

> kill_tel_util

You will see messages indicating which programs were killed and which were not running. Then issue the

> start_tel_util YYYYMMDD R E 1 nlast

command as you would have above. Make sure to type the above line correctly.
If write_log has not died, you are better off restarting the processes by hand so the encoder logs remain continuous. Issue whichever of the following commands are necessary (corresponding to the processes that need to be restarted):

> /home/observer/src/rotator/rotator R \
     >>& /data00/encdir/rotator_YYYYMMDD.log &

> /home/observer/src/dirsync/dirsync.py \
     /smb/andante/YYYYMMDD \
     /data00/rawdir/YYYYMMDD \
     >>& /data00/rawdir/dirsync_YYYYMMDD.log &

> /home/observer/src/dirsync/header_copy.py \
     /data/plog/YYYYMMDD \
     /data00/headerdir/YYYYMMDD
     >>& /data00/rawdir/dirsync_YYYYMMDD.log &

You can then restart the log monitoring windows using the

> start_tel_util YYYYMMDD R E 0

command as you would have if only the X connection had died. You may end up with duplicate log monitoring windows, just kill the duplicates: killing the duplicate log monitoring windows does not affect the operation of the running programs. Make sure to type the above line correctly, otherwise you may get unexpected behavior.

dirsync Crashes or Hangs

Check	Remedy
/smb/andante/YYYYMMDD is visible on allegro but not readable by observer	Check Windows sharing setup for \\andante\d\das_data and \\andante\d\das_data\YYYYMMDD
/smb/andante/YYYYMMDD not visible on allegro but directory listing of /smb/andante returns something	\\andante\d\das_data\YYYYMMDD probably has not been created. Do so from andante's desktop.
/smb/andante/YYYYMMDD not visible on allegro and directory listing of /smb/andante returns nothing	Check that andante is powered on and Windows has not crashed. Reboot if necessary. If andante is on, then \\andante\d\das_data probably has not been cross-mounted. cd to /smb on allegro and follow the instructions in /smb/AAAREADME. You may need the root password, it is on allegro's monitor. If this fails, then it is likely that \\andante\d\das_data is not being shared properly. Check the sharing setup for this directory on andante directly. A reboot of andante may be necessary. It is very unlikely that the problem is with allegro, as this cross-mounting has operated without problems on allegro's side since 2000.
/data00/rawdir/YYYYMMDD does not exist	Should not happen -- start_tel_util should not have started dirsync. Check that /data00/rawdir exists and that observer has write permissions. If the permissions are wrong, change them by becoming root.
/data00/headerdir/YYYYMMDD is not writeable by observer	Should not happen -- start_tel_util should not have started dirsync. Change permissions by becoming root.
Is there free disk space on /data00? Check using df -k.	Move or delete some data.

header_copy Crashes or Hangs

Check	Remedy
/data/plog visible on allegro but not readable by observer.	Check permissions for /data/plog, become root and change if necessary.
/data/plog visible on allegro but is empty	hau:/var/plog probably has not been cross-mounted. Check using df -k. If hau:/var/plog is not mounted at /data/plog, then become root and mount it by typing mount /data/plog. If this fails, then it's likely that hau is either not exporting /data/plog or not considering allegro to be a valid mount client. Contact a CSO staff member in the following order: Hiro, Ruisheng, Richard, Martin, anyone else. Of course, hau may just be dead, but presumably you would have been told that by now.
/data/plog visible on allegro and contains files, but nothing is being copied	Is there a .lck file in /data/plog? Check by doing ls /data/plog/*.lck. If not, then you are probably suffering from the antenna computer "no more free inodes" problem. You have to reboot the antenna computer; see this link. Once the antenna computer display shows something, in UIP type ANTENNA/RESTART/NOSYNC; you should see the antenna display come back up. If not, consult the CSO Troubleshooting page. If you still can't get it to come up, contact someone (try the pager first, then Hiro).
/data00/headerdir/YYYYMMDD does not exist	Should not happen: start_tel_util should not have started header_copy. Check that /data00/headerdir exists and that observer has write permissions. If the permissions are wrong, change them by becoming root.
/data00/headerdir/YYYYMMDD is not writeable by observer	Should not happen -- start_tel_util should not have started header_copy. Change permissions by becoming root.
Is there free disk space on /data00? Check using df -k.	Move or delete some data.

write_log Crashes or Hangs

Check	Remedy
Gives RPC timeout error.	hau's RPC server is not up, is failing, or the network connection to hau is not good. Not much you can do, try calling Hiro. Check whether you are also having access problems with /data/plog.
/data00/encdir/YYYYMMDD does not exist	Should not happen -- start_tel_util should not have started write_log. Check that /data00/encdir exists and that observer has write permissions. If the permissions are wrong, change them by becoming root.
/data00/encdir/YYYYMMDD is not writeable by observer	Should not happen -- start_tel_util should not have started write_log. Change permissions by becoming root.
Is there free disk space on /data00? Check using df -k.	Move or delete some data.

Inspecting the Encoder Logs

Sometimes you may not be sure what has happened with the encoder logs and you want to inspect them directly to see which observation numbers are present and whether they match up with the source names as you expect. There is a simply utility for doing this, sum_encdir. To use it, simply type

> sum_encdir /data00/encdir/YYYYMMDD

A list of observation numbers and source names will be printed out.

Merging Dies

Merging can die if any of the necessary files (raw bolometer data, pointing files from telescope, encoder log files from rotator) are missing or if the raw data files are short. Typical error messages are:

Error opening das directory
Error opening header directory
Error opening encoder directory

These imply that the given directory could not be found. Since start_merge ensures the directories exist when it begins, this means that a directory has "vanished" in midstream. This is usually because a cross-mounted disk from another computer has gone offline, usually because the computer has crashed. For example, if merging on kilauea and allegro crashes, you will get these errors. Consult the instructions above for dealing with a crashed computer.

Cannot open file XXXX, reached max number of tries

This means that for a given raw data file, no pointing log or encoder log file was found after waiting for some number of 30-second intervals.

Previous number, this number

This means that the raw data file minute number incremented by more than 1, which implies files were lost.

Now about to crash!
File size is XXXX
File pointer position is YYYY
feof reports ZZZZ
Now crashing, satisfied...
Happily aborting with error

This error occurs when a raw data file is the wrong size. Raw data files have an almost perfectly fixed length set by the number of sampled channels and the number of samples per minute. This error will usually happen on the last file of the night because the DAS is usually stopped mid-minute. That's fine. You should worry when it occurs partway through the night.

You should also worry if merging remains stuck in the wait loop for the next file. New raw data files should appear every minute, so if merging stalls for much longer than that, it indicates the raw data files are not being generated or being copied to allegro.

For the various cases, do the following:

If raw bolometer data files are missing or have a problem, then you can probably recover as long as you catch it within a few days (while the raw data are still resident on andante).

If the raw data file is missing, or if merge dies with an error that the filesize is wrong, then the raw data file was probably not copied over properly. Note the file number where merge died, then log into allegro as observer and copy any missing or short raw data files from /smb/andante/YYYYMMDD to /data00/rawdir/YYYYMMDD. Restart merging with this file as nstart as indicated below.
If you can't find the raw data files on andante, you just have to skip over the hole. Restart merging as indicated below with nstart set to the first minute for which you have good raw data, header, and encoder log files.
If the raw data files on andante are the wrong size (they should all be the same within 1-2 bytes), then you have a serious problem -- the digitizer card is not being synchronized properly and it is losing samples. Consult an expert immediately!

If pointing files are missing, you can recover if you catch the problem before 24 hours from the missing file (otherwise the antenna computer overwrites the old file). Note the missing file, log into allegro, and look for it in /data/plog. Check the revision date (using ls -l, noting that the computer may have its time set to UTC). If it is still the file from the day you want (i.e, has not been overwritten on a subsequent day), copy it from /data/plog to /data00/headerdir/YYYYMMDD. Restart merging with this file as nstart as indicated below.

If the encoder log file is missing, then you're just screwed and you can't do anything about it; you have to skip over the hole. Note the number of the first available encoder log file after the one that died and restart merging with nstart set to the first available file after the missing ones.

Restarting merging: at the shell prompt in any bolocam window on kilauea, type

start_merge YYYYMMDD nstart

where nstart is as indicated previously.

If your slicing window is still running, then it should be able to deal with possibly irregular merged files (files with encoder information missing because the encoder logs were lost) without restarting it. If slicing does die, send the Bolocam support person the error message so he can fix the program so it doesn't die. If you are fixing the problem from a previous night, you can do

IDL> run_auto_slice_files, YYYYMMDD, obsnum_start = obsnum_start

where YYYYMMDD is the day you want to reslice and obsnum_start can be set if you know where you need to start reslicing (if you leave it unset, the entire day will be resliced, which is fine.).
Your cleaning, mapping, etc. windows should reprocess the data when they see the new sliced file is present. This should happen automatically even if it is a different day from the day you are currently observing: the auto processing programs don't care about which day a sliced file is from, just whether the file itself is newer than its matching cleaned files (and so on down the analysis chain).

Data Processing

This section describes how to restart the auto-analysis programs. NOTE: For any instance where you are asked to delete files, be careful to always use the -i option so that you can confirm any deletes. This should be the default on allegro and kilauea, but be sure about it before you delete anything.

For all cases

Processes that die involuntarily can leave partially written output files, especially cleaning. Look for .lck files in your data directories (see the Analysis Software page for details on where these would be). For any .lck files that exist, delete the .lck file and the associated data file. For example, if the data directories start at ~/data, then the command

> find ~/data -path '*.lck' -follow

will find all the .lck files. Don't forget to include the single quotes. Only delete the .lck files on kilauea; do not delete .lck files in the cross-mounted directories rawdir/, headerdir/, or encdir/.

Xterms die

Either because you accidentally killed them, or because kilauea's X server dies, or because kilauea crashes, etc. You can restart the xterm(s) and the routine(s) running in them as follows. If all your xterms died, you should still use this by-hand method because start_autos does not supply the necessary obsnum_start argument to run_auto_slice_files.

Start a new xterm in the second kilauea display and type idl to start IDL.
Issue the appropriate command to restart the script:
- For slicing, use the optional obsnum_start argument to tell it the first observation you want sliced. Otherwise it will start from obsnum = 1 for that day. That is, type
  
  run_auto_slice_files, YYYYMMDD, obsnum_start = obsnum_start
- For the other commands just type one of the following (note the @ sign!):

Xterm(s) remain alive but IDL quits

(very unlikely) Even though the xterm hasn't died, you will need to kill the offensive xterm(s) and follow the above instructions for starting new xterms. The reasons are technical, you can ask the Bolocam support person if you really care.

Xterm(s) remain alive but an IDL routine crashes

You will need to restart the IDL code by hand. Various scenarios are described below. Be sure to ALWAYS type retall at the IDL prompt before attempting to restart the IDL code; this brings you back to the main IDL program level and prevents unpredictable behavior that may arise from restarting the code from inside a routine that has crashed. No ill effects arise from typing retall when it isn't necessary, so go crazy! Do not use the .full_reset_session or .reset_session executive commands; these will erase assorted variables that were initialized at startup and are needed for some of the code to run properly.

You get the error (in any routine)

% Unable to free memory: from array descriptor
Interrupted system call

You should quit IDL and restart the window and program that died as instructed above. This is some sort of IDL memory access error that will usually go away if you restart.
IDL slicing code dies, with no problems earlier in the data stream:
- IDL says that the observation number goes backward from one file to the next one (with the second file usually starting with observation number = 0) and gives error message
  
  %SLICE_FILES_MANY_OBS: ignore_obsnum_skips_flag not set and/or obsnum goes backward, stopping.
  
  It's likely there was a conflict between merging and slicing over access to a file. Just restart run_auto_slice_files as indicated above, taking care to set obsnum_start so that any unsliced observations will be sliced. Apologies that this happens, but it's because IDL doesn't recognize when merging is trying to write to the file. If a file or observation cannot be sliced in spite of multiple attempts, then
- If restarting slicing does not cure the problem, then there may be a corrupted merged file: This should be dealt with by an expert. In the short term, you can work around the problem file as follows:
IDL cleaning or mapping code dies: If one of these routines dies, it will usually issue some kind of error message. Some typical ones:
- Cleaning code gives a message indicating something is wrong with the scans in the sliced file: The sliced file may be problematic -- the observation may have been interrupted in such a way that cleaning can't deal with it.
  - Case 1: It happens to every observation, usually with the cleaning routine generating a bunch of (non-crashing) errors saying that something is very wrong with trck_tel / trck_das followed by an eventual complete crash. This is very bad! It likely indicates that the DAS computer clock is not closely enough synchronized with the telescope computer and allegro. To fix:
    1. Go to the DAS computer and check its time synchronization with allegro; instructions for doing this are given in the daily startup instructions.
    2. If the DAS computer is not synchronized, presumably you forgot to check the synchronization when you started everything up. You presumably have not taken much data yet. The simplest course of action is basically to pretend you are shutting down for the day, delete all the data for the day, and then start up again from scratch. Do this deletion between shutting down and restarting! If you don't want to delete the data in hand, you can move the raw data to allegro:/data00/hold and contact the Bolocam support person. You should still delete the merged, sliced, and cleaned files as indicated below. Deletion instructions:
      - log into allegro and delete (use rm -iR to be safe, note that you have to be logged into allegro directly to remove these directories; i.e., even though these directories are cross-mounted to kilauea, you don't have the privileges needed to delete them from kilauea)
        
        /data00/rawdir/YYYYMMDD
        
        /data00/encdir/YYYYMMDD
        
        /data00/headerdir/YYYYMMDD
      - log into kilauea and delete (use rm -iR to be safe)
        
        ~/data/merged/YYYYMMDD
        
        ~/data/sliced/*/YYMMDD_*
        
        ~/data/cleaned/*/YYMMDD_*
    3. If the DAS computer is synchronized, this problem should not have happened. It may mean that a very important logic signal is not being digitized correctly. Contact the Bolocam support person.
  - Case 2: If this happens only to the occasional file, it's best not to spend your time figuring it out. You can force the file to be skipped by the cleaning routines by doing the following:
    1. cd to kilauea:~/data/sliced/ and move the offending file from its source_name directory to the corresponding bad/source_name directory.
    2. cd to kilauea:~/data and type (where YYMMDD_OOO identify the observation)
      
      > rm -i cleaned/*/YYMMDD_OOO*
      > rm -i mapped/*/YYMMDD_OOO*
      > rm -i psd/*/YYMMDD_OOO*
      > rm -i psd_plot/*/YYMMDD_OOO*
      > rm -i map_sum/*/YYMMDD_OOO*
      > rm -i map_sum_plot/*/YYMMDD_OOO*
      > rm -i centroid/*/YYMMDD_OOO*
      
      This will remove any attempts at cleaning, mapping, centroiding, and running diagnostics on the file.
    3. Restart any programs that have died using the commands given above (@routine_name), remembering to type retall at the IDL prompt prior to restarting. The sliced file has now been removed from view, so when you restart, none of the programs will try to process the bad file.
    4. It would be prudent to check the DAS computer time synchronization as indicated in the daily startup instructions, but you should not do anything; just make sure you do the time synchronization on startup the next night.
- Cleaning or mapping code gives a message indicating a source is not in the cleaning or mapping parameters file: This usually happens early during a run because we forget to add new sources to these parameters files. It is easy to fix:
  1. cd to kilauea:~/bolocam_cvs/pipeline/automation/params
  2. Open either cleaning_params_YYYYMM.txt or mapping_params_YYYYMM.txt, depending on whether cleaning or mapping died.
  3. See if the source on which the routines crashed is in the file. If not, add an entry by copying an entry for a similar source (e.g., planet, secondary pointing source, science field). Save the file and exit. NOTE: the source name you add must be the same as the one that the cleaning or mapping says it crashed on. These may not be the same as the source name given in the observing plan; it will be the same as the source name given in UIP and on the antenna computer display.
  4. Go to the window where the routine died, type retall to get back to the main IDL level, and then issue one of the following commands, depending on which program died (check the title of the window to figure out which one):
    - @run_auto_clean_files_ptg
    - @run_auto_map_files_ptg
    - @run_auto_clean_files_blankfields
    - @run_auto_map_files_blankfields
    The routine should immediately recognize the source that it previously died on and process it.
- Cleaning or mapping consistently fails on a given file for no apparent reason: If you just can't get cleaning or mapping to deal with a given file, don't waste time on it. Move the corresponding sliced file(s) to the bad directory as indicated above for a problem sliced file, pass the date and observation number to an expert, and restart whatever needs to be restarted.
- Cleaning or mapping consistently fails on a large subset of files: If there's something systematic going wrong -- i.e., you can't seem to process a large subset of files -- tell an expert as soon as possible as there may be some irrecoverable problem in the data stream that is causing the problem.

IDL cleaning code does not clean observations of a particular source

You probably have forgotten to add your source to the appropriate source list files. See the Analysis Software page for instructions on making the cleaning pipeline aware of new sources. The pipeline won't be aware of these changes, though, until you restart it. You can do this in one of two ways:

If you still have all the pipeline windows up, just hit q in all of them except the slice_files window to stop the ongoing processes. If q does not work, try Ctrl-c then type retall to get back to the MAIN level in IDL. Then, in each window except slice_files, type the appropriate one of the following (refer to the xterm title)
- @run_auto_clean_files_ptg
- @run_auto_map_files_ptg
- @run_auto_centroid_files
- @run_auto_clean_files_blankfields
- @run_auto_map_files_blankfields
- @run_auto_diag_clean_files_blankfields
- @run_auto_diag_map_files_blankfields
Don't forget the @ sign! This procedure is similar to what is done above for when all the Xterms die.
If you need to restart everything because the windows are gone, do the usual start_autos from the shell command line, but then hit Ctrl-c in the slice_files window as soon as you can so you don't reslice all the data for that day (which will then cause it all to be reprocessed).

The reason the above works is that, as long as you don't reslice the files, the analysis routines realize that only the unprocessed observations need to be analyzed -- the revision dates on the processed observations' files tell the pipeline they are done. If you reslice the files, though, then the sliced files get new revision dates and the pipeline thinks all the downstream files are out of date and need to be regenerated.

General Computer Troubleshooting

Computers are built to fail, one might say. Here are some problems you might run into and how to deal with them, working from the front-end to the back-end. If you run into a problem that prevents you from taking data and can't solve it with the following information, call the CSO pager. The on-call staff member will either be able to help you or to get the necessary person in touch with you.

andante (the DAS/fridge computer)

andante has had a troubled history that seems to dog it no matter what computer we call andante. We have had to reinstall the system more times than we would like. Hence, we have become quite expert at it. Here's how to deal if andante starts acting up.

If you start to see crashing of either LabView, disk cross-mounting to allegro, or the entire system itself, and the problems are not obviously attributable to a specific cause, the likely problem is that something bad has happened to Windows. Don't fight it! Your first course of action is to switch over to the image disk. When we have andante in a happily working state, we make a byte-by-byte image of it onto a second, identical disk. That disk is then powered down and left sitting in andante. To switch to the image disk, do the following;

First, find the target of the desktop shortcuts for BCAM_DAS and fridge_cycle. Copy these programs off andante, as updates may have been done since the last time the image disk was made. If you have network access, you can use SSH (shortcut on the desktop) to copy the programs to any other computer; allegro or kilauea are good choices since they are on the local network. If not, you can probably copy the programs off to a floppy disk. Make sure you copy the programs themselves, not just the shortcuts! You can find the targets of the shortcuts by right-clicking on the shortcut and selecting Properties.
Second, shut down andante and open it up. It is nontrivial to open the computer up due to the way the cover locks. See the instructions. Find the current system drive and the image drive (both are IDE drives) -- they are probably sitting right next to each other. The image drive will likely have no power and IDE cables connected. Simply switch the power and IDE cables over to the image drive and try booting. Close up the computer if you are able to boot properly.
Use SSH to recopy the DAS and fridge cycle programs down to the image drive. Make sure to put them in the right folders (find the targets of the desktop shortcuts) and to redefine the shortcuts as necessary. Do this even if the originals have the same names (e.g., BCAM_DAS_20040225) -- there might be minor updates that did not warrant a new name but need to be propagated.
If you have had to switch to the image disk, inform the Bolocam support person so that we can recover the original system disk at the next chance, turning it into the image disk.

You may have gotten into the much worse situation where you actually need to reinstall Windows from scratch and you can't just image the working drive. This will overwrite much of the configuration information, so it takes some work to get back to a properly working state. If you have to do this, follow these instructions. You will be frequently prompted for reboots, go ahead and reboot as necessary. Log in as bolocam whenever possible.

The CDs you will need are in the Bolocam file cabinet in the computer room. You will find Windows XP Professional Service Pack 2, Partition Magic 8.0, and LabView 7.1.
Some software must be downloaded from Caltech's site-licensed software site. You need a Caltech ITS account for this. If you don't have one, contact the Bolocam support person.
First, power down the computer and remove all the National Instruments cards. See the instructions for opening the computer. Remove the PCI-6031E, PCI-6034E, and PCI-GPIB+ cards from the computer. Note which slots they were initially in so you can return them to the right places, and be careful about static electricity.
Make sure the computer is connected to the web.
Install a fresh version of Windows XP PRO - SP2. (In the options, choose to format the hard drive and install Windows XP).
Log on as administrator and make sure to create a password if you weren't prompted to do so during the installation of Windows (use same password as noted in the white Bolocam binder).
Run Windows Update until the Windows installation is fully up to date, with all security patches.
Create a new user account bolocam with full administrator rights with the same password as written in the Bolocam white binder.
Log out of the administrator account and log in as bolocam.
If you are using the Dell Precision 420 as andante, get the video card driver. The video card is a Matrox G400 (http://www.matrox.com). After rebooting, you can run the resolution up to something sensible (1200 x 768). You might need to change the frequency to 75 Hz. These latter settings are accessible by right-clicking in an open space on the Windows Desktop, which will open the display settings. Click on the Settings tab. To find the frequency setting, click on Advanced and then the Monitor tab.
Download and install VPN-3000 Virtual Private Network client software from the Caltech ITS site:

software and certificate from ftp://ftp.its.caltech.edu/caltech/site-lic/VPN/
configuration instructions from http://www.its.caltech.edu/ra/vpn/config-win.html

Run VPN-3000 to obtain a Caltech virtual IP address and install Caltech site-licensed software from http://software.caltech.edu:
- Norton Antivirus. Make sure LiveUpdate is run and that it is configured to download updates daily at around 22:00 UT (noon local time).
- F-Secure SSH.
You may disconnect VPN-3000 at this point.
Install Partition Magic 8.0 from CD.
Install the FULL version of Labview 7.1 from CD. Note that the FULL installation includes the very useful MAX (Measurements & Automation Explorer).
Shut down and install all the National Instruments cards, being sure to put them back in the same slots you removed them from. Again, take precautions against static electricity.
Plug the GPIB connector and the two ADC cables in (one ADC cable comes from the SCXI chassis and connects to the upper PCI card, the other comes from the white thermometry breakout box and connects to the middle PCI card. The connectors have different form factors so there should be no confusion). Restart the computer and log in as bolocam.
Fire up MAX (there should be a shortcut on the desktop labeled Measurement and Automation Explorer). You should see:

My System
     Devices & Interfaces
          Traditional NI-DAQ Devices
          GPIB

To see the fridge power supplies (the Tektronix PS2520G modules), right-click on GPIB and click Scan for Instruments. Two GPIB devices should come up. (You may need to left-click on GPIB to open the tree up further.)

NEED TO UPDATE THE FOLLOWING WHILE HAVING ACCESS TO PC.
To see the MUX chassis, right click on Traditional NI-DAQ Devices and choose Add SCXI Chassis and pick SCXI-1001. Right-click on the SCXI-1001 entry, select Properties, and make sure Chassis ID is set to 1 and Chassis Address to 0.

Click on SCXI-1001 and you should see 12 SCXI-1100 modules appear in the right window. Right-click on the first one and select Properties. Under the General tab, go to Connected to: and select the PCI-6034E card. Also click the This device will control the chassis checkbox. Leave the defaults in the other tabs. For the other SCXI-1100 modules, open their Properties windows and make sure that the Connected to: box says None. The This device will control the chassis checkbox will be grayed out.
Correct the device numbers in the BCAM_DAS and fridge_cycle LabView programs. Open fridge_cycle and look for the PCI-6031E Device Number control to the right on the front panel (probably off the screen). If the default value does not point to the PCI-6031E card (to the device number indicated in MAX), then change the value. To save the new value as the default, change to edit mode (Operate -> Change to Edit Mode), right-click on the control and select Data Operations -> Make Current Value Default. Then switch back to run mode (Operate -> Change to Run Mode) and save the program. Similarly, open up the BCAM_DAS program and look for the PCI-6034E Device Number control at the top of its front panel and repeat the above for this program. It may be that one or both of these programs complains of missing vi's on startup; they are probably in one of the llb files in Program Files/National Instruments, just dig until you find them, they will be there.
Set the IP address to be andante's static address. Click on Start -> Connect to -> Show all Connections and then select Local Area Connection and click on Properties. In the Components window, select TCP/IP or possibly Internet Protocol (TCP/IP) and then click on Properties. Check the Use the following IP address: radio button and type in the following:

IP address: 128.171. 86.211
subnet mask: 255.255.255. 0
gateway: 128.171. 86. 2

Also check the Use the following DNS server addresses: and type in the following:

Preferred DNS Server: 128.171.3.13

Have it change the IP address immediately (i.e., don't wait to reboot).
Set up the computer to do network time synchronization. Double-click on the clock in the lower-right corner of the desktop. The Date and Time Properties dialog box will come up. Click on the Time Zone tab and make sure the clock is set to the GMT time zone. Click on the Internet Time tab. Enable automatic time synchronization with hau.submm.caltech.edu. Click the Update Now button.
Set up disks properly:

Using Partition Magic, split up the master drive into two partition C:\ (~21 GB) and D:\ (~17 GB, call it DATA). Follow the instructions. You'll be prompted to reboot the system at the end.
Create the following directories in D:\
     D:\das_data
     D:\fridge_data
     D:\lab_tests
Make D:\das_data remotely accessible so data can be transferred to allegro:
- Use Windows Explorer to get access to D:\asdsa
- Right-click on the the das_data directory and select Properties.
- Click on the Sharing tab.
- In the Network Sharing and Security box, enable Share this folder on the network and give it the name das_data. Make sure Allow network users to change my files remains disabled.

Turn on the network firewall:
- Right-click on Start -> Connect To -> Show All Connections
- Right-click on Local Area Connection and select Properties.
- Click the Advanced tab and select Settings... in the Windows Firewall box. Click the ONn button, and click Ok in all windows that get opened.
- Close the Network Connections window
Turn on the Remote Desktop server to allow remote users to use this computer:
- Right-click on the My Computer icon on the desktop and select Properties.
- Click on the Remote tab.
- In the Remote Desktop box, enable Allow users to connect remotely to this computer.
- Make sure that the Remote Assistance box in disabled.
- Close the windows that have been opened, clicking Ok where necessary.
- The firewall will be automatically adjusted to allow remote users to connect.
- Check that it works by remotely connecting from another computer; directions are provided elsewhere.
Enable automatic Windows Updates:
- Right-click on the My Computer icon on the desktop and select Properties.
- Click on the Automatic Updates tab. Make sure the Automatic option box is enabled. Set the update time to every day update at 21:00 UT (11:00 am local time) so it doesn't interfere with observing.

allegro

allegro has been remarkably stable. If it crashes, instructions for bringing it back up and cross-mounting disks have been given above. If the system itself seems to be going belly-up -- e.g., frequent crashing, unexpected behavior -- you can switch to the image disk. This is a disk that, like for andante, is basically a byte-by-byte image of the system and /home disk. Your data will be unaffected by this switch. To do this:

If possible, dismount /data00 from kilauea by logging in to kilauea as root (password in the white Boloca Manual binder) and typing umount /data00. If you get a device busy error, you may have to get people to log out user sessions if they happen to be sitting in one of the /data00 directories (unlikely). If you can't get the disk to unmount, skip to the next step.
If possible, copy the directory /home/observer/src to another computer (e.g., kilauea) so that the code on the image disk can be updated if necessary. To do this:
- log in to allegro as observer
- make sure you are in /home/observer
- type tar --gzip -cvf src.tar.gz src
- copy src.tar.gz off to a different computer
Shut down allegro: log in as root (password in the white Bolocam Manual binder), then type shutdown -h now. The computer will shut down and power off. Flip the power switch on the back of the computer to the off position also.
Open up allegro. You will see a 20 GB drive connected to the primary IDE port on the motherboard -- this is the current system disk. (Follow the cables back to the motherboard and you will the connectors labeled on the board.) Somewhere else inside the computer there will be another 20 GB disk without a power cable attached -- this is the image disk. Switch the IDE and power cables from the original system disk to the image disk. You may have to move disks around in order to be able to connect the IDE cable to the image disk.
Turn the rear power switch back on and then press the front panel power button to boot the computer from the image disk.
Update the src directory as follows:
- log in to allegro as observer
- make sure you are in /home/observer
- type mv src src_old
- copy src.tar.gz from whereever you copied it to to the current directory
- type tar --gunzip -xvf src.tar.gz
Follow the remaining directions above for bringing allegro back up (cross-mounting disks)

kilauea

kilauea is managed by Ruisheng Peng. If it starts having problems, let him and the Bolocam support person know. If kilauea just dies completely and won't reboot, you don't have much recourse until you get in touch with Ruisheng. However, kilauea is not critical to taking data. All the normal data-taking processes will continue to run if kilauea dies. You can log in to puuoo as bolocam (same password as on kilauea) and from there restart the xterms that monitor rotator, dirsync.py, header_copy.py, and write_log, and also restart gbolostrip following the directions given above as if you were doing it on kilauea. You can also do this from any other computer with a X-server; feel free to use your laptop, or you can also use Reflection X on pika, the PC in the main computer room.

Problems Accessing Data Disks or Dying Data Disks

The summit is not a friendly place for hard drives, especially data drives that get heavily exercised. We keep spare 120 GB data drives ready for when a data drive on one of the linux machines dies. Symptoms of this happening are i/o errors from the processes that write or read data to the particular disks, or simply the directories on a drive not appearing.

allegro: For allegro, we have a spare data drive sitting in the computer ready to go. Just switch to this drive. Even if it turns out that the original drive had not fully failed, switching drives will minimize downtime. Further investigation can be done during the daytime. To switch drives, do as follows:

Shut down the computer as explained above, being sure to turn off the power switch on the back of the computer.
Open up the computer. Find the 120 GB drive connected to the secondary IDE bus (follow the cables back and look on the motherboard for the IDE connector label); this is the current data drive. Find the spare 120 GB drive also, which will have its power cable unplugged. Switch the IDE cable from the original 120 GB drive to the spare. Connect a power cable to the spare drive.
Boot the computer back up as explained above.
You should see the new /data00 mounted and the /data00/rawdir, /data00/headerdir, and /data00/encdir directories. Now continue as if allegro had simply crashed, following the instructions given above (mounting disks, etc.). You will have to restart everything as if it it were the start of the night. Be sure to note in your observing logs the time of the crash and the last observation taken before the crash. After restarting everything, your observation numbers will start at 1 again, but the offset can be inserted later. Include the offset observation number in your logs of the new observations. For example, if the disk died during observation 34, then your first observation after restarting should be labeled 1 (35)to indicated what the corrected observation number will be.
For merging and downstream, you need to move the files already generated for the given day out of the way, as they will cause confusion to the analysis software. Do as follows:
- Rename the directory containing the merged data using the command
  
  > mv ~/data/merged/YYYYMMDD ~/data/merged/YYYYMMDD_HOLD
  
  where YYYYMMDD is the UT day currently being taken.
- Go to ~/data and type
  
  > find ~/data -path 'YYMMDD_o*' -follow
  
  to get a list of all the files from slicing onwards that have been created for the current day (YYMMDD). Delete them all using rm -i; they can all be regenerated later.
- Restart the merging and data analysis as if it were the start of the night.
Contact the Bolocam support person; he will work with the day crew to recover the data from the original /data00 and to reprocess the full day as if there had been no problem.

kilauea: This machine uses a SCSI RAID for almost every disk, so they should be pretty robust. Responding to various disk failure modes:

kilauea:/data becomes unavailable: This should not matter, as we don't use it anymore, except if kilauea:/bigdisk is down. Since /home/kilauea and /data are on the same RAID, it's pretty likely that /home/kilauea will also become unavailable, in which case you can't do anything on kilauea. You can of course keep taking data -- you just can't analyze it. Call or email Ruisheng Peng and the Bolocam support person to let them know of the kilauea disk problem.

kilauea:/bigdisk becomes unavailable: This makes the processed data directories unavailable. You can temporarily use kilauea:/data for the processed data. You need to

Create the appropriate /data/bolocam/YYYYMM/ directory.
Create all the standard subdirectories (merged/, sliced/, etc.) as explained elsewhere.
Create soft links from /home/kilauea/bolocam/data to these new directories as explained elsewhere.
Restart the data processing as if starting it at the beginning of the night. You will have lost all your processed data, but the processing catches up on the night's data pretty quickly.
Call or email Ruisheng Peng and the Bolocam support person to let them know of the kilauea disk problem. Depending on whether the problem is recoverable, you may have lost all of your processed data and will need to reprocess it. See the AnalysisSoftware page for instructions,

Dying System Disks

System disks can also die. To make it possible to recover quickly from such a problem, we have created image drives for the andante and allegro system drives. They are left powered off and disconnected inside the particular computer.

Note that, on andante, the system and data drives are different partitions of the same disk, and so if one begins to fail, so does the other.

The image drives nominally have the same version of the Bolocam-specific software as the original drive, but they could be slightly out of date. To be sure to get the most recent software, do the following first:

andante: Copy the folders C:\DAS and C:\FRIDGE to a safe location on a different computer (e.g., allegro:/data00/backup) using F-Secure SSH File Transfer.
allegro: Log in as observer and issue the following command:

> backup_home_disk /data00/backup

This will create a gzipped tar archive of /home/observer in /data00/backup, titled backup_home_observer_YYYYMMDD.tar.gz where YYYYMMDD is the day the backup is done.

To switch over to the image drive, do the following (same for both computers):

Shut down and open up the computer.
Find the 20 GB drive that has no power or IDE cables connected to it; this is the image drive.
Switch the IDE cable from the failing drive to the image drive. Connect a power cable to the image drive.
Restart the computer as usual.

Once you have switched to the image drive, you can update the software as follows:

andante: Rename C:\DAS and C:\FRIDGE on the image drive to C:\DAS_OLD and C:\FRIDGE_OLD for safekeeping. Then just copy the DAS and FRIDGE folders to the image drive C: from the location you copied them to above. Make sure the desktop shortcuts fridge_cycle and BCAM_DAS are pointing to the versions of the programs in the new folders. Delete C:\DAS_OLD and C:\FRIDGE_OLD once you have confirmed the versions you replaced them with are fully functional.
allegro: Become root. Rename /home/observer to /home/observer_old for safekeeping. Then type the following:

> cd /
> tar --gunzip -xvf /data00/backup/backup_home_observer_YYYYMMDD.tar.gz

This will unpack the archive that you created prior to switching disks into the /home/observer directory. Once you have confirmed that the software there is fully functional, delete /home/observer_old. Note that you need to have /data00 mounted in order to do this; this should be automatic, but this is the likely cause if you can't find /data00/backup.

Revision History

2003/12/07
Separate troubleshooting section from rest of Observer Manual
2004/02/02 SG
Add link back to main page, minor updates
2004/02/04 SG
Add instructions on getting rid of partially completed files and on the "Unable to free memory" error. Rearrange a bit.
2004/02/09 SG
Add instructions for dealing with big trck_das/trck_tel offsets in cleaning and for observations of a given source not being cleaned, improve section on restarting merging.
2004/02/26 SG
Add instructions for restarting data copying programs using new version of start_tel_util. Advise about raw data file size problems in merging troubleshooting section.
2004/02/27 SG
Modify instructions for restarting data copying programs to use check_tel_util and kill_tel_util.
2004/04/28 SG
Minor fixes, lots of new material
2004/05/02 SG

2004/05/03 SG
Update for modifications to write_log
2004/05/06 SG
Update restart of DAS after andante crash -- can have problems smbmounting if gbolostrip is still trying to access the disk.
2004/05/08 SG
Further updating of DAS restart instructions.
2004/05/09 SG
Add instructions for switching to image system disks.
2004/05/26 SG
Add instructions for dealing with 2-drive death in RAID.
2004/10/02 SG
Add more detailed information on merging errors. Now that write_log deals with RPC timeouts properly, do not force kill of write_log when telescope computer crashes. Remove warning about header_copy copying files from previous day if telescope computer was restarted, should no longer happen.
2004/12/10 SG
Add check of fiber-optic isolator DIP switches if there are rotator problems.
2005/03/19 SG
All kinds of updates for puuoo SCSI RAID.
2005/06/03 SG
Significant updates for reinstalliing software on andante and for move of /bigdisk to kilauea.
2005/12/16 SG
More detailed instructions for recovering when a source was not added to the params files before processing.

2010/12/18 JS
Updated the kilauea restart proceedure.

Questions or comments? Contact the Bolocam support person.