Troubleshooting
- Telescope
- Cryogenics
- Electronics
- Rotator
- Computer Crashes
- andante crashes
- allegro crashes
- allegro and andante crash
- kilauea crashes
- The telescope
computer (hau)
crashes
- Everybody crashes!
- Data Streams
- Data Stream
Programs or Windows Die
- dirsync Crashes or Hangs
- header_copy Crashes or Hangs
- write_log Crashes or Hangs
- Inspecting the Encoder
Logs
- Merging Dies
- Data Processing
- For all cases
- Xterms die
- Xterm(s) remain alive but IDL
quits
- Xterm(s) remain alive but an
IDL routine crashes
- IDL cleaning code does not
clean observations of a particular source
- General Computer
Troubleshooting
- andante (the DAS/fridge
computer)
- allegro
- kilauea
- Problems Accessing
Data Disks or Dying Data Disks
- Dying System Disks
- Revision History
Back to BolocamWebPage
Back to ExpertManual
Telescope
For non-Bolocam related problems (dome, dish, antenna computer,
etc.), go to the main CSO Hawaii webpage (http://www.cso.caltech.edu),
scroll down and click on "Local Information". You will find
generic troubleshooting information there.
Cryogenics
About the only cryogenics problem that the typical observer can deal
with is running out of LHe. If you know how to do LHe fills, you
can go ahead and refill; see the cryogen fill
instructions. If you caught the problem quickly enough and
the fridge did not die, you may be able to continue observing. If
the UC Fridge GRT reading on the fridge
monitoring page returns to its previous value, you're fine and you
can keep observing.
If the fridge did run out (IC and/or UC Fridge GRT readings high and
not recovering), then you can at least speed the recovery along by
following the recovery
instructions.
If you are not experienced with doing cryogen fills, your night is
done. Leave a note for the day crew and shut down for the
day. They will refill, recover, and set the fridge to cycle and
be cold for the next night.
If you have more serious problems -- cryogen hold time sharply
decreased, fridge cycle failing, etc. -- let the day crew and the Bolocam support person know.
Electronics
There are two kinds of electronics problems one typically runs into:
- functionality problems:
Some signal is just not present, or is reading completely incorrectly,
etc. It is likely that, for some reason or another, some switch
has been put into the wrong state, some cable has become disconnected,
etc. The best thing to do is to carefully walk through the system
and make sure everything is set up correctly. Go to the Electronics page and the Setting up for Observing page
and make sure all necessary connections have been made and all the
power switches are on.
- noise problems: If noise
problems appear in most or all channels simultaneously, or in all
channels of a given hextant, it is likely that the problem is the bias
board. If you are not an expert, your best bet is to simply
replace the bias board with the spare. We use either bias board
2.1 or 3.1, so you can grab the unused one (usually in the 3rd floor
storage room, in the cabinet) and replace the problem board. You must turn off the power to the board
before removing it; see the Electronics
page for instructions on how to do this and for pictures to identify
the bias boards and instructions
for setting them up for observing.
If noise problems appear in a single channel or only a few channels,
note them and send the list to the
Bolocam support person, and then
continue with your observing. He will get in touch with the day
crew to troubleshoot the problem channels.
Rotator
Typical problems that might occur are:
- The rotator will not home when the home program is used.
- The home program
cannot set the origin of the rotator encoder when it is homed.
- The interactive
program reads back junk from the encoder.
- The interactive
program cannot seem to get the rotator to go to a programmed angle.
- The rotator program
either rotates when you have asked it not to rotate, or does not rotate
when you have asked it to rotate.
- The rotator hits its limit switch, resulting in killing the motor
power and the rotator swinging to some arbitrary angle.
The sequence of troubleshooting is as follows:
- You should stop your current observing macro but otherwise leave
all programs running.
- First, assume that it is a simple problem that simply requires
resetting the rotator. Use the home program; instructions are
given elsewhere.
This essentially resets the entire system and will likely get rid of
any mild problems. If rehoming is successful, test the rotator by
rotating to some small angles (between -30 and +30 deg) using the interactive program, and
try reading back the angles and see if they make sense. If interactive works, then you can
then restart observing. The observation during which the problem
occured
should probably be discarded, but in principle later observations
should be fine. If either the rotator or write_log program died, you
will have to restart them as explained below
prior to restarting observing.
- If you can't rehome, or rehoming does not help, maybe the DIP
switches on the fiber-optic isolators have gotten screwed
up. Check that they are set correctly by comparing to the
instructions
on the Setting
up for Observing page.
- If rehoming fails, you should determine whether it is an obvious
broken connection problem. Go outside and check all the rotator
cabling, which is described on the Setting
up for Observing page. The most likely failure mode is the
fiber-optic cables; spares can be found in a box near allegro.
Make sure you hook the replacement up properly, paying attention to the
connector colors and where they connect to. If the other parts of
the
cabling fail, you may be able to find replacements by digging around
the AOS lab. After replacing the damaged cabling, try rehoming
again. If that works, try testing using interactive as above. If
that works, you can probably start observing again, though again you
may have to restart rotator
or write_log as above.
- If rehoming continues to fail, there may be communication or
control problems. If you suddenly get errors in your rotator or write_log window such as modprobe: can't locate module,
then somehow one of the run-time kernel modules has been
unloaded. Log in to allegro.submm.caltech.edu
as observer (password in
white Bolocam Manual binder) and type
> insmod rocket
> insmod seaio
You should receive messages like
Using
/lib/modules/2.4.13-0.6/kernel/drivers/char/rocket.o
Using
/lib/modules/misc/seaio.o
possibly with warnings or the messages
insmod: a module named rocket already
exists
insmod: a module named seaio
already exists
There may be other warnings. As long as none of them say a module
could not be loaded, then things should work. Try rehoming and
running interactive as
above; if successful, you can restart observing, restarting rotator and/or interactive as above if
necessary.
- If you are still having problems, then the best thing to do is
just lock the rotator to its home position and disable the rotator for
the night. By turning the motor power off (see the rotator
instructions elsewhere),
you can rotate the dewar to its home
position (where the homing sensor tab occludes the right half of the
homing sensor). Turn the power back on to have it hold
there.
You will have to restart
the rotator program with
rotation disabled (R = 0).
If you had problems communicating with the encoder, then you will have
to restart write_log with the encoder
readout disabled (do not include the -e flag). Clearly note
when this
occured in your observing logs, as it will be necessary to recalibrate
the rotator angle from that point onward. The data will be
entirely analyzable, it will just have to be treated differently than
preceding data.
Regardless of the problem, inform the
Bolocam support person, providing
details, so the problem can be rectified.
Computer Crashes
Ah, the bane of every system, the reason we should just go back to
using chart recorders and slide rules!
Remarkably, our critical computers, andante and allegro are quite stable.
This is because we do not run much on them. andante only runs the DAS and
the fridge control, allegro
only runs the data copying programs and gbolostrip. kilauea tends to be less stable
due to strange goings-on with its video card. We provide
instructions here for recovering from crashes of each of these machines.
For an explanation of the data streams, see the Data
Acquisition,
Rotator Control, and Data Handling page.
andante crashes
This is not too tragic. Do the following:
- Reboot andante twice.
Go to the folder containing the raw data (D:\DAS_DATA\YYYYMMDD) and
delete any files that are the wrong size (compare to the other files,
all should be the same size to within 1-2 bytes). Delete any .lck files also.
- Log into allegro as
observer (password in
the white Bolocam Manual binder). Go to /data00/rawdir/YYYYMMDD and
delete any undersized files you find there also.
- If gbolostrip. is
still running, kill it. The window's kill button may not work,
you can kill it by doing the following:
- Remount andante's
data disk on allegro. Log
into allegro as root . Go to /smb. Follow the
instructions in the AAAREADME
file that is located there. Check that the disk is mounted by
typing ls /smb/andante.
You should see the YYYYMMDD
data directories.
If you have problems, make sure that andante is properly set up to
share D:\DAS_DATA and
subdirectories thereof. If you think everything is properly set
up, then the problem may be that dirsync.py
is trying to access /smb/andante.
You need to kill dirsync.py.
You can do this in the same way as you killed gbolostrip, just replace gbolostrip with dirsync.py in the grep command. Once you
have killed dirsync.py,
you should be able to mount /smb/andante
as instructed above. Restart dirsync.py as explained below. Your log
monitoring window should still be running, you don't need to restart it.
Remember to exit your root
session now.
- Restart the DAS in the same way as you did at the start of the
night (see the daily
startup instructions).
- Restart gbolostrip
in the same way as you did at the start of the
night (see the daily
startup instructions).
- Restart merge on kilauea as explained below, working around the hole in the data
due to the missing DAS files. If you receive "short file" errors,
then you did not properly clean up the short raw data files on either andante or allegro as instructed
above. Check this and try again.
- The remainder of the analysis software should work around the
hole without any problems. If you do have problems see the Data Processing section of this page.
allegro crashes
This is a pain because the encoder log files are completely lost for
this period. Do the following:
- Reboot allegro.
- Mount \\andante\d\das_data
and hau:/var/plog on allegro and allegro:/data00 on kilauea as explained elsewhere.
- Restart the data stream programs as indicated below, remember to provide the nlast argument to start_tel_util.
- Restart merge on kilauea as explained below, working around
the hole in the data due to the missing encoder log files.
- The remainder of the analysis software should work around the
hole without any problems. If you do have problems see the Data
Processing section of this page.
allegro and andante crash
You should only be so lucky. The main thing here is to bring up
both computers and get everything cross-mounted as explained separately
for each machine above before starting any programs. Then you can
restart the data-copying programs, then the DAS, then merge.
kilauea crashes
This is not so bad because no data are lost. Don't be fooled by
the fact that your data copying programs were running in windows on kilauea; they weren't really,
only the log files were being displayed in these windows. Do the
following:
- Reboot kilauea.
- Restart UIP using the instructions found here:
UIP guide.
- Mount allegro:/data00
and kilauea:/data_bolocam
on kilauea as explained elsewhere.
- Restart the log monitoring for the data-copying programs as
explained below.
- Restart merge on kilauea as explained below
from the last point where you think things were merged properly.
Err on the side of remerging rather than missing unmerged data.
- Restart the analysis software on kilauea as explained below. The software will
automatically figure out what data has and has not been processed,
though you may need to clean up .lck
files as indicated.
The telescope
computer (hau)
crashes
This happens very infrequently. Your observation is
terminated. write_log
will continue running without too much problem, but, obviously, it gets
no information from the
telescope and so will write invalid values.
To recover, do the following:
- In the write_log
log screen, you will see timeout errors while the telescope computer is
unavailable. This is fine. They should stop and you should
see normal write_log
messages when the telescope computer becomes available again.
- Reboot the telescope computer (see the CSO web page as instructed
above).
- Mount hau:/var/plog
as explained elsewhere.
- Wait a minute and see if dirsync
starts copying new telescope computer files by watching dirsync's log screen.
Check that normal write_log
messages begin to appear. If one of these programs fails to start
executing properly again, follow the instructions for restarting them below.
- merge will
presumably have died because it could not find any pointing log
files. You will have to merge
around the hole as explained below.
Everybody crashes!
Again, get all the computers up and the disks cross-mounted first, then
start up the various programs.
Data Streams
For an explanation of the data streams, see the Data Acquisition,
Rotator Control, and Data Handling page.
Data Stream
Programs or Windows Die
The more likely occurrence is that the X connection to the machine
displaying the monitoring windows for the data copying programs goes
down (for example, if kilauea
crashes). This is not a major
problem! The data copying programs are running
autonomously on allegro;
all that has happened is that the windows that
display the log files written by these programs have died. You have not lost any data, all you need to
do is restart the monitor windows. Once you have your X
server back up and running, log into allegro (set X forwarding as
necessary) and type
>
start_tel_util YYYYMMDD R E 0
YYYYMMDD is UT date, R indicates whether you want to
use the dewar rotator or not (R
= 1 means "use the rotator"), and E indicates whether you want to
read the rotator encoder (E = 1
indicates that the encoder should be read; if you don't read the
encoder, the rotator angle will be taken to be 0 and you will have to
deal with this later in the analysis). The last argument, 0, tells the program that all
the processes are already running, you just want to create the
monitoring windows. start_tel_util
will check to see whether all 4 programs are indeed running; it will
advise you if there is a problem.
If, on the other hand, allegro
has itself died, then the file copying
programs have died. Once you have allegro back up and are ready
to start taking data again, you can start them back up using the command
>
start_tel_util
YYYYMMDD R E 1 nlast
where YYYYMMDD, R, and E are as above. The 4th
argument is set to 1 to
advise start_tel_util
that it needs to start up the programs again, not just start up the
log
monitoring windows. nlast is very important; it is the number
of the last rotator log that was written (in /data00/encdir/YYYYMMDD).
Remember is just
the number, not the entire filename. You will have lost the
encoder log files between nlast
and the minute you restart the programs; you will have to force merging
to work around them as indicated below.
However, as long as you use the nlast
argument, the observation number should pick up where it left
off. If you forget the nlast argument, the observation number will
start again from 0 and you will have a mess on your hands (it
can be cleaned up, but you will have to consult an expert).
If allegro has not died
but you suspect that one or more of the file copying programs has died,
you can check by logging into allegro
and typing
>
check_tel_util
You will get messages indicating which processes are still
running. Proceed as follows:
- If all of the processes have died, you can restart in the same
way as you would if allegro
had died.
- If write_log has
died, the easiest thing to do is to kill the other programs and restart
everything as if allegro
had died. Just issue the command
> kill_tel_util
You will see messages indicating which programs were killed and which
were not running. Then issue the
> start_tel_util YYYYMMDD R E 1 nlast
command as you would have above. Make
sure to type the above line correctly.
- If write_log has
not died, you are better off restarting the processes by hand so the
encoder logs remain continuous. Issue whichever of the following
commands are necessary (corresponding to the processes that need to be
restarted):
>
/home/observer/src/rotator/rotator R \
>>& /data00/encdir/rotator_YYYYMMDD.log &
>
/home/observer/src/dirsync/dirsync.py \
/smb/andante/YYYYMMDD \
/data00/rawdir/YYYYMMDD \
>>& /data00/rawdir/dirsync_YYYYMMDD.log &
>
/home/observer/src/dirsync/header_copy.py \
/data/plog/YYYYMMDD \
/data00/headerdir/YYYYMMDD
>>& /data00/rawdir/dirsync_YYYYMMDD.log &
You can then restart the log monitoring windows using the
> start_tel_util YYYYMMDD R E 0
command as you would have if only the X connection had died. You
may end up with duplicate log monitoring windows, just kill the
duplicates: killing the duplicate log monitoring windows does not
affect the operation of the running programs. Make sure to type the above line correctly,
otherwise you may get unexpected behavior.
dirsync Crashes or Hangs
Check |
Remedy
|
/smb/andante/YYYYMMDD is
visible on allegro but
not
readable by observer |
Check Windows sharing setup for \\andante\d\das_data and \\andante\d\das_data\YYYYMMDD
|
/smb/andante/YYYYMMDD not
visible on allegro but
directory listing of /smb/andante
returns something
|
\\andante\d\das_data\YYYYMMDD
probably has not been created. Do so from andante's desktop.
|
/smb/andante/YYYYMMDD not
visible on allegro and
directory listing of /smb/andante
returns nothing
|
- Check that andante
is powered on and Windows has not crashed. Reboot if necessary.
- If andante is on, then \\andante\d\das_data probably
has not been cross-mounted. cd
to /smb on allegro and follow the
instructions in /smb/AAAREADME.
You may need the root
password, it is on allegro's
monitor. If this fails, then it is likely that \\andante\d\das_data is not
being shared properly. Check the sharing setup for this directory
on andante
directly. A reboot of andante
may be necessary. It is very unlikely that the problem is with allegro, as this cross-mounting
has operated without problems on allegro's
side since 2000.
|
/data00/rawdir/YYYYMMDD does
not exist
|
Should not happen -- start_tel_util should not have
started dirsync.
Check that /data00/rawdir
exists and that observer
has write permissions. If the permissions are wrong, change them
by becoming root. |
/data00/headerdir/YYYYMMDD is
not writeable by observer |
Should not happen -- start_tel_util should not have
started dirsync.
Change permissions by becoming root. |
Is there free disk space on /data00? Check using df -k.
|
Move or delete some data.
|
header_copy Crashes or Hangs
Check
|
Remedy
|
/data/plog visible on allegro but not readable by observer.
|
Check permissions for /data/plog, become root and change if necessary.
|
/data/plog visible on allegro but is empty
|
hau:/var/plog probably has not
been cross-mounted. Check using df -k. If hau:/var/plog is not mounted at
/data/plog, then
become root and mount it
by typing mount /data/plog.
If this fails, then it's likely that hau is either not exporting /data/plog or not considering allegro to be a valid mount
client. Contact a CSO staff member in the following order: Hiro,
Ruisheng, Richard, Martin, anyone else. Of course, hau may just be dead, but
presumably you would have been told that by now.
|
/data/plog visible on allegro and contains files, but
nothing is being copied
|
Is there a .lck file in /data/plog? Check by
doing ls /data/plog/*.lck.
If not, then you are probably suffering from the antenna computer
"no more free inodes" problem. You have to reboot the antenna
computer; see this link.
Once the antenna computer display shows something, in UIP type ANTENNA/RESTART/NOSYNC; you
should see the antenna display come back up. If not, consult the
CSO Troubleshooting
page. If you still can't get it to come up, contact someone (try
the pager first, then Hiro).
|
/data00/headerdir/YYYYMMDD does
not exist
|
Should not happen: start_tel_util should not have
started header_copy.
Check that /data00/headerdir exists
and that observer has
write permissions. If the permissions are wrong, change them by
becoming root. |
/data00/headerdir/YYYYMMDD is
not writeable by observer
|
Should not happen -- start_tel_util should not have
started header_copy.
Change permissions by becoming root.
|
Is there free disk space on /data00? Check using df -k. |
Move or delete some data. |
write_log Crashes or Hangs
Check
|
Remedy
|
Gives RPC timeout error.
|
hau's RPC server is not up, is
failing, or the network connection to hau is not good. Not much
you can do, try calling Hiro. Check whether you are also having
access problems with /data/plog.
|
/data00/encdir/YYYYMMDD does
not exist |
Should not happen -- start_tel_util should not have
started write_log.
Check that /data00/encdir exists
and that observer has
write permissions. If the permissions are wrong, change them by
becoming root. |
/data00/encdir/YYYYMMDD is not
writeable by observer |
Should not happen -- start_tel_util should not have
started write_log.
Change permissions by becoming root. |
Is there free disk space on /data00? Check using df -k. |
Move or delete some data.
|
Inspecting the Encoder
Logs
Sometimes you may not be sure what has happened with the encoder logs
and you want to inspect them directly to see which observation numbers
are present and whether they match up with the source names as you
expect. There is a simply utility for doing this, sum_encdir. To use it,
simply type
>
sum_encdir /data00/encdir/YYYYMMDD
A list of observation numbers and source names will be printed out.
Merging Dies
Merging can die if any of the necessary files (raw bolometer
data, pointing files from telescope, encoder log files from rotator)
are missing or if the raw data files are short. Typical error
messages are:
Error
opening das directory
Error opening header directory
Error opening encoder directory
These imply that the given directory
could not be found. Since start_merge
ensures the directories exist when it begins, this means that a
directory has "vanished" in midstream. This is usually because a
cross-mounted disk from another computer has gone offline, usually
because the computer has crashed. For example, if merging on kilauea and allegro crashes, you will get
these errors. Consult the instructions above for dealing with a
crashed computer.
Cannot open file XXXX, reached
max number of tries
This means that for a given raw data
file, no pointing log or encoder log file was found after waiting for
some number of 30-second intervals.
Previous
number, this number
This means that the raw data file
minute number incremented by more than 1, which implies files were lost.
Now
about to crash!
File size is XXXX
File pointer position is YYYY
feof reports ZZZZ
Now crashing, satisfied...
Happily aborting with error
This error occurs when a raw data file
is the wrong size. Raw data files have an almost perfectly fixed
length set by the number of sampled channels and the number of samples
per minute. This error will usually happen on the last file of
the night because the DAS is usually stopped mid-minute. That's
fine. You should worry when it occurs partway through the night.
You should also worry if merging remains stuck in the wait loop for the
next file. New raw data files should appear every minute, so if
merging stalls for much longer than that, it indicates the raw data
files are not being generated or being copied to allegro.
For the various
cases, do the following:
Data Processing
This section describes how to restart the auto-analysis programs.
NOTE: For any instance where you are
asked to delete files, be careful to always use the -i option so that
you can confirm any deletes. This should be the default on allegro and kilauea, but be sure about it
before you delete anything.
For all cases
Processes that die involuntarily can leave partially written output
files, especially cleaning. Look for .lck files in your data
directories (see the Analysis
Software page for details on where these
would be). For any .lck
files that exist, delete the .lck
file and the associated data file. For example, if the data
directories start at ~/data,
then the command
>
find ~/data -path '*.lck' -follow
will find all the .lck
files. Don't forget to include the single quotes. Only delete the .lck files on kilauea; do not delete .lck
files in the cross-mounted directories rawdir/, headerdir/, or encdir/.
Xterms die
Either
because you accidentally killed them, or because kilauea's X server dies, or
because kilauea crashes,
etc. You can
restart the xterm(s) and the routine(s) running in them as
follows. If all your xterms
died, you should still use this by-hand method because start_autos does not supply the necessary obsnum_start argument to run_auto_slice_files.
- Start a new xterm in the second kilauea display and type idl
to start IDL.
- Issue the appropriate command to restart the script:
- For slicing, use
the optional obsnum_start argument
to tell it the first observation you want sliced. Otherwise it
will start from obsnum = 1
for that day. That is, type
run_auto_slice_files, YYYYMMDD, obsnum_start = obsnum_start
- For the other commands just type one of the following
(note
the @ sign!):
- @run_auto_clean_files_ptg
- @run_auto_map_files_ptg
- @run_auto_centroid_files
- @run_auto_clean_files_blankfields
- @run_auto_map_files_blankfields
- @run_auto_diag_clean_files_blankfields
- @run_auto_diag_map_files_blankfields
Xterm(s) remain alive but IDL
quits
(very unlikely) Even though the xterm hasn't died, you
will need to kill the offensive xterm(s) and follow the above
instructions for starting new xterms. The reasons are technical,
you can ask the Bolocam support person
if you really care.
Xterm(s) remain alive but an
IDL routine crashes
You will need to restart the IDL code by
hand. Various scenarios are described below. Be sure to ALWAYS type retall at the IDL prompt before
attempting to restart the IDL code; this brings you back to the main
IDL program level and prevents unpredictable behavior that may arise
from restarting the code from inside a routine that has crashed.
No ill effects arise from typing retall
when it isn't necessary, so go crazy! Do not use the .full_reset_session or .reset_session executive
commands; these will erase assorted variables that were initialized at
startup and are needed for some of the code to run properly.
IDL cleaning code does not
clean observations of a particular source
You probably have
forgotten to add your source to the appropriate source list
files. See the Analysis Software
page for instructions on making the cleaning
pipeline aware of new sources. The pipeline won't be aware of
these changes, though, until you restart it. You can do this in
one of two ways:
- If you still have all the pipeline windows up, just hit q in all of them except the slice_files window to stop the
ongoing processes. If q
does not work, try Ctrl-c
then type retall to get
back to the MAIN level in
IDL. Then, in each window except
slice_files, type
the appropriate one of the following (refer to the xterm title)
- @run_auto_clean_files_ptg
- @run_auto_map_files_ptg
- @run_auto_centroid_files
- @run_auto_clean_files_blankfields
- @run_auto_map_files_blankfields
- @run_auto_diag_clean_files_blankfields
- @run_auto_diag_map_files_blankfields
Don't forget the @
sign! This procedure is similar to what is done above for when
all the Xterms die.
- If you need to restart everything because the windows are
gone, do the usual start_autos
from the shell command line, but then hit Ctrl-c in the slice_files window as soon as
you can so you don't reslice all the data for that day (which will then
cause it all to be reprocessed).
The reason the above works is that, as long as you don't reslice the
files, the analysis routines realize that only the unprocessed
observations need to be analyzed -- the revision dates on the processed
observations' files tell the pipeline they are done. If you
reslice the files,
though, then the sliced files get new revision dates and the pipeline
thinks all the downstream files are out of date and need to be
regenerated.
General Computer
Troubleshooting
Computers are built to fail, one might say. Here are some
problems you might run into and how to deal with them, working from the
front-end to the back-end. If you run into a problem that
prevents you from taking data and can't solve it with the following
information, call the CSO pager. The on-call staff member will
either be able to help you or to get the necessary person in touch with
you.
andante (the DAS/fridge
computer)
andante has had a troubled
history that seems to dog it no matter what computer we call andante. We have had to
reinstall the system more times than we would like. Hence, we
have become quite expert at it. Here's how to deal if andante starts acting up.
If you start to see crashing of either LabView, disk cross-mounting to allegro, or the entire system
itself, and the problems are not obviously attributable to a specific
cause, the likely problem is that something bad has happened to
Windows. Don't fight it! Your first course of action is to
switch over to the image disk.
When we have andante in a
happily working state, we make a byte-by-byte image of it onto a
second, identical disk. That disk is then powered down and left
sitting in andante.
To switch to the image disk, do the following;
- First, find the target of the desktop shortcuts for BCAM_DAS and fridge_cycle. Copy these
programs off andante, as
updates may have been done since the last time the image disk was
made. If you have network access, you can use SSH (shortcut on
the desktop) to copy the programs to any other computer; allegro or kilauea are good choices since
they are on the local network. If not, you can probably copy the
programs off to a floppy disk. Make sure you copy the programs
themselves, not just the shortcuts! You can find the targets of
the shortcuts by right-clicking on the shortcut and selecting
Properties.
- Second, shut down andante
and open it up. It is nontrivial to open the computer up due to
the way the cover locks. See the instructions.
Find the current system drive and the image drive (both are IDE drives)
-- they are probably sitting right next to each other. The image
drive will likely have no power and IDE cables connected. Simply
switch the power and IDE cables over to the image drive and try
booting. Close up the computer if you are able to boot properly.
- Use SSH to recopy the DAS and fridge cycle programs down to the
image drive. Make sure to put them in the right folders (find the
targets of the desktop shortcuts) and to redefine the shortcuts as
necessary. Do this even if the originals have the same names
(e.g., BCAM_DAS_20040225)
-- there might be minor updates that did not warrant a new name but
need to be propagated.
- If you have had to switch to the image disk, inform the Bolocam support person so that we
can recover the original system disk at the next chance, turning it
into the image disk.
You may have gotten into the much worse situation where you actually
need to reinstall Windows from scratch and you can't just image the
working drive. This will overwrite much of the configuration
information, so it takes some work to get back to a properly working
state. If you have to do this, follow these instructions.
You will be frequently prompted for reboots, go ahead and reboot as
necessary. Log in as bolocam
whenever possible.
- The CDs you will need are in the Bolocam file cabinet in the computer
room. You will find Windows XP Professional Service Pack 2,
Partition Magic 8.0, and LabView 7.1.
- Some software must be downloaded from Caltech's site-licensed
software site. You need a Caltech ITS account for this. If
you don't have one, contact the Bolocam
support person.
- First, power down the computer and remove all the National
Instruments cards. See the instructions
for opening the computer. Remove the PCI-6031E, PCI-6034E, and
PCI-GPIB+ cards from the
computer. Note which slots they were initially in so you can
return them to the right places, and be careful about static
electricity.
- Make sure the computer is connected to the web.
- Install a fresh version of Windows XP PRO - SP2. (In the
options, choose to format the hard drive and install Windows XP).
- Log on as administrator
and make sure to create a password if you weren't prompted to do so
during the installation of Windows (use same password as noted in the
white Bolocam binder).
- Run Windows Update until the Windows installation is fully up to
date, with all security patches.
- Create a new user account bolocam
with full administrator rights with the same password as written in the
Bolocam white binder.
- Log out of the administrator
account and log in as bolocam.
- If you are using the Dell Precision 420 as andante, get the video card
driver. The video card is a Matrox G400 (http://www.matrox.com). After
rebooting, you can run the
resolution up to something sensible (1200 x
768). You might need to change the frequency to 75 Hz.
These latter settings are accessible by right-clicking in an open space
on the Windows Desktop, which will open the display settings.
Click on the Settings
tab. To find the frequency setting, click on Advanced and then the Monitor tab.
- Download and install VPN-3000 Virtual Private Network client
software from the Caltech ITS site:
- Run VPN-3000 to obtain a Caltech virtual IP address and install
Caltech site-licensed software from http://software.caltech.edu:
- Norton Antivirus. Make sure LiveUpdate is run and that
it is configured to download updates daily at around 22:00 UT (noon
local time).
- F-Secure SSH.
- You may disconnect VPN-3000 at this point.
- Install Partition Magic 8.0 from CD.
- Install the FULL version of Labview 7.1 from CD. Note that the
FULL installation includes the very useful MAX (Measurements &
Automation Explorer).
- Shut down and install all the National Instruments cards, being
sure to put them back in the same slots you removed them from.
Again, take precautions against static electricity.
- Plug the GPIB connector and the two ADC cables in (one ADC cable
comes from the SCXI chassis and connects to the upper PCI card, the
other comes from the white thermometry breakout box and connects to the
middle PCI card. The connectors have different form factors so
there should be no confusion). Restart the computer and log in as
bolocam.
- Fire up MAX (there should be a
shortcut on the desktop labeled Measurement and Automation
Explorer). You should see:
My System
Devices & Interfaces
Traditional
NI-DAQ Devices
GPIB
To see the fridge power supplies (the Tektronix
PS2520G modules), right-click on GPIB
and click Scan for Instruments.
Two
GPIB devices should come up. (You may need to left-click on GPIB to open the tree up
further.)
NEED TO UPDATE THE FOLLOWING WHILE
HAVING ACCESS TO PC.
To see the MUX chassis, right click
on Traditional NI-DAQ Devices
and choose Add SCXI Chassis
and pick SCXI-1001.
Right-click on the SCXI-1001 entry, select Properties, and make sure Chassis ID is set to 1 and Chassis Address to 0.
Click on SCXI-1001 and you should see 12 SCXI-1100 modules appear in
the right window. Right-click on the first one and select Properties. Under the General tab, go to Connected to: and select the
PCI-6034E card. Also click the This device will control the chassis
checkbox. Leave the defaults in the other tabs. For the
other SCXI-1100 modules, open their Properties windows and make
sure that the Connected to:
box says None. The This device will control the chassis
checkbox will be grayed out.
- Correct the device numbers in the BCAM_DAS and fridge_cycle LabView
programs. Open fridge_cycle
and look for the PCI-6031E
Device Number
control to the right on the front panel (probably off the
screen). If the default value does not point to the
PCI-6031E card (to the device number indicated in MAX), then change the
value. To save the new value as the default, change to edit mode (Operate -> Change to Edit Mode),
right-click on the
control and select Data
Operations -> Make Current Value Default. Then switch
back to run mode (Operate ->
Change to Run Mode) and save the program. Similarly, open
up
the BCAM_DAS program and
look for the PCI-6034E Device
Number
control at the top of its front panel and repeat the above for this
program.
It may be that one or both of these programs complains of missing vi's
on startup; they are probably in one of the llb files in Program Files/National Instruments,
just dig until you find them, they will be there.
- Set the IP address to be andante's
static address. Click on Start
-> Connect to -> Show all Connections and then select Local Area Connection and click
on Properties. In
the Components window,
select TCP/IP or possibly
Internet Protocol (TCP/IP)
and then click on Properties.
Check the Use the following IP
address: radio button and type in the following:
IP address: 128.171. 86.211
subnet mask: 255.255.255. 0
gateway:
128.171. 86. 2
Also check the Use the following
DNS server addresses: and type in the following:
Preferred DNS Server: 128.171.3.13
Have it change the IP address immediately (i.e., don't wait to reboot).
- Set up the computer
to do network time synchronization. Double-click on the clock in
the lower-right corner of the desktop. The Date and Time Properties dialog
box will come up. Click on the Time Zone tab and make sure the
clock is set to the GMT time zone. Click on the Internet Time tab. Enable
automatic time synchronization with hau.submm.caltech.edu.
Click the Update Now
button.
- Set up disks properly:
- Using Partition Magic, split up the master drive into two
partition C:\ (~21 GB)
and D:\ (~17 GB, call it DATA). Follow the
instructions. You'll be prompted to reboot the system at the end.
- Create the following directories in D:\
D:\das_data
D:\fridge_data
D:\lab_tests
- Make D:\das_data remotely accessible so data can be
transferred to allegro:
- Use Windows Explorer to get access to D:\asdsa
- Right-click on the the das_data directory and select Properties.
- Click on the Sharing
tab.
- In the Network
Sharing and Security box, enable Share this folder on the network
and give it the name das_data.
Make sure Allow network users to
change my files remains disabled.
- Turn on the network firewall:
- Right-click on Start
-> Connect To -> Show All Connections
- Right-click on Local
Area Connection and select Properties.
- Click the Advanced
tab and select Settings...
in the Windows Firewall
box. Click the ONn
button, and click Ok in
all windows that get opened.
- Close the Network
Connections window
- Turn on the Remote Desktop server to allow remote users to use
this computer:
- Right-click on the My
Computer icon on the desktop and select Properties.
- Click on the Remote
tab.
- In the Remote Desktop
box, enable Allow users to
connect remotely to this computer.
- Make sure that the Remote
Assistance box in disabled.
- Close the windows that have been opened, clicking Ok where necessary.
- The firewall will be
automatically adjusted to allow remote users to connect.
- Check that it works
by remotely connecting from another computer; directions are provided elsewhere.
- Enable automatic Windows Updates:
- Right-click on the My
Computer icon on the desktop and select Properties.
- Click on the Automatic
Updates tab. Make
sure the Automatic option
box is enabled. Set the update time to every day update at 21:00 UT
(11:00 am local time) so it doesn't interfere with observing.
allegro
allegro has been
remarkably stable. If it crashes, instructions for bringing it
back up and cross-mounting disks have been given above. If the system itself seems to
be going belly-up -- e.g., frequent crashing, unexpected behavior --
you can switch to the image disk. This is a disk that, like for andante, is basically a
byte-by-byte image of the system and /home disk. Your data
will be unaffected by this switch. To do this:
- If possible, dismount /data00
from kilauea by logging
in to kilauea as root (password in the white
Boloca Manual binder) and typing umount
/data00. If you get a device busy error, you may have
to get people to log out user sessions if they happen to be sitting in
one of the /data00
directories (unlikely). If you can't get the disk to
unmount, skip to the next step.
- If possible, copy the directory /home/observer/src to another
computer (e.g., kilauea)
so that the code on the image disk can be updated if necessary.
To do this:
- log in to allegro
as observer
- make sure you are in /home/observer
- type tar --gzip -cvf
src.tar.gz src
- copy src.tar.gz
off to a different computer
- Shut down allegro:
log in as root (password
in the white Bolocam Manual binder), then type shutdown -h now. The
computer will shut down and power off. Flip the power switch on
the back of the computer to the off position also.
- Open up allegro.
You will see a 20 GB drive connected to the primary IDE port on the
motherboard -- this is the current system disk. (Follow the
cables back to the motherboard and you will the connectors labeled on
the board.) Somewhere else inside the computer there will be
another 20 GB disk without a power cable attached -- this is the image
disk. Switch the IDE and power cables from the original system
disk to the image disk. You may have to move disks around in
order to be able to connect the IDE cable to the image disk.
- Turn the rear power switch back on and then press the front panel
power button to boot the computer from the image disk.
- Update the src
directory as follows:
- log in to allegro
as observer
- make sure you are in /home/observer
- type mv src src_old
- copy src.tar.gz
from whereever you copied it to to the current directory
- type tar --gunzip -xvf
src.tar.gz
- Follow the remaining directions above
for bringing allegro back
up (cross-mounting disks)
kilauea
kilauea is managed by Ruisheng Peng. If it
starts having problems, let him and the
Bolocam support person know.
If kilauea just dies
completely and won't reboot, you don't have much recourse until you get
in touch with Ruisheng. However, kilauea is not critical to
taking data. All the normal data-taking processes will continue
to run if kilauea
dies. You can log in to puuoo
as bolocam (same password
as on kilauea) and from
there restart the xterms that monitor rotator, dirsync.py, header_copy.py, and write_log, and also restart gbolostrip following the
directions given above as if you were
doing it on kilauea.
You can also do this from any other computer with a X-server; feel free
to use your laptop, or you can also use Reflection X on pika, the PC in the main
computer room.
Problems Accessing
Data Disks or Dying Data Disks
The summit is not a friendly place for hard drives, especially data
drives that get heavily exercised. We keep spare 120 GB data
drives ready for when a data drive on one of the linux machines
dies. Symptoms of this
happening are i/o errors from the processes that write or read data to
the particular disks, or simply the directories on a drive not
appearing.
allegro:
For allegro, we
have a spare data drive sitting in the computer ready to go. Just
switch to this drive. Even if it turns out that the original
drive had not fully failed, switching drives will minimize
downtime. Further investigation can be done during the
daytime. To switch drives, do as follows:
- Shut down the computer as explained above, being sure to turn off
the power switch on the back of the computer.
- Open up the computer. Find the 120 GB drive connected to
the secondary IDE bus (follow the cables back and look on the
motherboard for the IDE connector label); this is the current data
drive. Find the spare 120 GB drive also, which will have its
power cable unplugged. Switch the IDE cable from the original 120
GB drive to the spare. Connect a power cable to the spare drive.
- Boot the computer back up as explained above.
- You should see the new /data00
mounted and the /data00/rawdir,
/data00/headerdir, and
/data00/encdir
directories. Now continue as if allegro had simply crashed,
following the instructions given above (mounting disks, etc.).
You will have to restart everything as if it it were the start of the
night. Be sure to note in your observing logs the time of the
crash and the last observation taken before the crash. After
restarting everything, your observation numbers will start at 1 again,
but the offset can be inserted later. Include the offset
observation number in your logs of the new observations. For
example, if the disk died during observation 34, then your first
observation after restarting should be labeled 1 (35)to indicated what the
corrected observation number will be.
- For merging and downstream, you need to move the files already
generated for the given day out of the way, as they will cause
confusion to the analysis software. Do as follows:
- Contact the Bolocam support person;
he will work with the day crew to recover the data from the original /data00 and to reprocess the
full day as if there had been no problem.
kilauea: This
machine uses a SCSI RAID for almost every disk, so they should be
pretty robust. Responding to various disk failure modes:
- kilauea:/data
becomes unavailable: This should not matter, as we don't use it
anymore, except if kilauea:/bigdisk
is down. Since /home/kilauea
and /data are on the same
RAID, it's pretty likely that /home/kilauea
will also become unavailable, in which case you can't do anything on kilauea. You can of
course keep taking data -- you just can't analyze it. Call or
email Ruisheng Peng and the
Bolocam support person to let
them know of the kilauea
disk problem.
- kilauea:/bigdisk
becomes unavailable: This makes the processed data directories
unavailable. You can temporarily use kilauea:/data for the processed
data. You need to
- Create the appropriate /data/bolocam/YYYYMM/
directory.
- Create all the standard subdirectories (merged/, sliced/, etc.) as explained elsewhere.
- Create soft links from /home/kilauea/bolocam/data
to these new directories as explained elsewhere.
- Restart the data processing as if starting it at the beginning
of the night. You will have lost all your processed data, but the
processing catches up on the night's data pretty quickly.
- Call or
email Ruisheng Peng and the Bolocam support person to let them
know of the kilauea
disk problem. Depending on whether the problem is recoverable,
you may have lost all of your processed data and will need to reprocess
it. See the AnalysisSoftware
page for instructions,
Dying System Disks
System disks can also die. To make it possible to recover quickly
from such a problem, we have created image drives for the andante and allegro system drives.
They are left powered off and disconnected inside the particular
computer.
Note that, on andante,
the system and data drives are different partitions of the same disk,
and so if one begins to fail, so does the other.
The image drives nominally have the same version of the
Bolocam-specific
software as the original drive, but they could be slightly out of
date. To be sure to get the most recent software, do the
following first:
To switch over to the image drive, do the following (same for both
computers):
- Shut down and open up the computer.
- Find the 20 GB drive that has no power or IDE cables connected to
it; this is the image drive.
- Switch the IDE cable from the failing drive to the image
drive. Connect a power cable to the image drive.
- Restart the computer as usual.
Once you have switched to the
image drive, you can update the software as follows:
Revision History
- 2003/12/07
Separate troubleshooting section from rest of
Observer Manual
- 2004/02/02 SG
Add link back to main page, minor updates
- 2004/02/04 SG
Add instructions on getting rid of partially completed files and on the
"Unable to free memory" error. Rearrange a bit.
- 2004/02/09 SG
Add instructions for dealing with big trck_das/trck_tel offsets in
cleaning and for observations of a given source not being cleaned,
improve section on restarting merging.
- 2004/02/26 SG
Add instructions for restarting data copying programs using new version
of start_tel_util. Advise about raw data file size problems in
merging troubleshooting section.
- 2004/02/27 SG
Modify instructions for restarting data copying programs to use
check_tel_util and kill_tel_util.
- 2004/04/28 SG
Minor fixes, lots of new material
- 2004/05/02 SG
Lots of new computer troubleshooting, esp. reinstalling Windows on the
DAS computer
- 2004/05/03 SG
Update for modifications to write_log
- 2004/05/06 SG
Update restart of DAS after andante crash -- can have problems
smbmounting if gbolostrip is still trying to access the disk.
- 2004/05/08 SG
Further updating of DAS restart instructions.
- 2004/05/09 SG
Add instructions for switching to image system disks.
- 2004/05/26 SG
Add instructions for dealing with 2-drive death in RAID.
- 2004/10/02 SG
Add more detailed information on merging errors. Now that
write_log deals with RPC timeouts properly, do not force kill of
write_log when telescope computer crashes. Remove warning about
header_copy copying files from previous day if telescope computer was
restarted, should no longer happen.
- 2004/12/10 SG
Add check of fiber-optic isolator DIP switches if there are rotator
problems.
- 2005/03/19 SG
All kinds of updates for puuoo SCSI RAID.
- 2005/06/03 SG
Significant updates for reinstalliing software on andante and for move
of /bigdisk to kilauea.
- 2005/12/16 SG
More detailed instructions for recovering when a source was not added
to the params files before processing.
- 2010/12/18 JS
Updated the kilauea restart proceedure.
Questions or
comments?
Contact the Bolocam support person.