[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bad blocks on Symbolics disks...

Once a month or so our 3670 has been crashing due to "Irrecoverable
disk error" or "Unproceedable disk search error." The 3670 is out
namespace, file, mail, print, dialnet domain and documentation server
for the entire office LISPM network and uses three disks with 15,000
LMFS files.  It also contained over fifteen customized demonstration
and project world loads.  One crash a month was tolerable, however one
day I came into work and found that I couldn't list the disk or access
most files without generating an "Irrecoverable Disk Error - 1 pending
operation to disk unit 2."

The world appeared to be corrupted but it was much worse than that.  There
were apparently two bad blocks in one of the unit 2 files.

Also, every month or two we would run into a corrupted file in the LMFS
(Mail log file trashed or some such similar death).  Symbolics tells me
that this happens.... a file will be corrupted on the disk every month
or two and it's to be expected.  This is disturbing in that Symbolics
is telling me I cannot rely on the file system to store data without

Things looked bad so at this point we called in Symbolics support.
Zeroing blocks (in the style of ECC errors) didn't help and attempts
to write blocks into the bad blocks list failed.  Unfortunately
Symbolics and I accidently removed the bad-blocks list while attempting
to further correct the problem.  When asking which course of action to
take in repairing the file (zero out, splice or remove), the debugger
was asking what operation to perform on the BAD-BLOCKS file, not the file
with the bad blocks in it... and we selected "delete."

and herein lies the interesting part of my story...

Symbolics concluded that our only remaining course of action was to
format the disk and put a new IFS on it, and before doing that we would
vacate the LMFS partitions on the bad disk (FEP2) into another
partition on another disk (FEP0).  This sounded straightforward
enough.  The vacate process should have kept track of all pointers and
the several LMFI we had scattered about the FEP0 and FEP1 disks should
have been able to point to each other even after the vacation.

After formating I rebooted the system and did a set site, but set site
complained that an octal ID number for a partition was not found.  In
other words, the vacate did not properly link the LMFS partitions back
together.  Symbolics concluded that nothing could be done for the LMFS
at this point.  Patching the volume table was pointless since some
functions thought there were 9 partitions, others thought there were 7.
The LMFS was royally screwed.  I ended up removing all 7 LMFS
partitions and consequently all 200,000 records of our LMFS on all
three disks.

Mark Grover was wandering the halls mumbling "Oh the horror... the horror..."

The moral of this story is that vacating LMFS partitions is not a
guarantee of preserving the LMFS.  It is also a sad commentary on
Symbolics LISPM's when two bad disk blocks should necessitate ten days
of downtime, rebuilding one entire disk and our entire LMFS.  I've
never been satisfied with the amount of system administrative software
on the machine and this is one example of where that weakness can be
devastating... not to mention loss of project development time and the
cost to me personally.

The story ends happily though.... I keep double and sometimes triple
sets of backup tapes of both the FEP and the LMFS and so was able to
restore our LMFS and worlds into the clean partitions.  Our machine was
back up and running ten days after the first crash.  We are still
operating at reduced capacity since the one disk will have to be
replaced, but we are hopeful that once this is done we will have no
more problems.

John T. Nelson			UUCP: rutgers!mimsy!rlgvax!sundc!potomac!jtn
Advanced Decision Systems	Internet:  jtn@ads.arpa
1500 Wilson Blvd #512; Arlington, VA 22209-2401		(703) 243-1611

			*OOP*  *ACK*
               _   /|