SuperMUC: Explanation for recent system problems
LRZ aktuell
publish at lrz.de
Fr Mär 15 15:17:16 CET 2013
Dear users of SuperMUC,
After a phase of stable operation since end of January, problems with
the stability of the I/O subsystems have resurfaced after the last
maintenance on March 11 and 12. This document intends to explain what
has happened based on the root cause analysis performed by IBM and DDN
throughout the last days.
During the maintenance, a mechanism on the storage systems that is
intended to contribute to data integrity was temporarily switched off;
when returning this mechanism to operation, a number of data blocks was
erroneously marked as bad due to a bug in the controller firmware. As a
consequence, a total of three files was damaged, leading to GPFS outage
if any of these three files was read. The resolution process consisted
in identifying the bug, the list of bad blocks, and the three files
associated with the bad blocks. The files have been rendered
inaccessible, and their owners have been notified.
The firmware bug is triggered if a combination of exactly three
conditions is fulfilled (one of them being the switching-on process
mentioned above); unfortunately with a storage system of the size
installed at our site, the probability for this occurring is rather
high. By changing a setting on the controllers one of the conditions
that trigger the bug is eliminated; for this reason we believe that the
system can now again be safely operated. The changed setting will be
reverted to its old value once a new firmware release is installed that
fixes the bug.
Please remember to archive important data stored on the WORK or SCRATCH
file systems to tape via TSM (see http://www.lrz.de/services/compute/
backup/ and use a login to supermuc-tsm.lrz.de; the file system
policies are described in http://www.lrz.de/services/compute/supermuc/
filesystems/). Even for rapidly changing datasets (like restart files),
it may be advisable to do this at least occasionally in order to
prevent accumulated loss of CPU budget. Corrupted data can occur not
only through system-related bugs, but also may be caused by external
influences, e.g., a power failure.
Once more, apologies for the delay in processing your jobs and thanks
for your patience.
This information is also available on our web server
http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4544/
Reinhold Bader
Mehr Informationen über die Mailingliste aktuell