SuperMUC: Explanation for recent system problems

LRZ aktuell publish at lrz.de
Fr Mär 15 15:17:16 CET 2013


Dear users of SuperMUC,
 
 After a phase of stable operation since end of January, problems with
 the stability of the I/O subsystems have resurfaced after the last
 maintenance on March 11 and 12. This document intends to explain what
 has happened based on the root cause analysis performed by IBM and DDN
 throughout the last days.
 
 During the maintenance, a mechanism on the storage systems that is
 intended to contribute to data integrity was temporarily switched off;
 when returning this mechanism to operation, a number of data blocks was
 erroneously marked as bad due to a bug in the controller firmware. As a
 consequence, a total of three files was damaged, leading to GPFS outage
 if any of these three files was read. The resolution process consisted
 in identifying the bug, the list of bad blocks, and the three files
 associated with the bad blocks. The files have been rendered
 inaccessible, and their owners have been notified.
 
 The firmware bug is triggered if a combination of exactly three
 conditions is fulfilled (one of them being the switching-on process
 mentioned above); unfortunately with a storage system of the size
 installed at our site, the probability for this occurring is rather
 high. By changing a setting on the controllers one of the conditions
 that trigger the bug is eliminated; for this reason we believe that the
 system can now again be safely operated. The changed setting will be
 reverted to its old value once a new firmware release is installed that
 fixes the bug.
 
 Please remember to archive important data stored on the WORK or SCRATCH
 file systems to tape via TSM (see http://www.lrz.de/services/compute/
 backup/ and use a login to supermuc-tsm.lrz.de; the file system
 policies are described in http://www.lrz.de/services/compute/supermuc/
 filesystems/). Even for rapidly changing datasets (like restart files),
 it may be advisable to do this at least occasionally in order to
 prevent accumulated loss of CPU budget. Corrupted data can occur not
 only through system-related bugs, but also may be caused by external
 influences, e.g., a power failure.
 
 Once more, apologies for the delay in processing your jobs and thanks
 for your patience.


 This information is also available on our web server
 http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4544/

 Reinhold Bader



Mehr Informationen über die Mailingliste aktuell