SuperMUC: Explanation for recent system problems
Dear users of SuperMUC, After a phase of stable operation since end of January, problems with the stability of the I/O subsystems have resurfaced after the last maintenance on March 11 and 12. This document intends to explain what has happened based on the root cause analysis performed by IBM and DDN throughout the last days. During the maintenance, a mechanism on the storage systems that is intended to contribute to data integrity was temporarily switched off; when returning this mechanism to operation, a number of data blocks was erroneously marked as bad due to a bug in the controller firmware. As a consequence, a total of three files was damaged, leading to GPFS outage if any of these three files was read. The resolution process consisted in identifying the bug, the list of bad blocks, and the three files associated with the bad blocks. The files have been rendered inaccessible, and their owners have been notified. The firmware bug is triggered if a combination of exactly three conditions is fulfilled (one of them being the switching-on process mentioned above); unfortunately with a storage system of the size installed at our site, the probability for this occurring is rather high. By changing a setting on the controllers one of the conditions that trigger the bug is eliminated; for this reason we believe that the system can now again be safely operated. The changed setting will be reverted to its old value once a new firmware release is installed that fixes the bug. Please remember to archive important data stored on the WORK or SCRATCH file systems to tape via TSM (see http://www.lrz.de/services/compute/ backup/ and use a login to supermuc-tsm.lrz.de; the file system policies are described in http://www.lrz.de/services/compute/supermuc/ filesystems/). Even for rapidly changing datasets (like restart files), it may be advisable to do this at least occasionally in order to prevent accumulated loss of CPU budget. Corrupted data can occur not only through system-related bugs, but also may be caused by external influences, e.g., a power failure. Once more, apologies for the delay in processing your jobs and thanks for your patience. This information is also available on our web server http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4544/ Reinhold Bader
participants (1)
-
LRZ aktuell