SuperMUC: On the way to resolving the stability issues

Mi Jan 30 08:54:59 CET 2013

Dear users of the SuperMUC Petaflop system,

 As known to many of you, we have encountered significant stability
 issues on the system throughout the last months, not all of which were
 resolved up to January 2013. These problems particularly impact large
 jobs, as well as jobs that highly stress the I/O subsystems, especially
 GPFS.

 Investigation of the causes for these instabilities has progressed to a
 point that allows us to provide information on IBM s and LRZ s plans
 for significant improvement of operational stability.

 Issue 1: Occasional switch outages caused by a race condition in the
 management part of the switch logic. This issue will be fixed during
 the Jan 28-31 maintenance phase.

 Issue 2: A relatively high number of Infiniband cables has shown
 degradation or failure. While the exact cause is still under
 investigation, the prime suspect is a too high temperature in some
 parts of the air-cooled switch infrastructure that will be resolved by
 increasing the fan speeds, to be implemented during the Jan 28-31
 maintenance phase. Also, the defective cables will be replaced.

 Issue 3: The hard disks underlying the GPFS are not designed for
 uninterrupted high-bandwidth I/O and may degrade in performance over
 time if overstressed. Given the I/O profiles we observe, this is not an
 issue if such disks are automatically retired from usage once the
 problem arises, and then exchanged. However, this automatism inside the
 storage does not presently work for some scenarios. Unfortunately, even
 a small number of slow disks will considerably slow down the complete
 file system. At present, manual disk reviews are done in short
 intervals to alleviate this problem; we expect that a full solution to
 the problem will become available within a few weeks.

 Given the above, considerable stability improvements should already be
 observable once the system is returned to regular operation after the
 above-mentioned maintenance. Please provide feedback (via our Service
 Desk) as soon as possible if you feel this not to be the case, and also
 if you already have an incident ticket open at our site.

 This information is also available on our web server
 http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4512/

 Reinhold Bader