SuperMUC: On the way to resolving the stability issues
publish at lrz.de
Mi Jan 30 08:54:59 CET 2013
Dear users of the SuperMUC Petaflop system,
As known to many of you, we have encountered significant stability
issues on the system throughout the last months, not all of which were
resolved up to January 2013. These problems particularly impact large
jobs, as well as jobs that highly stress the I/O subsystems, especially
Investigation of the causes for these instabilities has progressed to a
point that allows us to provide information on IBM s and LRZ s plans
for significant improvement of operational stability.
Issue 1: Occasional switch outages caused by a race condition in the
management part of the switch logic. This issue will be fixed during
the Jan 28-31 maintenance phase.
Issue 2: A relatively high number of Infiniband cables has shown
degradation or failure. While the exact cause is still under
investigation, the prime suspect is a too high temperature in some
parts of the air-cooled switch infrastructure that will be resolved by
increasing the fan speeds, to be implemented during the Jan 28-31
maintenance phase. Also, the defective cables will be replaced.
Issue 3: The hard disks underlying the GPFS are not designed for
uninterrupted high-bandwidth I/O and may degrade in performance over
time if overstressed. Given the I/O profiles we observe, this is not an
issue if such disks are automatically retired from usage once the
problem arises, and then exchanged. However, this automatism inside the
storage does not presently work for some scenarios. Unfortunately, even
a small number of slow disks will considerably slow down the complete
file system. At present, manual disk reviews are done in short
intervals to alleviate this problem; we expect that a full solution to
the problem will become available within a few weeks.
Given the above, considerable stability improvements should already be
observable once the system is returned to regular operation after the
above-mentioned maintenance. Please provide feedback (via our Service
Desk) as soon as possible if you feel this not to be the case, and also
if you already have an incident ticket open at our site.
This information is also available on our web server
Mehr Informationen über die Mailingliste aktuell