Dear users of the SuperMUC Petaflop system, As known to many of you, we have encountered significant stability issues on the system throughout the last months, not all of which were resolved up to January 2013. These problems particularly impact large jobs, as well as jobs that highly stress the I/O subsystems, especially GPFS. Investigation of the causes for these instabilities has progressed to a point that allows us to provide information on IBM s and LRZ s plans for significant improvement of operational stability. Issue 1: Occasional switch outages caused by a race condition in the management part of the switch logic. This issue will be fixed during the Jan 28-31 maintenance phase. Issue 2: A relatively high number of Infiniband cables has shown degradation or failure. While the exact cause is still under investigation, the prime suspect is a too high temperature in some parts of the air-cooled switch infrastructure that will be resolved by increasing the fan speeds, to be implemented during the Jan 28-31 maintenance phase. Also, the defective cables will be replaced. Issue 3: The hard disks underlying the GPFS are not designed for uninterrupted high-bandwidth I/O and may degrade in performance over time if overstressed. Given the I/O profiles we observe, this is not an issue if such disks are automatically retired from usage once the problem arises, and then exchanged. However, this automatism inside the storage does not presently work for some scenarios. Unfortunately, even a small number of slow disks will considerably slow down the complete file system. At present, manual disk reviews are done in short intervals to alleviate this problem; we expect that a full solution to the problem will become available within a few weeks. Given the above, considerable stability improvements should already be observable once the system is returned to regular operation after the above-mentioned maintenance. Please provide feedback (via our Service Desk) as soon as possible if you feel this not to be the case, and also if you already have an incident ticket open at our site. This information is also available on our web server http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4512/ Reinhold Bader