PRACE PATC Course: Node-Level Performance Engineering (Dec 3 and 4, 2013)

Do Okt 31 11:43:14 CET 2013

    Date:      Dec 3, 2013 9:00 - 17:00                                
                Dec 4, 2013 9:00 - 17:00                                

   Location:    LRZ Building, University campus Garching, near Munich   

                This course teaches performance engineering approaches  
                on the compute node level. "Performance Engineering" as 
                we define it is more than employing tools to identify   
                hotspots and bottlenecks. It is about developing a      
                thorough understanding of the interactions between      
                software and hardware. This process must start at the   
                core, socket, and node level, where the code gets       
                executed that does the actual computational work. Once  
                the architectural requirements of a code are understood 
                and correlated with performance measurements, the       
                potential benefit of optimizations can often be         
                predicted. We introduce a ?holistic? node-level         
                performance engineering strategy, apply it to different 
                algorithms from computational science, and also show how
                an awareness of the performance features of an          
                application may lead to notable reductions in power     
                consumption.                                            

                Introduction                                            

                  * Intel and AMD x86 architectures                     
                  * ccNUMA                                              
                  * Performance modeling & engineering approaches       
                  * Our Approach                                        

                Practical performance analysis                          

                  * The LIKWID tools                                    
                  * Typical performance patterns                        

                Microbenchmarks and the memory hierarchy                

                  * Understanding the memory hierarchy                  
                      + Data transfer between memory levels             
                      + Write allocate vs. NT stores                    
                      + Modeling of cache hierarchies                   
                      + Contention                                      
                  * NUMA effects ? anisotropy and asymmetry             

                Typical node-level software overheads                   

                  * Cost of synchronization                             
   Contents:      * Work distribution                                   

                Example Problem: The 3D Jacobi solver                   

                  * Core-level optimizations                            
                      + Blocking                                        
                      + Non Temporal stores                             
                      + SIMD vectorization (SSE, AVX)                   
                  * Multithreading ? contention at different memory     
                    hierarchies                                         
                  * Temporal Blocking                                   

                Example Problem: The Lattice-Boltzmann Method (LBM)     

                  * Introduction                                        
                  * Roofline Model                                      
                  * Data layout                                         
                  * Non Temporal stores                                 
                  * Model for in-cache data & multicore scaling         
                  * Sparse representation and options for propagation   

                Example Problem: Sparse Matrix-Vector Multiplication    

                  * Data layouts                                        
                  * Performance model ? CPU vs. GPU                     
                  * Bandwidth reduction                                 

                Example Problem: A backprojection algorithm for CT      
                reconstruction                                          

                  * The algorithm                                       
                  * Naive analysis                                      
                  * Detailed analysis and performance model             
                  * Optimizations                                       

                Energy & Parallel Scalability                           

                  * Energy consumption of modern processors             
                  * The energy-to-solution metric                       
                  * Performance engineering == power engineering        
                  * Case studies                                        

                Between each module, there is time for Questions and    
                Answers!                                                

 Prerequisites  Participants must have basic knowledge in programming   
                with Fortran or C                                       

   Language:    English                                                 

   Teachers:    Prof. Gerhard Wellen/RRZE, Dr. Georg Hager/RRZE et. al. 

                LRZ registration form: http://www.lrz-muenchen.de/      
 Registration:  services/schulung/kursanmeldung (Please choose course   
                HNPF1W13)                                               

 Diese Information finden Sie im WWW unter
 http://www.lrz-muenchen.de/services/compute/supermuc/aktuell/ali4695/

 Matthias Brehm