Tutorials

The SC Tutorials program is always one of the highlights of the SC Conference, offering attendees a variety of short courses on key topics and technologies relevant to high performance computing, networking, storage, and analysis. Tutorials also provide the opportunity to interact with recognized leaders in the field and to learn about the latest technology trends, theory, and practical techniques. As in years past, tutorial submissions were subjected to a rigorous peer review process. Of the 77 submissions, the Tutorials Committee selected the following 33 tutorials for presentation.

Sunday, 11/16/14	Full-Day (8:30am-5pm)	A “Hands-On” Introduction to OpenMP Presenters: Mark Bull (Edinburgh Parallel Computing Centre) Tim Mattson (Intel Corporation) Mike Pearce (Intel Corporation) Abstract: OpenMP is the de facto standard for writing parallel applications for shared memory computers. With multi-core processors in everything from tablets to high-end servers, the need for multithreaded applications is growing and OpenMP is one of the most straightforward ways to write such programs. In this tutorial, we will cover the core features of the OpenMP 4.0 standard. This will be a hands-on tutorial. We expect students to use their own laptops (with Windows, Linux, or OS/X). We will have access to systems with OpenMP (a remote SMP server), but the best option is for students to load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Efficient Parallel Debugging for MPI, Threads, and Beyond Presenters: Bronis R. de Supinski (Lawrence Livermore National Laboratory) Ganesh Gopalakrishnan (University of Utah) Tobias Hilbrich (Technische Universität Dresden) David Lecomber (Allinea Software) Matthias S. Müller (Aachen University) Mark O’Connor (Allinea Software) Joachim Protze (Aachen University) Abstract: Parallel software enables modern simulations on high performance computing systems. Defects—or commonly bugs—in these simulations can have dramatic consequences on matters of utmost importance. This is especially true if defects remain unnoticed and silently corrupt results. The correctness of parallel software is a key challenge for simulations in which we can trust. At the same time, the parallelism that enables these simulations also challenges their developers, since it gives rise to new sources of defects. We invite attendees to a tutorial that addresses the efficient removal of software defects. The tutorial provides systematic debugging techniques that are tailored to defects that revolve around parallelism. We present leading edge tools that aid developers in pinpointing, understanding, and removing defects in MPI, OpenMP, MPI-OpenMP, and further parallel programming paradigms. The tutorial tools include the widely used parallel debugger Allinea DDT, the Intel Inspector XE that reveals data races, and the two MPI correctness tools MUST and ISP. Hands-on examples—make sure to bring a laptop computer—will guide attendees in the use of these tools. We will conclude the tutorial with a discussion and will provide pointers for debugging with paradigms such as CUDA or on architectures such as Xeon Phi.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	From "Hello World" to Exascale Using x86, GPUs and Intel Xeon Phi Coprocessors Presenter: Robert M. Farber (Blackdog Endeavors, LLC) Abstract: Both GPUs and Intel Xeon Phi coprocessors can provide a teraflop/s performance. Working source code will demonstrate how to achieve such high performance using OpenACC, OpenMP, CUDA, and Intel Xeon Phi. Key data structures for GPUs and multi-core such as low-wait counters, accumulators, and massively-parallel stack will be covered. Short understandable examples will walk students from “Hello World” first programs to exascale capable computation via a generic mapping for numerical optimization that demonstrates near-linear scaling on conventional, GPU, and Intel Xeon Phi based leadership class supercomputers. Students will work hands-on with code that actually delivers a teraflop/s average performance per device plus MPI code that scales to the largest leadership class supercomputers, and leave with the ability to solve generic optimization problems including data intensive PCA (Principle Components Analysis), NLPCA (Nonlinear Principle Components), plus numerous machine learning and optimization algorithms. A generic framework for data intensive computing will be discussed and provided. Real-time visualization and video processing will also be covered because GPUs make superb big-data visualization platforms.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Hands-On Practical Hybrid Parallel Application Performance Engineering Presenters: Markus Geimer (Forschungzentrum Juelich GmbH) Sameer S. Shende (University of Oregon) Bert Wesarg (Technical University Dresden) Brian J. N. Wylie (Forschungzentrum Juelich GmbH) Abstract: This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the Score-P community-developed instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid combination of both, and increasingly common usage of accelerators. Parallel performance tools from the Virtual Institute - High Productivity Supercomputing (VI-HPS) are introduced and featured in hands-on exercises with Scalasca, Vampir, and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers with a provided Linux Live-ISO image containing all of the tools (running within a virtual machine or booted directly from DVD/USB) will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Large Scale Visualization with ParaView Presenters: David E. DeMarle (Kitware Inc.) Joseph Insley (Argonne National Laboratory) Ollie Lo (Los Alamos National Laboratory) Robert Maynard (Kitware Inc.) Kenneth Moreland (Sandia National Laboratories) W. Alan Scott (Sandia National Laboratories) Abstract: ParaView is a powerful open-source turnkey application for analyzing and visualizing large data sets in parallel. Designed to be configurable, extendible, and scalable, ParaView is built upon the Visualization Toolkit (VTK) to allow rapid deployment of visualization components. This tutorial presents the architecture of ParaView and the fundamentals of parallel visualization. Attendees will learn the basics of using ParaView for scientific visualization with hands-on lessons. The tutorial features detailed guidance in visualizing the massive simulations run on today’s supercomputers and an introduction to scripting and extending ParaView. Attendees should bring laptops to install ParaView and follow along with the demonstrations.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Linear Algebra Libraries for High-Performance Computing: Scientific Computing with Multicore and Accelerators Presenters: James Demmel (University of California, Berkeley) Michael Heroux (Sandia National Laboratories) Jakub Kurzak (University of Tennessee) Abstract: Today, a desktop with a multicore processor and a GPU accelerator can already provide a TeraFlop/s of performance, while the performance of the high-end systems, based on multicores and accelerators, is already measured in PetaFlop/s. This tremendous computational power can only be fully utilized with the appropriate software infrastructure, both at the low end (desktop, server) and at the high end (supercomputer installation). Most often a major part of the computational effort in scientific and engineering computing goes in solving linear algebra subproblems. After providing a historical overview of legacy software packages, the tutorial surveys the current state-of-the-art numerical libraries for solving problems in linear algebra, both dense and sparse. MAGMA, (D)PLASMA and Trilinos software packages are discussed in detail. The tutorial also highlights recent advances in algorithms that minimize communication, i.e. data motion, which is much more expensive than arithmetic.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Parallel Computing 101 Presenters: Christiane Jablonowski (University of Michigan) Quentin F. Stout (University of Michigan) Abstract: This tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, managers, students and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available. The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering and scientific problems. These examples illustrate using MPI on distributed memory systems, OpenMP on shared memory systems, MPI+OpenMP on hybrid systems, GPU programming, and Hadoop on big data. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools. The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how they are used and what they are most suitable for. Extensive pointers to the literature and web-based resources are provided to facilitate follow-up studies.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Parallel Programming in Modern Fortran Presenters: Salvatore Filippone (University of Rome Tor Vergata) Karla Morris (Sandia National Laboratories) Damian Rouson (Sourcery, Inc.) Abstract: User surveys from high-performance computing (HPC) centers in the U.S. and Europe consistently show Fortran predominating, but most users describe their programming skills as self-taught and most continue to use older versions of the language. The increasing compiler support for the parallel programming features of Fortran 2008 makes the time ripe to offer instruction on these features to the HPC community. This tutorial teaches single-program, multiple-data (SPMD) programming with Fortran 2008 coarrays. We also introduce Fortran's loop concurrency and pure procedure features and demonstrate their use in asynchronous expression evaluation for partial differential equation (PDE) solvers. We incorporate other language features, including object-oriented (OO) programming, when they support our chief aim of teaching parallel programming. In particular, we demonstrate OO design patterns that enable hybrid CPU/GPU calculations on sparse matrices in the Parallel Sparse Basic Linear Algebra Subroutines (PSBLAS) library. The students will obtain hands-on experience with parallel programming in modern Fortran through the use of virtual machines.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	Programming the Xeon Phi Presenters: Lars Koesterke (Texas Advanced Computing Center) Kent Milfeld (Texas Advanced Computing Center) Dan Stanzione (Texas Advanced Computing Center) Jerome Vienne (Texas Advanced Computing Center) Abstract: The use of heterogeneous architectures in HPC at the large scale has become increasingly common over the past few years. One new technology for HPC is the Intel Xeon Phi co-processor also known as the MIC. The Xeon Phi is x86 based, hosts its own Linux OS, and is capable of running most codes with little porting effort. However, the MIC architecture has significant features that are different from that of current x86 CPUs. Attaining optimal performance requires an understanding of possible execution models and the architecture. This tutorial is designed to introduce attendees to the MIC architecture in a practical manner. Experienced C/C++ and Fortran programmers will be introduced to techniques essential for utilizing the MIC architecture efficiently. Multiple lectures and hands-on exercises will be used to acquaint attendees with the MIC platform and to explore the different execution modes as well as parallelization and optimization through example testing and reports. All exercises will be executed on the Stampede system at the Texas Advanced Computing Center (TACC). Stampede features more than 2PF of performance using 100,000 Intel Xeon E5 cores and an additional 7+ PF of performance from more than 6,400 Xeon Phi.
Sunday, 11/16/14	Full-Day (8:30am-5pm)	SciDB - Manage and Analyze Terabytes of Array Data Presenters: Lisa Gerhardt (Lawrence Berkeley National Laboratory) Jeremy Kepner (Massachusetts Institute of Technology Lincoln Laboratory) Marilyn Matz (Paradigm4) Alex Poliakov (Paradigm4) Yushu Yao (Lawrence Berkeley National Laboratory) Abstract: With the emergence of the Internet of Everything in the commercial and industrial worlds and with the advances in device and instrument technologies in the science world, there is an urgent need for data scientists and scientists to be able to work more easily with extremely large and diverse data sets. SciDB is an open-source analytical database for scalable complex analytics on very large array or multi-structured data from a variety of sources, programmable from R and Python. It runs on HPC, commodity hardware grids, or in a cloud. We present an overview of SciDB’s array data model, programming and query interfaces, and math library. We will demonstrate how to set up a SciDB cluster, ingest data from various standard file formats, design schema, and analyze data on two use cases: one from computational genomics and one using satellite imagery of the earth. This tutorial will help computational scientists learn how to do interactive exploratory data mining and analytics on terabytes of data. During the tutorial, attendees will have an opportunity to describe their own data and key operations needed for analysis. The presenters can guide them through implementing their use case in SciDB.
Sunday, 11/16/14	Half-Day (8:30am-12pm)	A Computation-Driven Introduction to Parallel Programming in Chapel Presenters: Bradford L. Chamberlain (Cray Inc.) Sung-Eun Choi (Cray Inc.) Greg Titus (Cray Inc.) Abstract: Chapel (http://chapel.cray.com) is an emerging parallel language whose design and development are being led by Cray Inc. in collaboration with members of computing labs, academia, and industry—both domestically and internationally. Chapel aims to vastly improve programmability, generality, and portability compared to current parallel programming models while supporting comparable or improved performance. Chapel’s design and implementation are portable and open-source, supporting a wide spectrum of platforms from desktops (Mac, Linux, and Windows) to commodity clusters and large-scale systems developed by Cray and other vendors. This tutorial will provide an in-depth introduction to Chapel’s concepts and features using a computation-driven approach: rather than simply lecturing on individual language features, we will introduce each Chapel concept by studying its use in a real computation taken from a motivating benchmark or proxy application. As time permits, we will also demonstrate Chapel interactively in lieu of a hands-on session. We’ll wrap up the tutorial by providing an overview of Chapel status and activities, and by soliciting participants for their feedback to improve Chapel’s utility for their parallel computing needs.
Sunday, 11/16/14	Half-Day (8:30am-12pm)	How to Analyze the Performance of Parallel Codes 101 Presenters: James E. Galarowicz (Krell Institute) Mahesh Rajan (Sandia National Laboratories) Martin Schulz (Lawrence Livermore National Laboratory) Abstract: Performance analysis is an essential step in the development of HPC codes. It will even gain in importance with the rising complexity of machines and applications that we are seeing today. Many tools exist to help with this analysis, but the user is too often left alone with interpreting the results. In this tutorial we will provide a practical road map for the performance analysis of HPC codes and will provide users step by step advice on how to detect and optimize common performance problems in HPC codes. We will cover both on-node performance and communication optimization and will also touch on threaded and accelerator-based architectures. Throughout this tutorial, we will show live demos using Open\|SpeedShop, a comprehensive and easy to use performance analysis tool set, to demonstrate the individual analysis steps. All techniques will, however, apply broadly to any tool and we will point out alternative tools where useful.
Sunday, 11/16/14	Half-Day (8:30am-12pm)	MPI+X - Hybrid Programming on Modern Compute Clusters with Multicore Processors and Accelerators Presenters: Georg Hager (Erlangen Regional Computing Center) Rolf Rabenseifner (High Performance Computing Center Stuttgart) Abstract: Most HPC systems are clusters of shared memory nodes. Such SMP nodes can be small multi-core CPUs up to large many-core CPUs. Parallel programming may combine the distributed memory parallelization on the node interconnect (e.g., with MPI) with the shared memory parallelization inside of each node (e.g., with OpenMP or MPI-3.0 shared memory). This tutorial analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. MPI-3.0 introduced a new shared memory programming interface, which can be combined with inter-node MPI communication. It can be used for direct neighbor accesses similar to OpenMP or for direct halo copies, and enables new hybrid programming models. These models are compared with various hybrid MPI+OpenMP approaches and pure MPI. This tutorial also includes a discussion on OpenMP support for accelerators. Benchmark results are presented for modern platforms such as Intel Xeon Phi and Cray XC30. Numerous case studies and micro-benchmarks demonstrate the performance-related aspects of hybrid programming. The various programming schemes and their technical and performance implications are compared. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a "how-to" section. Details: https://fs.hlrs.de/projects/rabenseifner/publ/SC2014-hybrid.html
Sunday, 11/16/14	Half-Day (1:30pm-5pm)	PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC Clusters Presenters: Khaled Hamidouche (Ohio State University) Dhabaleswar K. (DK) Panda (Ohio State University) Abstract: Multi-core processors, accelerators (GPGPUs/MIC) and high-performance interconnects with RDMA are shaping the architecture for next generation exascale clusters. Efficient programming models to design applications on these systems are still evolving. Partitioned Global Address Space (PGAS) models provide an attractive alternative to the traditional MPI model, owing to their easy to use shared memory abstractions and light-weight one-sided communication. Hybrid MPI+PGAS models are gaining attention as a possible solution to programming exascale systems. They help MPI applications to take advantage of PGAS models, without paying the prohibitive cost of re-designing complete applications. They also enable hierarchical design of applications using different models to suite modern architectures. In this tutorial, we provide an overview of the research and development taking place and discuss associated opportunities and challenges as we head toward exascale. We start with an in-depth overview of modern system architectures with multi-core processors, accelerators and high-performance interconnects. We present an overview of UPC and OpenSHMEM. We introduce MPI+PGAS hybrid programming models and highlight their advantages and challenges. We examine the challenges in designing high-performance UPC, OpenSHMEM and unified MPI+UPC/OpenSHMEM runtimes. We present application case-studies to demonstrate the productivity and performance of MPI+PGAS models, using the publicly available MVAPICH2-X software package.
Sunday, 11/16/14	Half-Day (1:30pm-5pm)	Practical Fault Tolerance on Today's Supercomputing Systems Presenters: Nathan DeBardeleben (Los Alamos National Laboratory) Laxmikant Kale (University of Illinois at Urbana-Champaign) Kathryn Mohror (Lawrence Livermore National Laboratory) Eric Roman (Lawrence Berkeley National Laboratory) Abstract: The failure rates on high performance computing systems are increasing with increasing component count. Applications running on these systems currently experience failures on the order of days; however, on future systems, predictions of failure rates range from minutes to hours. Developers need to defend their application runs from losing valuable data by using fault tolerant techniques. These techniques range from changing algorithms, to checkpoint and restart, to programming model-based approaches. In this tutorial, we will present introductory material for developers who wish to learn fault tolerant techniques available on today’s systems. We will give background information on the kinds of faults occurring on today’s systems and trends we expect going forward. Following this, we will give detailed information on several fault tolerant approaches and how to incorporate them into applications. Our focus will be on scalable checkpoint and restart mechanisms and programming model-based approaches.
Sunday, 11/16/14	Half-Day (1:30pm-5pm)	Scaling I/O Beyond 100,000 Cores Using ADIOS Presenters: Scott Klasky (Oak Ridge National Laboratory) Qing Liu (Oak Ridge National Laboratory) Norbert Podhorszki (Oak Ridge National Laboratory) Abstract: As concurrency and complexities continue to increase on high-end machines, in terms of both the number of cores and the level of storage hierarchy, managing I/O efficiently becomes more and more challenging. As we are moving forward, one of the major roadblocks to exascale is how to manipulate, write, read big datasets quickly and efficiently on high-end machines. In this tutorial we will demonstrate I/O practices and techniques that are crucial to achieve high performance on 100,000+ cores. Part I of this tutorial will introduce parallel I/O and the ADIOS framework to the audience. Specifically we will discuss the concept of ADIOS I/O abstraction, the binary-packed file format, and I/O methods along with the benefits to applications. Since 1.4.1, ADIOS can operate on both files and data streams. Part II will include a session on how to write/read data, and how to use different I/O componentizations inside of ADIOS. Part III will show users how to take advantage of the ADIOS framework to do topology-aware data movement, compression and data staging/streaming using DIMES/FLEXPATH.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Advanced MPI Programming Presenters: Pavan Balaji (Argonne National Laboratory) William Gropp (University of Illinois at Urbana-Champaign) Torsten Hoefler (ETH Zurich) Rajeev Thakur (Argonne National Laboratory) Abstract: The vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. For example, several MPI applications are running at full scale on the Sequoia system (on ∼1.6 million cores) and achieving 12 to 14 petaflops/s of sustained performance. At the same time, the MPI standard itself is evolving (MPI-3 was released in late 2012) to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI, including new MPI-3 features, that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid (MPI + shared memory) programming, topologies and topology mapping, and neighborhood and nonblocking collectives. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Advanced OpenMP: Performance and 4.0 Features Presenters: Bronis R. de Supinski (Lawrence Livermore National Laboratory) Michael Klemm (Intel Corporation) Eric J. Stotzer (Texas Instruments) Christian Terboven (RWTH Aachen University) Ruud van der Pas (Oracle) Abstract: With the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP but rather with the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance. While we quickly review the basics of OpenMP programming, we assume attendees understand basic parallelization concepts and will easily grasp those basics. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. We discuss language features in-depth, with emphasis on advanced features like tasking and those recently added to OpenMP 4.0 such as cancellation. We close with the presentation of the new directives for attached compute accelerators.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Debugging and Performance Tools for MPI and OpenMP 4.0 Applications for CPU and Accelerators/Coprocessors. Presenters: Damian Alvarez (Forschungzentrum Juelich) Mike Ashworth (STFC Daresbury Laboratory) Vince Betro (University of Tennessee, Knoxville) Chris Gottbrath (Rogue Wave Software) Nikolay Piskun (Rogue Wave Software) Sandra Wienke (RWTH Aachen University) Abstract: With High Performance Computing trends heading towards increasingly heterogeneous solutions, scientific developers face challenges adapting software to leverage these new architectures. For instance, many systems feature nodes that couple multi-core processors with GPU-based computational accelerators, like the NVIDIA Kepler, or many-core coprocessors, like the Intel Xeon Phi. In order to effectively utilize these systems, programmers need to leverage an extreme level of parallelism in applications. Developers also need to juggle multiple programming paradigms including MPI, OpenMP, CUDA, and OpenACC. This tutorial provides in-depth exploration of parallel debugging and optimization focused on techniques that can be used with accelerators and coprocessors. We cover debugging techniques such as grouping, advanced breakpoints and barriers, and MPI message queue graphing. We discuss optimization techniques like profiling, tracing, and cache memory optimization with tools such as Vampir, Scalasca, Tau, CrayPAT, Vtune and the NVIDIA Visual Profiler. Participants have the opportunity to do hands-on GPU and Intel Xeon Phi debugging and profiling. Additionally, the OpenMP 4.0 standard will be covered which introduces novel capabilities for both Xeon Phi and GPU programming. We will discuss peculiarities of that specification with respect to error finding and optimization. A laptop will be required for hands on sessions.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Fault-tolerance for HPC: Theory and Practice Presenters: George Bosilca (University of Tennessee, Knoxville) Aurélien Bouteiller (University of Tennessee, Knoxville) Thomas Hérault (University of Tennessee, Knoxville) Yves Robert (ENS Lyon) Abstract: Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance computing, with a fair balance between practice and theory. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection; (iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications; and (iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a proposed MPI extension to the MPI forum). Relevant examples based on widespread computational solver routines will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session. The tutorial is open to all SC'14 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Node-Level Performance Engineering Presenters: Georg Hager (Erlangen Regional Computing Center) Jan Treibig (Erlangen Regional Computing Center) Gerhard Wellein (Erlangen Regional Computing Center) Abstract: The advent of multi- and manycore chips has led to a further opening of the gap between peak and application performance for many scientific codes. This trend is accelerating as we move from petascale to exascale. Paradoxically, bad node-level performance helps to “efficiently” scale to massive parallelism, but at the price of increased overall time to solution. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, bandwidth bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements.
Monday, 11/17/14	Full-Day (8:30am-5pm)	OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools Presenters: James Beyer (Cray Inc.) Luiz DeRose (Cray Inc.) Alistair Hart (Cray Inc.) Heidi Poxon (Cray Inc.) Abstract: Portability and programming difficulty are two critical hurdles in generating widespread adoption of accelerated computing in high performance computing. The dominant programming frameworks for accelerator-based systems (CUDA and OpenCL) offer the power to extract performance from accelerators, but with extreme costs in usability, maintenance, development, and portability. To be an effective HPC platform, hybrid systems need a high-level programming environment to enable widespread porting and development of applications that run efficiently on either accelerators or CPUs. In this hands-on tutorial we present the high-level OpenACC parallel programming model for accelerator-based systems, demonstrating compilers, libraries, and tools that support this cross-vendor initiative. Using personal experience in porting large-scale HPC applications, we provide development guidance, practical tricks, and tips to enable effective and efficient use of these hybrid systems, both in terms of runtime and energy efficiency.
Monday, 11/17/14	Full-Day (8:30am-5pm)	OpenCL: A Hands-On Introduction Presenters: Alice Koniges (Lawrence Berkeley National Laboratory) Tim Mattson (Intel Corporation) Simon McIntosh-Smith (University of Bristol) Abstract: OpenCL is an open standard for programming heterogeneous parallel computers composed of CPUs, GPUs and other processors. OpenCL consists of a framework to manipulate the host CPU and one or more compute devices plus a C-based programming language for writing programs for the compute devices. Using OpenCL, a programmer can write parallel programs that harness all of the resources of a heterogeneous computer. In this hands-on tutorial, we introduce OpenCL using the more accessible C++ API. The tutorial format will be a 50/50 split between lectures and exercises. Students will use their own laptops (Windows, Linux or OS/X) and log into a remote server running an OpenCL platform. Alternatively, students can load OpenCL onto their own laptops prior to the course (Intel, AMD and NVIDIA provide OpenCL SDKs. Apple laptops with X-code include OpenCL by default. Be sure to configure X-code to use the command line interface). The last segment of the tutorial will be spent visiting the “OpenCL zoo”; a diverse collection of OpenCL conformant devices. Tutorial attendees will run their own programs on devices in the zoo to explore performance portability of OpenCL. The zoo should include a mix of CPU, GPU, FPGA, mobile, and DSP devices.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Parallel I/O In Practice Presenters: Katie Antypas (National Energy Research Scientific Computing Center) Robert J. Latham (Argonne National Laboratory) Robert Ross (Argonne National Laboratory) Brent Welch (Google) Abstract: I/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack from parallel file systems at the lowest layer, to intermediate layers (such as MPI-IO), and finally high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance. Benchmarks on real systems are used throughout to show real-world results. This tutorial first discusses parallel file systems in detail (PFSs). We cover general concepts and examine three examples: GPFS, Lustre, and PanFS. We examine the upper layers of the I/O stack, covering POSIX I/O, MPI-IO, Parallel netCDF, and HDF5. We discuss interface features, show code examples, and describe how application calls translate into PFS operations. Finally we discuss I/O best practice.
Monday, 11/17/14	Full-Day (8:30am-5pm)	Python in HPC Presenters: Matt Knepley (University of Chicago) Kurt Smith (Enthought, Inc) Andy Terrel (Continuum Analytics) Matthew Turk (Columbia University) Abstract: The Python ecosystem empowers the HPC community with a stack of tools that are not only powerful but a joy to work with. It is consistently one of the top languages in HPC with a growing vibrant community of open source tools. Proven to scale on the world's largest clusters, it is a language that has continued to innovate with a wealth of new data tools. This tutorial will survey the state of the art tools and techniques used by HPC Python experts throughout the world. The first half of the day will include an introduction to the standard toolset used in HPC Python and techniques for speeding Python and using legacy codes by wrapping Fortran and C. The second half of the day will include discussion on using Python in a distributed workflow via MPI and tools for handling large scale visualizations. Students should be familiar with basic Python syntax, we recommend the Python 2.7 tutorial on python.org. We will include hands-on demonstrations of building simulations, wrapping low-level code, executing on a cluster via MPI, and use of visualization tools. Examples for a range of experience levels will be provided.
Monday, 11/17/14	Half-Day (8:30am-12pm)	In Situ Data Analysis and Visualization with ParaView Catalyst Presenters: Andrew Bauer (Kitware, Inc.) Thierry Carrard (French Alternative Energies and Atomic Energy Commission) Jeffrey Mauldin (Sandia National Laboratories) David H. Rogers (Los Alamos National Laboratory) Abstract: As supercomputing moves towards exascale, scientists, engineers and medical researchers will look for efficient and cost effective ways to enable data analysis and visualization for the products of their computational efforts. The ‘exa’ metric prefix stands for quintillion, and the proposed exascale computers would approximately perform as many operations per second as 50 million laptops. Clearly, typical spatial and temporal data reduction techniques employed for post processing will not yield desirable results where reductions of 10e3, 10e6, or 10e9 may still produce petabytes, terabytes or gigabytes of data to transfer or store. Since transferring or storing data may no longer be viable for many simulation applications, data analysis and visualization must now be performed in situ. ParaView Catalyst is an open-source data analysis and visualization library, which aims to reduce IO by tightly coupling simulation, data analysis and visualization codes. This tutorial presents the architecture of ParaView Catalyst and the fundamentals of in situ data analysis and visualization. Attendees will learn the basics of using ParaView Catalyst with hands-on exercises. The tutorial features detailed guidance in implementing C++, Fortran and Python examples. Attendees should bring laptops to install a VirtualBox image and follow along with the demonstrations.
Monday, 11/17/14	Half-Day (8:30am-12pm)	InfiniBand and High-Speed Ethernet for Dummies Presenters: Dhabaleswar K. (DK) Panda (Ohio State University) Hari Subramoni (Ohio State University) Abstract: InfiniBand (IB) and High-speed Ethernet (HSE) technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, file systems, storage, cloud computing and Big Data (Hadoop, HBase and Memcached) environments. RDMA over Converged Enhanced Ethernet (RoCE) technology is also emerging. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB and HSE. In-depth overview of the architectural features of IB and HSE (including iWARP and RoCE), their similarities and differences, and the associated protocols will be presented. Next, an overview of the emerging OpenFabrics stack which encapsulates IB, HSE and RoCE in a unified manner will be presented. Hardware/software solutions and the market trends behind IB, HSE and RoCE will be highlighted. Finally, sample performance numbers of these technologies and protocols for different environments will be presented.
Monday, 11/17/14	Half-Day (8:30am-12pm)	Introducing R: From Your Laptop to HPC and Big Data Presenters: George Ostrouchov (Oak Ridge National Laboratory) Drew Schmidt (University of Tennessee) Abstract: The R language has been called the "lingua franca" of data analysis and statistical computing, and is quickly becoming the de facto standard for analytics. This tutorial will introduce attendees to the basics of the R language with a focus on its recent high performance extensions enabled by the ''Programming with Big Data in R'' (pbdR) project. Although R had a reputation for lacking scalability, our experiments with pbdR have easily scaled to 50 thousand cores. No background in R is assumed but even R veterans will benefit greatly from the session. We will cover only those basics of R that are needed for the HPC portion of the tutorial. The tutorial is very much example-oriented, with many opportunities for the engaged attendee to follow along. Examples on real data will utilize common data analytics techniques, such as principal components analysis and cluster analysis.
Monday, 11/17/14	Half-Day (8:30am-12pm)	Parallel Programming with Charm++ Presenters: Nikhil Jain (University of Illinois at Urbana-Champaign) Laxmikant Kale (University of Illinois at Urbana-Champaign) Michael Robson (University of Illinois at Urbana-Champaign) Abstract: There are several challenges in programming applications for large supercomputers: exposing concurrency, data movement, load imbalance, heterogeneity, variations in application’s behavior, system failures etc. Addressing these challenges requires more emphasis on the following important concepts during application development: overdecomposition, asynchrony, migratability, and adaptivity. At the same time, the runtime systems (RTS) will need to become introspective and provide automated support for several tasks, e.g. load balancing, that currently burden the programmer. This tutorial is aimed at exposing the attendees to the above mentioned concepts. We will present details on how a concrete implementation of these concepts, in synergy with an introspective RTS, can lead to development of applications that scale irrespective of the rough landscape. We will focus on Charm++ as the programming paradigm that encapsulates these ideas, and use examples from real world applications to further the understanding. Charm++ provides an asynchronous, message-driven programming model via migratable objects and an adaptive RTS that guides execution. It automatically overlaps communication, balances loads, tolerates failures, checkpoints for split-execution, interoperates with MPI, and promotes modularity while allowing programming in C++. Several widely used Charm++ applications thrive in computational science domains including biomolecular modeling, cosmology, quantum chemistry, epidemiology, and stochastic optimization.
Monday, 11/17/14	Half-Day (1:30pm-5pm)	Designing and Using High-End Computing Systems with InfiniBand and High-Speed Ethernet Presenters: Dhabaleswar K. (DK) Panda (Ohio State University) Hari Subramoni (Ohio State University) Abstract: As InfiniBand (IB) and High-Speed Ethernet (HSE) technologies mature, they are being used to design and deploy different kinds of High-End Computing (HEC) systems: HPC clusters with accelerators (GPGPUs and MIC) supporting MPI and PGAS (UPC and OpenSHMEM), Storage and Parallel File Systems, Cloud Computing with Virtualization, Big Data systems with Hadoop (HDFS, MapReduce and HBase), Multi-tier Datacenters with Web 2.0 (memcached) and Grid Computing systems. These systems are bringing new challenges in terms of performance, scalability, and portability. Many scientists, engineers, researchers, managers and system administrators are becoming interested in learning about these challenges, approaches being used to solve these challenges, and the associated impact on performance and scalability. This tutorial will start with an overview of these systems and a common set of challenges being faced while designing these systems. Advanced hardware and software features of IB and HSE and their capabilities to address these challenges will be emphasized. Next, case studies focusing on domain-specific challenges in designing these systems (including the associated software stacks), their solutions and sample performance numbers will be presented. The tutorial will conclude with a set of demos focusing on RDMA programming, network management infrastructure and tools to effectively use these systems.
Monday, 11/17/14	Half-Day (1:30pm-5pm)	Effective HPC Visualization and Data Analysis using VisIt Presenters: Jean M. Favre (Swiss National Supercomputing Center) Cyrus Harrison (Lawrence Livermore National Laboratory) David Pugmire (Oak Ridge National Laboratory) Rob Sisneros (National Center for Supercomputing Applications) Brad Whitlock (Intelligent Light) Abstract: Visualization and data analysis are an essential component of the scientific discovery process. Scientists and businesses running HPC simulations leverage visualization and analysis tools for data exploration, quantitative analysis, visual debugging, and communication of results. This half-day tutorial will provide attendees with a practical introduction to mesh-based HPC visualization and analysis using VisIt, an open source parallel scientific visualization and data analysis platform. We will provide a foundation in basic HPC visualization practices and couple this with hands-on experience creating visualizations and analyzing data. This tutorial includes: 1) An introduction to visualization techniques for mesh-based simulations. 2) A guided tour of VisIt. 3) Hands-on demonstrations of end-to-end visualizations of both a water flow simulation and blood flow (aneurysm) simulation. This tutorial builds on the past success of VisIt tutorials, updated and anchored with new concrete use cases. Attendees will gain practical knowledge and recipes to help them effectively use VisIt to analyze data from their own simulations. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PROP-652360
Monday, 11/17/14	Half-Day (1:30pm-5pm)	Enhanced Campus Bridging via a Campus Data Service Using Globus and the Science DMZ Presenters: Raj Kettimuthu (Argonne National Laboratory) Steve Tuecke (University of Chicago) Vas Vasiliadis (University of Chicago) Abstract: Existing campus data services are limited in their reach and utility due, in part, to unreliable tools and a wide variety of storage systems with sub-optimal user interfaces. An increasingly common solution to campus bridging comprises Globus operating within the Science DMZ, enabling reliable, secure file transfer and sharing, while optimizing use of existing high-speed network connections and campus identity infrastructures. Attendees will be introduced to Globus and have the opportunity for hands-on interaction installing and configuring the basic components of a campus data service. We will also describe how newly developed Globus services for public cloud storage integration and metadata management may be used as the basis for a campus publication system that meets an increasingly common need at many campus libraries. The tutorial will help participants answer these questions: What services can I offer to researchers for managing large datasets more efficiently? How can I integrate these services into existing campus computing infrastructure? What role can the public cloud play (and how does a service like Globus facilitate its integration)? How should such services be delivered to minimize the impact on my infrastructure? What issues should I expect to face (e.g. security) and how should I address them?
Monday, 11/17/14	Half-Day (1:30pm-5pm)	Introductory and Advanced OpenSHMEM Programming Presenters: Tony Curtis (University of Houston) Deepak Eachempati (University of Houston) Oscar Hernandez (Oak Ridge National Laboratory) Swaroop Pophale (University of Houston) Pavel Shamis (Oak Ridge National Laboratory) Aaron Welch (University of Houston) Abstract: As high performance computing systems become larger better mechanisms are required to communicate and coordinate processes within or across nodes. Communication libraries like OpenSHMEM can leverage hardware capabilities such as remote directory memory access (RDMA) to hide latencies through one-sided data transfers. OpenSHMEM API provides data transfer routines, collective operations such as reductions and concatenations, synchronizations, and memory management. In this tutorial, we will present an introductory course on OpenSHMEM, its current state and the community’s future plans. We will show how to use OpenSHMEM to add parallelism to programs via an exploration of its core features. We will demonstrate simple techniques to port applications to run at scale while improving the program performance using OpenSHMEM, and discuss how to migrate existing applications that use message passing techniques to equivalent OpenSHMEM programs that run more efficiently. We will then present a new low-level communication library called Unified Common Communication Substrate (UCCS) and the second part of the tutorial will focus on how to use the OpenSHMEM-UCCS implementation on applications. UCCS is designed to sit underneath any message passing or PGAS user-oriented language or library.

Conference Dates: November 16-21, 2014
Tutorial Dates: November 16-17, 2014
Email Contact: tutorials@info.supercomputing.org

SC14 Tutorials Co-Chairs:
Pavan Balaji, Argonne National Laboratory
Bernd Mohr, Jülich Supercomputing Centre

SC14 Tutorials Committee: The committee list is available online at http://sc14.supercomputing.org/sc-planning-committee

Tutorials

Quick Links