HPC
This page page needs expert review!
Introduction
HPC stands for High Performance Computing, which is shorthand for the next generation of large-scale computing that is being developed. (This is being written in 2017.) We expect mu2e to be come more involved in these sites and advanced styles of programming, and taking full advantage of these resource by the time we take data.
As physical limitations to Moore's and other scaling laws approach, the economics of computing has evolved. Instead of many machines with large memory and few cores, the trend is towards fewer machines with limited memory and many cores. This forces users into reducing memory where possible and sharing as much of the rest as possible. It also motivates making programs with parts that can run in parallel on the cores.
On top of this trend is another strong trend for the user to bring their operating system environment with them. Instead of the user just submitting a job to a machine with a known environment (such as SL6) coordinated with the site, the user submits their jobs as a virtual machine plus their workflow. The worker node runs the virtual machine which then starts the workflow. In this way, the site can be much more generic and host almost any type of user. The virtual machine may or may not be visible to the user. The user may simply check a box for what VM to use, or may build a custom VM for the experiment, or even specific to the workflow. An light-weight way to provide a virtual machine is by a Docker image. This object provides everything but the kernel - OS utilities and libraries, products and collaboration code.
Methods
There are different levels of, loosely speaking, memory sharing and threading...
One method that is being used at several sites is the "container", such as Docker or Singularity. (See a FIFE summary.) This is a type of a pseudo virtual machine that can maximize flexibility and memory sharing between your jobs. The worker node is usually running a lunix kernel, but the exact OS is not important. An image is built up from the all libraries and executables that the workflow will need. Contents such has the OS libraries, lab products, and art libraries might be put in an image by an expert. The experiment or user then would add the particular version of mu2e code libraries and utilities needed. This image is moved to the worker node, and then one or many copies of the job workflow (in this language, each is a container) can be started from the image. The memory sharing comes from the many jobs running on one set of shared libraries in memory instead of each job being run in its own virtual machine, with its own copy of the libraries. The convenience and interoperability comes from the fact that the user is bringing their whole environment to the worker node and therefore is very likely to run successfully and correctly.
There are different levels of parallelization with one run of an exe:
- bit-level - larger buses and FPU's, fixed by the hardware
- multithreading - the CPU decides at the level of a few instructions what can be computed in parallel on the fly
- vectorization of code - each line of code is examined by a programmer and threads inserted where possible. For example, if an initialization of several objects is independent, then each can be done in a separate parallel thread.
- function level - some tasks are naturally independent and and be done in a parallel, such as tracing a particle or fitting a track. This approach is being pursued in geant, where each particle is naturally a thread.
- module level - art's rules enforce that analysis and output modules can be run independently and in parallel, for example
- event level - a central input module distributes whole events out to a set of threads ("multi-scheduling"). This is the focus of the art team in 2017. Will use Intel Thread Building Blocks (TBB)
Machines
See also Doc 9040 mu2e might be able to run on Edison and CORI I with little effort. MIRA with dedicated effort.
- NERSC (National Energy Research Scientific Computing Center)
- Edison - 2.57 Petaflops, Cray XC30, Intel "Ivy Bridge" processor,2.5 GB of memory per core, uses Docker, no special compiler. Max wal time 36h. To be retired 3/31/19.
- CORI Phase I - Cray XC40, Intel Xeon Haswell processors, 2 GB of memory per thread, uses Docker. Max wal time 48h.
- CORI Phase II - second generation Intel Xeon Phi, Knights Landing (KNL), 4 hardware threads per core, 350 MB/thread requires function-level parallel programming, prefers special compilation. The standard executable runs on this machine, but with considerable inefficiency. Max wall time 24h. KNL training slides
Simulation job relative timing test. KNL is slower because the architecture is much more specialized and we do not take advantage of it. It may still be the most valuable because it has the most cores.
FermiGrid | 1.0 |
OSG | 1.2 |
Edison | 0.56 |
Cori I | 0.66 |
Cori II KNL | 2.6 |
- ALCF (Argonne Leadership Computing Facility)
ALCF is one of two DOE Leadership Computing Facilities in the US dedicated to open science. It provides supercomputing resources and expertise to the scientific and engineering community in a broad range of disciplines. ALCF is supported by the DOE Office of Science and the Advanced Scientific Computing Research (ASCR) program. Computing time on ALCF resources is open to anyone, but one must apply for an allocation, either a Director's Discretionary Allocation (DD) [1], an ASCR Leadership Computing Challenge (ALCC) allocation, or an INCITE allocation. ALCF has several computing systems: Mira and Theta are the workhorses of the facility, each capable of performing at ~10 Petaflops, Cetus (used for debugging) and Vesta (used for testing and development) are smaller versions of Mira, and Cooley is the visualization cluster equipped with GPUs. Aurora is ALCF's next generation supercomputer. It will have a peak performance of greater than an Exaflop and is slated to be completed in 2021.
- Mira: 10 petaflops IBM Blue Gene/Q, 200 MB/core (requires special compilation)
- Theta: 9.65 petaflops, 4GB/core
- Aurora: future exascale machine
- LCRC (Laboratory Computing Resource Center) at Argonne
The LCRC provides mid-range supercomputing resources to Argonne employees and their collaborators. As of this writing (April 2019), Lisa Goodenough, Yuri Oksuzian, and Suvarna Ramachandran have access to LCRC computing resources, specifically Bebop. Bebop has a total of 1024 computing nodes, 664 Broadwell nodes (each with 32 cores) and 352 Knights Landing nodes (each with 64 cores). For more information on Bebop, see [2].
Mu2e has successfully run 'Stage 1' simulation jobs using the CRY event generator and multithreaded Geant4 on both Broadwell and KNL nodes on Bebop. These jobs were run using single-threaded art, and about half of the job time was spent running in single-threaded mode, i.e. about half of job time was devoted to emptying the EventStash. For these CRY simulation jobs, the Broadwell nodes typically process events ~4 times faster than the KNL nodes, most likely because these jobs are not using the threads efficiently. There is a scale factor of 0.585 for charging time on the KNL nodes, so a job that takes 1 core-hour to run is charged at 0.585 core-hours. Thus, the amount of time charged for the same amount of work on the KNL nodes is only ~2 times more than on the Broadwell nodes.
- SDSC (San Diego Supercomputer Center)
- COMET, similar to CORI I.
- XSEDE - a collaboration helping users with supercomputers
Running Jobs
NERSC
We (Ray Culbertson) ran 500K jobs at NERSC in 12/2016. Starting out, we used Edison, which is familiar Xeon nodes, then switched to Cori II, which is KNL, for the bulk of the processing. The former was fully efficient, the latter required more memory per process than the memory per thread available, so about half the threads were idle on these nodes. As I recall, we ran all in "singles" queues.
A Docker image was created. Several login, setup, and production scripts had to be modified due to running in a custom container. When the Docker imagine is run at NERSC, it is actually converted to a shifter image. This conversion adds security, access to the aggregated disk system on the cluster (their dcache), and access to the internet. We were able to write output with ifdh. The singularity conversion was done by the computing division and later by us. It was a hand operation to install a new image, and only one image could be run at a time. Submission is by a special fifebatch url, but otherwise acts like a regular batch job.
NERSC has several queues.
- whole nodes This will run your job a node with 64 cores and 128 GB memory so you can plan your memory sharing.
- singles This will run on one of the the same nodes, but with only about 2GB reserved for your job.
- debug Runs quickly, but only for 30min, and much of that time may be wasted by the glide-in.
The queue at NERSC is FIFO, there is no scheduling based on priority or "fair share" Wall clock time limits are 48h, but shorter are likely to start sooner, because you might fit into a few limited time slots, able to jump ahead in the queue.
Argonne
Activities and References
Geant4 with threading is a project called Geant4-MT. This version allows many Geant4 events to be generated in parallel with one call to the Geant event generator. Each set of Geant4 events would appear in one art event, so this will be a different type of event which will have to be accomodated. Some recent talks: talk1 talk2.
The art team is developing the event-level and module-level threading capability for art executables. As of fall 2017, the infrastructure is in place and the process of rolling it out and matching user needs is underway. There is an art multi-threading forum. There is an associated mailing list "art-mt-forum@listserv.fnal.gov" (see listserv instructions).
You can see mu2e talks and docs on the topic of threading at this doc-db search link.
Some tools being investigated by SCD (2017)
- HDF5 - a unique suite of technologies and supporting services that make possible the management of extremely large and complex data collections. (Replacement for root ntuples, interface to industry machine learning.)
- Apache Spark a fast and general engine for large-scale data processing. (In-memory computing..)
- Dask - flexible parallel computing library for analytic computing in Python
- Apache Flink An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications