HPC: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
Line 11: Line 11:
There are different levels of memory sharing..
There are different levels of memory sharing..


One method that is being used at several sites is the "container", such as [https://www.docker.com/what-docker Docker] or [http://singularity.lbl.gov/ Singularity].  This is a type of a virtual machine that can maximize sharing between your jobs.  The worker node is running a lunix kernel, but the exact OS is not important. The container is built up of the libraries and executables that the workflow will need.  Contents such has the art libraries might be collected by an expert.  The experiment or user then would add the particular version of mu2e code and utilities needed.  This container is run on the worker node, and then one or many copies of the job workflow can be run inside the container.  The sharing comes from the many jobs running on one set of shared libraries instead of each job being run in its own virtual machine, with its own copy of the libraries.
One method that is being used at several sites is the "container", such as [https://www.docker.com/what-docker Docker] or [http://singularity.lbl.gov/ Singularity]. (See a [http://fife.fnal.gov/singularity-on-the-osg/ FIFE summary].) This is a type of a pseudo virtual machine that can maximize flexibility and memory sharing between your jobs.  The worker node is usually running a lunix kernel, but the exact OS is not important. An '''image''' is built up from the all libraries and executables that the workflow will need.  Contents such has the OS libraries, lab products, and art libraries might be put in an image by an expert.  The experiment or user then would add the particular version of mu2e code libraries and utilities needed.  This image is moved to the worker node, and then one or many copies of the job workflow (in this language, each is a '''container''') can be started from the image.  The memory sharing comes from the many jobs running on one set of shared libraries in memory instead of each job being run in its own virtual machine, with its own copy of the libraries.  The convenience and interoperability comes from the fact that the user is bringing their whole environment to the worker node and therefore is very likely to run successfully and correctly.


There are different levels of parallelization.
There are different levels of parallelization.
* bit-level - larger buses and FPU's
* bit-level - larger buses and FPU's, fixed by the hardware
* multithreading - the CPU decides at the level of a few instructions what can be computed in parallel
* multithreading - the CPU decides at the level of a few instructions what can be computed in parallel on the fly
* vectorization of code - each line of code is examined by a programmer and threads inserted where possible.  For example, if an initialization of several objects is independent, then each can be done is separate parallel thread.
* vectorization of code - each line of code is examined by a programmer and threads inserted where possible.  For example, if an initialization of several objects is independent, then each can be done in a separate parallel thread.
* function level - some tasks are naturally independent and and be dome in a parallel, such as tracing a particle or fitting a track. This approach is being pursued in geant, where each particle is naturally a thread.
* function level - some tasks are naturally independent and and be done in a parallel, such as tracing a particle or fitting a track. This approach is being pursued in geant, where each particle is naturally a thread.
* module level  - art's rules enforce that analysis and output modules can be run independently and in parallel, for example
* module level  - art's rules enforce that analysis and output modules can be run independently and in parallel, for example
* event level  - a central input module distributes whole events out to a set of threads ("multi-scheduling").  This is the focus of the art team in 2017. Will use Intel Thread Building Blocks (TBB)
* event level  - a central input module distributes whole events out to a set of threads ("multi-scheduling").  This is the focus of the art team in 2017. Will use Intel Thread Building Blocks (TBB)

Revision as of 15:42, 13 June 2017

Expert.jpeg This page page needs expert review!

Introduction

HPC stands for High Performance Computing, which is shorthand for the next generation of large-scale computing that is being developed. (This is being written in 2017.) We expect mu2e to be come more involved in these sites and advanced styles of programming, and taking full advantage of these resource by the time we take data.

As physical limitations to Moore's and other scaling laws approach, the economics of computing has evolved. Instead of many machines with large memory and few cores, the trend is towards fewer machines with limited memory and many cores. This forces users into reducing memory where possible and sharing as much of the rest as possible. It also motivates making programs with parts that can run in parallel on the cores.

On top of this trend is another strong trend for the user to bring their operating system environment with them. Instead of the user just submitting a job to a machine with a known environment (such as SL6) coordinated with the site, the user submits their jobs as a virtual machine plus their workflow. The worker node runs the virtual machine which then starts the workflow. In this way, the site can be much more generic and host almost any type of user. The virtual machine may or may not be visible to the user. The user may simply check a box for what VM to use, or may build a custom VM for the experiment, or even specific to the workflow. An light-weight way to provide a virtual machine is by a Docker image. This object provides everything but the kernel - OS utilities and libraries, products and collaboration code.

Methods

There are different levels of memory sharing..

One method that is being used at several sites is the "container", such as Docker or Singularity. (See a FIFE summary.) This is a type of a pseudo virtual machine that can maximize flexibility and memory sharing between your jobs. The worker node is usually running a lunix kernel, but the exact OS is not important. An image is built up from the all libraries and executables that the workflow will need. Contents such has the OS libraries, lab products, and art libraries might be put in an image by an expert. The experiment or user then would add the particular version of mu2e code libraries and utilities needed. This image is moved to the worker node, and then one or many copies of the job workflow (in this language, each is a container) can be started from the image. The memory sharing comes from the many jobs running on one set of shared libraries in memory instead of each job being run in its own virtual machine, with its own copy of the libraries. The convenience and interoperability comes from the fact that the user is bringing their whole environment to the worker node and therefore is very likely to run successfully and correctly.

There are different levels of parallelization.

  • bit-level - larger buses and FPU's, fixed by the hardware
  • multithreading - the CPU decides at the level of a few instructions what can be computed in parallel on the fly
  • vectorization of code - each line of code is examined by a programmer and threads inserted where possible. For example, if an initialization of several objects is independent, then each can be done in a separate parallel thread.
  • function level - some tasks are naturally independent and and be done in a parallel, such as tracing a particle or fitting a track. This approach is being pursued in geant, where each particle is naturally a thread.
  • module level - art's rules enforce that analysis and output modules can be run independently and in parallel, for example
  • event level - a central input module distributes whole events out to a set of threads ("multi-scheduling"). This is the focus of the art team in 2017. Will use Intel Thread Building Blocks (TBB)

Geant4MT

Geant4 with mutli-threading, development in progress.

Machines

See also Doc 9040 mu2e might be able to run on Edison and CORI I with little effort. MIRA with dedicated effort.

  • NERSC (National Energy Research Scientific Computing Center)
    • Edison - 2.57 Petaflops, Cray XC30, Intel "Ivy Bridge" processor,2.5 GB of memory per core, uses Docker, no special compiler
    • CORI Phase I - Cray XC40, Intel Xeon Haswell processors, 4 GB of memory per core, uses Docker
    • CORI Phase II - second generation Intel Xeon Phi, Knights Landing (KNL), 4 hardware threads per core, 50 MB/thread requires function-level parallel programming, requires special compilation. It may be possible for a standard executable to run on this machine, but with considerable inefficiency.
  • ALCF (Argonne Leadership Computing Facility)
    • MIRA 10 petaflops IBM Blue Gene/Q, 200 MB/core, requires special compilation
    • Theta 9.65 petaflops, 4GB/core
    • Aurora - future
  • SDSC (San Diego Supercomputer Center)
    • COMET, similar to CORI I.
  • XSEDE - a collaboration helping users with supercomputers

Tools

Some tools being investigated by SCD (2017)

  • HDF5 - a unique suite of technologies and supporting services that make possible the management of extremely large and complex data collections. (Replacement for root ntuples, interface to industry machine learning.)
  • Apache Spark a fast and general engine for large-scale data processing. (In-memory computing..)
  • Dask - flexible parallel computing library for analytic computing in Python
  • Apache Flink An open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications