ALCF: Theta
Running Jobs on Theta at ALCF
Login
Once you have an allocation on Theta, or if you are using an existing allocation, you can reference the Onboarding Guide for answers to most of your questions about how to get started.
To login to Theta from a terminal:
ssh <username>@theta.alcf.anl.gov
At the prompt you must enter your password; you must enter your 4-digit PIN (given to you by ALCF when you get an account) followed immediately by the one-time 8-digit cryptocard password with no spaces in between the two. You are then in your home directory /home/<username> running on a login node running the bash shell. Login nodes run SUSE Enterprise Linux-based full CLE OS. You can change the login shell in your account web page.
Filesystems
There are two filesystems on Theta: the GPFS system which houses the /home/<username> directories in /gpfs/mira-home), and the Lustre filesystem which houses the /project/<proejctname> directories in /lus/theta-fs0/projects. The /home directories are backed up and by default are 50GiB. The /project directories are NOT backed up and are by default 1TiB. The /project directory is viewable by all members of the project, so common code and files should be placed here.
Environment
Your environment is controlled via 'modules'. There is a default set of modules set up for all users. Run
module list
to see what is loaded at any given time. For the work done on Theta to date (as of May 2019), users have not needed to modify their environment. As of May 2019, the output of the 'module list' command for a default environment is
Currently Loaded Modulefiles: 1) modules/3.2.11.1 2) intel/18.0.0.128 3) craype-network-aries 4) craype/2.5.15 5) cray-libsci/18.07.1 6) udreg/2.3.2-6.0.7.1_5.13__g5196236.ari 7) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari 8) pmi/5.0.14 9) dmapp/7.1.1-6.0.7.1_5.45__g5a674e0.ari 10) gni-headers/5.0.12.0-6.0.7.1_3.11__g3b1768f.ari 11) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari 12) job/2.2.3-6.0.7.1_5.43__g6c4e934.ari 13) dvs/2.7_2.2.118-6.0.7.1_10.1__g58b37a2 14) alps/6.6.43-6.0.7.1_5.45__ga796da32.ari 15) rca/2.2.18-6.0.7.1_5.47__g2aa4f39.ari 16) atp/2.1.3 17) perftools-base/7.0.4 18) PrgEnv-intel/6.0.4 19) craype-mic-knl 20) cray-mpich/7.7.3 21) nompirun/nompirun 22) darshan/3.1.5 23) trackdeps 24) xalt
Containers
The easiest way to run the Mu2e Offline code on Theta is to run it in a container. Docker is a common container platform, but because of security issues, ALCF does not allow users to run Docker containers on their systems. Singularity is another container platform that does not have the same security issues as Docker, and can be run on Theta. Singularity is capable of building containers from Docker images, so the Mu2e Offline code can be containerized as a Docker or Singularity container for use on Theta.
We built a Docker container of the Offline code and put it on Docker Hub. To pull a container to Theta and turn it into a Singularity container run the command
singularity pull docker://username/image_name:image_version
You will then have a container named 'image_name-image_version.simg' in the current directory.
For example, for the March 2019 jobs on Theta we used 'singularity pull docker://goodenou/mu2emt:v7_2_0-7.7.6' to create a container called mu2emt-v7_2_0-7.7.6.simg for use. We placed the container in the /projects/mu2e_CRY for all project members to access. For more information on using Singularity containers on Theta, see the ALCF tutorial.
Running Jobs
The ALCF has a detailed webpage on running jobs on Theta. Theta uses the batch scheduler Cobalt. Jobs are run using the 'aprun' command.
Interactive Jobs
Interactive Jobs
As a first test of any code, it is good practice to run an interactive job. To get
- one node (-n 1)
- for 15 minutes (-t 15)
- for interactive use (-I)
- charged to projectname (-A <projectname>)
run the following:
qsub -A <projectname> -t 15 -q debug-cache-quad -n 1 -I
This will put you on a service node from which you can launch your interactive job.
The 'debug' queue is a good place to test your code, since there is no minimum requirement on the number of nodes that you can request. The maximum number of nodes you can request in the debug queue is 16, in either cache-quad or flat-quad mode. The maximum job time in the debug queue is 1 hour, and a user may only have one job running at a time. For more information on the available queues, see the Job Scheduling Policy page.
Cache (Flat) is the memory mode in which the high bandwidth memory MCDRAM acts as a cache (regular memory). Quad is the clustering mode. 'cache-quad' is the default configuration if none is specified.
Once you have requested an interactive session, you may have to wait. Usually the wait is no more than a few minutes. The output from the batch system looks something like this during the process:
goodenou@thetalogin6:~/Mu2eMT> qsub -A Comp_Perf_Workshop -t 15 -q debug-cache-quad -n 1 -I Connecting to thetamom2 for interactive qsub... Job routed to queue "debug-cache-quad". Memory mode set to cache quad for queue debug-cache-quad Wait for job 336989 to start... Opening interactive session to 3830 goodenou@thetamom2:/gpfs/mira-home/goodenou>
The service nodes have the names 'thetamom#', so there is no mistaking when you are on one. Note that you are placed back in your home directory on the service node, regardless of where you made the job request from.
Interacting with the Singularity Container via a Shell
A user can interact with a Singularity container in many ways. To run a shell from within your container, type
aprun -n 1 singularity shell <containername>
The '-n 1' argument has a different meaning than the '-n' argument of sub. Here is indicates that we want to run one instance of our job - it has nothing to do with the number of nodes we run on. See the aprun man page for more information on the various arguments to aprun.
On Theta, the container resides in the project directory so that everyone can access it. The command and output look like this:
goodenou@thetamom2:/gpfs/mira-home/goodenou/Mu2eMT> aprun -n 1 singularity shell /projects/mu2e_CRY/mu2emt-v7_2_0-7.7.6.simg Singularity: Invoking an interactive shell within container...
An 'ls' will show you the contents of your $HOME directory, which is always mounted to the container. Other directories are also mounted to container by default. An 'ls /' will show the contents of the top level directory in the container. The mu2emt-v7_2_0-7.7.6.simg container was built with the following directory structure:
ls / anaconda-post.log DataFiles etc home media Offline products sbin srv usr artexternals dev etc_bashrc lib mnt opt root setupmu2e-art.sh sys var bin environment graphicslibs lib64 mu2egrid proc run singularity tmp
To run the mu2e executable over g4test_03.fcl in this container, we need to execute the following commands. The output is not shown here.
source /setupmu2e-art.sh source /Offline/setup.sh mu2e -c /Offline/Mu2eG4/fcl/g4test_03.fcl -n 100
Interacting with the Singularity Container via a Script
The next level of complexity is to run a shell inside the container from a script. The Singularity sub-command 'exec' allows you to spawn an arbitrary command within your container image as if it were running directly on the host system. The command
goodenou@thetamom2:/gpfs/mira-home/goodenou> singularity exec -B /projects/mu2e_CRY:/mnt /projects/mu2e_CRY/mu2emt-v7_2_0-7.7.6.simg bash -c "~/Mu2eMT/run_script.sh"
runs the script ~/Mu2eMT/run_script.sh in a shell inside the container.
The following runscript is a good example of a basic script that can be used to run a single job on Theta.
#!/bin/bash JOB_DIR=~/Mu2eMT ### Setup the environment in the container echo "setting up products and mu2e" source /setupmu2e-art.sh echo "setting up Mu2e Offline in directory /Offline" source /Offline/setup.sh echo "JOBDIR = $JOB_DIR" ### ALPS_APP_PE is an environment variable set up by the Cray Linux Environment utility. ### It is different for every instance of the mu2e executable ### that you start within a given job. The first instance is always '0', and the ### variable is incremented by 1 for each subsequent instance of the executable. ### When you run multiple processes within a job, each will be given a different ### value of ALPS_APP_PE, so the different processes can be identified. echo "ALPS_APP_PE = $ALPS_APP_PE" PROCESS_NUMBER=$(( ${ALPS_APP_PE} + 1)) echo "PROCESS_NUMBER=$PROCESS_NUMBER" echo "cd $JOB_DIR" > $JOB_DIR/runfile_${COBALT_JOBID}_${PROCESS_NUMBER}.sh ### Make the output directory in /mnt inside the container. ### We mounted this to the /project directory through ### '-B /projects/mu2e_CRY:/mnt' in the call to singularity. mkdir -p /mnt/000/00 echo "printenv > OUTFILE_${COBALT_JOBID}_${PROCESS_NUMBER}" >> $JOB_DIR/runfile_${COBALT_JOBID}_${PROCESS_NUMBER}.sh echo echo "mu2e -c FCL_FILES/cnf.goodenou.my-test-s1.v0.000001_00000000_64_1000_${PROCESS_NUMBER}.fcl -n 10000 > OUTFILE_${COBALT_JOBID}_${PROCESS_NUMBER}" \ >> $JOB_DIR/runfile_${COBALT_JOBID}_${PROCESS_NUMBER}.sh source $JOB_DIR/runfile_${COBALT_JOBID}_${PROCESS_NUMBER}.sh
This script does four things:
- sets up the mu2e environment within the container
- defines the PROCESS_NUMBER environment variable, which we need to label the output
- creates the output directory 000/00 in the project directory /projects/mu2e_CRY. (The project directory was bind-mounted to the container in the '-B' argument in the call to singularity shown above. You need to define the location of your output path in your FHiCL file as well.)
- creates and sources another script called runfile_${COBALT_JOBID}_${PROCESS_NUMBER}.sh containing the following commands
- printenv > OUTFILE_${COBALT_JOBID}_${PROCESS_NUMBER}
- mu2e -c FCL_FILES/cnf.goodenou.my-test-s1.v0.000001_00000000_64_1000_${PROCESS_NUMBER}.fcl -n 10000 > OUTFILE_${COBALT_JOBID}_${PROCESS_NUMBER}"
Batch Jobs
For production work comprising tens or hundreds of jobs, interactive sessions are inefficient. You must run batch jobs.
A special mu2egrid script, mu2ehpc, was developed by Andrei Gaponenko to run mu2e jobs on Theta and Bebop. The script is part of the mu2egrid product and is located in the mu2egrid/hpc directory. It is similar in spirit to mu2eprodsys, a script for running jobs on the grid. You must have the mu2egrid product on the HPC machine you are running on, either in your home directory or the project directory. It is available as a git repository which you can pull using
git clone http://cdcvs.fnal.gov/projects/mu2egrid
To see the documentation for this script, use the typical --help or -h argument:
goodenou@thetamom1:/gpfs/mira-home/goodenou/Mu2eMT> mu2egrid/hpc/mu2ehpc --help Usage: mu2ehpc --cluster=<name> \ --fcllist=<fcl-file-list> \ --dsconf=<output-dataset-version-string> \ [--dsowner=<name>] \ --topdir=<directory> \ --container=<file> \ --batchopts=<file> \ --nthreads-per-process=<integer> \ --nprocs-per-node=<integer> \ --time-limit=<string> \ [-h|--help] \ [--dry-run] \ [--debug] \ [--mu2e-setup=<inside-the-container-file>] \ [--setup=<inside-the-container-file>] where square brackets denote optional parameters. --cluster is one of bebop, theta - The set of fcl files specified via --fcllist will be processed by Mu2e Offline, one fcl file per job. The --fcllist argument should be a plain text file with one fcl file name per line. The submission process will take a snapshot of the content of fcl files, and later modifications to the files (or fcllist itself) will not affect any previously submitted jobs. - The configuration field in the names of output files should be specified via --dsconf. - The --dsowner option specifies the username field in the names of output files. The default is the 'mu2e' --topdir specifies where to create a new directory structure to place inputs and receive outputs of the jobs. --container is the Mu2e container to be used for the job --batchopts should point to a text file that will be merged in a the beginning of the submitted batch script. Each line must start with '#SBATCH' for Bebop or '#COBALT' for Theta, and define an option for the batch submission. --nthreads-per-process is the number of threads used by an individual Mu2e job to process a single fcl file in the submission. --nprocs-per-node must be provided. The number of nodes request is deduced from --nprocs-per-node and the number of jobs in --fcllist. --time-limit is the maximal job duration. -h or --help: print out the documentation. --dry-run will run pre-submission job checks and print out the script, but not submit it. --debug requests a more verbose logging by the worker node script --mu2e-setup defaults to /setupmu2e-art.sh, to be sourced by the inside-the-container working processes. --setup defaults to /Offline/setup.sh, another file to be sourced by the inside-the-container working processes.
The help is descriptive, but an example will prove useful. Consider the following command:
./mu2egrid/hpc/mu2ehpc -cluster theta -f /projects/mu2e_CRY/cry2/prod_v734/fcllist_theta.10 -dsc cry0319 -top /projects/mu2e_CRY/lisa/RUN_1 -cont /projects/mu2e_CRY/mu2emt-v7_2_0-7.7.6.simg -b cobalt_prod.conf -nth 8 --nprocs-per-node 32 --time 360
This would run one submission
- on Theta
- using fcl files from the list /projects/mu2e_CRY/cry2/prod_v734/fcllist_theta.10
- with cry0319 being the configuration field in the names of output files
- with the top directory of the new input and output files being /projects/mu2e_CRY/lisa/RUN_1
- using the container /projects/mu2e_CRY/mu2emt-v7_2_0-7.7.6.simg
- with the batch job configuration file cobalt_prod.conf located in the current directory
- running 8 threads per mu2e process (i.e. G4MT would run using 8 threads)
- running 32 mu2e processes per node
- for up to 360 minutes in total
As mentioned above, the spirit of mu2ehpc is similar to mu2eprodsys, so many of the arguments are the same. See the mu2eprodsys wiki for additional descriptions of these. The HPC-related arguments deserve more explanation here.
- --fcllist- The number of fcl files listed in this file divided by the number of processes per node (--nprocs-per-node) determines how many nodes are used in the job. Thus, if you want to use 256 nodes on Theta and plan to run 32 processes on each node, you must supply a fcllist file containing 8192 fcl file names.
- --batchopts-
- --nthreads-per-process- The number of threads-per-process * the number of processes-per-node for a job should not exceed the total number of threads on the node. On Theta, the KNL nodes each have 256 threads (64 cores each with 4 hardware threads). The configuration of threads-per-process and processes-per-node that maximizes job efficiency depends on the type of job that is run (e.g. a POT job will have a different configuration than a CRY cosmic ray job), and should be determined before running production jobs by running performance tests.
- --nprocs-per-node- The number of processes one can run per node may be limited by the memory available on the node. On Theta, the KNL nodes each have up to 208GB of memory, composed of 192GB of DDR4 memory and 16GB of MCDRAM high-bandwidth memory, which can be used as a cache or as normal memory. Additionally, there is the constraint listed above, i.e. that the number of threads-per-process * the number of processes-per-node for a job should not exceed the total number of threads on the node.
- --time-limit- On Theta, the maximum allowed wall-clock times for a job vary depending on the run queue and the number of nodes the job is using. See the ALCF's webpage on queues for up-to-date information.
- --mu2e-setup- With the current design of the container, the script setupmu2e-art.sh that sets up the Mu2e environment is inside the container in the top level directory. The default behavior is to source this script. You may change this behavior by specifying a different filename to this argument.
- --setup- With the current design of the container, the standard Mu2e Offline setup.sh script that sets up the Offline code is inside the container in the Offline directory. The default behavior is to source this script. You may change this behavior by specifying a different filename to this argument.
File Transfer
There are a few ways to transfer files into and out of ALCF resources. See their website on [ https://www.alcf.anl.gov/user-guides/data-transfer Data Transfer] for more information. For the March 2019 CRY simulations on Theta, we used scp and Globus. scp is not recommended for the transfer of large files. To use scp to/from FNAL machines, establish a kerberos ticket for the FNAL domain and use the scp option (-o) GSSAPIAuthentication=yes.
For example, to copy the file '336599.cobaltlog' from Theta to a user directory on /mu2e/app/, do the following:
goodenou@thetalogin6:~> kinit goodenou@FNAL.GOV
goodenou@thetalogin6:~> scp -o GSSAPIAuthentication=yes 336599.cobaltlog goodenou@mu2ebuild01.fnal.gov:/mu2e/app/users/goodenou/.