Grids

From Mu2eWiki
Jump to navigation Jump to search

Grids in General

The short answer is that computing grids are just the mother of all batch queues.The metaphor behind "Grid Computing" is that computing resources should be as available, as reliable and as easy to use as is the electric power grid. The big picture is that institutions, both large and small, can make their computing resources available to the grid and appropriately authorized users can use any resource that is available and is appropriate for their job. Priority schemes can be established to ensure that those who provide resources can have special access to their own resources while allowing others to have "as available" access. A large collection of software tools is needed to implement the full vision of the grid. These tools manage authentication, authorization, accounting, job submission, job scheduling, resource discovery, work-flow management and so on.

Head Nodes, Worker Nodes, Cores and Slots

A simplistic picture is that a grid installation consists of one head node and many worker nodes; typically there are hundreds of worker nodes per head node. In a typical grid installation today the worker nodes are multi-core machines and it is normal for there to be one batch slot per core. So, in an installation with 100 worker nodes, each of which has a dual quad-core processor, there would be 800 batch slots. This means that up to 800 grid jobs can run in parallel. There might be some grid installations with more than one batch slot per core, perhaps 9 or 10 slots on a dual quad-core machine. This makes sense if the expected job mix has long IO latencies.

When a job is submitted to the grid, it is submitted to a particular head node, which looks after queuing, resource matching, scheduling and so on. When a job has reached the front of the queue and is ready to run, the head node sends the job to a slot on a worker node.


FermiGrid Project

FermiGrid is a umbrella computing project, managed by the Fermilab Scientific Computing Division (SCD), deploying a collection of computing resources that the Division makes available via grid protocols. FermiGrid Project provides pools of Grid resources: the general purpose farm FermiGrid, separate resources for CMS, and access to remote grids at Universities. Note that the FermiGrid Project includes the FermiGrid compute farm, so the names can be confusing. Fermilab users are intended as the primary users of FermiGrid but unused resources can be used by other authorized grid users. Mu2e is among the authorized users of FermiGrid.

The FermiGrid worker nodes run our jobs in a pseudo virtual machine called an image. The user only sees the operating system that the image supplies, and not the machine's actual operating system. Since the host machine can switch between different images, or run several different ones at the same time, this is extremely flexible and eventually there will be several operating systems that can run on demand. As of now (2017), the nodes only run the CentOS 6 image. This operating system can be treated as equivalent to Scientific Linux 6 (SL6), which runs on our interactive nodes. Code compiled on the interactive nodes will run on the grid nodes.


All FermiGrid worker nodes have sufficient memory for our expected needs. FermiGrid has some nodes with a large memory and many cores. In typical running mode all cores would be running one job and use about 2GB memory. It is possible to partition these nodes so they run fewer jobs, but give each job much more memory. Some specialized mu2e jobs take advantage of this capability. This will happen automatically if you request large memory in the submission command.

By default you get 10 GB, but you can request up to 40 GB per job of local disk space on each worker node.


Another feature of FermiGrid is that certain disks are mounted on all worker nodes. For production jobs, we access our code as a pre-built release on the cvmfs distributed disk system. For personal jobs, you can compile your code on a mu2e disk which is visible on the FermiGrid nodes, where your executable can then run your code. We are moving away from this style of grid-mounted disk for personal jobs and moving toward moving a tarball of your code to the grid node, but do not have a plan in place now (fall 2017). Please consult the disk and data transfer documentation for more information about moving code and data to grid jobs.

Condor

In FermiGrid, the underlying job scheduler is CONDOR. This is the code the receives requests to run jobs, queues up the requests, prioritizes them, and then sends each job to a batch slot on an appropriate worker node.

CONDOR allows you, with a single command, to submit many related jobs. For example you can tell condor to run 100 instances of the same code, each running on a different input file. If enough worker nodes are available, then all 100 instances will run in parallel, each on its own worker node. In CONDOR-speak, each instance is referred to as a process (or job) and a collection of processes is called a "cluster" (or submission, or jobid'). A set of clusters with one goal can be called a project.

We do not use condor directly, we interact with a Fermilab product call jobsub, described more below, which translates your commands to the condor processes.

Virtual Organizations and roles

Access to grid resources is granted to members of various "Virtual Organizations" (VO); you can think of these as a strong authentication version of Unix user groups. One person may be a member of several VOs but when that person runs a job they must choose the VO under which they will submit the job. There are individual VOs for each of the large experiments at Fermilab and one general purpose VO; most of the small experiments, including Mu2e, have their own group within this general purpose VO.

VO's can also have roles. Most users will use the default Analysis role, but production uses the Production role.

Authentication

In order to use any Fermilab resource, the lab requires that you pass its strong authentication tests. When you want to log into a lab machine you first authenticate yourself on your own computer using kinit and then you log in to the lab computer using a kerberos aware version of ssh.

This is all you need to do to access the grid, or transfer data. When you issue commands for grid jobs, the system will convert your kerberos ticket to your CILogon identity, check your VO, and then derives a VOMS proxy for that identity, as needed (see Authentication). The CILogon cert and VOMS proxy is a stored in /tmp and reused as long as it is valid.

Usually only the user (or a SCD expert) can remove a user's job, however, jobsub allows each experiment to assign jobsub "superusers" who can delete or hold all user's jobs. For mu2e, the superusers are Rob, Andrei and Ray.

Monte Carlo

For running Monte Carlo grid jobs, please follow the instructions at Workflows.

non-Monte Carlo

Under some circumstances you may need to submit scripts outside of the standard procedures. In this case, you can use jobsub jobsub_litedirectly. Please read and understand disks and data transfer.

To start:

setup jobsub_client

to use basic submit commands.

If you want the environmental variable VAR set to a value in the grid environment, use the "-e" switch.

Files, such as tarballs and configuration files, which need to get to all your jobs can be sent with the "-f" option and they will appear in the grid job in the directory $CONDOR_DIR_INPUT. Several options exist which are efficient for large jobs:

  1. put them on resilient and send them with -f /pnfs/mu2e/resilient/<user>
  2. use -f dropbox://fullpath/file which moves the file to a temp area before starting your job.
  3. make and/or send tarballs with other jobsub switches

To move data files, use data transfer.

In your job there will be an environmental PROCESS which will have a value between 0 and N-1 (where N is the number of jobs in the cluster) which allows you to tell the jobs apart.

jobsub_submit --group=mu2e -N 10 \
--OS=SL6  --memory=2GB --disk=5GB --expected-lifetime=5h \
-e VAR=value -f dropbox://full/path/useful_small_file.txt   \
file://fullpath/myBashScript.sh

you will get back a message, includes a jobid, that identifies this submission or cluster:

...
Use job id 15414243.0@fifebatch1.fnal.gov to retrieve output

to check running jobs (also see Monitoring below, notice the "dot-blank" syntax to refer to a cluster ):

jobsub_q --jobid=15414243.@fifebatch1.fnal.gov

see the result of a job that has finished (results may be up to 10 minutes old, due to caching)

jobsub_history  | grep 15414243

remove a job or a whole cluster:

jobsub_rm --jobid=15414243.123@fifebatch1.fnal.gov
jobsub_rm --jobid=15414243@fifebatch1.fnal.gov

get condor log files in a tarball:

jobsub_fetchlog  --jobid=15414243.0@fifebatch1.fnal.gov

The condor log files do not usually include the printed output of any runs of a mu2e executable - those are usually redirected to separate files.

Please see the jobsub documentation and the "-h" output for more information.

Defined Resources and OSG

jobsub has a switch to control if you want to submit to certain resources. The main resources are FermiGrid farm using the mu2e allocation (DEDICATED), FermiGrid, using any other unused nodes (OPPORTUNISTIC), and sites other than FermiGrid farm (OFFSITE). Here is the expert's summary:

  • DEDICATED: access to quota-based FermiGrid slots. Handled by the first Condor negotiator.
  • OPPORTUNISTIC: These are FermiGrid-only slots available when other experiments aren't using their full quota. Handled by the second negotiator. These jobs have priority over other non-Fermilab VO jobs.
  • OFFSITE: Everything non-FermiGrid. Handled by the second negotiator.

We believe the default is

--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC

The Open Science Grid (OSG) is a collaboration of Universities which have agreed to run similar software so the sites look similar to jobs submitted there. This enables us to submit to all these sites and get any resources that are not being used by their high-priority users. To submit to the OSG, switch the resource request:

--resource-provides=usage_model=OFFSITE

Submitting with this switch will cause the system to select which sites you run on based on what sites can provide your request for disk, memory and and time. Practically, we find that some sites perform so poorly that they cause a drag on the project, so we usually supply a list of sites that we find are working well, for example:

--resource-provides=usage_model=OFFSITE  --site=SU-OG,Wisconsin 

Here are a few thoughts on which sites to use based on experience from 2015-2107. Since this sort of information can get stale, please also consult the results of the daily ping jobs. and the lab's site summary. Note that certain sites, such as FZU can only be used by specific experiments.

  1. In all cases use BNL,Caltech,FNAL,FermiGrid,Nebraska,Omaha,SU-OG,Wisconsin,UCSD,NotreDame
  2. if expected time less than 8h add MIT,Michigan
  3. if time less than 5h, also add MWT2,UChicago
  4. if time less than 3h, also add Hyak_CE

"FNAL" is the name for the CMS Tier1 grid at Fermilab - we can use this farm opportunistically through the OFFSITE switch. But if you submit with "OFFSITE" only, then there is no way to get back to FermiGrid, even opportunistically. To get to the maximum resources, request all paths:

--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE

To submit this maximal way and include a site list, then FermiGrid should be on the site list (even though is is implied with DEDICATED and OPPORTUNISITC):

--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE  --site=SU-OG,Wisconsin,FermiGrid,...


Watch the conMon monitor for sites with lots of "churning" indicated by nearly as many "disc" (disconnected jobs) as "strt" (started jobs). You might want to drop these struggling sites since it is just delaying your jobs and creating noise. Also note that you can submit with no list of sites in which case the system will try to match you to any site that can meet your requirements. In the past we needed to understand individual sites, but this might work fine at times. See OfflineOps for more monitoring links.

Useful Tips

Setting limits

If you run a job locally on mu2egpvm, you can estimate the limits that you should use on submission. The log file has a memory report:

MemReport  VmPeak = 7567.69 VmHWM = 6734.57

The second number, VmHWM, is the max resident size in memory, which is relvant for the memory limit. Submit with a memory limit maybe 1.2 times this number.

The job has a CPU report

TimeReport CPU = 8805.605719 Real = 10422.012182

Note the CPU report includes initialization time which can be up to 1.5 minutes if it includes geant. The limit for submission are based on the wall time, not the CPU time and there is a large variation in the speed of grid nodes. As a rule of thumb, most jobs will use wall time between 1.2 and 1.6 times the CPU time you observe on the interactive nodes. If you use a wall time limit of 1.8 times the interactive CPU, you will typically get 98% of jobs done. You can use 2.0 times the CPU time to typically get 99.9% done.

High Priority

If your work has been approved as a high-priority project, you can submit a high-priority job by adding the following argument to the jobsub command:

--subgroup=highana

You can do this with mu2eprodsys by:

--jobsub-arg=" --subgroup=highana "

There are actually four subgroups, highpro, highana, lowpro, and lowana. The "pro" versions are used by production in the mu2epro account. The "ana" is for high-priority individual analysis, as defined by spokes or their appointees. The "low" versions have the reverse effect, to lower the priority, to run only when no much else is running.


The scheduling algorithm (see FIFE document) for each attempt of the system to to start jobs:

  • jobs submitted with a subgroup flag have the highest priority for the slots assigned to that subgroup, and are started first, until those slots are full
  • jobs submitted without a subgroup flag, or with a subgroup flag but analyzed when that subgroup is full, are scheduled for any remaining slots up to the Mu2e quota

within each of the above categories

  • jobs for one user are started sequentially
  • jobs from multiple users are started according the priority derived from recent CPU use

Keep in mind that the system may not match quotas immediately, since it does not eject opportunistic jobs. Practically speaking, this means the system may take up to 8 hours, or sometimes even longer, to asymptote to the quotas.

At present, the use of the high priority subgroup is on an honor system. Please do NOT use it unless you have been authorized to.

The full list of subgroups and the fraction of mu2e slots that they may take (10/2018).

GROUP_QUOTA_DYNAMIC_group_mu2e.highana = 0.06
GROUP_QUOTA_DYNAMIC_group_mu2e.highpro = 0.53
GROUP_QUOTA_DYNAMIC_group_mu2e.lowana = 0.01
GROUP_QUOTA_DYNAMIC_group_mu2e.lowpro = 0.04
GROUP_QUOTA_DYNAMIC_group_mu2e.mars = 0.21
GROUP_QUOTA_DYNAMIC_group_mu2e.monitor = 0.14
GROUP_QUOTA_DYNAMIC_group_mu2e.test = 0.01

See also experiment total quotas

Steering to Intel

Fermigrid consists of AMD (20%) and Intel (80%) CPUs. A large fraction of the time, simulation jobs run on one flavor will not reproduce on the other flavor. If you are debugging or validating, you might want to stick to one flavor. This switch for jobsub forces only Intel (the same as the interactive machines).

--append_condor_requirements='(TARGET.CpuFamily==6)'

and here is the way to also avoid a particular model of CPU. (There are about 5 different models on FermiGrid). The model number comes from "model:" in /proc/cpuinfo.

--append_condor_requirements='(TARGET.CpuFamily==6) && (TARGET.CpuModelNumber!=62)'

Steering to a node

Sometimes in debugging difficult grid issues the grid experts will set up test nodes. Here is how to target jobs to run on a list of nodes in a jobsub command.

--append_condor_requirements='(stringlistmember(UTSNAMENODENAME,\"node_name_1,node_name_2\"))'

with replacing node_name_1, etc., for example:

--append_condor_requirements='(stringlistmember(UTSNAMENODENAME,\"fnpc17159.fnal.gov,fnpc17113.fnal.gov\"))'

In a mu2eprodsys command, it will look like:

--jobsub-arg=" --append_condor_requirements='(stringlistmember(UTSNAMENODENAME,\"fnpc17159.fnal.gov,fnpc17113.fnal.gov\"))' "


Another example:

-l '+MU2E_TEST=true' --append_condor_requirements='(TARGET.MU2E_TEST==true)'

Test Nodes

Most jobs can be tested interactively, and they will run successfully on the grid, even the OSG. If there seems to be a problem which can only be understand by running on a grid node interactively, there a way to do that. You can request an account and log onto fnpctest1.fnal.gov which is setup like a FermiGrid node. (Select an OS - SL6 or 7 - at login).

As of 12/2022, after logging in to fnpctest1, you are not in the latest container, you have to start it explicitly

singularity exec --pid --ipc --contain --bind /cvmfs --bind /etc/hosts --home ./:/srv --pwd /srv /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest /bin/bash

As of 2/2024, singularity has evolved into apptainer, here is an example of how to run the standard sl7 container, which is a way to run sl7 on al9 platform.

apptainer shell -B /cvmfs -B $PWD /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest
apptainer shell -B /cvmfs -B $PWD /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-el9:latest

When running the sl7 container on an al9 node, and using UPS, it might be necessary to set export UPS_OVERRIDE='-H Linux64bit+3.10-2.17-sl7-9', but try without it. The name of the container can be seen in /image-source-info.txt

To get a bash shell including running your login scripts, you can

apptainer exec <options> <container> /bin/bash

As of 12/13/18:

  • Nodes fnpc82* have three disks instead of the usual one.
  • Nodes fnpc91* have SSD's.
  • Nodes fnpc19101-fnpc19142 (new 2/2020) 4TB SSD data disk

core dumps

Occasionally, it is useful to retrieve a core dump from a grid job. For example, sometimes a crash only occurs when running on a grid node. Core dumps are turned off in grid jobs by default. To turn them on include

ulimit -c unlimited

in your script. The core file will appear as

core.NNNN

and you will have to arrange to copy it back.

Requesting an image

A mechanism to force your job to use a specific operating system image. See also general singularity notes

--singularity-image '/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest'

or for al9

--singularity-image '/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-el9:latest'

condor commands

Users usually submit jobs using jobsub commands. Behind that layer of code is a system called condor. In some cases the user can access this layer to do useful things.

change the submit parameters of an idle job cluster. Default units are KB for disk and MB for memory, or you can write 4G for 4 GB.

condor_qedit -name jobsub02.fnal.gov 66903511  memory=2.0
condor_qedit -name jobsub02.fnal.gov -pool gpcollector03.fnal.gov 66907803.0 'DesiredOS' '"SL7"'
  (-pool switch is likely optional)

See the tail of the stdout log file of a running job

condor_tail -name jobsub02.fnal.gov 66930712.0 -maxbytes 100000

Autorelease

It is possible, after some jobs sections have gone to hold, to release them to run with an increased resource request (time, memory, disk), see here.

Monitoring

Please see the grafana pages for the primary web monitor for anything running on FermiGrid. Please Also see operations page for more links.

A summary message for each submission will be sent to the YOUR_USERNAME@fnal.gov email address. You can change your preferences, for example, requesting only one daily summary email, or turning this report off, by following the links in the mail or going to the documentation.

Efficiency

Everyone wants the grid resources to be used efficiently so we can get the most out them. Some definitions or efficiency follow.

  • CPU. This is the CPU time used divided by the wall-clock time the slot was held by the job. If the job is sitting there and not using the CPU, that is a form of wasted resources. Some typical causes of this inefficiency would be a scripting error, or waiting for a database, or input or output files.
  • Memory. This is the memory used divided by the memory requested. If you block out a lot of extra memory for your job, it can't be used by other jobs. If the memory is predictable, the request should be set something like 10% over the largest memory used in testing. Sometimes the memory used is unpredictable in which case the only efficiency strategy is to set the memory so a few percent of jobs fail, and resubmit those jobs with a larger memory request.
  • Success This is the fraction of jobs with return code zero. Under the assumption that failed jobs will have to be re-run, those jobs were wasted resources.
  • Time. This is the wall-clock time used divided by the time requested. Your job's slot will be freed whenever you exit, but inappropriate time requests may cause the scheduling to be non-optimal (and delay your job start!).

The computing division has a policy on efficiency. If a job falls below certain target efficiency levels, the user is notified. If repeated notifications are ignored, the user's priority may be set down. Small jobs are exempt.

Transition to jobsub_lite

The use of VOMS proxies for grid Authentication was discontinued in March 2023 and the replacement authentication technology is known as tokens. As part of this transition, the UPS package jobsub_client was retired and its replacement is called jobsub_lite, which uses tokens. Unlike jobsub_client which is distributed as a UPS product, jobsub_lite is installed directly on each of our interactive nodes using rpms. A tutorial is available.

Starting March 27, 2024, jobsub_lite v1.7 is deployed on all of our interactive machines. This changes the default container from SL7 to AL9. If you submit jobs using mu2eprodsys, it overrides the default and selects an SL7 container. If you wish to override this and use the default AL9 container, add the following argument to your mu2eprodsys command:

--predefined-args=none

If you submit jobs using jobsub_submit and wish to get an SL7 container, you should add the following argument:

--singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest