Grids
Grids in General
The short answer is that computing grids are just the mother of all batch queues.The metaphor behind "Grid Computing" is that computing resources should be as available, as reliable and as easy to use as is the electric power grid. The big picture is that institutions, both large and small, can make their computing resources available to the grid and appropriately authorized users can use any resource that is available and is appropriate for their job. Priority schemes can be established to ensure that those who provide resources can have special access to their own resources while allowing others to have "as available" access. A large collection of software tools is needed to implement the full vision of the grid. These tools manage authentication, authorization, accounting, job submission, job scheduling, resource discovery, work-flow management and so on.
Head Nodes, Worker Nodes, Cores and Slots
A simplistic picture is that a grid installation consists of one head node and many worker nodes; typically there are hundreds of worker nodes per head node. In a typical grid installation today the worker nodes are multi-core machines and it is normal for there to be one batch slot per core. So, in an installation with 100 worker nodes, each of which has a dual quad-core processor, there would be 800 batch slots. This means that up to 800 grid jobs can run in parallel. There might be some grid installations with more than one batch slot per core, perhaps 9 or 10 slots on a dual quad-core machine. This makes sense if the expected job mix has long IO latencies.
When a job is submitted to the grid, it is submitted to a particular head node, which looks after queuing, resource matching, scheduling and so on. When a job has reached the front of the queue and is ready to run, the head node sends the job to a slot on a worker node.
FermiGrid Project
FermiGrid is a umbrella computing project, managed by the Fermilab Scientific Computing Division (SCD), deploying a collection of computing resources that the Division makes available via grid protocols. FermiGrid Project provides pools of Grid resources: the general purpose farm FermiGrid, separate resources for CMS, and access to remote grids at Universities. Note that the FermiGrid Project includes the FermiGrid compute farm, so the names can be confusing. Fermilab users are intended as the primary users of FermiGrid but unused resources can be used by other authorized grid users. Mu2e is among the authorized users of FermiGrid.
The FermiGrid worker nodes run our jobs in a pseudo virtual machine called an image. The user only sees the operating system that the image supplies, and not the machine's actual operating system. Since the host machine can switch between different images, or run several different ones at the same time, this is extremely flexible and eventually there will be several operating systems that can run on demand. As of now (2017), the nodes only run the CentOS 6 image. This operating system can be treated as equivalent to Scientific Linux 6 (SL6), which runs on our interactive nodes. Code compiled on the interactive nodes will run on the grid nodes.
All FermiGrid worker nodes have sufficient memory for our expected needs. FermiGrid has some nodes with a large memory and many cores. In typical running mode all cores would be running one job and use about 2GB memory. It is possible to partition these nodes so they run fewer jobs, but give each job much more memory. Some specialized mu2e jobs take advantage of this capability. This will happen automatically if you request large memory in the submission command.
We can use up to 40 GB per job of local disk space on each worker node.
Another feature of FermiGrid is that certain disks are mounted on all worker nodes. For production jobs, we access our code as a pre-built release on the cvmfs distributed disk system. For personal jobs, you can compile your code on a mu2e disk which is visible on the FermiGrid nodes, where your executable can then run your code. We are moving away from this style of grid-mounted disk for personal jobs and moving toward moving a tarball of your code to the grid node, but do not have a plan in place now (fall 2017).
Please consult the disk and data transfer documentation for more information about moving code and data to grid jobs.
Condor
In FermiGrid, the underlying job scheduler is CONDOR. This is the code the receives requests to run jobs, queues up the requests, prioritizes them, and then sends each job to a batch slot on an appropriate worker node.
CONDOR allows you, with a single command, to submit many related jobs. For example you can tell condor to run 100 instances of the same code, each running on a different input file. If enough worker nodes are available, then all 100 instances will run in parallel, each on its own worker node. In CONDOR-speak, each instance is referred to as a process (or job) and a collection of processes is called a "cluster" (or submission, or jobid'). A set of clusters with one goal can be called a project.
We do not use condor directly, we interact with a Fermilab product call jobsub, described more below, which translates your commands to the condor processes.
Virtual Organizations and roles
Access to grid resources is granted to members of various "Virtual Organizations" (VO); you can think of these as a strong authentication version of Unix user groups. One person may be a member of several VOs but when that person runs a job they must choose the VO under which they will submit the job. There are individual VOs for each of the large experiments at Fermilab and one general purpose VO; most of the small experiments, including Mu2e, have their own group within this general purpose VO.
VO's can also have roles. Most users will use the default Analysis role, but production uses the Production role.
Authentication
In order to use any Fermilab resource, the lab requires that you pass its strong authentication tests. When you want to log into a lab machine you first authenticate yourself on your own computer using kinit and then you log in to the lab computer using a kerberos aware version of ssh.
This is all you need to do to access the grid, or transfer data. When you issue commands for grid jobs, the system will convert your kerberos ticket to your CILogin identity, check your VO, and then derives a VOMS proxy for that identity, as needed (see Authentication). The CILogin cert and VOMS proxy is a stored in /tmp and reused as long as it is valid.
Usually only the user (or a SCD expert) can remove a user's job, however, jobsub allows each experiment to assign jobsub "superusers" who can delete or hold all user's jobs. For mu2e, the superusers are Rob, Andrei and Ray.
Monte Carlo
For running Monte Carlo grid jobs, please follow the instructions at Workflows.
non-Monte Carlo
Under some circumstances you may need to submit scripts outside of the standard procedures. In this case, you can use jobsub directly. Please read and understand disks and data transfer.
To start:
setup jobsub_client
basic submit commands. If you want the environmental variable VAR set to a value in the grid environment, use the "-e" switch. Various small files can be sent with the "-f" option and they will appear in the grid job in the directory $CONDOR_DIR_INPUT. To move data files use data transfer. IN your job there will be an environmental PROCESS which will have a value between 0 and N-1 (where N is the number of jobs in the cluster) which allows you to tell the jobs apart.
jobsub_submit --group=mu2e -N 10 \ --OS=SL6 --memory=2GB --disk=5GB --expected-lifetime=5h \ -e VAR=value -f /full/path/useful_small_file.txt \ file://$PWD/myBashScript.sh
you will get back a message, includes a jobid, that identifies this submission or cluster:
... Use job id 15414243.0@fifebatch1.fnal.gov to retrieve output
to check running jobs (also see Monitoring below, notice the "dot-blank" syntax to refer to a cluster ):
jobsub_q --jobid=15414243.@fifebatch1.fnal.gov
see the result of a job that has finished (results may be up to 10 minutes old, due to caching)
jobsub_history | grep 15414243
remove a job or a whole cluster:
jobsub_rm --jobid=15414243.123@fifebatch1.fnal.gov jobsub_rm --jobid=15414243@fifebatch1.fnal.gov
get condor log files in a tarball:
jobsub_fetchlog --jobid=15414243.0@fifebatch1.fnal.gov
The condor log files do not usually include the printed output of any runs of a mu2e executable - those are usually redirected to separate files.
Please see the jobsub documentation and the "-h" output for more information.
Defined Resources and OSG
jobsub has a switch to control if you want to submit to certain resources. The main resources are FermiGrid farm using the mu2e allocation (DEDICATED), FermiGrid, using any other unused nodes (OPPORTUNISTIC), and sites other than FermiGrid farm (OFFSITE). Here is the expert's summary:
- DEDICATED: access to quota-based FermiGrid slots. Handled by the first Condor negotiator.
- OPPORTUNISTIC: These are FermiGrid-only slots available when other experiments aren't using their full quota. Handled by the second negotiator. These jobs have priority over other non-Fermilab VO jobs.
- OFFSITE: Everything non-FermiGrid. Handled by the second negotiator.
We believe the default is
--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC
The Open Science Grid (OSG) is a collaboration of Universities which have agreed to run similar software so the sites look similar to jobs submitted there. This enables us to submit to all these sites and get any resources that are not being used by their high-priority users. To submit to the OSG, switch the resource request:
--resource-provides=usage_model=OFFSITE
Submitting with this switch will cause the system to select which sites you run on based on what sites can provide your request for disk, memory and and time. Practically, we find that some sites perform so poorly that they cause a drag on the project, so we usually supply a list of sites that we find are working well, for example:
--resource-provides=usage_model=OFFSITE --site=SU-OG,Wisconsin
Here are a few thoughts on which sites to use based on experience from 2015-2107. Since this sort of information
can get stale, please also consult the results of the daily ping jobs.
and the lab's site summary.
Note that certain sites, such as FZU
can only be used by specific experiments.
- In all cases use
BNL,Caltech,FNAL,FermiGrid,Nebraska,Omaha,SU-OG,Wisconsin,UCSD
- if memory less than 1.5G, add
NotreDame
- if expected time less than 8h add
MIT,Michigan
- if time less than 5h, also add
MWT2,UChicago
- if time less than 3h, also add
Hyak_CE
"FNAL" is the name for the CMS Tier1 grid at Fermilab - we can use this farm opportunistically through the OFFSITE switch. But if you submit with "OFFSITE" only, then there is no way to get back to FermiGrid, even opportunistically. To get to the maximum resources, request all paths:
--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE
To submit this maximal way and include a site list, then FermiGrid should be on the site list (even though is is implied with DEDICATED and OPPORTUNISITC):
--resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --site=SU-OG,Wisconsin,FermiGrid,...
Watch the conMon monitor for sites with lots of "churning" indicated by
nearly as many "disc" (disconnected jobs) as "strt" (started jobs). You might want to drop these struggling sites since it is just delaying your jobs and creating noise. Also note that you can submit with no list of sites in which case the system will try to match you to any site that can meet your requirements. In the past we needed to understand individual sites, but this might work fine at times. See OfflineOps for more monitoring links.
Useful Ttips
High Priority
If your work has been approved as a high-priority project, you can submit a high-priority job by adding the following argument to the jobsub command:
--subgroup=highana
You can do this with mu2eprodsys by:
--jobsub-arg=" --subgroup=highana "
There are actually four subgroups, highpro, highana, lowpro, and lowana. The "pro" versions are used by production in the mu2epro account. The "ana" is for high-priority individual analysis, as defined by spokes or their appointees. The "low" versions have the reverse effect, to lower the priority, to run only when no much else is running.
The scheduling algorithm is (assuming a total mu2e quota of 1250 slots):
- If the scheduler says that the next job is a mu2e job using quota, then it will check to see if there are any processes in the queue that are in “high" subgroup. If there are, and if there are fewer than 650 such processes already running, then the scheduler will choose a process from the high subgroup.
- If there are already 650 high processes running, then additional high jobs in the queue have access the rest of the mu2e quota, competing equally with other mu2e jobs. Similarly for opportunistic slots.
- If there are no high jobs in the queue, then regular mu2e jobs have access to all 1250 slots.
At present, the use of the high priority subgroup is on an honor system. Please do NOT use it unless you have been authorized to.
Steering to Intel
Fermigrid consists of AMD (20%) and Intel (80%) CPUs. A large fraction of the time, simulation jobs run on one flavor will not reproduce on the other flavor. If you are debugging or validating, you might want to stick to one flavor. This switch for jobsub forces only Intel (the same as the interactive machines).
--append_condor_requirements='(TARGET.CpuFamily==6)'
Test Nodes
Most jobs can be tested interactively, and they will run successfully on the grid, even the OSG. If there seems to be a problem which can only be understand by running on a grid node interactively, there a way to do that. You can request an account and log onto fnpctest1,fnal.gov which is setup like a FermiGrid node. (Select an OS - SL6 or 7 - at login).
Monitoring
Please see the grafana pages for the primary web monitor for anything running on FermiGrid. Please Also see operations page for more links.
A summary message for each submission will be sent to the YOUR_USERNAME@fnal.gov
email address. You can change your preferences, for example, requesting only one daily summary email, or turning this report off, by following the links in the mail or going to the documentation.
Efficiency
Everyone wants the grid resources to be used efficiently so we can get the most out them. Some definitions or efficiency follow.
- CPU. This is the CPU time used divided by the wall-clock time the slot was held by the job. If the job is sitting there and not using the CPU, that is a form of wasted resources. Some typical causes of this inefficiency would be a scripting error, or waiting for a database, or input or output files.
- Memory. This is the memory used divided by the memory requested. If you block out a lot of extra memory for your job, it can't be used by other jobs. If the memory is predictable, the request should be set something like 10% over the largest memory used in testing. Sometimes the memory used is unpredictable in which case the only efficiency strategy is to set the memory so a few percent of jobs fail, and resubmit those jobs with a larger memory request.
- Success This is the fraction of jobs with return code zero. Under the assumption that failed jobs will have to be re-run, those jobs were wasted resources.
- Time. This is the wall-clock time used divided by the time requested. Your job's slot will be freed whenever you exit, but inappropriate time requests may cause the scheduling to be non-optimal (and delay your job start!).
The computing division has a policy on efficiency. If a job falls below certain target efficiency levels, the user is notified. If repeated notifications are ignored, the user's priority may be set down. Small jobs are exempt.