SubmitJobs
Introduction
This page is part of the standard simulation workflow. It uses the jobsub package through the mu2eprodsys script in mu2egrid. Before this step, you should have prepared a set of fcl files for the job, and have them in a set of files, called fcllist.00, etc., in the example. Each of these files will drive one submission, or cluster. You should also have defined the Offline version (the filespec of the setup.sh file) to use in the project. This should be the same as the Offline you set up during fcl generation.
Setup
It is advisable to start with a clean shell (no Offline setup),
because the jobsub_client package requires a different
Python version then Mu2e Offline.
setup mu2e setup mu2egrid setup mu2efiletools
Job submission
The mu2eprodsys command is used to submit grid jobs.
A single execution of mu2eprodsys submits a single
Condor cluster of jobs. We have been asked to limit the size of
clusters to 10,000 jobs. If the total number of jobs to be run
exceeds 10,000, the list of fcl files should be split into chunks
with each chunk not exceeding 10,000 files. One can use the
Linux split command to do this.
The required parameters are:
--setupthe setup script that determines which version of Offline to use. To run jobs offsite, the Offline release must reside incvmfs. For on-site running one can put code on the/mu2e/appdisk.--fcllistThe list of uploaded fcl files in /pnfs--dsconfAn user-defined string that will be used for the "configuration" field of output files.
Some of the options
--dsownerthe "owner" (in the dataset naming convention sense) of output datasets. Defaults to the user who runsmu2eprodsys, unless the user is "mu2epro". In the latter case--dsowneris set to "mu2e" to produce "official" datasets.--wfprojectcan be used to group results in the outstage area. The use of this option is highly recommended. For example,--wfproject=beamor--wfproject=cosmic.--expected-lifetimeshould be used to set an upper limit on the wallclock job duration. The value is passed directly tojobsub_submit, see its documentation for more information.
Run mu2eprodsys --help to see all the options.
Outstage location
By default the outputs will be placed into
/pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/
where $USER is the submitter user name, and $WFPROJECT is specified by
the --wfproject parameter (the default is "default").
If --role=Production (see authentication) is used (default for the mu2epro user), the outputs will go into
/pnfs/mu2e/persistent/users/$USER/workflow/$WFPROJECT/outstage/
instead.
Each submission, or call to mu2eprosys creates one cluster of jobs. The number of jobs in the cluster is determined by how long the list is in the fcl list files. The script will print out a jobId like
Use job id 18497021.0@fifebatch2.fnal.gov to retrieve output
It is a good idea to save this so you can track this cluster.
Example
mu2eprodsys --setup=/cvmfs/mu2e.opensciencegrid.org/Offline/v5_6_7/SLF6/prof/Offline/setup.sh \ --wfpro=pion-test \ --fcllist=fcllist.00 \ --dsconf=v567 \ --expected-lifetime=5h
Monitoring
You can use the mu2e_clusters script from the mu2egrid package, or
the jobsub_q command, to check the status of your jobs.
If you see that some jobs
are on hold (the "H" state), look for HoldReason in the output of
jobsub_q --long --jobid <jobid>. If you see a PERIODIC_HOLD message, that means the job
tried to use more resources (memory/disk/time) than you asked for.
jobsub_rm the offending jobs and re-submit after
adjusting the requirements. If HoldReason is not a
PERIODIC_HOLD, open a servicedesk ticket.
See the operation links page for links to useful monitors, in particular mu2e grafana. If the jobs go on hold, the jobs list will tell you which resource limits you violated.