SubmitJobs: Difference between revisions
Line 138: | Line 138: | ||
</pre> | </pre> | ||
then use the list of failed jobs to re-submit them | then use the list of failed jobs to re-submit them | ||
with <code>mu2eprodsys</code>. | with <code>mu2eprodsys</code>. <code>--dsconf</code> is the value used at for the same switch at submission (and <code>--dsowner</code>, if non-default). It is important to set these consistently throughout the submission and recovery process. The fcl dataset name is the dataset of the fcl files in the submission, and was set in the generate fcl step - it may or may not have the same dsconf. | ||
non-default) throughout the process. | |||
If the fcl dataset was not registered in SAM, then you need to make | If the fcl dataset was not registered in SAM, then you need to make |
Revision as of 17:37, 1 September 2017
Introduction
This page is part of the standard simulation workflow. It uses the jobsub package through the mu2eprodsys
script in mu2egrid. Before this step, you should have prepared a set of fcl files for the job, and have them in a set of files, called fcllist.00, etc., in the example. Each of these files will drive one submission, or cluster. You should also have defined the Offline version (the filespec of the setup.sh file) to use in the project. This should be the same as the Offline you set up during fcl generation.
Setup
It is advisable to start with a clean shell (no Offline setup),
because the jobsub_client
package requires a different
Python version then Mu2e Offline.
setup mu2e setup mu2egrid setup mu2efiletools
Job submission
The mu2eprodsys
command is used to submit grid jobs.
A single execution of mu2eprodsys
submits a single
Condor cluster of jobs. We have been asked to limit the size of
clusters to 10,000 jobs. If the total number of jobs to be run
exceeds 10,000, the list of fcl files should be split into several smaller files
with each each file not exceeding 10,000 lines. One can use the
Linux split -l
command to do this.
The required parameters are:
--setup
the setup script that determines which version of Offline to use. To run jobs offsite, the Offline release must reside incvmfs
. For on-site running one can put code on the/mu2e/app
disk.--fcllist
The list of uploaded fcl files in /pnfs--dsconf
An user-defined string that will be used for the "configuration" field of output files.
Some of the options
--dsowner
the "owner" (in the dataset naming convention sense) of output datasets. Defaults to the user who runsmu2eprodsys
, unless the user is "mu2epro". In the latter case--dsowner
is set to "mu2e" to produce "official" datasets.--wfproject
can be used to group results in the outstage area. The use of this option is highly recommended. For example,--wfproject=beam
or--wfproject=cosmic
.--expected-lifetime
should be used to set an upper limit on the wallclock job duration. The value is passed directly tojobsub_submit
, see its documentation for more information.
Run mu2eprodsys --help
to see all the options.
If you plan to register the fcl files in SAM, you can submit the fcl file list before there are registered.
If you are approved by the management, you can run as a high priority user.
Outstage location
By default the outputs will be placed into
/pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/
where $USER is the submitter user name, and $WFPROJECT is specified by
the --wfproject
parameter (the default is "default").
If --role=Production
(see authentication) is used (default for the mu2epro user), the outputs will go into
/pnfs/mu2e/persistent/users/$USER/workflow/$WFPROJECT/outstage/
instead.
Each submission, or call to mu2eprosys
creates one cluster of jobs. The number of jobs in the cluster is determined by how long the list is in the fcl list files. The script will print out a jobId like
Use job id 18497021.0@fifebatch2.fnal.gov to retrieve output
It is a good idea to save this so you can track this cluster.
Each submission will create a new directory in the outstage area, like
/pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/15503895.beam_p03_17_03_31_16_32_48_1490995968_0
in each directory will be subdirectories like 00
which serve to reduce the number of files in each directory, then in those directories are directories for each job, like 0000
, and the job output is here.
Example
mu2eprodsys --setup=/cvmfs/mu2e.opensciencegrid.org/Offline/v5_6_7/SLF6/prof/Offline/setup.sh \ --wfpro=pion-test \ --fcllist=fcllist.00 \ --dsconf=v567 \ --expected-lifetime=5h
Monitoring
You can use the mu2e_clusters
script from the mu2egrid
package, or
the jobsub_q
command, to check the status of your jobs.
If you see that some jobs
are on hold (the "H" state), look for HoldReason
in the output of
jobsub_q --long --jobid <jobid>
. If you see a PERIODIC_HOLD message, that means the job
tried to use more resources (memory/disk/time) than you asked for.
jobsub_rm
the offending jobs and re-submit after
adjusting the requirements. If HoldReason is not a
PERIODIC_HOLD, open a servicedesk ticket.
See the operation links page for links to useful monitors, in particular mu2e grafana. If the jobs go on hold, the jobs list will tell you which resource limits you violated.
Checking output
The mu2eClusterCheckAndMove
script from the
mu2efiletools
package can be used to separate
"good" and "failed" jobs. One does not have to wait
until all jobs complete;
mu2eClusterCheckAndMove
can be run periodically
on job outputs in the outstage area.
cd .../workflow/$WFPROJECT/outstage mu2eClusterCheckAndMove 15503895.beam_p03_17_03_31_16_32_48_1490995968_0 >& \ $WORKDIR/cam_`date +"%y_%m_%d_%H_%M_%S_%s"`.log &
After running the mu2eClusterCheckAndMove
script,
in the output area are there will be three directories:
.../workflow/$WFPROJECT/outstage .../workflow/$WFPROJECT/good .../workflow/$WFPROJECT/failed
As they are checked, job directories are moved from "outstage" into "good" and "failed" subdirectories, as needed. The script prints a summary of the results.
Two frequently used options:
--timecut
The script will not look at job outputs that are "too fresh" and may still be written out. The default minimal age is 7200 seconds. If you know that all jobs in the cluster have finished, you can set--timecut
to a small value instead of waiting for 2 hours before checking the results.--nosam
By default the script talks to SAM to ensure that there are no duplicate jobs. Because of glitches in grid running, sometimes one gets more than one copy of files for the same jobs. However if the original fcl files have not been registered with SAM, the uniquiness check can not be performed. This obviously means if you are registering the fcl files in SAM, you don't want to run this script before that registration is done.
Recovering jobs
If the fcl dataset was registered in SAM, then you can use the following
method to make a list of jobs to recover.
After all jobs from the current submission have completed and processed
with the mu2eClusterCheckAndMove
script, so that
outstage
area is empty,
SAM has a record of all "good" jobs from that attempt.
Continuing with the pion example, one can run
mu2eMissingJobs --fclds=cnf.gandr.my-test-s1.v0.fcl --dsconf=v567 > failed-jobs.txt
then use the list of failed jobs to re-submit them
with mu2eprodsys
. --dsconf
is the value used at for the same switch at submission (and --dsowner
, if non-default). It is important to set these consistently throughout the submission and recovery process. The fcl dataset name is the dataset of the fcl files in the submission, and was set in the generate fcl step - it may or may not have the same dsconf.
If the fcl dataset was not registered in SAM, then you need to make
a list of job directories in the good
directory, find
missing entries, and construct a new fcl file list from that.
Return to workflow
At the end of this procedure, you should have a complete set of output files in the good
subdirectory of the output area.