SubmitJobs: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Introduction==
==Introduction==
This page is part of the standard simulation [[MCProdWorkflow|workflow]].  It uses the [[Grids|jobsub]] product through the <code>mu2eprodsys</code> script from the [https://cdcvs.fnal.gov/redmine/projects/mu2egrid/wiki mu2egrid] product. Before this step, you should have [[GenerateFcl|prepared a set of fcl files]] for the job, and have them in a set of files, called fcllist.00, etc., in the example.  Each of these files will drive one submission, or cluster.  You should also have defined the Offline version (the filespec of the setup.sh file) to use in the project.  This should be the same as the Offline you set up during fcl generation.  If you are running custom-built code (instead of a fixed release on [[Cvmfs|cvmfs]]), you will need to make a tarball of the code using [[Gridexport|gridexport]].
This page is part of the standard simulation [[MCProdWorkflow|workflow]].  It uses the [[Grids|jobsub]] product through the <code>mu2eprodsys</code> script from the [https://cdcvs.fnal.gov/redmine/projects/mu2egrid/wiki mu2egrid] product. Before this step, you should have [[GenerateFcl|prepared a set of fcl files]] for the job, and have them in a set of files, called fcllist.00, etc., in the example.  Each of these files will drive one submission, or cluster.  You should also have defined the Offline version (the filespec of the setup.sh file) to use in the project.  This should be the same as the Offline you set up during fcl generation.  If you are running custom-built code (instead of a fixed release on [[Cvmfs|cvmfs]]), you will need to make a tarball of the code using [[Muse#Tarball|muse tarball]] and distribute it using [[Cvmfs#Rapid_Code_Distribution_Service_.28RCDS.29 | RCDS]].


==Setup==
==Setup==
Line 8: Line 8:


<pre>
<pre>
setup mu2e
mu2einit
setup mu2egrid  
setup mu2egrid  
setup mu2efiletools
setup mu2efiletools
Line 18: Line 18:
The first required parameter points to the code using one of these two:
The first required parameter points to the code using one of these two:
*<code>--setup</code> the setup script on cvmfs that determines which version of Offline to use if you are running a pre-built release available on [[Cvmfs|cvmfs]].  
*<code>--setup</code> the setup script on cvmfs that determines which version of Offline to use if you are running a pre-built release available on [[Cvmfs|cvmfs]].  
*<code>--code</code> the path to the code [[Gridexport|tarball]] if you are using a custom-built release.
*<code>--code</code> the path to the code [[Muse#Tarball|tarball]] if you are using a custom-built release.


The other required parameters are:
The other required parameters are:
Line 33: Line 33:
   </li>
   </li>
   <li><code>--expected-lifetime</code> should be used to set an upper  limit on the wallclock job duration.  The value is passed directly  to <code>jobsub_submit</code>, see [https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Using_the_Client its documentation] for more  information.
   <li><code>--expected-lifetime</code> should be used to set an upper  limit on the wallclock job duration.  The value is passed directly  to <code>jobsub_submit</code>, see [https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Using_the_Client its documentation] for more  information.
  <li><code>--transfer-all-files</code> will transfer all created files at the end of the job. The default is to only transfer *.art, *.root, *.json, and the log file.
  </li>
</ul>
</ul>


Line 65: Line 67:


==Examples==
==Examples==
 
An example using a code tarball made by [[Muse#Tarball | muse tarball]], distrubuted using [[Cvmfs#Rapid_Code_Distribution_Service_.28RCDS.29 | RCDS]],
An example using a code tarball from [[Gridexport]] and a tarball of fcl files:
and a tarball of fcl files:
<pre>
<pre>
mu2eprodsys  \
mu2eprodsys  \
Line 74: Line 76:
--dsconf=geant_181203 --dsowner=rlc  \
--dsconf=geant_181203 --dsowner=rlc  \
--disk=10GB --memory=1950MB --expected-lifetime=6h \
--disk=10GB --memory=1950MB --expected-lifetime=6h \
--code=/pnfs/mu2e/resilient/users/rlc/gridexport/tmp.ZD43mR9OB3/Code.tar.bz
--code=/exp/mu2e/data/users/rlc/museTarball/tmp.RFLlub8qcy/Code.tar.bz2
</pre>
</pre>


Line 86: Line 88:
</pre>
</pre>


An example from production, using xrootd input
An example from production, using [[DataTransfer#xrootd|xrootd]] input
<pre>
<pre>
mu2eprodsys  \
mu2eprodsys  \
Line 97: Line 99:
   --xrootd
   --xrootd
</pre>
</pre>
Note that for mu2eprodsys and jobsub, the memory is labeled, for example, as "2400MB", but it actually limited by MiB, 2400*1024*1024


==Monitoring==
==Monitoring==
Line 111: Line 115:
PERIODIC_HOLD, open a [[ComputingHelp|servicedesk ticket]].
PERIODIC_HOLD, open a [[ComputingHelp|servicedesk ticket]].


See the [[OfflineOps|operation links]] page for links to useful monitors, in particular [https://fifemon.fnal.gov/monitor/dashboard/db/experiment-batch-details?var-experiment=mu2e mu2e grafana].  If the jobs go on hold, the jobs list will tell you which resource limits you violated.
See the [[OfflineOps|operation links]] page for links to useful monitors, in particular [https://fifemon.fnal.gov/monitor/dashboard/db/experiment-batch-details?var-experiment=mu2e mu2e grafana].  If the jobs go on hold, the pages called "Why are my jobs held?" will tell you which resource limits you violated.


==Checking output==
==Checking output==
Line 203: Line 207:
./mu2eprodsys.sh > out.log 2> err.log
./mu2eprodsys.sh > out.log 2> err.log
</pre>
</pre>
==MiB, MB, 1000KiB and the Grid Memory Watchdog==
This section has the secret decoder ring for the many different memory units you will encounter while managing your jobs. For extra confusion, not all software correctly describes the units its uses.  First, the standard definitions from https://en.wikipedia.org/wiki/Kilobyte
{|
| 1 kB
| = 1,000 bytes
|-
| 1 MB 
| = 1,000,000 bytes
|-
| 1 KiB
| = 1024 bytes
|-
| 1 MiB
| = 1024*1024 bytes
|}
and so on for GB/GiB, TB/TiB etc.
There is another unit that appears in some places: 1000KiB which is sometimes erroneously written as either MB or MiB.
First the units that are done correctly.  The art end-of-job memory report looks like:
MemReport  ---------- Memory summary [base-10 MB] ------
MemReport  VmPeak = 2964.03 VmHWM = 2240.32
It is really in MB, properly defined, and art goes out of its way to say so.  Note that "VmHWM" means "Virtual Memory High Water Mark".
If you look in your grid log file, the first line after the output from art looks like:
136.42user 4.12system 2:25.97elapsed 96%CPU (0avgtext+0avgdata 2188076maxresident)k
Here "k" means 1024 bytes and "max resident" means "Maximum Resident set size".  This means the same thing as VmHWM. You can cross-check this with the reported value of VmHWM; they usually agree to a few in the first decimal place.  Why are they not exactly equal? While the two metrics measure the same thing they do so at different times.
Both of the above numbers come from /proc/self/status, which reports both the current value of the resident set size plus the maximum value over the job.  The reported value is the maximum over the job.
The Condor software used by the grid has a memory watchdog that periodically asks each running grid process for the current value of its resident set size (summed over all subprocesses and threads if that is relevant).  It does not ask for the maximum resident set size over the processes so it will almost always miss the peak.  If the value found by Condor is above the memory limit requested when you submit your job, then the grid management software will stop your job and put it into a hold state.  At this time the only thing that you can do is to end the job using:
  jobsub_rm --jobid=<id of your grid process>
and resubmit it with a larger memory request, by adding this option to mu2eprodsys
  --memory="nnnnMB"
where nnnn is a number that the grid group tells us is actually in MiB; however we have some counter examples and are discussing the the Fermilab grid support team.
As best I can reverse engineer the watchdog looks up memory use in units of KiB.  When your grid job completes, it includes a summary of the Min, Max and average memory use.  These numbers are labeled as being in MiB, which I believe is correct.  The number reported in the email is always consistent with the largest value found by the watchdog.
You can see the reports from the memory watchdog using [https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Jobsub_fetchlog jobsub_fetchlog]. Note that the values are labelled as being in kB but they are actually in KiB.


[[Category:Computing]]
[[Category:Computing]]
[[Category:Workflows]]
[[Category:Workflows]]

Latest revision as of 18:11, 20 July 2024

Introduction

This page is part of the standard simulation workflow. It uses the jobsub product through the mu2eprodsys script from the mu2egrid product. Before this step, you should have prepared a set of fcl files for the job, and have them in a set of files, called fcllist.00, etc., in the example. Each of these files will drive one submission, or cluster. You should also have defined the Offline version (the filespec of the setup.sh file) to use in the project. This should be the same as the Offline you set up during fcl generation. If you are running custom-built code (instead of a fixed release on cvmfs), you will need to make a tarball of the code using muse tarball and distribute it using RCDS.

Setup

It is advisable to start with a clean shell (no Offline setup), because the jobsub_client package requires a different Python version then Mu2e Offline.

mu2einit
setup mu2egrid 
setup mu2efiletools

Job submission

The mu2eprodsys command is used to submit grid jobs. A single execution of mu2eprodsys submits a single Condor cluster of jobs. We have been asked to limit the size of clusters to 10,000 jobs. If the total number of jobs to be run exceeds 10,000, the list of fcl files should be split into several smaller files with each each file not exceeding 10,000 lines. One can use the Linux split -l command to do this. If the fcl files are in tarballs, each tarball should have fewer than 10,000 files.

The first required parameter points to the code using one of these two:

  • --setup the setup script on cvmfs that determines which version of Offline to use if you are running a pre-built release available on cvmfs.
  • --code the path to the code tarball if you are using a custom-built release.

The other required parameters are:

  • --fcllist The filespec of a file containing a list of fcl files, or the filespec of a tarball containing the fcl files.
  • --dsconf An user-defined string that will be used for the "configuration" field of output files.

Please review FileNames for guidance on choosing file names. The dsconf field here should reflect how the job is configured, in particular, if there are any non-standard detector or simulation settings. It may or may not be the the same as the fcl dataset dsconf, since one fcl dataset can be used to produce many very different output datasets by changing the geometry or simulation parameters.

Some other commonly-used options are:

  • --dsowner the "owner" (in the dataset naming convention sense) of output datasets. Defaults to the user who runs mu2eprodsys, unless the user is "mu2epro". In the latter case --dsowner is set to "mu2e" to produce "official" datasets.
  • --wfproject can be used to group results in the outstage area. The use of this option is highly recommended. For example, --wfproject=beam or --wfproject=cosmic.
  • --expected-lifetime should be used to set an upper limit on the wallclock job duration. The value is passed directly to jobsub_submit, see its documentation for more information.
  • --transfer-all-files will transfer all created files at the end of the job. The default is to only transfer *.art, *.root, *.json, and the log file.

Run mu2eprodsys --help to see all the options.

If you want to access additional nationwide opportunistic resources (at the cost of increased latency and possible complications) you can access the Open Science Grid (OSG).

If you plan to register the fcl files in SAM, you can submit the fcl file list before they are registered.

If you are approved by the management, you can run as a high priority user.

Outstage location

By default the outputs will be placed into

/pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/

where $USER is the submitter user name, and $WFPROJECT is specified by the --wfproject parameter (the default is "default"). If --role=Production (see authentication) is used (default for the mu2epro user), the outputs will go into

/pnfs/mu2e/persistent/users/$USER/workflow/$WFPROJECT/outstage/

instead.

Each submission, or call to mu2eprosys creates one cluster of jobs. The number of jobs in the cluster is determined by how long the list is in the fcl list files. The script will print out a jobId like

Use job id 18497021.0@fifebatch2.fnal.gov to retrieve output

It is a good idea to save this so you can track this cluster.

Each submission will create a new directory in the outstage area, like /pnfs/mu2e/scratch/users/$USER/workflow/$WFPROJECT/outstage/15503895.beam_p03_17_03_31_16_32_48_1490995968_0 in each directory will be subdirectories like 00 which serve to reduce the number of files in each directory, then in those directories are directories for each job, like 0000, and the job output is here.

Examples

An example using a code tarball made by muse tarball, distrubuted using RCDS, and a tarball of fcl files:

mu2eprodsys  \
--fcllist=/pnfs/mu2e/scratch/users/rlc/${FCLLIST}.tgz \
--clustername=00  \
--wfproject=geant_181203_new_0 \
--dsconf=geant_181203 --dsowner=rlc  \
--disk=10GB --memory=1950MB --expected-lifetime=6h \
--code=/exp/mu2e/data/users/rlc/museTarball/tmp.RFLlub8qcy/Code.tar.bz2

An example using a pre-compiled release on cvmfs and a local text file of fcl file names:

  mu2eprodsys --setup=/cvmfs/mu2e.opensciencegrid.org/Offline/v5_6_7/SLF6/prof/Offline/setup.sh \
  --wfpro=pion-test \
  --fcllist=fcllist.00 \
  --dsconf=v567 \
  --expected-lifetime=5h

An example from production, using xrootd input

mu2eprodsys  \
  --fcllist=fcllist.01 \
  --clustername=$FCLLIST  \
  --wfproject=MDC2018_DS-cosmic-mix_i-cat_1 \
  --setup=/cvmfs/mu2e.opensciencegrid.org/Offline/v7_5_4/SLF7/prof/Offline/setup.sh  \
  --dsconf=MDC2018i --dsowner=mu2e  \
  --disk=30GB --memory=2GB --expected-lifetime=3h \
  --xrootd

Note that for mu2eprodsys and jobsub, the memory is labeled, for example, as "2400MB", but it actually limited by MiB, 2400*1024*1024

Monitoring

You can use the mu2e_clusters script from the mu2egrid package, or the jobsub_q command, to check the status of your jobs.

If you see that some jobs are on hold (the "H" state), look for HoldReason in the output of jobsub_q --long --jobid <jobid>. If you see a PERIODIC_HOLD message, that means the job tried to use more resources (memory/disk/time) than you asked for. jobsub_rm the offending jobs and re-submit after adjusting the requirements. If HoldReason is not a PERIODIC_HOLD, open a servicedesk ticket.

See the operation links page for links to useful monitors, in particular mu2e grafana. If the jobs go on hold, the pages called "Why are my jobs held?" will tell you which resource limits you violated.

Checking output

The mu2eClusterCheckAndMove script from the mu2efiletools package can be used to separate "good" and "failed" jobs. One does not have to wait until all jobs complete; mu2eClusterCheckAndMove can be run periodically on job outputs in the outstage area.

cd .../workflow/$WFPROJECT/outstage
mu2eClusterCheckAndMove 15503895.beam_p03_17_03_31_16_32_48_1490995968_0 >& \
    $WORKDIR/cam_`date +"%y_%m_%d_%H_%M_%S_%s"`.log &


After running the mu2eClusterCheckAndMove script, in the output area are there will be three directories:

.../workflow/$WFPROJECT/outstage
.../workflow/$WFPROJECT/good
.../workflow/$WFPROJECT/failed

As they are checked, job directories are moved from "outstage" into "good" and "failed" subdirectories, as needed. The script prints a summary of the results.

Two frequently used options:

  • --timecut The script will not look at job outputs that are "too fresh" and may still be written out. The default minimal age is 7200 seconds. If you know that all jobs in the cluster have finished, you can set --timecut to a small value instead of waiting for 2 hours before checking the results.
  • --nosam By default the script talks to SAM to ensure that there are no duplicate jobs. Because of glitches in grid running, sometimes one gets more than one copy of files for the same jobs. However if the original fcl files have not been registered with SAM, the uniquiness check can not be performed. This obviously means if you are registering the fcl files in SAM, you don't want to run this script before that registration is done.

Recovering jobs

If the fcl dataset was declared in SAM, then you can use the following method to make a list of jobs to recover. After all jobs from the current submission have completed and processed with the mu2eClusterCheckAndMove script, so that outstage area is empty, SAM will have a record of all "good" jobs from that attempt. Continuing with the pion example, one can run

mu2eMissingJobs --fclds=cnf.gandr.my-test-s1.v0.fcl  --dsconf=v567  > failed-jobs.txt

then use the list of failed jobs to re-submit them with mu2eprodsys. Here, --dsconf is the value used at for the same switch at submission (and --dsowner, if non-default). It is important to set these consistently throughout the submission and recovery process. The fcl dataset name is the dataset of the fcl files in the submission, and was set in the generate fcl step - it may or may not have the same dsconf.

If the fcl dataset was not registered in SAM, then you need to make a list of job directories in the good directory, find missing entries, and construct a new fcl file list from that.

Return to workflow

At the end of this procedure, you should have a complete set of output files in the good subdirectory of the output area.


Local test

This is an example of how to run one section of mu2eprodsys grid job locally. It comes from v4_00_02, so other versions may need tweaks.

export EXPERIMENT="mu2e"
export MU2EGRID_MU2ESETUP="/cvmfs/mu2e.opensciencegrid.org/setupmu2e-art.sh"
export IFDH_VERSION="v2_1_0"
export MU2EGRID_ERRORDELAY="1"
export MU2EGRID_INPUTLIST="filelist-test-10fast"
export MU2EGRID_DIR="/mu2e/app/home/mu2epro/test/products/mu2egrid/v4_00_02"
export MU2EGRID_USERSETUP="/mu2e/app/users/rlc/head/Offline/setup.sh"
export MU2EGRID_CLUSTERNAME="00"
export MU2EGRID_DHTOOLS_VERSION="v1_13"
export MU2EGRID_SAM_WEB_CLIENT_VERSION="v2_0"
export MU2EGRID_DSOWNER="mu2e"
export MU2EGRID_TRANSFER_ALL="0"
export MU2EGRID_DSCONF="0000a"
export MU2EGRID_WFOUTSTAGE="/pnfs/mu2e/persistent/users/mu2epro/workflow/grid_pwd_test_2/outstage"
export MU2EGRID_MU2EBINTOOLS_VERSION="v1_01_06"
export MU2EGRID_SUBMITTER="mu2epro"
export PROCESS=0
export CLUSTER=100

export TMPDIR=$PWD
export CONDOR_DIR_INPUT="$PWD"

rm -rf $MU2EGRID_WFOUTSTAGE
rm -rf mu2egridInDir localFileDefs err.log out.log mu2eprodsys_errmsg* ifdh_*
./mu2eprodsys.sh > out.log 2> err.log

MiB, MB, 1000KiB and the Grid Memory Watchdog

This section has the secret decoder ring for the many different memory units you will encounter while managing your jobs. For extra confusion, not all software correctly describes the units its uses. First, the standard definitions from https://en.wikipedia.org/wiki/Kilobyte

1 kB = 1,000 bytes
1 MB = 1,000,000 bytes
1 KiB = 1024 bytes
1 MiB = 1024*1024 bytes

and so on for GB/GiB, TB/TiB etc.

There is another unit that appears in some places: 1000KiB which is sometimes erroneously written as either MB or MiB.

First the units that are done correctly. The art end-of-job memory report looks like:

MemReport  ---------- Memory summary [base-10 MB] ------
MemReport  VmPeak = 2964.03 VmHWM = 2240.32

It is really in MB, properly defined, and art goes out of its way to say so. Note that "VmHWM" means "Virtual Memory High Water Mark".

If you look in your grid log file, the first line after the output from art looks like:

136.42user 4.12system 2:25.97elapsed 96%CPU (0avgtext+0avgdata 2188076maxresident)k

Here "k" means 1024 bytes and "max resident" means "Maximum Resident set size". This means the same thing as VmHWM. You can cross-check this with the reported value of VmHWM; they usually agree to a few in the first decimal place. Why are they not exactly equal? While the two metrics measure the same thing they do so at different times.

Both of the above numbers come from /proc/self/status, which reports both the current value of the resident set size plus the maximum value over the job. The reported value is the maximum over the job.

The Condor software used by the grid has a memory watchdog that periodically asks each running grid process for the current value of its resident set size (summed over all subprocesses and threads if that is relevant). It does not ask for the maximum resident set size over the processes so it will almost always miss the peak. If the value found by Condor is above the memory limit requested when you submit your job, then the grid management software will stop your job and put it into a hold state. At this time the only thing that you can do is to end the job using:

 jobsub_rm --jobid=<id of your grid process>

and resubmit it with a larger memory request, by adding this option to mu2eprodsys

  --memory="nnnnMB"

where nnnn is a number that the grid group tells us is actually in MiB; however we have some counter examples and are discussing the the Fermilab grid support team.

As best I can reverse engineer the watchdog looks up memory use in units of KiB. When your grid job completes, it includes a summary of the Min, Max and average memory use. These numbers are labeled as being in MiB, which I believe is correct. The number reported in the email is always consistent with the largest value found by the watchdog.

You can see the reports from the memory watchdog using jobsub_fetchlog. Note that the values are labelled as being in kB but they are actually in KiB.