JobPlan: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 14: Line 14:
==Total CPU resources==
==Total CPU resources==
Curently (2017) a user may expect to get about 500-2000 slots
Curently (2017) a user may expect to get about 500-2000 slots
(with spikes to 10K) on GPGrid and 2000-5000 slots (with spikes to 15K)  
(with spikes to 10K) on Fermigrid and 2000-5000 slots (with spikes to 15K)  
on [[Grids|OSG]].  While the resources on OSG are substantial, there  
on [[Grids|OSG]].  While the resources on OSG are substantial, there  
is additional operational friction such as greater unpredictability and  
is additional operational friction such as greater unpredictability and  
Line 22: Line 22:
is reasonable given the total CPU you can get in a given time frame.   
is reasonable given the total CPU you can get in a given time frame.   
The interactive nodes are, on average, a little faster than grid nodes.
The interactive nodes are, on average, a little faster than grid nodes.


==Job time==
==Job time==
Line 28: Line 27:
from general considerations of per-job overhead and reliability.
from general considerations of per-job overhead and reliability.


GPGrid has a maximum time limit of about 4 days.
Fermigrid has a maximum time limit of about 4 days.
If you are running on the [[Grids|OSG]], each  
If you are running on the [[Grids|OSG]], each  
[https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Information_about_job_submission_to_OSG_sites site] has its
[https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Information_about_job_submission_to_OSG_sites site] has its
Line 39: Line 38:
have various defaults and limits, and the less you request, the more sites you can access.
have various defaults and limits, and the less you request, the more sites you can access.


==Where to keep the output==
From time to time, the collaboration decides to run major simulation campaigns, taking months, such as those performed for the TDR or CD3 reviews.  All these datasets, from every stage, are written to [[Enstore|tape]].  But simulation run by individuals in pursuit of their analysis is not usually written to tape.  Most analyzers find it is sufficient to write their simulation samples to the scratch [[Dcache|dCache]] area, and make ntuples which are saved on the [[Disks|data disks]].  After that, they can delete, or let dCache eventually delete, the simulation output.  If an individual's project takes a long time to produce, or is very large,  or requires long-term reproducability, or needs to be passed to several other users, it might make sense to upload it to tape.  Please consult your mu2e mentors in making this decision.
It is also possible to save a dataset to the persistent area of [[Dcache|dCache]].  This area is only disk, not tape-backed, but not subject to automatic purging. We require special permission to write here.
==Output file size==
As the primary concern on output file size, you should consider how the file will be used.  A rule of thumb is to
plan for files that take about 1h to operate on ''in the following stages of the analysis''.
It is easy for a job to read multiple files on input, to make it longer, but we don't have a reasonable way to split a file on input to make a job shorter. In addition, longer jobs are harder to schedule on the grid and are more likely to exceed memory limits.  Planning for multiples of 1 h job pieces is efficient and flexible.


==Upload size==
Once the primary run-time concern is met, there is also a secondary concern on file sizes generally.
For those outputs files that are intended for tape storage,
For those output files that are intended for tape storage, it is inefficient to store small files because of per-file operational costs.  If jobs are too slow to produce sufficiently large files, one should consider [[Concatenate|concatenating]] outputs before writing them to tape.   
the size is important.
It is inefficient to store small files on tape because of per-file costs
Good file sizes are from a few hundred MB to a few GB.  If jobs are
too slow to produce sufficiently large files, one should
consider [[Concatenate|concatenating]] outputs  
before writing them to tape.  There are no technical limits
on how small files can be, just operational concerns.


We do not recommend writing files larger than 5GB to tape,  
Very generally, good final file sizes are from 200 MB to 5 GB.  Too many files of very small size start to limit your work by per-file costs such as writing a SAM record, or simply accessing it in dCache, or the root file initialization.  We also do not recommend writing files larger than 5GB to tape,  
because of the increased operational risks, such as a corrupt file
because of the increased operational risks, such as a corrupt file or disk space and job management, when so much data is grouped in one object.  The technical limits
or disk space and job management,
when so much data is grouped in one object.  The technical limits
are on the scale of a TB, so if you have a few files over 5GB, that's OK.
are on the scale of a TB, so if you have a few files over 5GB, that's OK.


==Memory==
==Memory==


Most mu2e art jobs use around 2 GB, so most jobs can be submitted with
Most mu2e art jobs use around 2 GB, so most jobs can be submitted with a request for 2 GB of memory. On Fermigrid, the max allowed is 16GB.  
a request for 2 GB of memory.   On GPGrid, the default is 16GB.  
On [[Grids|OSG]] all [https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Information_about_job_submission_to_OSG_sites sites]  
On [[Grids|OSG]] all [https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Information_about_job_submission_to_OSG_sites sites]  
have various defaults and limits, and the less you request, the more sites you can access.
have various defaults and limits, and the less you request, the more sites you can access.


Note that a few test events may not use more than 2GB, but then in the full set of jobs, some rare events may use much more due, for example, to physics leading to a very long particle list.
To find what memory you need, run a test job, then look for this line at the end:
MemReport  VmPeak = 2757.39 VmHWM = 1863.99
The relevant number is the second, VmHWM, which is in MB (as opposed to MiB). Note that a few test events may not use more than 2GB, but then in the full set of jobs, some rare events may use much more due, for example, to physics leading to a very long particle list. Until you have experience with your exact job, start out with 1.1 times the test job memory.
 
The bash shell has a built-in command called <code>time</code>, but there is also <code>/usr/bin/time</code>. The latter also reports the memory size.
  /usr/bin/time mu2e -c ...
  48.25user 4.06system 1:15.79elapsed 69%CPU (0avgtext+0avgdata 1851712maxresident)k
  2347192inputs+8536outputs (4420major+1211874minor)pagefaults 0swaps
The relevant number is the one right before "maxresident", which is in KiB and agrees well with the art report. 


In grid jobs we often use a Mu2e custom time bin
/cvmfs/mu2e.opensciencegrid.org/bin/SLF6/time
which corrects a bug in passing the exe return code to the shell.
'''Note''' due to an historical bug, the number reported is actually 4 times the correct number in some OS versions.  SLF6.10 has this fixed.
Note that for mu2eprodsys and jobsub, the job requested memory is labeled, for example, as "2400MB", but it actually limited by MiB, 2400*1024*1024


==Testing==
==Testing==
Line 83: Line 95:
From the above output, we conclude that the test job used
From the above output, we conclude that the test job used
<code>3836928 / 4096 = 937 MB</code> of memory.  (SL6 is shipped with
<code>3836928 / 4096 = 937 MB</code> of memory.  (SL6 is shipped with
a buggy version of GNU time, this is why we need to '''divide the
a buggy version of GNU time, this is why we need to '''divide the reported size by 4096 instead of 1024 for the purported "k" units'''.)
reported size by 4096 instead of 1024 for the purported "k" units'''.)


One caveat is that estimates can be affected by large rare
One caveat is that estimates can be affected by large rare
events, which may not be seen in short jobs.
events, which may not be seen in short jobs.


See also [[Grids#Setting_limits|grid limit tips]].


==Following stages==
==Following stages==
Line 108: Line 120:
==Prestaging==
==Prestaging==
Before submitting any jobs to the grid, make sure that tape-based input datasets, if any, are [[Prestage|pre-staged to disk]].
Before submitting any jobs to the grid, make sure that tape-based input datasets, if any, are [[Prestage|pre-staged to disk]].
[[Category:Computing]]
[[Category:Workflows]]

Latest revision as of 17:35, 8 December 2022

Introduction

When running large jobs on grids you must take into consideration all the physical limits on resources.

The first question is always: is there a standard dataset that you can use instead of starting from scratch? Some are listed here, and you should also be consulting a mentor in the area of your work, who can answer this question definitively.


Total CPU resources

Curently (2017) a user may expect to get about 500-2000 slots (with spikes to 10K) on Fermigrid and 2000-5000 slots (with spikes to 15K) on OSG. While the resources on OSG are substantial, there is additional operational friction such as greater unpredictability and higher rates of jobs failing and restarting.

You can run a test job and multiply up the time to see if it is reasonable given the total CPU you can get in a given time frame. The interactive nodes are, on average, a little faster than grid nodes.

Job time

It is recommended to have jobs longer than 15 minutes, and shorter than about 8 hours, from general considerations of per-job overhead and reliability.

Fermigrid has a maximum time limit of about 4 days. If you are running on the OSG, each site has its own limits and if your job is short enough, it can run on more sites.

File Sizes

Worker nodes at Fermilab have 30GB of disk space per core. OSG sites have various defaults and limits, and the less you request, the more sites you can access.

Where to keep the output

From time to time, the collaboration decides to run major simulation campaigns, taking months, such as those performed for the TDR or CD3 reviews. All these datasets, from every stage, are written to tape. But simulation run by individuals in pursuit of their analysis is not usually written to tape. Most analyzers find it is sufficient to write their simulation samples to the scratch dCache area, and make ntuples which are saved on the data disks. After that, they can delete, or let dCache eventually delete, the simulation output. If an individual's project takes a long time to produce, or is very large, or requires long-term reproducability, or needs to be passed to several other users, it might make sense to upload it to tape. Please consult your mu2e mentors in making this decision.

It is also possible to save a dataset to the persistent area of dCache. This area is only disk, not tape-backed, but not subject to automatic purging. We require special permission to write here.

Output file size

As the primary concern on output file size, you should consider how the file will be used. A rule of thumb is to plan for files that take about 1h to operate on in the following stages of the analysis. It is easy for a job to read multiple files on input, to make it longer, but we don't have a reasonable way to split a file on input to make a job shorter. In addition, longer jobs are harder to schedule on the grid and are more likely to exceed memory limits. Planning for multiples of 1 h job pieces is efficient and flexible.

Once the primary run-time concern is met, there is also a secondary concern on file sizes generally. For those output files that are intended for tape storage, it is inefficient to store small files because of per-file operational costs. If jobs are too slow to produce sufficiently large files, one should consider concatenating outputs before writing them to tape.

Very generally, good final file sizes are from 200 MB to 5 GB. Too many files of very small size start to limit your work by per-file costs such as writing a SAM record, or simply accessing it in dCache, or the root file initialization. We also do not recommend writing files larger than 5GB to tape, because of the increased operational risks, such as a corrupt file or disk space and job management, when so much data is grouped in one object. The technical limits are on the scale of a TB, so if you have a few files over 5GB, that's OK.

Memory

Most mu2e art jobs use around 2 GB, so most jobs can be submitted with a request for 2 GB of memory. On Fermigrid, the max allowed is 16GB. On OSG all sites have various defaults and limits, and the less you request, the more sites you can access.

To find what memory you need, run a test job, then look for this line at the end:

MemReport  VmPeak = 2757.39 VmHWM = 1863.99

The relevant number is the second, VmHWM, which is in MB (as opposed to MiB). Note that a few test events may not use more than 2GB, but then in the full set of jobs, some rare events may use much more due, for example, to physics leading to a very long particle list. Until you have experience with your exact job, start out with 1.1 times the test job memory.

The bash shell has a built-in command called time, but there is also /usr/bin/time. The latter also reports the memory size.

 /usr/bin/time mu2e -c ...
 48.25user 4.06system 1:15.79elapsed 69%CPU (0avgtext+0avgdata 1851712maxresident)k
 2347192inputs+8536outputs (4420major+1211874minor)pagefaults 0swaps

The relevant number is the one right before "maxresident", which is in KiB and agrees well with the art report.

In grid jobs we often use a Mu2e custom time bin

/cvmfs/mu2e.opensciencegrid.org/bin/SLF6/time 

which corrects a bug in passing the exe return code to the shell. Note due to an historical bug, the number reported is actually 4 times the correct number in some OS versions. SLF6.10 has this fixed.

Note that for mu2eprodsys and jobsub, the job requested memory is labeled, for example, as "2400MB", but it actually limited by MiB, 2400*1024*1024

Testing

Wallclock time and the output size scale linearly in the number of events for most Mu2e jobs. One can estimate the slope and the intercept by running jobs of increasing length (the -n option to mu2e). The /usr/bin/time utility (not just "time"!) is useful to measure the memory consumption:

/usr/bin/time mu2e -c test.fcl -n 10
...
Art has completed and will exit with status 0.
28.07user 2.41system 0:43.59elapsed 69%CPU (0avgtext+0avgdata 3836928maxresident)k
3003072inputs+112outputs (9463major+210992minor)pagefaults 0swaps

From the above output, we conclude that the test job used 3836928 / 4096 = 937 MB of memory. (SL6 is shipped with a buggy version of GNU time, this is why we need to divide the reported size by 4096 instead of 1024 for the purported "k" units.)

One caveat is that estimates can be affected by large rare events, which may not be seen in short jobs.

See also grid limit tips.

Following stages

If the current job writes out framework files for use by subsequent job stages, the above criteria should be applied to analyze the whole processing chain. It is important to remember that later stages can easily read multiple input files per job, but can not "split" an existing art file. This implies smaller output files from the first stage, or a more mild concatenation factor. For example, the configuration to produce standard g4s4 conversion electron datasets cnf.mu2e.....fcl runs only 1000 events per job, resulting in short (minutes) jobs that produce small (few MB) output files. However digi+tracking jobs with background mixing would be too slow for significantly larger g4s4 inputs.

Prestaging

Before submitting any jobs to the grid, make sure that tape-based input datasets, if any, are pre-staged to disk.