Difference between revisions of "Concatenate"

From Mu2eWiki
Jump to navigation Jump to search
(Created page with "It is inefficient to store small files on tape. If a job (e.g. "stage 1") produces small files, <b>and</b> if consumer job ("stage 2") can use multiple files per job, then "s...")
 
 
(5 intermediate revisions by one other user not shown)
Line 1: Line 1:
It is inefficient to store small files on tape.  If a job
 
(e.g. "stage 1") produces small files, <b>and</b> if consumer job
 
("stage 2") can use multiple files per job, then "stage 1" output
 
files should be concatenated before writing them to tape.  To do that,
 
one should define jobs based on
 
the <code>JobConfig/cd3/common/artcat.fcl</code> file using an
 
appropriate <code>--merge</code> value in
 
the <code>generate_fcl</code> call.  Do not forget to set an
 
appropriate output file name in the template file, like
 
  
 +
==Introduction==
 +
 +
It can be inefficient to store small files in [[Dcache|dCache]] and on [[Enstore|tape]].  Manipulating small files in dCache  can be dominated by the time it takes for dCache to read or update its underlying database.  Small files are [[Enstore|tarred up]] before they go on tape, and this procedure adds per-file latency, sometimes considerable. The solution is to merge small files into a smaller dataset of larger files of approximately 200 MB to 5 GB.
 +
 +
It is important to not simply make files as large as possible, but to consider how the files will be used.  It is easy for jobs to read multiple input files, but we do not have a scheme for a grid job to read part of an input file.  A reasonable target is plan the concatenation so that jobs which run on the new files take about 1h per file.  See also [[JobPlan|job planning]], [[Dcache|dCache]] and [[Enstore|enstore]].
 +
 +
Production datasets of the first stage of a simulation project often have the string '''g4s1''' in the dataset name, and we have been replacing this will '''cs1''' to create the concatenated dataset name.
 +
 +
==Root files==
 +
 +
Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root '''hadd''' utility:
 
<pre>
 
<pre>
#include "JobConfig/cd3/common/artcat.fcl"
+
(setup an appropriate Offline version)
outputs.out.fileName: "sim.DSOWNER.cd3-beam-cs3-mothers.DSCONF.SEQ.art"
+
hadd output.root input1.root input2.root ...
 
</pre>
 
</pre>
  
The small input files can be left in their original "good"
+
==Art files==
directory.  However if some jobs failed and were re-run, it is often
+
Art files are more complicated root files and can't be concatenated by the root hadd utility.
easier to <code>mu2eFileUpload</code> them to <code>--scratch</code>
+
 
and obtain a sorted list of inputs
+
A few files can be concatenated by a mu2e command:
with <code>mu2eDatasetFileList</code>.
+
<pre>
 +
(setup an appropriate Offline version)
 +
mu2e -c JobConfig/common/artcat.fcl -o output.art -s input1.art -s input2.art ...
 +
</pre>
 +
 
 +
Large datasets can be concatenated by following parts of the  [[MCProdWorkflow|MC production workflow]]. This workflow is good for an overview.  The relevant parts are
 +
* [[GenerateFcl|generating]] a fcl file set, using the concatenation example
 +
* [[SubmitJobs|submitting]] those fcl to the grid
 +
 
 +
==Notes==
 +
 
 +
Note that reading and writing an art file can "update" it.  If root version X reads a file written by root version Y, it will write it out as root version X.  (Note, X can only be greater than Y.)  This can have significant consequences if the there is a major version difference between X and Y.  Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.
 +
 
 +
{{Expert}}
 +
 
 +
A similar effect can occur with the file's art products.  If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?
 +
 
 +
If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning".  If there is a change in products, art will detect this and issue an error message.  In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format.  You will need to set the switch:
 +
outputs.out.fastCloning: false
 +
 
 +
Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.
  
  
Whether left in the original location or "uploaded", the individual
+
[[Category:Computing]]
small files must be registered in SAM
+
[[Category:Workflows]]
with <code>mu2eFileDeclare</code>.  See
 
the [postprocess.shtml#store postprocessing] page for
 
examples of using <code>mu2eFileUpload</code>
 
and <code>mu2eFileDeclare</code>.
 

Latest revision as of 18:00, 16 February 2021

Introduction

It can be inefficient to store small files in dCache and on tape. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are tarred up before they go on tape, and this procedure adds per-file latency, sometimes considerable. The solution is to merge small files into a smaller dataset of larger files of approximately 200 MB to 5 GB.

It is important to not simply make files as large as possible, but to consider how the files will be used. It is easy for jobs to read multiple input files, but we do not have a scheme for a grid job to read part of an input file. A reasonable target is plan the concatenation so that jobs which run on the new files take about 1h per file. See also job planning, dCache and enstore.

Production datasets of the first stage of a simulation project often have the string g4s1 in the dataset name, and we have been replacing this will cs1 to create the concatenated dataset name.

Root files

Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root hadd utility:

(setup an appropriate Offline version)
hadd output.root input1.root input2.root ...

Art files

Art files are more complicated root files and can't be concatenated by the root hadd utility.

A few files can be concatenated by a mu2e command:

(setup an appropriate Offline version)
mu2e -c JobConfig/common/artcat.fcl -o output.art -s input1.art -s input2.art ...

Large datasets can be concatenated by following parts of the MC production workflow. This workflow is good for an overview. The relevant parts are

Notes

Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.

Expert.jpeg This page page needs expert review!

A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?

If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch:

outputs.out.fastCloning: false

Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.