Difference between revisions of "Concatenate"

From Mu2eWiki
Jump to navigation Jump to search
(Created page with "It is inefficient to store small files on tape. If a job (e.g. "stage 1") produces small files, <b>and</b> if consumer job ("stage 2") can use multiple files per job, then "s...")
 
Line 1: Line 1:
It is inefficient to store small files on tape.  If a job
+
 
(e.g. "stage 1") produces small files, <b>and</b> if consumer job
+
==Introduction==
 +
 
 +
It is inefficient to store small files on tape.  Manipulating small files in dCache
 +
can be dominated by the time it takes for dCache to read or update its underlying database.
 +
Small files are [[Enstore|tarred up]] before they go on tape,
 +
and this procedure adds per-file latency, sometimes considerable.
 +
 
 +
If a job (e.g. "stage 1") produces small files, <b>and</b> if consumer job
 
("stage 2") can use multiple files per job, then "stage 1" output
 
("stage 2") can use multiple files per job, then "stage 1" output
files should be concatenated before writing them to tape.  To do that,
+
files should be concatenated before writing them to tape.   
one should define jobs based on
+
It is important to not simply make file as large as possible, but
the <code>JobConfig/cd3/common/artcat.fcl</code> file using an
+
to consider how the files will be usedIt is easy for jobs
appropriate <code>--merge</code> value in
+
to read multiply file, but we do not have a scheme for a grid job
the <code>generate_fcl</code> callDo not forget to set an
+
to read part of a file.  See also [[JobPlan|job planning]], [[DCache|dCache]] and [[Enstore|enstore]].
appropriate output file name in the template file, like
 
  
 +
 +
==Root files==
 +
 +
Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root '''hadd''' utility:
 
<pre>
 
<pre>
#include "JobConfig/cd3/common/artcat.fcl"
+
(setup an appropriate Offline version)
outputs.out.fileName: "sim.DSOWNER.cd3-beam-cs3-mothers.DSCONF.SEQ.art"
+
hadd output.root input1.root input2.root ...
 
</pre>
 
</pre>
  
The small input files can be left in their original "good"
+
==Art files==
directory.  However if some jobs failed and were re-run, it is often
+
Art files are more complicated root files and can't be concatenated by the root hadd utility.
easier to <code>mu2eFileUpload</code> them to <code>--scratch</code>
+
 
and obtain a sorted list of inputs
+
A few files can be concatenated by a mu2e command:
with <code>mu2eDatasetFileList</code>.
+
<pre>
 +
(setup an appropriate Offline version)
 +
mu2e -c JobConfig/cd3/common/artcat.fcl -o output.art -s input1.art -s input2.art ...
 +
</pre>
 +
 
 +
Large datasets can be concatenated by following parts of the  [[MCProdWorkflow|MC production workflow]]. This workflow is good for an overview.  The relevant parts are
 +
* [[GenerateFcl|generating]] a fcl file set, using the concatenation example
 +
* [[SubmitJobs|submitting]] those fcl to the grid
 +
 
 +
==Notes==
 +
 
 +
Note that reading and writing an art file can "update" it.  If root version X reads a file written by root version Y, it will write it out as root version X.  (Note, X can only be greater than Y.)  This can have significant consequences if the there is a major version difference between X and Y.  Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.
 +
 
 +
{{Expert}}
 +
 
 +
A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?
  
 +
If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning".  If there is a change in products, art will detect this and issue an error message.  In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format.  You will need to set the switch:
 +
outputs.out.fastCloning: false
  
Whether left in the original location or "uploaded", the individual
+
Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.
small files must be registered in SAM
 
with <code>mu2eFileDeclare</code>.  See
 
the [postprocess.shtml#store postprocessing]  page for
 
examples of using <code>mu2eFileUpload</code>
 
and <code>mu2eFileDeclare</code>.
 

Revision as of 19:56, 11 April 2017

Introduction

It is inefficient to store small files on tape. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are tarred up before they go on tape, and this procedure adds per-file latency, sometimes considerable.

If a job (e.g. "stage 1") produces small files, and if consumer job ("stage 2") can use multiple files per job, then "stage 1" output files should be concatenated before writing them to tape. It is important to not simply make file as large as possible, but to consider how the files will be used. It is easy for jobs to read multiply file, but we do not have a scheme for a grid job to read part of a file. See also job planning, dCache and enstore.


Root files

Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root hadd utility:

(setup an appropriate Offline version)
hadd output.root input1.root input2.root ...

Art files

Art files are more complicated root files and can't be concatenated by the root hadd utility.

A few files can be concatenated by a mu2e command:

(setup an appropriate Offline version)
mu2e -c JobConfig/cd3/common/artcat.fcl -o output.art -s input1.art -s input2.art ...

Large datasets can be concatenated by following parts of the MC production workflow. This workflow is good for an overview. The relevant parts are

Notes

Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.

Expert.jpeg This page page needs expert review!

A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?

If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch:

outputs.out.fastCloning: false

Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.