Concatenate: Difference between revisions
(Created page with "It is inefficient to store small files on tape. If a job (e.g. "stage 1") produces small files, <b>and</b> if consumer job ("stage 2") can use multiple files per job, then "s...") |
m (→Art files) |
||
(5 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
==Introduction== | |||
It can be inefficient to store small files in [[Dcache|dCache]] and on [[Enstore|tape]]. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are [[Enstore|tarred up]] before they go on tape, and this procedure adds per-file latency, sometimes considerable. The solution is to merge small files into a smaller dataset of larger files of approximately 200 MB to 5 GB. | |||
It is important to not simply make files as large as possible, but to consider how the files will be used. It is easy for jobs to read multiple input files, but we do not have a scheme for a grid job to read part of an input file. A reasonable target is plan the concatenation so that jobs which run on the new files take about 1h per file. See also [[JobPlan|job planning]], [[Dcache|dCache]] and [[Enstore|enstore]]. | |||
Production datasets of the first stage of a simulation project often have the string '''g4s1''' in the dataset name, and we have been replacing this will '''cs1''' to create the concatenated dataset name. | |||
==Root files== | |||
Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root '''hadd''' utility: | |||
<pre> | <pre> | ||
(setup an appropriate Offline version) | |||
hadd output.root input1.root input2.root ... | |||
</pre> | </pre> | ||
==Art files== | |||
Art files are more complicated root files and can't be concatenated by the root hadd utility. | |||
and | A few files can be concatenated by a mu2e command: | ||
with | <pre> | ||
(setup an appropriate Offline version) | |||
mu2e -c JobConfig/common/artcat.fcl -o output.art -s input1.art -s input2.art ... | |||
</pre> | |||
Large datasets can be concatenated by following parts of the [[MCProdWorkflow|MC production workflow]]. This workflow is good for an overview. The relevant parts are | |||
* [[GenerateFcl|generating]] a fcl file set, using the concatenation example | |||
* [[SubmitJobs|submitting]] those fcl to the grid | |||
==Notes== | |||
Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions. | |||
{{Expert}} | |||
A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen? | |||
If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch: | |||
outputs.out.fastCloning: false | |||
Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized. | |||
[[Category:Computing]] | |||
[[Category:Workflows]] | |||
Latest revision as of 18:00, 16 February 2021
Introduction
It can be inefficient to store small files in dCache and on tape. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are tarred up before they go on tape, and this procedure adds per-file latency, sometimes considerable. The solution is to merge small files into a smaller dataset of larger files of approximately 200 MB to 5 GB.
It is important to not simply make files as large as possible, but to consider how the files will be used. It is easy for jobs to read multiple input files, but we do not have a scheme for a grid job to read part of an input file. A reasonable target is plan the concatenation so that jobs which run on the new files take about 1h per file. See also job planning, dCache and enstore.
Production datasets of the first stage of a simulation project often have the string g4s1 in the dataset name, and we have been replacing this will cs1 to create the concatenated dataset name.
Root files
Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root hadd utility:
(setup an appropriate Offline version) hadd output.root input1.root input2.root ...
Art files
Art files are more complicated root files and can't be concatenated by the root hadd utility.
A few files can be concatenated by a mu2e command:
(setup an appropriate Offline version) mu2e -c JobConfig/common/artcat.fcl -o output.art -s input1.art -s input2.art ...
Large datasets can be concatenated by following parts of the MC production workflow. This workflow is good for an overview. The relevant parts are
- generating a fcl file set, using the concatenation example
- submitting those fcl to the grid
Notes
Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.
This page page needs expert review!
A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?
If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch:
outputs.out.fastCloning: false
Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.