Concatenate
Introduction
It is inefficient to store small files on tape. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are tarred up before they go on tape, and this procedure adds per-file latency, sometimes considerable.
If a job (e.g. "stage 1") produces small files, and if consumer job ("stage 2") can use multiple files per job, then "stage 1" output files should be concatenated before writing them to tape. It is important to not simply make file as large as possible, but to consider how the files will be used. It is easy for jobs to read multiply file, but we do not have a scheme for a grid job to read part of a file. See also job planning, dCache and enstore.
Root files
Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root hadd utility:
(setup an appropriate Offline version) hadd output.root input1.root input2.root ...
Art files
Art files are more complicated root files and can't be concatenated by the root hadd utility.
A few files can be concatenated by a mu2e command:
(setup an appropriate Offline version) mu2e -c JobConfig/cd3/common/artcat.fcl -o output.art -s input1.art -s input2.art ...
Large datasets can be concatenated by following parts of the MC production workflow. This workflow is good for an overview. The relevant parts are
- generating a fcl file set, using the concatenation example
- submitting those fcl to the grid
Notes
Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.
This page page needs expert review!
A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?
If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch:
outputs.out.fastCloning: false
Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.