Concatenate

From Mu2eWiki
Jump to navigation Jump to search

Introduction

It can be inefficient to store small files in dCache and on tape. Manipulating small files in dCache can be dominated by the time it takes for dCache to read or update its underlying database. Small files are tarred up before they go on tape, and this procedure adds per-file latency, sometimes considerable. The solution is to merge small files into a smaller dataset of larger files of approximately 200 MB to 5 GB.

It is important to not simply make files as large as possible, but to consider how the files will be used. It is easy for jobs to read multiple input files, but we do not have a scheme for a grid job to read part of an input file. A reasonable target is plan the concatenation so that jobs which run on the new files take about 1h per file. See also job planning, dCache and enstore.

Production datasets of the first stage of a simulation project often have the string g4s1 in the dataset name, and we have been replacing this will cs1 to create the concatenated dataset name.

Root files

Simple root files, such as those produced by TFileService (the ntuple or histogram file from an art job) can be concatenated with the root hadd utility:

(setup an appropriate Offline version)
hadd output.root input1.root input2.root ...

Art files

Art files are more complicated root files and can't be concatenated by the root hadd utility.

A few files can be concatenated by a mu2e command:

(setup an appropriate Offline version)
mu2e -c JobConfig/common/artcat.fcl -o output.art -s input1.art -s input2.art ...

Large datasets can be concatenated by following parts of the MC production workflow. This workflow is good for an overview. The relevant parts are

Notes

Note that reading and writing an art file can "update" it. If root version X reads a file written by root version Y, it will write it out as root version X. (Note, X can only be greater than Y.) This can have significant consequences if the there is a major version difference between X and Y. Primarily, the file cannot longer be read by root version Y, and this will block the use of some set of Offline versions.

Expert.jpeg This page page needs expert review!

A similar effect can occur with the file's art products. If a product is written within a certain release X and then concatenated in a more advance release Y, and there was a change in the definition of a product, such as adding a variable, then what will happen?

If there is a no change in products, then root can copy the events without looking into the products, and this mode is called "fast cloning". If there is a change in products, art will detect this and issue an error message. In this case, you have to turn off the fast cloning and let it unpack the products and repack them in the newer format. You will need to set the switch:

outputs.out.fastCloning: false

Note that the output file can't be read by release X any more, since the advanced version of the product will be unrecognized.