FclIOModules: Difference between revisions
No edit summary |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 283: | Line 283: | ||
recognizes that makeSH only needs to be executed once in order to satisfy the requirements | recognizes that makeSH only needs to be executed once in order to satisfy the requirements | ||
of both paths. | of both paths. | ||
==Parent/Child Relationships Among Data Products== | |||
art maintains a notion of parent/child relationships among data products. Other language used in this context is that some products are "descendants of" others or "depend on" others. | |||
Suppose I have a module that reads product A and writes product C; there is a parent child relationionship between them. If some other module reads C and writes E, then both C and E are descendants of A and E is a descendant of C. And so on for child products of E. In some of the art documentation and error messages it says that C and E "depend on" A; this just means that they are descendants of A. | |||
Now consider a module that reads products A and B and writes products C and D. As far as art's bookkeeping is concerned, both C and D are children of both A and B; this is true even if the true parent child relationship is narrower, perhaps A->C and B->D. art does not have a mechanism for us to tell it that the parent-child relationship is narrower so it labels both C and D as children of both A and B. | |||
===A special case: Filters/src/CompressDigiMCs_module.cc=== | |||
In the standard Mu2e simulation workflow, one of the last steps is to produce data products that are targeted at analysis users. In order to reduce file sizes and reduce the time needed to read analysis-ready files, most of the data products are "thinned"; for example StepPointMCs that are not associated with any reco data product (e.g. straw digi, calorimeter cluster or CRV hit) are removed and SimParticles that are not on any chain connecting the primary GenParticle to a reco data product are removed. We also thin the intermediate reco domain data products. | |||
This work is spread over several modules, one of which is Filters/src/CompressDigiMCs_module.cc; as of September 2020, this module reads up to 15 data products and writes 16 data products. | |||
A side effect of this module is that all of the output data products are descendants of all of the input data products. This has the odd effect that reconstructed tracks are descendants of the original CrvDigiMCCollection. | |||
==Drop on Input== | |||
art's RootInput module supports dropping products on input. The syntax for doing this is very similar to that for selecting which products will be written to which output file: the only difference is that the parameter name is "inputCommands" not "outputCommands". However there is a critical difference in behaviour when compared to selecting products for output. | |||
The default behaviour of the RootInput module is that if you drop a product on input, RootInput will also drop all of its | |||
[[#Parent.2FChild_Relationships_Among_Data_Products|descendants]]. You can override this behaviour by specifying an additional parameter to RootInput: | |||
dropDescendantsOfDroppedBranches : false | |||
This default behaviour was chosen as a safety feature to protect against a certain type of inter-product inconsistency. Suppose that you create a parent/child pair of products and then drop the parent. If you recreate the parent in another job, it is possible that the parent/child pair are no longer self consistent. Experience has shown that many uses of drop-on-input are to recreate the product, not to simply save disk space in the eventual output file. Conversely most uses of drop-on-output are simply to save disk space. This observation informed the choice of the default behaviour. | |||
Another subtlety of drop on input has to do with the interaction of wildcards with the tree of parent/child relationships. Suppose that you create a product A and it's child B; then you write an output file that contains only B. Because product A still has at least one descendant in the output file, the output file retains the Provenance of product A . Now suppose that you read the file and ask to drop product A on input. Even though product A payload is no longer in the event, art still knows about product A and the default behavior is to drop all of its descendants, including product B. | |||
This case was encountered in https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/HELPBug/355.html ; it was triggered by a use of a wildcard with the intention to drop a different data product. The issue was solved by specifying the product to drop with it's full name and no wildcards. | |||
==File Names== | ==File Names== |
Latest revision as of 18:09, 16 September 2020
Introduction
This section describes how to configure input and output modules. This includes how to specify filenames, how to skip events from an input file, how to write multiple output files and how to write only selected data products to a particular ouput file. It also describes a special input source named EmptySource.
Reading from Files
When reading from and existing file, art allows one to select input files, the starting event, the number of events to read, etc either from the command line or from the fcl file. If a particular quantity is controlled from both the command line and the fcl file, the value on the command line takes precedence.
The following code fragment tells art to read event data from the
file named file01.root
, to start at the beginning of the file and read until
the end of file is reached:
source :{ module_type : RootInput fileNames : [ "file01.root" ] maxEvents : -1 }
To tell art to read 100 events, or until the end of file, which ever comes first, change the parameter maxEvents to 100. One may also specify a list of input files:
source : { module_type : RootInput fileNames : [ "file01.root", "file02.root", "file03.root" ] maxEvents : 100 }
One may give an (essentially) unlimited number of files in the list of input files. One may also tell art to skip the first two events and start with the third:
source : { module_type : RootInput fileNames : [ "file01.root", "file02.root", "file03.root" ] maxEvents : 100 skipEvents : 2 }
The list below shows some other parameters that can be included in the source parameter set:
firstRun : 0 firstSubRun : 0 firstEvent : 0 noEventSort : false skipBadFiles : false fileMatchMode : "permissive" inputCommands : ""
The first* parameters specify that the first event to be processed will be the first event that has an EventID greater than or equal to the specified event; if one of the first* parameters is not specified, it takes a default value of -1 and is excluded from the comparison. If a file of unsorted events is read in, art will, by default, present the events for processing in order of increasing event number; a corollary of this is that the output file will contain the events in sorted order. This sorting occurs one input file at a time; art does not sort across file boundaries in a list of input files. If the noEventSort parameter is set to true, the sorting is disabled, which, will, in most cases yield a minor performance improvement. I have not yet learned the precise meaning of the skipBadFiles and the fileMatchMode parameters. The inputCommands parameter tells art to delete certain data products after reading the input file; that is, the input file itself is not modified but data products are removed from the copy of the event in memory before any modules are called. The syntax of this language is the same as for outputCommands, described below. In the pre-art versions of the framework, there were methods to select ranges of events or ranges of SubRuns. This is not yet working in art; the art developers will add this feature back once we decided exactly what we mean by "ranges of events".
Empty Source
In many simulation applications one wishes to start with an empty event, run one or more event generators, pass the generated particles through the Geant4, and so on. In art the first step in this chain is accomplished using a source module named EmptySource, as follows:
source :{ module_type : EmptyEvent maxEvents : 200 }
Instead of reading event-data from a file, the empty source increments the event number and presents an empty event to the modules that will do the work. One may configure EmptySource to specify the EventId of the first event, to specify the maximum number of events in a SubRun or SubRuns in a run.
source :{ module_type : EmptyEvent firstRun : 2 firstSubRun : 1 firstEvent : 1 numberEventsInRun : 1000 numberEventsInSubRun : 100 maxEvents : 200 resetEventOnSubRun : true }
The last option tells art to reset event numbers to start at 1 whenever art starts a new SubRun begins; this is the default behavior and is opposite to the behavior we inherited from CMS.
Configuring Output Modules
Writing all Data Products in All Events to an Output File
The code fragment below shows how to configure art to have one output module that writes every event to the file named "output.root":
physics: { outputFiles: [ out ] end_paths: [ outputFiles ] } outputs: { out: { module_type: RootOutput fileName: "output.root" } }
At first glance this appears a little verbose, with some redundant information; later examples will show that more powerful features that require a structure of this level of detail. In the above fragment the identifiers physics, end_paths, outputs and module_type all have special meaning to art. The name RootOutput is the name of a class, supplied by art, that writes event-data to root files. The identifier fileName has special meaning to the class RootOutput. The two other identifies in this fragment, out and outFiles, are arbitrary names; that is, the identifier out appears in two places, so long as I replace both occurences by the same thing, the fragment will still work; similarly for the identifier outputFiles.
When art parses this fragment it looks for a parameter named physics.end_paths. This parameter must have a value that is a list of names of paths; it must be a list even though it is legal, as in this example, to have only one path name in the list. Art will then look to find the definition of the path physics.outputFiles. This must be a list of module labels; it must be a list even if it has a length of one. The module labels in the list may refer only to output modules or analyzer modules; it is an error if the label of a producer, a filter or a source module is found in the list. Art then looks to find a module with the label of out and finds it under outputs.
When the job starts, art will create an instance of the RootOutput module, which will open an output file named "output.root". All events from the input file will be written to the output file. All data products found in each event will be written to the output file.
Writing Selected Data Products to an Output File
In the next fragment the configuration of the output module has been altered to so that some data products are not written to the output file.
outputs: { out: { module_type: RootOutput fileName: "output.root" outputCommands : [ "keep *_*_*_*" ,"drop mu2e::PointTrajectorymv_*_*_*" ] } }
In the keep/drop commands, the names with the format DataType_ModuleLabel_InstanceName_ProcessName are the four part identifier for a data product. In the example above, the outputCommands parameter should be understood as follows: the output module will write out all data products unless the data product is of type mu2e::PointTrajectorymv. The outputCommands parameter can be an arbitrarily long list that is parsed from the top down using the logic: do the first rule, unless the second rule applies, unless the third rule applies, and so on for all rules. The logic is similar to the allow/deny logic in .htaccess files.
The first command in the list must be one of
"keep *_*_*_*" or "drop *_*_*_*"
Having no outputCommands parameter is equivalent to "keep *_*_*_*". Having a blank outputCommands:
outputCommands : [ ]
is equivalent to "drop *_*_*_*"
Writing Selected Events to an Output File
The code fragment below shows how to define a path that contains a filter and how to connect that path to an output module. All events that pass the filter will be written by this output module.
The code fragment below shows how to define a two filter modules and use them to direct some events to one output module and some events to another output module. The example also writes different data products to each file.
- Writes its output to the file named data02_Mode0.root
- Only writes out events that complete the path named path0.
- Drops any data product with data type mu2e::PointTrajectorymv.
The second output module:
- Writes its output to the file named data02_Mode1.root
- Only writes out events that complete the path named path1.
- Keeps only two groups of data products, mu2e::StrawHits that were made by the module with the label makeSH and mu2e::CaloHits that were made by any module.
physics: { producers: { makeSH: { module_type: MakeStrawHits } } filters: { selectMode0: { module_type: Filter1 mode: 0 } selectMode1: { module_type: Filter1 mode: 1 } } path0: [ makeSH, selectMode0 ] path1: [ makeSH, selectMode1 ] outputFiles: [ out1, out2 ] trigger_paths: [ path0, path1 ] end_paths: [ outputFiles ] } outputs: { out1: { module_type: RootOutput fileName: "data02_Mode0.root" SelectEvents: [ path0 ] outputCommands : [ "keep *_*_*_*" ,"drop mu2e::PointTrajectorymv_*_*_*" ] } out2: { module_type: RootOutput fileName: "data02_Mode1.root" SelectEvents: [ path1 ] outputCommands : [ "drop *_*_*_*" ,"keep mu2e::StrawHits_makeSH_*_*" ,"keep mu2e::CaloHits_*_*_*" ] } }
In the above, the module Filter1 is presumed to have two distinct modes selected by the mode parameter. The filter can send some events to just one of the files, some events to both files or some events to no files. The two identifiers path0 and path1 are arbitrary. They are the names of paths; that is they are lists of module labels. The parameter physics.trigger_paths is a special name known to art. It is a list of paths; the module labels on these paths must be either producer or filter modules. Art recognizes that the module label makeSH appears in both path0 and path1; it also recognizes that makeSH only needs to be executed once in order to satisfy the requirements of both paths.
Parent/Child Relationships Among Data Products
art maintains a notion of parent/child relationships among data products. Other language used in this context is that some products are "descendants of" others or "depend on" others.
Suppose I have a module that reads product A and writes product C; there is a parent child relationionship between them. If some other module reads C and writes E, then both C and E are descendants of A and E is a descendant of C. And so on for child products of E. In some of the art documentation and error messages it says that C and E "depend on" A; this just means that they are descendants of A.
Now consider a module that reads products A and B and writes products C and D. As far as art's bookkeeping is concerned, both C and D are children of both A and B; this is true even if the true parent child relationship is narrower, perhaps A->C and B->D. art does not have a mechanism for us to tell it that the parent-child relationship is narrower so it labels both C and D as children of both A and B.
A special case: Filters/src/CompressDigiMCs_module.cc
In the standard Mu2e simulation workflow, one of the last steps is to produce data products that are targeted at analysis users. In order to reduce file sizes and reduce the time needed to read analysis-ready files, most of the data products are "thinned"; for example StepPointMCs that are not associated with any reco data product (e.g. straw digi, calorimeter cluster or CRV hit) are removed and SimParticles that are not on any chain connecting the primary GenParticle to a reco data product are removed. We also thin the intermediate reco domain data products.
This work is spread over several modules, one of which is Filters/src/CompressDigiMCs_module.cc; as of September 2020, this module reads up to 15 data products and writes 16 data products.
A side effect of this module is that all of the output data products are descendants of all of the input data products. This has the odd effect that reconstructed tracks are descendants of the original CrvDigiMCCollection.
Drop on Input
art's RootInput module supports dropping products on input. The syntax for doing this is very similar to that for selecting which products will be written to which output file: the only difference is that the parameter name is "inputCommands" not "outputCommands". However there is a critical difference in behaviour when compared to selecting products for output.
The default behaviour of the RootInput module is that if you drop a product on input, RootInput will also drop all of its
descendants. You can override this behaviour by specifying an additional parameter to RootInput:
dropDescendantsOfDroppedBranches : false
This default behaviour was chosen as a safety feature to protect against a certain type of inter-product inconsistency. Suppose that you create a parent/child pair of products and then drop the parent. If you recreate the parent in another job, it is possible that the parent/child pair are no longer self consistent. Experience has shown that many uses of drop-on-input are to recreate the product, not to simply save disk space in the eventual output file. Conversely most uses of drop-on-output are simply to save disk space. This observation informed the choice of the default behaviour.
Another subtlety of drop on input has to do with the interaction of wildcards with the tree of parent/child relationships. Suppose that you create a product A and it's child B; then you write an output file that contains only B. Because product A still has at least one descendant in the output file, the output file retains the Provenance of product A . Now suppose that you read the file and ask to drop product A on input. Even though product A payload is no longer in the event, art still knows about product A and the default behavior is to drop all of its descendants, including product B.
This case was encountered in https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/HELPBug/355.html ; it was triggered by a use of a wildcard with the intention to drop a different data product. The issue was solved by specifying the product to drop with it's full name and no wildcards.
File Names
Files which are output of collaboration production campaigns or are written to tape, must be named by the Mu2e convention.
Schema Evolution and Fast Cloning
Suppose that you have some data product class, MyDP, defined in the file MyDP.h . You run some jobs and write some output files that contain collections of objects of type MyDP. Now suppose that, at a later date you edit MyDP.h, either adding or subtracting some data members.
This process is referred to as "schema evolution". "Schema" is a word borrowed from the database world: the schema of a root file describes, among other things, the data type of each data member of each type of object that is found in the root file. When the definition of one of these objects changes, the schema is said to "evolve".
If the changes are simple enough, then ROOT's automatic schema evolution will
almost always do the right thing. If you removed some data members from
MyDp.h, and if you read an old file with the new code, ROOT will read the disk
file and will simply discard the data for the removed data members. The
new-code
objects in memory will be the correct subset of the old-code objects on disk.
On the other hand, your new code may contain additional some data members. When you make this change you should update the default constructor of MyDP so that it initializes the new data members appropriately. In this case, when you read old-code objects from disk, the new-code objects in memory will have their newly added data members set to the values given by the default constructor. If you neglect to initialize these new data members in the default constructor, it is possible that the in-memory values may contain uninitialized memory.
There is an additional complication when you have an input file that was written with one version of the schema, you read it with a program that has a different version of the schema, and then you write an output file. It is possible to write an output file in which objects written with the old schema coexist with objects written with the new schema - but there are limitations on this. The guaranteed safe way of doing things is to write an ouptut file in which the old-schema objects have been translated into new schema objects. To do this you need to fill the in-memory representation of the objects from the input file and then write those in-memory objects to the output file. However the default behaviour of art has a speed optimization that takes a shortcut. If a data product is in both the input and the output file, art's default behaviour is simply to copy the packed data from the input file to the output file. This is true even if the data product was unpacked into memory; this saves the time needed to repack the memory into the output file, which can be significant. This shortcut is called "fast cloning". If the schema of the input file and the running program are the same, then fast cloning works properly. If, on the other hand, the schema of the input file and of the running program are different, then there may be problems. When this sort of problem happens, art throws and exception and attempts to shutdown gracefully. The text from execption message will look something like:
%MSG-s ArtException: PostOpenFile 15-Apr-2013 09:48:35 CDT BeforeEvents cet::exception caught in art ---- FatalRootError BEGIN Fatal Root Error: @SUB=TTreeCloner::CollectBranches One of the export sub-branches (mu2e::CaloClusters_makeCaloCluster_AlgoCLOSESTSeededByENERGY_Exercise01.obj._distance) is not present in the import TTree. cet::exception caught in EventProcessor and rethrown ---- FatalRootError END
The name of the data product in the fifth line will differ from one instance of this problem to another.
To work around this you should add the following parameter to the parameter set for each output file in the job:
fastCloning : false
This tells root to do the following for every data product that is destined for an output file: unpack the data product from the input file into memory and repack it into the output file. ROOT will do this for every data product, not just those that have had schema evolution. Because fast cloning is usually safe and because it is much faster than slow cloning, the default is for fast cloning to be enabled.
Aside for ROOT experts: the problem arises only for objects that have been split; the underlying limitation is that part of the schema is bound to the branch hierarchy.