IOModules: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
(Created page with "== Introduction == This section describes how to configure input and output modules. This includes how to specify filenames, how to skip events from an input file, how to wri...")
 
(Blanked the page)
 
Line 1: Line 1:
==  Introduction ==
This section describes how to configure input and output modules.
This includes how to specify filenames, how to skip events
from an input file, how to write multiple output files and
how to write only selected data products to a particular ouput file.
It also describes a special input source named EmptySource.


==Reading from Files==
When reading from and existing file, art allows one to select input files,
the starting event, the number of events to read, etc either from
the command line or from the fcl file.  If a particular quantity is
controlled from both the command line and
the fcl file, the value on the command line takes precedence.
The following code fragment tells art to read event data from the
file named <code>file01.root</code>, to start at the beginning of the file and read until
the end of file is reached:
<pre>
source :{
  module_type : RootInput
  fileNames  : [ "file01.root" ]
  maxEvents  : -1
}
</pre>
To tell art to read 100 events, or until the end of file, which ever comes first,
change the parameter maxEvents to 100. One may also specify a list of input files:
<pre>
source : {
  module_type : RootInput
  fileNames  : [ "file01.root", "file02.root",  "file03.root" ]
  maxEvents  : 100
}
</pre>
One may give an (essentially) unlimited number of files in the list of input files.
One may also tell art to skip the first two events and start with the third:
<pre>
source : {
  module_type : RootInput
  fileNames  : [ "file01.root", "file02.root",  "file03.root" ]
  maxEvents  : 100
  skipEvents  : 2
}
</pre>
The list below shows some other parameters that can be included in the source parameter set:
<pre>
  firstRun            : 0
  firstSubRun          : 0
  firstEvent          : 0
  noEventSort          : false
  skipBadFiles        : false
  fileMatchMode        : "permissive"
  inputCommands        : ""
</pre>
The first* parameters specify that the first event to be processed will be the first event
that has an EventID greater than or equal to the specified event;
if one of the first* parameters is not specified, it takes a default value of -1 and
is excluded from the comparison.
If a file of unsorted events is read in, art will, by default, present the events for
processing in order of increasing event number; a corollary of this is that the
output file will contain the events in sorted order. This sorting occurs one input file
at a time; art does not sort across file boundaries in a list of input files.
If the noEventSort parameter is set to true, the sorting is disabled, which, will, in
most cases yield a minor performance improvement.  I have not yet learned the precise
meaning of the skipBadFiles and the fileMatchMode parameters.
The inputCommands parameter tells art to delete certain data
products after reading the input file; that is, the input file itself is not modified but
data products are removed from the copy of the event in memory before any modules are called.
The syntax of this language is the same as for outputCommands, described below.
In the pre-art versions of the framework, there were methods to select ranges of events
or ranges of SubRuns. This is not yet working in art; the art developers will
add this feature back once we decided exactly what we mean by "ranges of events".
==Empty Source==
In many simulation applications one wishes to start with an empty event, run one or more
event generators, pass the generated particles through the Geant4, and so on.  In art
the first step in this chain is accomplished using a source module named
EmptySource, as follows:
<pre>
source :{
  module_type : EmptyEvent
  maxEvents  : 200
}
</pre>
Instead of reading event-data from a file, the empty source increments the event number
and presents an empty event to the modules that will do the work.
One may configure EmptySource to specify the EventId of the first event,
to specify the maximum number of events in a SubRun or SubRuns in a run.
<pre>
source :{
  module_type          : EmptyEvent
  firstRun            : 2
  firstSubRun          : 1
  firstEvent          : 1
  numberEventsInRun    : 1000
  numberEventsInSubRun :  100
  maxEvents            : 200
  resetEventOnSubRun  : true
}
</pre>
The last option tells art to reset event numbers to start at 1 whenever
art starts a new SubRun begins; this is the default behavior
and is opposite to the behavior we inherited from CMS.
==Configuring Output Modules ==
===Writing all Data Products in All Events to an Output File===
The code fragment below shows how to configure art to have one output module
that writes every event to the file named "output.root":
<font color=blue>physics</font>: {
  <font color=red>outputFiles</font>:  [ <font color=red>out</font> ]
  <font color=blue>end_paths</font>:    [ <font color=red>outputFiles</font> ]
}
<font color=blue>outputs</font>: {
  <font color=red>out</font>: {
    <font color=blue>module_type</font>: <font color=green>RootOutput</font>
    <font color=green>fileName</font>: "output.root"
  }
}
At first glance this appears a little verbose, with some redundant information;
later examples will show that more powerful features that require a structure
of this level of detail.
In the above fragment the identifiers <font color=blue>physics</font>,
<font color=blue>end_paths</font>, <font color=blue>outputs</font>
and <font color=blue>module_type</font>
all have special meaning to art. The
name <font color=green>RootOutput</font> is the name of a class, supplied by art,
that writes event-data to root files.  The identifier
<font color=green>fileName</font> has special meaning to the class <font color=green>RootOutput</font>.
The two other identifies in this fragment,
<font color=red>out</font> and <font color=red>outFiles</font>, are arbitrary names; that is, the identifier
<font color=red>out</font> appears in two places, so long as I replace both occurences by the same
thing, the fragment will still work; similarly for the identifier <font color=red>outputFiles</font>.
When art parses this fragment it looks for a parameter named
physics.end_paths.  This parameter must have a value that is a list
of names of paths; it must be a list even though it is legal, as in this example, to have
only one path name in the list.
Art will then look to find the definition of the path physics.outputFiles.
This must be a list of module labels; it must be a list even if it has
a length of one.  The module labels in the list may refer only to output modules
or analyzer modules; it is an error if the label of a producer, a filter or
a source module is found in the list.  Art then looks to find a module with the
label of <font color=red>out</font> and finds it under outputs.
When the job starts, art will create an instance of the RootOutput module, which will open
an output file named "output.root".  All events from the input file will
be written to the output file. All data products found in each event will
be written to the output file.
===Writing Selected Data Products to an Output File===
In the next fragment the configuration of the output module has been altered to
so that some data products are not written to the output file.
<pre>
outputs: {
  out: {
  module_type: RootOutput
  fileName: "output.root"
  outputCommands :  [ "keep *_*_*_*"
                      ,"drop mu2e::PointTrajectorymv_+_*_*"
                      ]
  }
}
</pre>
In the keep/drop commands, the names with the format DataType_ModuleLabel_InstanceName_ProcessName are
the four part identifier for a data product.
The outputCommands parameter should be understood as follows:
the output module will write out all data products unless the data
product is of type mu2e::PointTrajectorymv.
The outputCommands parameter can be an arbitrarily long list that is parsed
from the top down using the logic: do the first rule, unless the second rule
applies, unless the third rule applies, and so on for all rules.
The logic is similar to the allow/deny logic in .htaccess files.
Bill Tanenbaum recommends that the first command always be drop * or keep *,
and then apply keep or drop relative to that state.
===Writing Selected Events to an Output File===
The code fragment below shows how to define a path that contains a filter and
how to connect that path to an output module.  All events that pass the filter
will be written by this output module.
The code fragment below shows how to define a two filter modules and use them
to direct some events to one output module and some events to another output module.
The example also writes different data products to each file.
<ol>
<li> Writes its output to the file named <b>data02_Mode0.root</b>
<li> Only writes out events that complete the path named <b>path0</b>.
<li> Drops any data product with data type <b>mu2e::PointTrajectorymv</b>.
</ol>
The second output module:
<ol>
<li> Writes its output to the file named <b>data02_Mode1.root</b>
<li> Only writes out events that complete the path named <b>path1</b>.
<li> Keeps only two groups of data products, <b>mu2e::StrawHits</b> that were made by the module with the label <b>makeSH</b> and <b>mu2e::CaloHits</b> that were made by any module.
</ol>
<pre>
physics: {
  producers: {
    makeSH: { module_type: MakeStrawHits }
  }
  filters: {
    selectMode0: {
      module_type: Filter1
      mode: 0
    }
    selectMode1: {
      module_type: Filter1
      mode: 1
    }
  }
  path0: [ makeSH, selectMode0 ]
  path1: [ makeSH, selectMode1 ]
  outputFiles:  [ out1, out2 ]
  trigger_paths: [ path0, path1 ]
  end_paths:    [ outputFiles ]
}
outputs: {
  out1: {
  module_type: RootOutput
  fileName: "data02_Mode0.root"
  SelectEvents: { SelectEvents: [ path0 ] }
  outputCommands :  [ "keep *_*_*_*"
                      ,"drop mu2e::PointTrajectorymv_+_*_*"
                      ]
  }
  out2: {
  module_type: RootOutput
  fileName: "data02_Mode1.root"
  SelectEvents: { SelectEvents: [ path1 ] }
  outputCommands :  [ "drop *_*_*_*"
                      ,"keep mu2e::StrawHits_makeSH_*_*"
                      ,"keep mu2e::CaloHits_*_*_*"
                      ]
  }
}
</pre>
In the above, the module Filter1 is presumed to have two distinct modes selected by the
mode parameter.  The filter can send some events to just one of the files, some events
to both files or some events to no files.  The two identifiers path0 and path1
are arbitrary.  They are the names of paths; that is they are lists of module labels.
The parameter physics.trigger_paths is a special name known to art.  It is a list of
paths; the module labels on these paths must be either producer or filter modules.
Art recognizes that the module label makeSH appears in both path0 and path1; it also
recognizes that makeSH only needs to be executed once in order to satisfy the requirements
of both paths.
==Schema Evolution and Fast Cloning==
Suppose that you have some data product class, MyDP, defined in
the file MyDP.h . You
run some jobs and write some output files that contain collections of
objects of type MyDP. Now suppose that, at
a later date you edit MyDP.h, either adding or subtracting some data
members.
This process is referred to as "schema evolution".  "Schema" is a word
borrowed from the database world: the schema of a root file describes, among
other things, the data type of each data member of each type of object
that is found in the root file.  When the definition of one of these
objects changes, the schema is said to "evolve".
If the changes are simple enough, then ROOT's automatic schema evolution will
almost always do the right thing.  If you removed some data members from
MyDp.h, and if you read an old file with the new code, ROOT will read the disk
file and will simply discard the data for the removed data members. The
new-code
objects in memory will be the correct subset of the old-code objects on disk.
On the other hand, your new code may contain additional some data members.
When you make this change you should update the default constructor of MyDP
so that it initializes the new data members appropriately.  In this case,
when you read old-code objects from disk, the new-code objects in memory will
have their newly added data members set to the values given by the default
constructor.  If you neglect to initialize these new data members in the
default constructor, it is possible that the in-memory values may contain
uninitialized memory.
There is an additional complication when you have an input file that
was written with one version of the schema, you read it with a program that
has a different version of the schema, and then you write an output file.
It is possible to write an output file in which objects written with the old schema
coexist with objects written with the new schema - but there are limitations on
this.  The guaranteed safe way of doing things is to write an ouptut file in which
the old-schema objects have been translated into new schema objects.
To do this you need
to fill the in-memory representation of the objects from the input file and
then write those in-memory objects to the output file.
However the default behaviour of art has a speed optimization that takes
a shortcut.  If a data product is in both the input and the output file,
art's default behaviour is simply to copy the packed data from the input
file to the output file.  This is true even if the data product was
unpacked into memory; this saves the time needed to repack the memory into
the output file, which can be significant.  This shortcut is called
"fast cloning".  If the schema of the input file
and the running program are the same, then fast cloning works properly.
If, on the other hand, the schema of the input file and of the running program are
different, then there may be problems. When this sort of problem happens,
art throws and exception and attempts to shutdown gracefully.
The text from execption message will look something like:
<pre>
%MSG-s ArtException:  PostOpenFile 15-Apr-2013 09:48:35 CDT BeforeEvents
cet::exception caught in art
---- FatalRootError BEGIN
  Fatal Root Error: @SUB=TTreeCloner::CollectBranches
  One of the export sub-branches (mu2e::CaloClusters_makeCaloCluster_AlgoCLOSESTSeededByENERGY_Exercise01.obj._distance) is not present in the import TTree.
  cet::exception caught in EventProcessor and rethrown
---- FatalRootError END
</pre>
The name of the data product in the fifth line will differ from one instance of this problem to another.
To work around this you should add the following parameter to the parameter set
for each output file in the job:
<pre>
fastCloning : false
</pre>
This tells root to do the following for every data product that is destined
for an output file: unpack the data product from the input file into memory and
repack it into the output file.  ROOT will do this for every data product, not
just those that have had schema evolution. Because fast cloning is usually
safe and because it is much faster than slow cloning, the default is for
fast cloning to be enabled.
Aside for ROOT experts:
the problem arises only for objects that have been split; the underlying limitation
is that part of the schema is bound to the branch hierarchy.

Latest revision as of 17:07, 24 March 2017