Difference between revisions of "IOModules"

From Mu2eWiki
Jump to navigation Jump to search
(Created page with "== Introduction == This section describes how to configure input and output modules. This includes how to specify filenames, how to skip events from an input file, how to wri...")
 
(Blanked the page)
 
Line 1: Line 1:
==  Introduction ==
 
This section describes how to configure input and output modules.
 
This includes how to specify filenames, how to skip events
 
from an input file, how to write multiple output files and
 
how to write only selected data products to a particular ouput file.
 
It also describes a special input source named EmptySource.
 
  
==Reading from Files==
 
 
When reading from and existing file, art allows one to select input files,
 
the starting event, the number of events to read, etc either from
 
the command line or from the fcl file.  If a particular quantity is
 
controlled from both the command line and
 
the fcl file, the value on the command line takes precedence.
 
 
The following code fragment tells art to read event data from the
 
file named <code>file01.root</code>, to start at the beginning of the file and read until
 
the end of file is reached:
 
 
<pre>
 
source :{
 
  module_type : RootInput
 
  fileNames  : [ "file01.root" ]
 
  maxEvents  : -1
 
}
 
</pre>
 
 
To tell art to read 100 events, or until the end of file, which ever comes first,
 
change the parameter maxEvents to 100. One may also specify a list of input files:
 
 
<pre>
 
source : {
 
  module_type : RootInput
 
  fileNames  : [ "file01.root", "file02.root",  "file03.root" ]
 
  maxEvents  : 100
 
}
 
</pre>
 
 
One may give an (essentially) unlimited number of files in the list of input files.
 
One may also tell art to skip the first two events and start with the third:
 
 
<pre>
 
source : {
 
  module_type : RootInput
 
  fileNames  : [ "file01.root", "file02.root",  "file03.root" ]
 
  maxEvents  : 100
 
  skipEvents  : 2
 
}
 
</pre>
 
 
The list below shows some other parameters that can be included in the source parameter set:
 
<pre>
 
  firstRun            : 0
 
  firstSubRun          : 0
 
  firstEvent          : 0
 
  noEventSort          : false
 
  skipBadFiles        : false
 
  fileMatchMode        : "permissive"
 
  inputCommands        : ""
 
</pre>
 
 
The first* parameters specify that the first event to be processed will be the first event
 
that has an EventID greater than or equal to the specified event;
 
if one of the first* parameters is not specified, it takes a default value of -1 and
 
is excluded from the comparison.
 
If a file of unsorted events is read in, art will, by default, present the events for
 
processing in order of increasing event number; a corollary of this is that the
 
output file will contain the events in sorted order. This sorting occurs one input file
 
at a time; art does not sort across file boundaries in a list of input files.
 
If the noEventSort parameter is set to true, the sorting is disabled, which, will, in
 
most cases yield a minor performance improvement.  I have not yet learned the precise
 
meaning of the skipBadFiles and the fileMatchMode parameters.
 
The inputCommands parameter tells art to delete certain data
 
products after reading the input file; that is, the input file itself is not modified but
 
data products are removed from the copy of the event in memory before any modules are called.
 
The syntax of this language is the same as for outputCommands, described below.
 
In the pre-art versions of the framework, there were methods to select ranges of events
 
or ranges of SubRuns. This is not yet working in art; the art developers will
 
add this feature back once we decided exactly what we mean by "ranges of events".
 
 
 
==Empty Source==
 
 
In many simulation applications one wishes to start with an empty event, run one or more
 
event generators, pass the generated particles through the Geant4, and so on.  In art
 
the first step in this chain is accomplished using a source module named
 
EmptySource, as follows:
 
 
<pre>
 
source :{
 
  module_type : EmptyEvent
 
  maxEvents  : 200
 
}
 
</pre>
 
 
Instead of reading event-data from a file, the empty source increments the event number
 
and presents an empty event to the modules that will do the work.
 
One may configure EmptySource to specify the EventId of the first event,
 
to specify the maximum number of events in a SubRun or SubRuns in a run.
 
 
<pre>
 
source :{
 
  module_type          : EmptyEvent
 
  firstRun            : 2
 
  firstSubRun          : 1
 
  firstEvent          : 1
 
  numberEventsInRun    : 1000
 
  numberEventsInSubRun :  100
 
  maxEvents            : 200
 
  resetEventOnSubRun  : true
 
}
 
</pre>
 
 
The last option tells art to reset event numbers to start at 1 whenever
 
art starts a new SubRun begins; this is the default behavior
 
and is opposite to the behavior we inherited from CMS.
 
 
 
==Configuring Output Modules ==
 
 
===Writing all Data Products in All Events to an Output File===
 
 
The code fragment below shows how to configure art to have one output module
 
that writes every event to the file named "output.root":
 
 
<font color=blue>physics</font>: {
 
  <font color=red>outputFiles</font>:  [ <font color=red>out</font> ]
 
  <font color=blue>end_paths</font>:    [ <font color=red>outputFiles</font> ]
 
}
 
 
<font color=blue>outputs</font>: {
 
  <font color=red>out</font>: {
 
    <font color=blue>module_type</font>: <font color=green>RootOutput</font>
 
    <font color=green>fileName</font>: "output.root"
 
  }
 
}
 
 
At first glance this appears a little verbose, with some redundant information;
 
later examples will show that more powerful features that require a structure
 
of this level of detail.
 
In the above fragment the identifiers <font color=blue>physics</font>,
 
<font color=blue>end_paths</font>, <font color=blue>outputs</font>
 
and <font color=blue>module_type</font>
 
all have special meaning to art. The
 
name <font color=green>RootOutput</font> is the name of a class, supplied by art,
 
that writes event-data to root files.  The identifier
 
<font color=green>fileName</font> has special meaning to the class <font color=green>RootOutput</font>.
 
The two other identifies in this fragment,
 
<font color=red>out</font> and <font color=red>outFiles</font>, are arbitrary names; that is, the identifier
 
<font color=red>out</font> appears in two places, so long as I replace both occurences by the same
 
thing, the fragment will still work; similarly for the identifier <font color=red>outputFiles</font>.
 
 
When art parses this fragment it looks for a parameter named
 
physics.end_paths.  This parameter must have a value that is a list
 
of names of paths; it must be a list even though it is legal, as in this example, to have
 
only one path name in the list.
 
Art will then look to find the definition of the path physics.outputFiles.
 
This must be a list of module labels; it must be a list even if it has
 
a length of one.  The module labels in the list may refer only to output modules
 
or analyzer modules; it is an error if the label of a producer, a filter or
 
a source module is found in the list.  Art then looks to find a module with the
 
label of <font color=red>out</font> and finds it under outputs.
 
 
When the job starts, art will create an instance of the RootOutput module, which will open
 
an output file named "output.root".  All events from the input file will
 
be written to the output file. All data products found in each event will
 
be written to the output file.
 
 
===Writing Selected Data Products to an Output File===
 
 
In the next fragment the configuration of the output module has been altered to
 
so that some data products are not written to the output file.
 
<pre>
 
outputs: {
 
  out: {
 
  module_type: RootOutput
 
  fileName: "output.root"
 
  outputCommands :  [ "keep *_*_*_*"
 
                      ,"drop mu2e::PointTrajectorymv_+_*_*"
 
                      ]
 
  }
 
}
 
</pre>
 
 
In the keep/drop commands, the names with the format DataType_ModuleLabel_InstanceName_ProcessName are
 
the four part identifier for a data product.
 
The outputCommands parameter should be understood as follows:
 
the output module will write out all data products unless the data
 
product is of type mu2e::PointTrajectorymv.
 
The outputCommands parameter can be an arbitrarily long list that is parsed
 
from the top down using the logic: do the first rule, unless the second rule
 
applies, unless the third rule applies, and so on for all rules.
 
The logic is similar to the allow/deny logic in .htaccess files.
 
Bill Tanenbaum recommends that the first command always be drop * or keep *,
 
and then apply keep or drop relative to that state.
 
 
 
===Writing Selected Events to an Output File===
 
 
The code fragment below shows how to define a path that contains a filter and
 
how to connect that path to an output module.  All events that pass the filter
 
will be written by this output module.
 
 
The code fragment below shows how to define a two filter modules and use them
 
to direct some events to one output module and some events to another output module.
 
The example also writes different data products to each file.
 
<ol>
 
<li> Writes its output to the file named <b>data02_Mode0.root</b>
 
<li> Only writes out events that complete the path named <b>path0</b>.
 
<li> Drops any data product with data type <b>mu2e::PointTrajectorymv</b>.
 
</ol>
 
The second output module:
 
<ol>
 
<li> Writes its output to the file named <b>data02_Mode1.root</b>
 
<li> Only writes out events that complete the path named <b>path1</b>.
 
<li> Keeps only two groups of data products, <b>mu2e::StrawHits</b> that were made by the module with the label <b>makeSH</b> and <b>mu2e::CaloHits</b> that were made by any module.
 
</ol>
 
 
<pre>
 
physics: {
 
 
  producers: {
 
    makeSH: { module_type: MakeStrawHits }
 
  }
 
 
  filters: {
 
    selectMode0: {
 
      module_type: Filter1
 
      mode: 0
 
    }
 
    selectMode1: {
 
      module_type: Filter1
 
      mode: 1
 
    }
 
  }
 
  path0: [ makeSH, selectMode0 ]
 
  path1: [ makeSH, selectMode1 ]
 
  outputFiles:  [ out1, out2 ]
 
 
  trigger_paths: [ path0, path1 ]
 
  end_paths:    [ outputFiles ]
 
}
 
 
outputs: {
 
  out1: {
 
  module_type: RootOutput
 
  fileName: "data02_Mode0.root"
 
  SelectEvents: { SelectEvents: [ path0 ] }
 
  outputCommands :  [ "keep *_*_*_*"
 
                      ,"drop mu2e::PointTrajectorymv_+_*_*"
 
                      ]
 
  }
 
 
  out2: {
 
  module_type: RootOutput
 
  fileName: "data02_Mode1.root"
 
  SelectEvents: { SelectEvents: [ path1 ] }
 
  outputCommands :  [ "drop *_*_*_*"
 
                      ,"keep mu2e::StrawHits_makeSH_*_*"
 
                      ,"keep mu2e::CaloHits_*_*_*"
 
                      ]
 
  }
 
}
 
</pre>
 
 
In the above, the module Filter1 is presumed to have two distinct modes selected by the
 
mode parameter.  The filter can send some events to just one of the files, some events
 
to both files or some events to no files.  The two identifiers path0 and path1
 
are arbitrary.  They are the names of paths; that is they are lists of module labels.
 
The parameter physics.trigger_paths is a special name known to art.  It is a list of
 
paths; the module labels on these paths must be either producer or filter modules.
 
Art recognizes that the module label makeSH appears in both path0 and path1; it also
 
recognizes that makeSH only needs to be executed once in order to satisfy the requirements
 
of both paths.
 
 
 
==Schema Evolution and Fast Cloning==
 
 
Suppose that you have some data product class, MyDP, defined in
 
the file MyDP.h . You
 
run some jobs and write some output files that contain collections of
 
objects of type MyDP. Now suppose that, at
 
a later date you edit MyDP.h, either adding or subtracting some data
 
members.
 
 
This process is referred to as "schema evolution".  "Schema" is a word
 
borrowed from the database world: the schema of a root file describes, among
 
other things, the data type of each data member of each type of object
 
that is found in the root file.  When the definition of one of these
 
objects changes, the schema is said to "evolve".
 
 
 
If the changes are simple enough, then ROOT's automatic schema evolution will
 
almost always do the right thing.  If you removed some data members from
 
MyDp.h, and if you read an old file with the new code, ROOT will read the disk
 
file and will simply discard the data for the removed data members. The
 
new-code
 
objects in memory will be the correct subset of the old-code objects on disk.
 
 
On the other hand, your new code may contain additional some data members.
 
When you make this change you should update the default constructor of MyDP
 
so that it initializes the new data members appropriately.  In this case,
 
when you read old-code objects from disk, the new-code objects in memory will
 
have their newly added data members set to the values given by the default
 
constructor.  If you neglect to initialize these new data members in the
 
default constructor, it is possible that the in-memory values may contain
 
uninitialized memory.
 
 
There is an additional complication when you have an input file that
 
was written with one version of the schema, you read it with a program that
 
has a different version of the schema, and then you write an output file.
 
It is possible to write an output file in which objects written with the old schema
 
coexist with objects written with the new schema - but there are limitations on
 
this.  The guaranteed safe way of doing things is to write an ouptut file in which
 
the old-schema objects have been translated into new schema objects.
 
To do this you need
 
to fill the in-memory representation of the objects from the input file and
 
then write those in-memory objects to the output file.
 
However the default behaviour of art has a speed optimization that takes
 
a shortcut.  If a data product is in both the input and the output file,
 
art's default behaviour is simply to copy the packed data from the input
 
file to the output file.  This is true even if the data product was
 
unpacked into memory; this saves the time needed to repack the memory into
 
the output file, which can be significant.  This shortcut is called
 
"fast cloning".  If the schema of the input file
 
and the running program are the same, then fast cloning works properly.
 
If, on the other hand, the schema of the input file and of the running program are
 
different, then there may be problems. When this sort of problem happens,
 
art throws and exception and attempts to shutdown gracefully.
 
The text from execption message will look something like:
 
<pre>
 
%MSG-s ArtException:  PostOpenFile 15-Apr-2013 09:48:35 CDT BeforeEvents
 
cet::exception caught in art
 
---- FatalRootError BEGIN
 
  Fatal Root Error: @SUB=TTreeCloner::CollectBranches
 
  One of the export sub-branches (mu2e::CaloClusters_makeCaloCluster_AlgoCLOSESTSeededByENERGY_Exercise01.obj._distance) is not present in the import TTree.
 
  cet::exception caught in EventProcessor and rethrown
 
---- FatalRootError END
 
</pre>
 
The name of the data product in the fifth line will differ from one instance of this problem to another.
 
 
To work around this you should add the following parameter to the parameter set
 
for each output file in the job:
 
<pre>
 
fastCloning : false
 
</pre>
 
This tells root to do the following for every data product that is destined
 
for an output file: unpack the data product from the input file into memory and
 
repack it into the output file.  ROOT will do this for every data product, not
 
just those that have had schema evolution. Because fast cloning is usually
 
safe and because it is much faster than slow cloning, the default is for
 
fast cloning to be enabled.
 
 
Aside for ROOT experts:
 
the problem arises only for objects that have been split; the underlying limitation
 
is that part of the schema is bound to the branch hierarchy.
 

Latest revision as of 17:07, 24 March 2017