FileNames: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
 
(36 intermediate revisions by 4 users not shown)
Line 26: Line 26:
* must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.  This can be restated: do not attempt to include all the metadata in the file name.
* must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.  This can be restated: do not attempt to include all the metadata in the file name.


===File Name Structure===
Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:
Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:
<pre>
<pre>
Line 32: Line 33:
for example:
for example:
<pre>
<pre>
sim.mu2e.beam_g4s1_dsregion.0429a.12345678_123456.art
sim.mu2e.beam_g4s1_dsregion.0429a.123456_12345678.art
</pre>
</pre>


Line 39: Line 40:
With owner in the file name, potential name conflicts
With owner in the file name, potential name conflicts
will only occur within one user's files.
will only occur within one user's files.
Fields may contain only alphanumeric characters, hyphens, and underscores.


===Datasets===
===Datasets===
Line 60: Line 62:
All files in a logical dataset will have  the same "dh.dataset" field content, which will be unique to this dataset.
All files in a logical dataset will have  the same "dh.dataset" field content, which will be unique to this dataset.
The average user will almost always run on a dataset, so will only need to refer to this dataset name.
The average user will almost always run on a dataset, so will only need to refer to this dataset name.
Datasets that have similar operational patterns and lifetimes (raw vs reconstructed data, read by production vs read by users) are stored together on a set of tapes called a [[FileFamilies|file family]].


==Name Fields==
==Name Fields==
===data_tier===
===data_tier===
The data tier describes the type of data, conceptually.  Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed.  The choices for this field are fixed, and must come the list below.
The data tier describes the file product content, at a conceptual level.  Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed.  The choices for this field are fixed, and must come the list below.  Please use the tier that is the most advanced in a typical processing.  For example sim tier will contain StepPoints and SimParticles, but if those files are processed so they contain those and digis, then those new files should be dig tier.  If reco products are also added, it would be mcs.


* for physics data:
* for physics data:
** '''raw'''
** '''raw'''   TDAQ output - digis and some trigger reco
** '''rec'''  reconstructed
** '''rec'''  reconstructed
** '''ntd'''  data ntuples
** '''ntd'''  data ntuples
Line 76: Line 80:
** '''cnf'''  set of config files fcl or txt, to drive MC jobs
** '''cnf'''  set of config files fcl or txt, to drive MC jobs
** '''sim'''  result of geant, StepPointMC
** '''sim'''  result of geant, StepPointMC
** '''mix'''  mixed sim files (has multple generators)
** '''dts'''  detector steps
** '''mix'''  mixed sim files (has multiple generators)
** '''dig'''  detector hits, like raw data
** '''dig'''  detector hits, like raw data
** '''mcs'''  reconstructed data files
** '''mcs'''  reconstructed data files
Line 90: Line 95:


===owner===
===owner===
For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be permanent and used widely, the owner will be '''mu2e'''.
For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be semi-permanent and used widely, the owner will be '''mu2e'''.
For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.


===description===
===description===
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose.  It can also be thought of as a conceptual project and/or a high-level indication of the physics.  This should be limited to 20 characters,  
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose.  It can also be thought of as a conceptual project and/or a high-level indication of the physics.  This should be limited to 20 characters,  
but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name  
but may be more if it is strongly motivated. Examples are "POT", "ConversionElectrons", "CE-mix", or "cosmics-CRY". It might contain the name  
of the responsible group.  It should not contain a username or detailed configurations.  
of the responsible group.  It should not contain a username or detailed configurations, such as code tags or geometry choices.
 
===configuration===
===configuration===
This field is intended to capture details of the configuration,
This field is intended to capture details of the configuration,
Line 111: Line 116:
you should capture all information in a simple string like "tag100" or  
you should capture all information in a simple string like "tag100" or  
"c427-m1-g5-v2" which is documented elsewhere.
"c427-m1-g5-v2" which is documented elsewhere.
For raw data, the configuration is starts as something nominal, like 000.  When the DAQ settings, readout chain downloads or detector configuration changes to such an extend that a physicist would tend to think of the new data as a "different dataset", and should be treated separately from the previous data, beyond simply updated conditions database entries, then this configuration should be incremented.  Some examples which might cause the configuration to change are non-trivial changes to the data product structure, or a summer shutdown.  A change should be well-considered by the operations team.
In production, the configuration is a string like "pass1_00".  The "pass1" phrase indicates that this data was produced in a pass1 job and is there for user convenience.  The "000" represents the offline code tag and the conditions database version that went into this production, which must be documented elsewhere. When a repair to a dataset is needed due to a fix in code or conditions, the configuration field should be updated.  For example if data periods 1,2 and 3 are produced using the pass1_000 configuration, then a problem was found in the period 3 conditions, then a new set of conditions is defined and period 3 is reprocessed labeled as pass1_001.  The final user dataset is then periods 1 and 2 from pass1_000 and period 3 from pass1_001.  This can be captured in a SAM dataset definition.


===sequencer===
===sequencer===
This field is simply to give unique filenames to files of a single dataset.
This field is simply to give unique filenames to files of a single dataset.
It could be a counter 0000, 0001, etc.  For art files it will  
It could be a counter 0000, 0001, etc.  For art files it will  
be rrrrrrrr_ssssss, where the r field indicates  
be rrrrrr_ssssssss, where the r field indicates  
the run number and the s field indicates the subrun of the lowest  
the run number and the s field indicates the subrun of the lowest  
sorted subrun eventID in the file.  A subrun should only appear
sorted subrun in the file.  Subruns are sorted by run number first, then subrun number within a run.
only in one file so this is uniquely determined for a file within a dataset.
 
An art file contains records of all subruns that were input to the file, so subruns can be present even if no events in that subrun are present.  The primary reason to track subruns this way is to securely know what luminosity a file represents.  This principle also requires that all the data from a subrun be contained in one file in a dataset.  Since a subrun will appear only in one file, this lowest subrun this is unique for each file in a dataset, and therefore the file name will be unique.
 
In processing multiple input files into one output, for example in concatenation, the name of the output file will be the same as the input file with the lowest sorted subrun number.  For sequencers with run and subrun in fixed width field, this is lowest alphanumeric sort of input file names.
 
Currently Mu2e does generally support splitting a single file into multiple files, however, this is possible, and should maintain the rule of one subrun fully contained in one file.


===file_format===
===file_format===
This is the commonly-recognized file type, one of a fixed list  
This is the commonly-recognized file type, one of a fixed list  
(that can be extended): '''art''', '''root''',''' txt''', '''tar''', '''tgz''', '''tbz''' (tar.bz2), '''log''', '''fcl'''
(that can be extended): '''art''', '''root''',''' txt''', '''tar''', '''tgz''', '''tbz''' (tar.bz2), '''log''', '''fcl''', '''stn''' (Stntuple), '''mid''' (midas DAQ), '''enc''' (encrypted), '''dat''' (other binary), '''tka''' (TrkAna), '''pdf'''


A list of valid values from the database can be generated:
A list of valid values from the database can be generated:
Line 133: Line 147:
<pre>
<pre>
cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
sim.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
mix.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
dig.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
mcs.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
nts.mu2e.tdr-beam.TS3ToDS23.123456_12345678.root
log.mu2e.tdr-beam.TS3ToDS23.001.tgz
log.mu2e.tdr-beam.TS3ToDS23.001.tgz
</pre>
</pre>
A common situation is when one set of fcl files is used to produce multiple output datasets.  For example, a set of fcl files that generate protons on target could be run in two different releases without modification. The fcl files <code>cnf.mu2e.tdr-beam.TS3ToDS23.fcl</code> could give rise to <code>sim.mu2e.tdr-beam.TS3ToDS23.art</code> and <code>sim.mu2e.tdr-beam.TS3ToDS23-v521.art</code>.  This way, an output dataset can be referred to by a fcl dataset and a dsconf string.
If a new digitization (dig) file were to be made with a different mix file,
If a new digitization (dig) file were to be made with a different mix file,
then a derived name could be used.  Since this is a  
then a derived name could be used.  Since this is a  
new set of conditions, it makes sense to modify the configuration field:
new set of conditions, it makes sense to modify the configuration field:
<pre>
<pre>
dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art
dig.mu2e.tdr-beam.TS3ToDS23-v2.123456_12345678.art
</pre>
</pre>
When making variations there is a temptation to include  
When making variations there is a temptation to include  
Line 151: Line 167:
it is tempting to add that instead:
it is tempting to add that instead:
<pre>
<pre>
dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art
dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.123456_12345678.art
</pre>
</pre>
This style can get out of hand quickly, leading to large, unwieldy
This style can get out of hand quickly, leading to large, unwieldy
Line 159: Line 175:


If a user created the change for his own purposes, he would
If a user created the change for his own purposes, he would
make it into a usr data (and put it in the appropriate file family)
make it into a usr data by including his user name:
by including his user name:
<pre>
<pre>
dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art
dig.batman.tdr-beam.TS3ToDS23-v2.123456_12345678.art
</pre>
</pre>


Raw, reconstructed and ntuple beam data might look like:
Raw, reconstructed and ntuple beam data might look like:
<pre>
<pre>
raw.mu2e.streamA.triggerTable123.12345678_123456.art
raw.mu2e.streamA.triggerTable123.123456_12345678.art
rec.mu2e.streamA.triggerTable123.12345678_123456.art
rec.mu2e.streamA.triggerTable123.123456_12345678.art
ntd.mu2e.streamA.triggerTable123.0001.root
ntd.mu2e.streamA.triggerTable123.0001.root
</pre>
</pre>
Line 176: Line 191:
</pre>
</pre>


==A Note on Rucio==
The [[Rucio]] data manager will will be used for data file locations and data transfer.  Rucio has a strong philosophy that files should not be replaced and enforces that by disallowing updates to the CRC or size attached to a file name.  Mu2e, generally, but not strictly, follows this philosophy.  This implies there might be an occasion when we want to update a file but can't due to the Rucio restriction. There is no expectation in normal operations (including dataset updates and repairs) that we will face this problem, but if it arises, we expect to upload the new file with a new name. The decision on how to form this new file name will be context dependent, but will probably be to append a version string such as "v1" to either the config field or the sequencer.  Neither will cause downstream issues with datahandling polices or tools.  Another option is to not declare the CRC and size to Rucio.  These are recorded in metacat and dCache, but Rucio transfers might not be able to execute a check against the CRC.  It might also be possible to teach Rucio to ask metacat or dCache for the CRC before transfer, as there are places in Rucio to provide input from external databases.
==Maintenance==
To add a data_tier or an extension, you have to declare that these values are allowed in SAM. As user mu2epro,
setup sam_web_client
samweb add-value --help-categories
samweb add-value data_tiers raw "raw data"
samweb list-values data_tiers
samweb add-value file_formats enc
samweb list-values file_formats
The list of data_tiers can be listed with descriptions at [https://samweb.fnal.gov:8483/sam/mu2e/api/values/data_tiers this link]
'''Note:''' when adding a data_tier, the [[FileFamilies#Determining_file_family|file family matrix]] must be updated, the [[FileTools#File_path_tools|tools]] of the '''mu2efilenames''' product must also be updated so the new files can be sent to tape


[[Category:Computing]]
[[Category:Computing]]
[[Category:Workflows]]
[[Category:Workflows]]
[[Category:DataHandling]]
[[Category:DataHandling]]
[[Category:Datasets]]

Latest revision as of 19:41, 22 January 2024

Introduction

For everyday personal use, file names can be anything that is convenient for the user. If the files

  • go to tape
  • were produced by a collaboration effort
  • gain some long-term status
  • are used by more than one or two people

they must be named by fixed, six-field pattern as described here. When the files are written to tape, they must follow this name pattern without exception, while the other criteria may have exceptions. The primary Monte Carlo workflow has this naming pattern built-in.

File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc.

The user has some discretion in choosing names, and should embrace the following concepts for file names:

  • must be unique for each file
  • must contain only alphanumeric characters, hyphens, and underscores
  • should be mnemonic and helpful
  • must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents. This can be restated: do not attempt to include all the metadata in the file name.

File Name Structure

Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:

data_tier.owner.description.configuration.sequencer.file_format

for example:

sim.mu2e.beam_g4s1_dsregion.0429a.123456_12345678.art

Each of these fields will be discussed more below. These fields all correspond to required SAM metadata database fields. With owner in the file name, potential name conflicts will only occur within one user's files. Fields may contain only alphanumeric characters, hyphens, and underscores.

Datasets

If you remove the sequencer from a file name, you are left with five dot-separated fields:

data_tier.owner.description.configuration.file_format

for example

sim.mu2e.beam_g4s1_dsregion.0429a.art

This creates a string that is unique for this logical dataset, and that will be put in the "dh.dataset" SAM metadata field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata for certain fields. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. The average user will almost always run on a dataset, so will only need to refer to this dataset name.

Datasets that have similar operational patterns and lifetimes (raw vs reconstructed data, read by production vs read by users) are stored together on a set of tapes called a file family.

Name Fields

data_tier

The data tier describes the file product content, at a conceptual level. Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed. The choices for this field are fixed, and must come the list below. Please use the tier that is the most advanced in a typical processing. For example sim tier will contain StepPoints and SimParticles, but if those files are processed so they contain those and digis, then those new files should be dig tier. If reco products are also added, it would be mcs.

  • for physics data:
    • raw TDAQ output - digis and some trigger reco
    • rec reconstructed
    • ntd data ntuples
  • for ExtMon data:
    • ext ExtMon raw
    • rex ext production
    • xnt ext data ntuples
  • for simulation:
    • cnf set of config files fcl or txt, to drive MC jobs
    • sim result of geant, StepPointMC
    • dts detector steps
    • mix mixed sim files (has multiple generators)
    • dig detector hits, like raw data
    • mcs reconstructed data files
    • nts MC ntuples
  • other categories:
    • log for log files
    • bck for backups
    • etc for anything else
    • job for a production record

A list of valid values from the database can be generated:

samweb list-values  data_tiers

owner

For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be semi-permanent and used widely, the owner will be mu2e. For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.

description

This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "POT", "ConversionElectrons", "CE-mix", or "cosmics-CRY". It might contain the name of the responsible group. It should not contain a username or detailed configurations, such as code tags or geometry choices.

configuration

This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all information in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.

For raw data, the configuration is starts as something nominal, like 000. When the DAQ settings, readout chain downloads or detector configuration changes to such an extend that a physicist would tend to think of the new data as a "different dataset", and should be treated separately from the previous data, beyond simply updated conditions database entries, then this configuration should be incremented. Some examples which might cause the configuration to change are non-trivial changes to the data product structure, or a summer shutdown. A change should be well-considered by the operations team.

In production, the configuration is a string like "pass1_00". The "pass1" phrase indicates that this data was produced in a pass1 job and is there for user convenience. The "000" represents the offline code tag and the conditions database version that went into this production, which must be documented elsewhere. When a repair to a dataset is needed due to a fix in code or conditions, the configuration field should be updated. For example if data periods 1,2 and 3 are produced using the pass1_000 configuration, then a problem was found in the period 3 conditions, then a new set of conditions is defined and period 3 is reprocessed labeled as pass1_001. The final user dataset is then periods 1 and 2 from pass1_000 and period 3 from pass1_001. This can be captured in a SAM dataset definition.

sequencer

This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files it will be rrrrrr_ssssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun in the file. Subruns are sorted by run number first, then subrun number within a run.

An art file contains records of all subruns that were input to the file, so subruns can be present even if no events in that subrun are present. The primary reason to track subruns this way is to securely know what luminosity a file represents. This principle also requires that all the data from a subrun be contained in one file in a dataset. Since a subrun will appear only in one file, this lowest subrun this is unique for each file in a dataset, and therefore the file name will be unique.

In processing multiple input files into one output, for example in concatenation, the name of the output file will be the same as the input file with the lowest sorted subrun number. For sequencers with run and subrun in fixed width field, this is lowest alphanumeric sort of input file names.

Currently Mu2e does generally support splitting a single file into multiple files, however, this is possible, and should maintain the rule of one subrun fully contained in one file.

file_format

This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl, stn (Stntuple), mid (midas DAQ), enc (encrypted), dat (other binary), tka (TrkAna), pdf

A list of valid values from the database can be generated:

samweb list-values  file_formats

Coordinating names between datasets

An official Monte Carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:

cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
sim.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
mix.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
dig.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
mcs.mu2e.tdr-beam.TS3ToDS23.123456_12345678.art
nts.mu2e.tdr-beam.TS3ToDS23.123456_12345678.root
log.mu2e.tdr-beam.TS3ToDS23.001.tgz

A common situation is when one set of fcl files is used to produce multiple output datasets. For example, a set of fcl files that generate protons on target could be run in two different releases without modification. The fcl files cnf.mu2e.tdr-beam.TS3ToDS23.fcl could give rise to sim.mu2e.tdr-beam.TS3ToDS23.art and sim.mu2e.tdr-beam.TS3ToDS23-v521.art. This way, an output dataset can be referred to by a fcl dataset and a dsconf string.

If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:

dig.mu2e.tdr-beam.TS3ToDS23-v2.123456_12345678.art

When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:

dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.123456_12345678.art

This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.

If a user created the change for his own purposes, he would make it into a usr data by including his user name:

dig.batman.tdr-beam.TS3ToDS23-v2.123456_12345678.art

Raw, reconstructed and ntuple beam data might look like:

raw.mu2e.streamA.triggerTable123.123456_12345678.art
rec.mu2e.streamA.triggerTable123.123456_12345678.art
ntd.mu2e.streamA.triggerTable123.0001.root

A backup of an analysis project might look like:

bck.batman.node123.2014-06-04.aa.tgz

A Note on Rucio

The Rucio data manager will will be used for data file locations and data transfer. Rucio has a strong philosophy that files should not be replaced and enforces that by disallowing updates to the CRC or size attached to a file name. Mu2e, generally, but not strictly, follows this philosophy. This implies there might be an occasion when we want to update a file but can't due to the Rucio restriction. There is no expectation in normal operations (including dataset updates and repairs) that we will face this problem, but if it arises, we expect to upload the new file with a new name. The decision on how to form this new file name will be context dependent, but will probably be to append a version string such as "v1" to either the config field or the sequencer. Neither will cause downstream issues with datahandling polices or tools. Another option is to not declare the CRC and size to Rucio. These are recorded in metacat and dCache, but Rucio transfers might not be able to execute a check against the CRC. It might also be possible to teach Rucio to ask metacat or dCache for the CRC before transfer, as there are places in Rucio to provide input from external databases.

Maintenance

To add a data_tier or an extension, you have to declare that these values are allowed in SAM. As user mu2epro,

setup sam_web_client
samweb add-value --help-categories
samweb add-value data_tiers raw "raw data"
samweb list-values data_tiers
samweb add-value file_formats enc
samweb list-values file_formats

The list of data_tiers can be listed with descriptions at this link

Note: when adding a data_tier, the file family matrix must be updated, the tools of the mu2efilenames product must also be updated so the new files can be sent to tape