FileNames: Difference between revisions
Line 32: | Line 32: | ||
for example: | for example: | ||
<pre> | <pre> | ||
sim.mu2e.beam_g4s1_dsregion.0429a. | sim.mu2e.beam_g4s1_dsregion.0429a.123456_12345678.art | ||
</pre> | </pre> | ||
Revision as of 19:58, 17 July 2020
Introduction
For everyday personal use, file names can be anything that is convenient for the user. If the files
- go to tape
- were produced by a collaboration effort
- gain some long-term status
- are used by more than one or two people
they must be named by fixed, six-field pattern as described here. When the files are written to tape, they must follow this name pattern without exception, while the other criteria may have exceptions. The primary Monte Carlo workflow has this naming pattern built-in.
File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc.
The user has some discretion in choosing names, and should embrace the following concepts for file names:
- must be unique for each file
- must contain only alphanumeric characters, hyphens, and underscores
- should be mnemonic and helpful
- must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents. This can be restated: do not attempt to include all the metadata in the file name.
Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:
data_tier.owner.description.configuration.sequencer.file_format
for example:
sim.mu2e.beam_g4s1_dsregion.0429a.123456_12345678.art
Each of these fields will be discussed more below. These fields all correspond to required SAM metadata database fields. With owner in the file name, potential name conflicts will only occur within one user's files.
Datasets
If you remove the sequencer from a file name, you are left with five dot-separated fields:
data_tier.owner.description.configuration.file_format
for example
sim.mu2e.beam_g4s1_dsregion.0429a.art
This creates a string that is unique for this logical dataset, and that will be put in the "dh.dataset" SAM metadata field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata for certain fields. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. The average user will almost always run on a dataset, so will only need to refer to this dataset name.
Name Fields
data_tier
The data tier describes the type of data, conceptually. Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed. The choices for this field are fixed, and must come the list below. Please use the tier that is the most advanced in a typical processing. For example sim tier will contain StepPoints and SimParticles, but if those files are processed so they contain those and digis, then those new files should be dig tier. If reco products are also added, it would be mcs.
- for physics data:
- raw raw data
- rec reconstructed
- ntd data ntuples
- for ExtMon data:
- ext ExtMon raw
- rex ext production
- xnt ext data ntuples
- for simulation:
- cnf set of config files fcl or txt, to drive MC jobs
- sim result of geant, StepPointMC
- mix mixed sim files (has multiple generators)
- dig detector hits, like raw data
- mcs reconstructed data files
- nts MC ntuples
- other categories:
- log for log files
- bck for backups
- etc for anything else
- job for a production record
A list of valid values from the database can be generated:
samweb list-values data_tiers
owner
For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be semi-permanent and used widely, the owner will be mu2e. For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
description
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
configuration
This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all information in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
sequencer
This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files it will be rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear only in one file so this is uniquely determined for a file within a dataset.
file_format
This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl, stn, mid (midas DAQ), enc (encrypted)
A list of valid values from the database can be generated:
samweb list-values file_formats
Coordinating names between datasets
An official Monte Carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:
cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root log.mu2e.tdr-beam.TS3ToDS23.001.tgz
A common situation is when one set of fcl files is used to produce multiple output datasets. For example, a set of fcl files that generate protons on target could be run in two different releases without modification. The fcl files cnf.mu2e.tdr-beam.TS3ToDS23.fcl
could give rise to sim.mu2e.tdr-beam.TS3ToDS23.art
and sim.mu2e.tdr-beam.TS3ToDS23-v521.art
. This way, an output dataset can be referred to by a fcl dataset and a dsconf string.
If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:
dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art
When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:
dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art
This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.
If a user created the change for his own purposes, he would make it into a usr data by including his user name:
dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art
Raw, reconstructed and ntuple beam data might look like:
raw.mu2e.streamA.triggerTable123.12345678_123456.art rec.mu2e.streamA.triggerTable123.12345678_123456.art ntd.mu2e.streamA.triggerTable123.0001.root
A backup of an analysis project might look like:
bck.batman.node123.2014-06-04.aa.tgz
Maintenance
To add a data_tier or an extension, you have to declare that these values are allowed in sam. As user mu2epro,
setup sam_web_client samweb add-value --help-categories samweb add-value data_tiers raw samweb list-values data_tiers samweb add-value file_formats enc samweb list-values file_formats