FileNames: Difference between revisions
(Created page with " == Introduction == For everyday personal use, file names can be anything that is convenient for the user. If the files * go to tape * were produced by a collaboration effor...") |
No edit summary |
||
Line 91: | Line 91: | ||
===owner=== | ===owner=== | ||
For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be permanent and used widely, the owner will be '''mu2e'''. | For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be permanent and used widely, the owner will be '''mu2e'''. | ||
For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data. | |||
it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data. | |||
===description=== | ===description=== |
Revision as of 19:27, 27 March 2017
Introduction
For everyday personal use, file names can be anything that is convenient for the user. If the files
- go to tape
- were produced by a collaboration effort
- gain some long-term status
- are used by more than one or two people
they must be named by fixed, six-field pattern as described here. When the files are written to tape, they must follow this name pattern without exception, while the other criteria may have exceptions. The primary Monte Carlo workflow has this naming pattern built-in.
File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc.
The user has some discretion in choosing names, and should embrace the following concepts for file names:
- must be unique for each file
- must contain only alphanumeric characters, hyphens, and underscores
- should be mnemonic and helpful
- must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents. This can be restated: do not attempt to include all the metadata in the file name.
Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:
data_tier.owner.description.configuration.sequencer.file_format
for example:
sim.mu2e.beam_g4s1_dsregion.0429a.12345678_123456.art
Each of these fields will be discussed more below. These fields all correspond to required SAM metadata database fields. With owner in the file name, potential name conflicts will only occur within one user's files.
Datasets
If you remove the sequencer from a file name, you are left with five dot-separated fields:
data_tier.owner.description.configuration.file_format
for example
sim.mu2e.beam_g4s1_dsregion.0429a.art
This creates a string that is unique for this logical dataset, and that will be put in the "dh.dataset" SAM metadata field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata for certain fields. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. The average user will almost always run on a dataset, so will only need to refer to this dataset name.
Name Fields
data_tier
The data tier describes the type of data, conceptually. Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed. The choices for this field are fixed, and must come the list below.
- for physics data:
- raw
- rec reconstructed
- ntd data ntuples
- for ExtMon data:
- ext ExtMon raw
- rex ext production
- xnt ext data ntuples
- for simulation:
- cnf set of config files fcl or txt, to drive MC jobs
- sim result of geant, StepPointMC
- mix mixed sim files (has multple generators)
- dig detector hits, like raw data
- mcs reconstructed data files
- nts MC ntuples
- other categories:
- log for log files
- bck for backups
- etc for anything else
- job for a production record
A list of valid values from the database can be generated:
samweb list-values data_tiers
owner
For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be permanent and used widely, the owner will be mu2e. For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
description
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
configuration
This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all information in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
sequencer
This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files it will be rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear only in one file so this is uniquely determined for a file within a dataset.
file_format
This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl
A list of valid values from the database can be generated:
samweb list-values file_formats
Coordinating names between datasets
An official Monte Carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:
cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root log.mu2e.tdr-beam.TS3ToDS23.001.tgz
If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:
dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art
When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:
dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art
This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.
If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:
dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art
Raw, reconstructed and ntuple beam data might look like:
raw.mu2e.streamA.triggerTable123.12345678_123456.art rec.mu2e.streamA.triggerTable123.12345678_123456.art ntd.mu2e.streamA.triggerTable123.0001.root
A backup of an analysis project might look like:
bck.batman.node123.2014-06-04.aa.tgz