FileNames

From Mu2eWiki
Jump to navigation Jump to search

Introduction

For everyday personal use, file names can be anything that is convenient for the user. If the files

  • go to tape
  • were produced by a collaboration effort
  • gain some long-term status
  • are used by more than one or two people

they must be named by fixed, six-field pattern as described here. When the files are written to tape, they must follow this name pattern without exception, while the other criteria may have exceptions. The primary Monte Carlo workflow has this naming pattern built-in.

File names should be relatively short, but include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc.

The user has some discretion in choosing names, and should embrace the following concepts for file names:

  • must be unique for each file
  • must contain only alphanumeric characters, hyphens, and underscores
  • should be mnemonic and helpful
  • must not be primarily designed as, or assumed to be, complete and clear documentation of the file contents. This can be restated: do not attempt to include all the metadata in the file name.

Mu2e will name all uploaded files to be uploaded with the following pattern, six dot-separated fields:

data_tier.owner.description.configuration.sequencer.file_format

for example:

sim.mu2e.beam_g4s1_dsregion.0429a.12345678_123456.art

Each of these fields will be discussed more below. These fields all correspond to required SAM metadata database fields. With owner in the file name, potential name conflicts will only occur within one user's files.

Datasets

If you remove the sequencer from a file name, you are left with five dot-separated fields:

data_tier.owner.description.configuration.file_format

for example

sim.mu2e.beam_g4s1_dsregion.0429a.art

This creates a string that is unique for this logical dataset, and that will be put in the "dh.dataset" SAM metadata field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata for certain fields. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. The average user will almost always run on a dataset, so will only need to refer to this dataset name.

Name Fields

data_tier

The data tier describes the type of data, conceptually. Some of these concepts include raw, simulated but not reconstructed, and fully reconstructed. The choices for this field are fixed, and must come the list below.

  • for physics data:
    • raw
    • rec reconstructed
    • ntd data ntuples
  • for ExtMon data:
    • ext ExtMon raw
    • rex ext production
    • xnt ext data ntuples
  • for simulation:
    • cnf set of config files fcl or txt, to drive MC jobs
    • sim result of geant, StepPointMC
    • mix mixed sim files (has multple generators)
    • dig detector hits, like raw data
    • mcs reconstructed data files
    • nts MC ntuples
  • other categories:
    • log for log files
    • bck for backups
    • etc for anything else
    • job for a production record

A list of valid values from the database can be generated:

samweb list-values  data_tiers

owner

For official data samples and Monte Carlo dataset that are generated by a collaboration effort and are intended to be semi-permanent and used widely, the owner will be mu2e. For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data. In a few cases, the most logical "user" is a group smaller than the collaboration. For example, calorimeter test beam should have a good institutional memory in that group and it might be more logical to point to the group than one user. In this case, we can use the names "trk", "cal", and "crv".

description

This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.

configuration

This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all information in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.

sequencer

This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files it will be rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear only in one file so this is uniquely determined for a file within a dataset.

file_format

This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl, stn, mid

A list of valid values from the database can be generated:

samweb list-values  file_formats

Coordinating names between datasets

An official Monte Carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:

cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
log.mu2e.tdr-beam.TS3ToDS23.001.tgz

If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:

dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art

When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:

dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art

This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.

If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:

dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art

Raw, reconstructed and ntuple beam data might look like:

raw.mu2e.streamA.triggerTable123.12345678_123456.art
rec.mu2e.streamA.triggerTable123.12345678_123456.art
ntd.mu2e.streamA.triggerTable123.0001.root

A backup of an analysis project might look like:

bck.batman.node123.2014-06-04.aa.tgz