SamMetadata

From Mu2eWiki
Jump to navigation Jump to search


Introduction

One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.

A user should be aware fo the SAM metadata fields in general, and may find some useful, but a detailed knowledge is not necessary.

When a file record is created in SAM, it should have the metadata defined. It can be changed after the record is created, but we want to avoid that. A SAM file record can be created no matter what the state of the file is, even if it is virtual. A real file may be in a temporary location or in a final location, and already copied to tape.

Part of the SAM record, not usually considered metadata, is the file locations. This are usually paths in dCache, but may be almost anything. Locations are updated more easily, and are often recorded apart from the basic metadata.


SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file, even if it is the same for all the files in the dataset. See file names for a definition of a dataset.

all the metadata fields can be listed:

setup sam_web_client
samweb list-parameters
samweb list-parameters <parameter>
samweb list-values  --help-categories
samweb list-values <category>

Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered.

An important note on the reliability of metadata... The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, we don't know if the database has been well-maintained. It is not uncommon to find obsolete or invalidated data, unmarked, among good data. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work.

Some metadata, such as the file names is decided by the user. A lot of the rest is easily-derived, such as the file size, and is generated automatically in the upload procedures. We will mark the fields "(user)" usually chosen by the user.

Required metadata

  • file_size Integer, size in bytes
  • crc Integer
    Note, for debugging purposes, this crc can be computed by:
    setup encp v3_11 -q stken
    ecrc <filename>
    
  • create_user String - SAM user name (usually a group account)
  • create_date Date - when the SAM record was created
  • file_name String - (user) see file names
  • data_tier String - part of filename (user)
  • for physics data:
       raw   raw data from DAQ
       rec   reconstructed
       ntd   data ntuples
    for ExtMon data:
       ext   ExtMon raw
       rex   ext production
       xnt   ext data ntuples
    for simulation:
       cnf   set of config files fcl or txt, to drive MC jobs
       sim   result of geant, StepPointMC
       mix   mixed sim files (has multple generators)
       dig   detector hits, like raw data
       mcs   reconstructed data files
       nts   MC ntuples
    other categories:
       log for log files
       bck for backups
       etc for anything else
       job for a production record
    
  • dh.owner String - part of filename (user)
  • For official data samples and Monte Carlo that go into the phy* file families, this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
  • dh.description String - part of filename (user)
  • This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
  • dh.configuration String - part of filename (user)
  • This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
  • dh.sequencer String - part of filename (user sometimes)
  • This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.
  • dh.dataset String - part of filename (user)
  • a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.
  • file_format String - part of filename (user)
  • This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl
  • content_status String
  • always "good" at upload, can be set to "bad" later to deprecate files without deleting them
  • file_type String
  • "data", "MC" or "other"

Required for all uploaded art files

  • event_count Integer
  • total physics events in the file

The following list the first and last event in the file, after sorting. If the file has no events, they should be removed from the SAM file declaration.

  • dh.first_run_event Integer
  • run of the lowest sorted physics event ID
  • dh.first_subrun_event Integer
  • subrun of the lowest sorted physics event ID
  • dh.first_event Integer
  • event of the lowest sorted physics event ID
  • dh.last_run_event Integer
  • run of the highest sorted physics event ID
  • dh.last_subrun_event Integer
  • subrun of the highest sorted physics event ID
  • dh.last_event Integer
  • event of the highest sorted physics event ID

The following list the first and last subrun in the file, after sorting (including subruns with no events).

  • dh.first_run_subrun Integer
  • run of the lowest sorted subrun ID
  • dh.first_subrun Integer
  • subrun of the lowest sorted subrun ID
  • dh.last_run_subrun Integer
  • run of the highest sorted subrun ID
  • dh.last_subrun Integer
  • subrun of the highest sorted subrun ID

A list of subruns. For Monte Carlo, if this list is too long (>100), it is not written because it takes too much time and can occasionally fail. We have a request to add ranges which will help speed this up.

  • runs List of lists
  • The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type
  • run_type String
  • This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.

Here is a dump of the metadata for a cd3 stage1 MC file:

> setup sam_web_client
> samweb get-metadata sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art 
          File Name: sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art
            File Id: 1307399
        Create Date: 2015-05-07T14:13:08+00:00
               User: mu2epro
          File Size: 4290572
           Checksum: enstore:339371671
     Content Status: good
          File Type: mc
        File Format: art
          Data Tier: sim
        Event Count: 3018
   dh.configuration: 0506a
         dh.dataset: sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art
     dh.description: cd3-beam-g4s1-dsregion
     dh.first_event: 5
 dh.first_run_event: 1002
dh.first_run_subrun: 1002
    dh.first_subrun: 5
      dh.last_event: 10000
  dh.last_run_event: 1002
 dh.last_run_subrun: 1002
     dh.last_subrun: 5
           dh.owner: mu2e
       dh.sequencer: 001002_00000005
     dh.source_file: /export/data1/condor/execute/dir_3030/glide_4DEqbQ/execute/dir_28034/no_xfer/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art
  mc.generator_type: beam
mc.primary_particle: proton
mc.simulation_stage: 1
               Runs: 1002.0005 (mc)
            Parents: cnf.mu2e.cd3-beam-g4s1.0506a.001002_00000005.fcl

Optional metadata

  • mc.generator_type String - generator(user)
  • One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"
  • mc.simulation_stage Integer - MC stage (user)
  • Which step in multi-step generation
  • mc.primary_particle String - primary particle in generation (user)
  • One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"
  • dh.source_file String
  • The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.
  • dh.gencount int
  • Often in MC generation, the total number of generated events in a file is stored in a product called GenEventCount which can be extracted and stored here. The generated events may be rejected in the generator or in subsequent filters so the number of art events in the file can be much fewer. The number of generated events is kept to allow normalization.
  • parents List of Strings
  • For files derived from other specific SAM files, this contains the SAM names of the parent files
  • retire_date Date
  • When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten
  • dh.sha256 String
  • An alternate CRC for the file contents. can be computed: sha256sum filename

Metadata only for production records

Some production activity is stored as virtual SAM records, not of general interest.

  • job.cpu int
  • job cpu time in sec
  • job.maxres int
  • job max resident size in KB
  • job.site string
  • job grid site name
  • job.node string
  • job node
  • job.disk int
  • job disk space used, in KB

Metadata only for real data

  • start_time Date
  • Time the file was opened during data-taking
  • end_time Date
  • Time the file was closed during data-taking

The real data will require others such as run types, goodrun bits, detector configuration, etc.