SamMetadata: Difference between revisions
Line 67: | Line 67: | ||
Note, for debugging purposes, this crc can be computed by: | Note, for debugging purposes, this crc can be computed by: | ||
<pre> | <pre> | ||
setup encp v3_11 | setup encp v3_11 -q stken | ||
ecrc <filename> | ecrc <filename> | ||
</pre> | </pre> |
Revision as of 14:53, 2 August 2017
Introduction
One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.
A user should be aware fo the SAM metadata fields in general, and may find some useful, but a detailed knowledge is not necessary.
When a file record is created in SAM, it should have the metadata defined. It can be changed after the record is created, but we want to avoid that. A SAM file record can be created no matter what the state of the file is, even if it is virtual. A real file may be in a temporary location or in a final location, and already copied to tape.
Part of the SAM record, not usually considered metadata, is the file locations. This are usually paths in dCache, but may be almost anything. Locations are updated more easily, and are often recorded apart from the basic metadata.
SAM does not have the concept of dataset metadata,
so all metadata has to be supplied for each file, even if it is
the same for all the files in the dataset. See file names
for a definition of a dataset.
all the metadata fields can be listed:
setup samweb samweb list-parameters samweb list-parameters <parameter> samweb list-values --help-categories samweb list-values <category>
Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered.
An important note on the reliability of metadata... The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, we don't know if the database has been well-maintained. It is not uncommon to find obsolete or invalidated data, unmarked, among good data. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work.
Some metadata, such as the file names is decided by the user. A lot of the rest is easily-derived, such as the file size, and is generated automatically in the upload procedures. We will mark the fields "(user)" usually chosen by the user.
Required metadata
- file_size Integer, size in bytes
- crc Integer
Note, for debugging purposes, this crc can be computed by:setup encp v3_11 -q stken ecrc <filename>
- create_user String - SAM user name (usually a group account)
- create_date Date when uploaded
- file_name String - (user) see file names
- data_tier String - part of filename (user)
for physics data: raw rec reconstructed ntd data ntuples for ExtMon data: ext ExtMon raw rex ext production xnt ext data ntuples for simulation: cnf set of config files fcl or txt, to drive MC jobs sim result of geant, StepPointMC mix mixed sim files (has multple generators) dig detector hits, like raw data mcs reconstructed data files nts MC ntuples other categories: log for log files bck for backups etc for anything else job for a production record
Required for all uploaded art files
- event_count Integer total physics events in the file
- dh.first_run_event Integer run of the lowest sorted physics event ID
- dh.first_event Integer event of the lowest sorted physics event ID
- dh.last_run_event Integer run of the highest sorted physics event ID
- dh.last_event Integer event of the highest sorted physics event ID
- dh.first_run_subrun Integer run of the lowest sorted subrun ID
- dh.first_subrun Integer subrun of the lowest sorted subrun ID
- dh.last_run_subrun Integer run of the highest sorted subrun ID
- dh.last_subrun Integer subrun of the highest sorted subrun ID
- runs List of lists The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type
- run_type String This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.
Here is a dump of the metadata for a cd3 stage1 MC file:
> setup sam_web_client > samweb get-metadata sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art File Name: sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art File Id: 1307399 Create Date: 2015-05-07T14:13:08+00:00 User: mu2epro File Size: 4290572 Checksum: enstore:339371671 Content Status: good File Type: mc File Format: art Data Tier: sim Event Count: 3018 dh.configuration: 0506a dh.dataset: sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art dh.description: cd3-beam-g4s1-dsregion dh.first_event: 5 dh.first_run_event: 1002 dh.first_run_subrun: 1002 dh.first_subrun: 5 dh.last_event: 10000 dh.last_run_event: 1002 dh.last_run_subrun: 1002 dh.last_subrun: 5 dh.owner: mu2e dh.sequencer: 001002_00000005 dh.source_file: /export/data1/condor/execute/dir_3030/glide_4DEqbQ/execute/dir_28034/no_xfer/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art mc.generator_type: beam mc.primary_particle: proton mc.simulation_stage: 1 Runs: 1002.0005 (mc) Parents: cnf.mu2e.cd3-beam-g4s1.0506a.001002_00000005.fcl
Optional metadata
- mc.generator_type String - generator(user) One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"
- mc.simulation_stage Integer - MC stage (user) Which step in multi-step generation
- mc.primary_particle String - primary particle in generation (user) One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"
- dh.source_file String The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.
- parents List of Strings For files derived from other specific SAM files, this contains the SAM names of the parent files
- retire_date Date When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten
- dh.sha256 String An alternate CRC for the file contents. can be computed: sha256sum filename
Metadata only for production records
Some production activity is stored as virtual SAM records, not of general interest.
- job.cpu int job cpu time in sec
- job.maxres int job max resident size in KB
- job.site string job grid site name
- job.node string job node
- job.disk int job disk space used, in KB
Metadata only for real data
- start_time Date Time the file was opened during data-taking
- end_time Date Time the file was closed during data-taking
The real data will require others such as run types, goodrun bits, detector configuration, etc.