SAM
Introduction
SAM (Serial Access to Metadata) is a Fermilab product containing databases and servers, designed to help manage large datasets or files. This system contains several parts:
- a database of metadata for each file
- a database of locations for files, usually in dCache
- servers called SAM stations which help manage files
- SAM stations providing file names and locations to a grid job
- job submission and cache mangement features
Currently (2017) we are only using the first three features - file metadata and locations, and SAM stations as a way to manage files. mu2e has its own database and SAM station.
A side note... A big part of SAM, which we are not using yet, is its ability to provide file locations to jobs running on the grid. Processing data driven by SAM does several things that simple lists of files can't do. It can spread out the work by delivering more files to job sections which start sooner or are moving faster. It throttles file requests to avoid overloading dCache. It will stage (copy from tape to disk) the files you request at the start of the job, not just as each file is opened. It has the potential to deliver files which are are on disk while it is staging the files that are only on tape. It keeps track of which files are completed and their status and will store the job results forever. This help the jobs run efficienctly, however, we do not need to use these features yet, but we will as our datasets grow. We do not plan to use job submission and cache mangement features.
SAM Interface
We will interact with SAM through samweb , an http-based API for SAM, specifically the samweb client , a python module which is a lightweight, convenient layer over the http API. It has a user guide and command reference.
Users are specifically registered with the SAM database before they
can interact with it, but this should happen automactically when
mu2e accounts are created. If you get a "not registered in SAM database" error, you will need to put in a servicedesk ticket.
When you interact with SAM though samweb, you will need to be identified and authenticated through a grid cert.
For interactive work (see Authentication), setup like this:
kinit kx509 setup mu2e setup dhtools
"setup mu2e" sets the environmental SAM_EXPERIMENT=mu2e
so
you don't need to specify that any further.
dhtools is a mu2e product that adds some convenience scripts
to the SAM functionality. It sets up the sam_web_client UPS product. Note that to simply read the SAM database you do not need authentication as specified above, but to write anything, you do.
All samweb commands look like
samweb <command> <args>
there is interactive help
samweb -h samweb <command> -h
Selecting and Examining Files
The user defines a criteria for file selection based on the Upload information. As a practical matter, you will usually get this from the person who created the data or is otherwise expert in it.
This criteria might look like the following.
"dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
The selection based on the dataset ("dh.dataset", a metadata field which is always filled) is the most common selection format and the only one most people will use.
To select a few specific files using wildcards ("%" is any string, like unix command line "*," and "?" is any non-null character):
"file_name like ???.mu2e.example-beam-g4s1.1812a.%" "file_name=sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art"
As the number of files recorded in SAM reaches the many millions, the use of wildcards can cause poor performance and possibly overloads, and should be avoided.
To select a specific few files:
"file_name in (sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art,sim.mu2e.example-beam-g4s1.1812a.16638329_000014.art)"
or select on any metadata field
"data_tier=sim and mc.generator_type=beam"
You can use these criteria to look at the files they select.
You can execute these example commands, the files exist.
samweb count-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art" samweb list-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art" samweb list-files --summary "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art" samweb list-files --fileinfo "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art" (fileInfo columns are: file_id file_size event_count)
or see a file's complete metadata
export SAM_FILE=sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art samweb get-metadata $SAM_FILE
or see a file's location in dCache (or elsewhere)
samweb locate-file $SAM_FILE
The location string contains some metadata on the location format and media.
> samweb locate-file sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art enstore:/pnfs/mu2e/usr-sim/sim/mu2e/example-beam-g4s1/1812a/000/000(30@vpe007)
The /pnfs file path is the location in tape-backed dCache. The string "30@vpe007" means is it file 30 on tape number vpe007. In the future there may be several locations listed for a file.
By default, sam will only list files with a location in the sam database. Some files (such as fcl datasets) might exist at a standard location on disk, but the location might not be recorded in sam. In this case, you have to ask sam to list all file, even with no location:
samweb count-files "dh.dataset=cnf.mu2e.cd3-beam-g4s4-flate.v0.fcl and availability: anylocation"
All the available metadata fields can be listed:
samweb list-parameters samweb list-parameters <lt parameter> samweb list-values --help-categories samweb list-values <category>
To retrieve a few files for local testing, please see "samGet" in the [#utility sam utility scripts] . Note that this method of copying a file locally is provided only for copying one or two files for interactive testing. Grid jobs must use the full SAM access method decscribed in [#jobsub Running on the Grid with jobsub, SAM, and art] .
Dataset Definitions and Snapshots
The SAM way to make a list a files is through the dataset definition and snapshot mechanisms. Users do not frequently need to do this since allmost all operations are applied to a whole dataset or an individual file.
The file selection criteria is declared to SAM by creating a "dataset definition" with a name chosen by the user, here saved in SAM_DD:
export SAM_DD=${USER}_test_0 samweb create-definition $SAM_DD "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
The first argument of create-definition is your choice for the name of the dataset definition, the second is the selection criteria. You should start your dataset definition names with your username - it helps avoid conflicting names, which must be unique within mu2e. If we find dataset definitions that do not have user names, we will delete them.
You can examine your dataset definitions:
samweb list-definitions --user=${USER} samweb describe-definition $SAM_DD
Since a dataset definition is a selection criteria, you can use it to list the selected files. The following command evaluates the selection criteria to produce a list of files.
samweb list-definition-files $SAM_DD
The following command does not evaluate the dataset definitions, it lists the files that pass the criteria and are in any snapshot.
samweb list-files "defname:$SAM_DD"
The number of files selected by the criteria is determined by the metadata of the files in the database when the criteria is evaluated (not when the dataset definition is declared), and can change over time. This may be the desired behaviour, if a dataset is known to be growing. In practice, this is rarely an issue since most datasets, except for ongoing data-taking, do not change. If you choose to, you can lock in which files are selected, which is done with a "snapshot," a permanent, fixed file list.
export SAM_SNAP_ID=`samweb take-snapshot $SAM_DD` echo $SAM_SNAP_ID
The argument is the dataset definition name. This will return a "snapshot id." You can see the fixed list of files.
samweb list-files "snapshot_id=$SAM_SNAP_ID"
The special dataset definition for each dataset
There is a special dataset definition created for each dataset. This is done by a mu2epro cron job. For each unique value of the dh.dataset field, a dataset definition is created like.
samweb create-definition \ sim.mu2e.example-beam-g4s1.1812a.art \ "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
So there is always a dataset definition with name exactly equal to the name of each dataset. This is a special exception to the rule of including your username in the dataset definition which we allow because it is so overwhelming convenient. Please do not attempt to create these dataset definitions or anything like it without your username in the dataset definition name. Overall, we expect these dataset definitons will be, almost exclusively, the way users specify data to sam. You may never need to actually create a dataset definition of your own.
SAM projects
We do not use SAM projects generally, but they do appear in the prestaging procedures, so we include the concept here. A user directly or indirectly can start a project on the SAM station. A project is a process that keeps track of the the files used in a job. Typically this would track the files that were used as input to a grid job (but mu2e does not use this functionality yet). In the prestage procedure, it tracks the files that have been prestaged. The project starts with a list of files, usually a dataset or dataset definition. Somewhere off the station, directly or indirectly, the user starts one or many consumers of the project. The consumers ask the project for a file, the project sends file location, and the project records the file as "consumed". (There are handshakes etc. for precise record-keeping, but that's another story.) A consumer could be a grid job executable and in fact art has the ability to use a SAM project to get input files. In the prestage process, the project is started by SAM on server hidden from the user.
mu2e SAM utility scripts
These scripts put in your path with
setup dhtools
They are some utility scripts that simplify some multi-step SAM operations. They are only used interactively. They all have help with "-h".
- samDatasets Print a list of mu2e datasets in SAM
- samGet Find specific files in SAM and copy them to a local directory. Good for geting examples for testing.
- samNoChildren A file may have pointers to parent files in a parent dataset. This script reports the members of the parent dataset that have no members of child dataset pointing to them. Useful in uploading procedures.
- samOnDisk Pick random files from a dataset and check if they are on disk. Useful to see if you need to prestage your input files.
- samOnTape Summarize how many files of a dataset have a location on tape. Useful in uploading files.
- samToPnfs List the full Dcache (/pnfs) filespec of all files in the request. This can be useful for grid jobs which need a list of files to run on.
- samSplit Split a large dataset into smaller part by using dataste definitions. Useful in prestaging.
Links
- SAM
- samweb
- samweb user guide
- samweb command reference
- http interface
- Query Dimensions
- file names
- uploading instructions
- SAM listing of existing datasets
- operations links for sam monitors
- TDR sample as SAM datasets