SAM: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
Line 136: Line 136:
"30@vpe007" means is it file 30 on tape number vpe007. In the future there
"30@vpe007" means is it file 30 on tape number vpe007. In the future there
may be several locations listed for a file.
may be several locations listed for a file.
By default, sam will only list files with a location in the sam database.  Some files (such as fcl datasets) might exist at a standard location on disk, but the location might not be recorded in sam.  In this case, you have to ask sam to list all file, even with no location:
samweb count-files "dh.dataset=cnf.mu2e.cd3-beam-g4s4-flate.v0.fcl and availability: anylocation"


All the available metadata fields can be listed:
All the available metadata fields can be listed:
Line 151: Line 154:
use the full SAM access method decscribed in  
use the full SAM access method decscribed in  
[#jobsub  Running on the Grid with jobsub, SAM, and art] .
[#jobsub  Running on the Grid with jobsub, SAM, and art] .


==Dataset Definitions and Snapshots==
==Dataset Definitions and Snapshots==

Revision as of 22:10, 22 August 2017

Introduction

SAM (Serial Access to Metadata) is a Fermilab product containing databases and servers, designed to help manage large datasets or files. This system contains several parts:

  1. a database of metadata for each file
  2. a database of locations for files, usually in dCache
  3. servers called SAM stations which help manage files
  4. SAM stations providing file names and locations to a grid job
  5. job submission and cache mangement features

Currently (2017) we are only using the first three features - file metadata and locations, and SAM stations as a way to manage files. mu2e has its own database and SAM station.

A side note... A big part of SAM, which we are not using yet, is its ability to provide file locations to jobs running on the grid. Processing data driven by SAM does several things that simple lists of files can't do. It can spread out the work by delivering more files to job sections which start sooner or are moving faster. It throttles file requests to avoid overloading dCache. It will stage (copy from tape to disk) the files you request at the start of the job, not just as each file is opened. It has the potential to deliver files which are are on disk while it is staging the files that are only on tape. It keeps track of which files are completed and their status and will store the job results forever. This help the jobs run efficienctly, however, we do not need to use these features yet, but we will as our datasets grow. We do not plan to use job submission and cache mangement features.


SAM Interface

We will interact with SAM through samweb , an http-based API for SAM, specifically the samweb client , a python module which is a lightweight, convenient layer over the http API. It has a user guide and command reference.


Users are specifically registered with the SAM database before they can interact with it, but this should happen automactically when mu2e accounts are created. If you get a "not registered in SAM database" error, you will need to put in a servicedesk ticket. When you interact with SAM though samweb, you will need to be identified and authenticated through a grid cert.

For interactive work (see Authentication), setup like this:

kinit
kx509
setup mu2e
setup dhtools 

"setup mu2e" sets the environmental SAM_EXPERIMENT=mu2e so you don't need to specify that any further. dhtools is a mu2e product that adds some convenience scripts to the SAM functionality. It sets up the sam_web_client UPS product. Note that to simply read the SAM database you do not need authentication as specified above, but to write anything, you do.

All samweb commands look like

samweb <command> <args>

there is interactive help

samweb -h
samweb <command> -h

Selecting and Examining Files

The user defines a criteria for file selection based on the Upload information. As a practical matter, you will usually get this from the person who created the data or is otherwise expert in it.

This criteria might look like the following.

"dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"

The selection based on the dataset ("dh.dataset", a metadata field which is always filled) is the most common selection format and the only one most people will use.

To select a few specific files using wildcards ("%" is any string, like unix command line "*," and "?" is any non-null character):

"file_name like ???.mu2e.example-beam-g4s1.1812a.%"
"file_name=sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art"

As the number of files recorded in SAM reaches the many millions, the use of wildcards can cause poor performance and possibly overloads, and should be avoided.

To select a specific few files:

"file_name in (sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art,sim.mu2e.example-beam-g4s1.1812a.16638329_000014.art)"

or select on any metadata field

"data_tier=sim and mc.generator_type=beam"


You can use these criteria to look at the files they select. You can execute these example commands, the files exist.

samweb count-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
samweb list-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
samweb list-files --summary "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
samweb list-files --fileinfo "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
    (fileInfo columns are: file_id file_size event_count)

or see a file's complete metadata

export SAM_FILE=sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art
samweb get-metadata $SAM_FILE

or see a file's location in dCache (or elsewhere)

samweb locate-file $SAM_FILE

The location string contains some metadata on the location format and media.

> samweb locate-file sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art
enstore:/pnfs/mu2e/usr-sim/sim/mu2e/example-beam-g4s1/1812a/000/000(30@vpe007)

The /pnfs file path is the location in tape-backed dCache. The string "30@vpe007" means is it file 30 on tape number vpe007. In the future there may be several locations listed for a file.

By default, sam will only list files with a location in the sam database. Some files (such as fcl datasets) might exist at a standard location on disk, but the location might not be recorded in sam. In this case, you have to ask sam to list all file, even with no location:

samweb count-files "dh.dataset=cnf.mu2e.cd3-beam-g4s4-flate.v0.fcl and availability: anylocation"

All the available metadata fields can be listed:

samweb list-parameters
samweb list-parameters <lt parameter>
samweb list-values  --help-categories
samweb list-values <category>

To retrieve a few files for local testing, please see "samGet" in the [#utility sam utility scripts] . Note that this method of copying a file locally is provided only for copying one or two files for interactive testing. Grid jobs must use the full SAM access method decscribed in [#jobsub Running on the Grid with jobsub, SAM, and art] .

Dataset Definitions and Snapshots

The SAM way to make a list a files is through the dataset definition and snapshot mechanisms. Users do not frequently need to do this since allmost all operations are applied to a whole dataset or an individual file.

The file selection criteria is declared to SAM by creating a "dataset definition" with a name chosen by the user, here saved in SAM_DD:

export SAM_DD=${USER}_test_0
samweb create-definition $SAM_DD "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"

The first argument of create-definition is your choice for the name of the dataset definition, the second is the selection criteria. You should start your dataset definition names with your username - it helps avoid conflicting names, which must be unique within mu2e. If we find dataset definitions that do not have user names, we will delete them.

You can examine your dataset definitions:

samweb list-definitions --user=${USER}
samweb describe-definition $SAM_DD


Since a dataset definition is a selection criteria, you can use it to list the selected files. The following command evaluates the selection criteria to produce a list of files.

samweb list-definition-files $SAM_DD

The following command does not evaluate the dataset definitions, it lists the files that pass the criteria and are in any snapshot.

samweb list-files "defname:$SAM_DD"

The number of files selected by the criteria is determined by the metadata of the files in the database when the criteria is evaluated (not when the dataset definition is declared), and can change over time. This may be the desired behaviour, if a dataset is known to be growing. In practice, this is rarely an issue since most datasets, except for ongoing data-taking, do not change. If you choose to, you can lock in which files are selected, which is done with a "snapshot," a permanent, fixed file list.

export SAM_SNAP_ID=`samweb take-snapshot $SAM_DD`
echo $SAM_SNAP_ID

The argument is the dataset definition name. This will return a "snapshot id." You can see the fixed list of files.

samweb list-files "snapshot_id=$SAM_SNAP_ID"


The special dataset definition for each dataset

There is a special dataset definition created for each dataset. This is done by a mu2epro cron job. For each unique value of the dh.dataset field, a dataset definition is created like.

samweb create-definition \
  sim.mu2e.example-beam-g4s1.1812a.art \
  "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"

So there is always a dataset definition with name exactly equal to the name of each dataset. This is a special exception to the rule of including your username in the dataset definition which we allow because it is so overwhelming convenient. Please do not attempt to create these dataset definitions or anything like it without your username in the dataset definition name. Overall, we expect these dataset definitons will be, almost exclusively, the way users specify data to sam. You may never need to actually create a dataset definition of your own.

SAM projects

We do not use SAM projects generally, but they do appear in the prestaging procedures, so we include the concept here. A user directly or indirectly can start a project on the SAM station. A project is a process that keeps track of the the files used in a job. Typically this would track the files that were used as input to a grid job (but mu2e does not use this functionality yet). In the prestage procedure, it tracks the files that have been prestaged. The project starts with a list of files, usually a dataset or dataset definition. Somewhere off the station, directly or indirectly, the user starts one or many consumers of the project. The consumers ask the project for a file, the project sends file location, and the project records the file as "consumed". (There are handshakes etc. for precise record-keeping, but that's another story.) A consumer could be a grid job executable and in fact art has the ability to use a SAM project to get input files. In the prestage process, the project is started by SAM on server hidden from the user.

mu2e SAM utility scripts

These scripts put in your path with

setup dhtools

They are some utility scripts that simplify some multi-step SAM operations. They are only used interactively. They all have help with "-h".

  • samDatasets
  • Print a list of mu2e datasets in SAM
  • samGet
  • Find specific files in SAM and copy them to a local directory. Good for geting examples for testing.
  • samNoChildren
  • A file may have pointers to parent files in a parent dataset. This script reports the members of the parent dataset that have no members of child dataset pointing to them. Useful in uploading procedures.
  • samOnDisk
  • Pick random files from a dataset and check if they are on disk. Useful to see if you need to prestage your input files.
  • samOnTape
  • Summarize how many files of a dataset have a location on tape. Useful in uploading files.
  • samToPnfs
  • List the full Dcache (/pnfs) filespec of all files in the request. This can be useful for grid jobs which need a list of files to run on.
  • samSplit
  • Split a large dataset into smaller part by using dataste definitions. Useful in prestaging.

Links