FileTools: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
Line 103: Line 103:


== jsonMaker==
== jsonMaker==
'''as of 7/2020, jsonMaker is deprecated - replaced by printJson'''


All files to be uploaded to tape need to have a [[SAM]] file record.
All files to be uploaded to tape need to have a [[SAM]] file record.

Revision as of 03:14, 12 November 2020

mu2etools

Tools to help setup fcl for a grid project. Setup with

setup mu2e
source an offline setup.sh script
setup mu2etools

generate_fcl

A project may consist of several grid submissions, and each of those submissions may have many jobs. Each of these jobs will typically have a unique fcl file to drive it. For simulation, this fcl will include aunique random number seed, run numbers and output file names. generate_fcl will take a template fcl file, add these unique parts, and write out the complete set of fcl files for the project.

mu2efiletools

Tools to deal with file during grid operations, including method to list, check, count, move and upload all the files of a dataset. Some of these operations can be done with unix commands like find, but these scripts are recommended because they incorporate the most efficient methods, which are not always obvious. They all have "-h" help.

Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:

  • /pnfs/mu2e/scratch/datasets
  • /pnfs/mu2e/persistent/datasets
  • /pnfs/mu2e/tape

Setup with

setup mu2e
setup mu2efiletools

mu2eClusterFileList

Given a dataset name and directories that are the output of clusters in a simulation project, find all the files in that directory that belong to the dataset, and print their relative path names. (Note that this is more efficient than unix "find", which runs stat() on each file and overloads the dCache database.)

mu2eClusterCheckAndMove

Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success. Move the jobs files to "good" or "failed" directory based on the results. Checks:

  • the logfile is present
  • zero exit code from the art process
  • correct transfer of all job outputs per the manifest (sha256sum check)
  • identify duplicate jobs, both within the cluster and globally

mu2eDatasetList

Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).

mu2eDatasetFileList

List the files in a SAM dataset and print either just the filenames, or the full file paths. The output is based on the SAM file listing and the automatic path name algorithm.

mu2eDatasetLocation

Given a SAM data name, loop over the files that do not have locations. If there is a file in the standard location and the CRC matches, that add that location to the SAM record.

mu2eDatasetDelete

Delete SAM records or physical disk files for a dataset.

mu2eFileDeclare

Reads names of json files from stdin, and uses each to declare a new file to SAM. If you see errors while declaring files, check that you have a valid certificate.

mu2eFileMoveToTape

Move a file to tape, declare its SAM record, and wait for the file to be confirmed on tape (may take 24h to complete).

mu2eFileUpload

Will read file names on stdin and try to move each file to its standard location in dCache. Command options will determine which flavor of dCache (scratch, persistent, or tape).

mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html] is as follows

The authoritative code that defines file paths is in the mu2efilename package.Try

perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm

(also perldoc on other files in the same directory).  That still does
not document the exact algorithm, which is an implementation detail as
long as all file name manipulation is done via the mu2efilename package.

The spreader is obtained by computing a SHA-256 digest of the base
file name, and using the four initial characters in its hex
representation.  For example:

name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art
digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804
spreader = aa/a9

mu2eMissingJobs

Query the SAM database and print fcl file names for which a set of mu2eprodys jobs submitted with with the given --dsconf and --dsowner did not complete successfully, as defined by mu2eClusterCheckAndMove.

mu2eClusterArchive

A scrip to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.

File path tools

These tools, listed here

  • mu2eabsname_tape
  • mu2eabsname_disk
  • mu2eabsname_scratch

can be given a SAM (six-dot-field) file name, and will return the full path for this file in the respective dCache areas. The subdirectories in the path are all derived from the file name, and is unique. Files may be stored anywhere temporarily, but when they go to their permanent (or semi-permanent for scratch) location they should go here.

They only read stdin:

 > ls sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art | mu2eabsname_scratch
/pnfs/mu2e/scratch/datasets/phy-sim/sim/mu2e/example-beam-g4s1/1812a/art/f8/29/sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art

jsonMaker

as of 7/2020, jsonMaker is deprecated - replaced by printJson


All files to be uploaded to tape need to have a SAM file record. (Some other semi-permanent files in other locations may also have SAM records.) We create a SAM record by supplying a json file (which looks a lot like a Python dictionary or a fcl table) that contains keyword/value pairs. We include the keyword/value pairs for the file metadata that we want to supply for the file's SAM file record.

These json files could be written (or edited) by hand, but it is far easier to run jsonMaker on the file. This Python script is put in your path with the dhtools product:

setup dhtools

Running jsonMaker will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run jsonMaker. The code checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

jsonMaker has a help ("-h") option to show the optional switches. There is a lot about moving or copying files to the upload area - this is obsolete functionality. Please see the upload examples for how to use everything.

Here is an example json file output. Please do not use this for upload, let jsonMaker do the right thing...

{
    "dh.description": "cd3-beam-g4s1-dsregion", 
    "file_type": "mc", 
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "dh.first_subrun": 5, 
    "file_size": 4290572, 
    "file_format": "art", 
    "dh.first_run_event": 1002, 
    "dh.last_event": 10000, 
    "dh.last_subrun": 5, 
    "dh.last_run_event": 1002, 
    "dh.last_run_subrun": 1002, 
    "dh.first_run_subrun": 1002, 
    "data_tier": "sim", 
    "dh.first_event": 5, 
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "runs": [
        [
            1002, 
            5, 
            "mc"
        ]
    ], 
    "dh.configuration": "0506a", 
    "event_count": 3018, 
    "dh.owner": "mu2e", 
    "content_status": "good", 
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", 
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", 
    "dh.sequencer": "001002_00000005"
}