FileTools: Difference between revisions
No edit summary |
|||
Line 87: | Line 87: | ||
===mu2eClusterArchive=== | ===mu2eClusterArchive=== | ||
A | A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM. | ||
===printJson=== | |||
All files to be uploaded to tape need to have a [[SAM]] file record. | |||
We create a SAM record by supplying a [http://json.org/ json file] | |||
that contains keyword/value pairs. We include pairs for the | |||
file [[SamMetadata|metadata]] that we want to supply for the file's SAM file record. | |||
Running printJson will produce all the mundane metadata like file size. | |||
For art files, it will run a fast art executable over the file to extract | |||
information like the number of events in the file. | |||
This means '''a version of offline must be set up to run printJson'''. | |||
SAM has parent-child links between files. printJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h". | |||
If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself. | |||
Here is an example json file output. | |||
<pre> | |||
{ | |||
"dh.description": "cd3-beam-g4s1-dsregion", | |||
"file_type": "mc", | |||
"file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", | |||
"dh.first_subrun": 5, | |||
"file_size": 4290572, | |||
"file_format": "art", | |||
"dh.first_run_event": 1002, | |||
"dh.last_event": 10000, | |||
"dh.last_subrun": 5, | |||
"dh.last_run_event": 1002, | |||
"dh.last_run_subrun": 1002, | |||
"dh.first_run_subrun": 1002, | |||
"data_tier": "sim", | |||
"dh.first_event": 5, | |||
"dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", | |||
"runs": [ | |||
[ | |||
1002, | |||
5, | |||
"mc" | |||
] | |||
], | |||
"dh.configuration": "0506a", | |||
"event_count": 3018, | |||
"dh.owner": "mu2e", | |||
"content_status": "good", | |||
"dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", | |||
"dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", | |||
"dh.sequencer": "001002_00000005" | |||
} | |||
</pre> | |||
===File path tools=== | ===File path tools=== | ||
Line 167: | Line 222: | ||
} | } | ||
</pre> | </pre> | ||
[[Category:Computing]] | [[Category:Computing]] | ||
[[Category:Workflows]] | [[Category:Workflows]] | ||
[[Category:DataHandling]] | [[Category:DataHandling]] |
Revision as of 03:30, 12 November 2020
mu2etools
Tools to help setup fcl for a grid project. Setup with
setup mu2e source an offline setup.sh script setup mu2etools
generate_fcl
A project may consist of several grid submissions, and each of those submissions may have many jobs. Each of these jobs will typically have a unique fcl file to drive it. For simulation, this fcl will include aunique random number seed, run numbers and output file names. generate_fcl will take a template fcl file, add these unique parts, and write out the complete set of fcl files for the project.
mu2efiletools
Tools to deal with file during grid operations, including method to list, check, count, move and upload all the files of a dataset. Some of these operations can be done with unix commands like find, but these scripts are recommended because they incorporate the most efficient methods, which are not always obvious. They all have "-h" help.
Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:
- /pnfs/mu2e/scratch/datasets
- /pnfs/mu2e/persistent/datasets
- /pnfs/mu2e/tape
Setup with
setup mu2e setup mu2efiletools
mu2eClusterFileList
Given a dataset name and directories that are the output of clusters in a simulation project, find all the files in that directory that belong to the dataset, and print their relative path names. (Note that this is more efficient than unix "find", which runs stat() on each file and overloads the dCache database.)
mu2eClusterCheckAndMove
Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success. Move the jobs files to "good" or "failed" directory based on the results. Checks:
- the logfile is present
- zero exit code from the art process
- correct transfer of all job outputs per the manifest (sha256sum check)
- identify duplicate jobs, both within the cluster and globally
mu2eDatasetList
Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).
mu2eDatasetFileList
List the files in a SAM dataset and print either just the filenames, or the full file paths. The output is based on the SAM file listing and the automatic path name algorithm.
mu2eDatasetLocation
Given a SAM data name, loop over the files that do not have locations. If there is a file in the standard location and the CRC matches, that add that location to the SAM record.
mu2eDatasetDelete
Delete SAM records or physical disk files for a dataset.
mu2eFileDeclare
Reads names of json files from stdin, and uses each to declare a new file to SAM. If you see errors while declaring files, check that you have a valid certificate.
mu2eFileMoveToTape
Move a file to tape, declare its SAM record, and wait for the file to be confirmed on tape (may take 24h to complete).
mu2eFileUpload
Will read file names on stdin and try to move each file to its standard location in dCache. Command options will determine which flavor of dCache (scratch, persistent, or tape).
mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html] is as follows
The authoritative code that defines file paths is in the mu2efilename package.Try perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm (also perldoc on other files in the same directory). That still does not document the exact algorithm, which is an implementation detail as long as all file name manipulation is done via the mu2efilename package. The spreader is obtained by computing a SHA-256 digest of the base file name, and using the four initial characters in its hex representation. For example: name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804 spreader = aa/a9
mu2eMissingJobs
Query the SAM database and print fcl file names for which a set of mu2eprodys jobs submitted with with the given --dsconf and --dsowner did not complete successfully, as defined by mu2eClusterCheckAndMove.
mu2eClusterArchive
A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.
printJson
All files to be uploaded to tape need to have a SAM file record. We create a SAM record by supplying a json file that contains keyword/value pairs. We include pairs for the file metadata that we want to supply for the file's SAM file record.
Running printJson will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run printJson.
SAM has parent-child links between files. printJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h".
If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself.
Here is an example json file output.
{ "dh.description": "cd3-beam-g4s1-dsregion", "file_type": "mc", "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", "dh.first_subrun": 5, "file_size": 4290572, "file_format": "art", "dh.first_run_event": 1002, "dh.last_event": 10000, "dh.last_subrun": 5, "dh.last_run_event": 1002, "dh.last_run_subrun": 1002, "dh.first_run_subrun": 1002, "data_tier": "sim", "dh.first_event": 5, "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", "runs": [ [ 1002, 5, "mc" ] ], "dh.configuration": "0506a", "event_count": 3018, "dh.owner": "mu2e", "content_status": "good", "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", "dh.sequencer": "001002_00000005" }
File path tools
These tools, listed here
- mu2eabsname_tape
- mu2eabsname_disk
- mu2eabsname_scratch
can be given a SAM (six-dot-field) file name, and will return the full path for this file in the respective dCache areas. The subdirectories in the path are all derived from the file name, and is unique. Files may be stored anywhere temporarily, but when they go to their permanent (or semi-permanent for scratch) location they should go here.
They only read stdin:
> ls sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art | mu2eabsname_scratch /pnfs/mu2e/scratch/datasets/phy-sim/sim/mu2e/example-beam-g4s1/1812a/art/f8/29/sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art
jsonMaker
as of 7/2020, jsonMaker is deprecated - replaced by printJson
All files to be uploaded to tape need to have a SAM file record.
(Some other semi-permanent files in other locations may also have SAM records.)
We create a SAM record by supplying a json file
(which looks a lot like a Python dictionary or a fcl table) that contains
keyword/value pairs. We include the keyword/value
pairs for the file metadata that we want to supply for
the file's SAM file record.
These json files could be written (or edited) by hand, but it is far easier to run jsonMaker on the file. This Python script is put in your path with the dhtools product:
setup dhtools
Running jsonMaker will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run jsonMaker. The code checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.
jsonMaker has a help ("-h") option to show the optional switches. There is a lot about moving or copying files to the upload area - this is obsolete functionality. Please see the upload examples for how to use everything.
Here is an example json file output. Please do not use this for upload, let jsonMaker do the right thing...
{ "dh.description": "cd3-beam-g4s1-dsregion", "file_type": "mc", "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", "dh.first_subrun": 5, "file_size": 4290572, "file_format": "art", "dh.first_run_event": 1002, "dh.last_event": 10000, "dh.last_subrun": 5, "dh.last_run_event": 1002, "dh.last_run_subrun": 1002, "dh.first_run_subrun": 1002, "data_tier": "sim", "dh.first_event": 5, "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", "runs": [ [ 1002, 5, "mc" ] ], "dh.configuration": "0506a", "event_count": 3018, "dh.owner": "mu2e", "content_status": "good", "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", "dh.sequencer": "001002_00000005" }