FileTools: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(7 intermediate revisions by 2 users not shown)
Line 3: Line 3:
Tools to help setup fcl for a grid project.  Setup with  
Tools to help setup fcl for a grid project.  Setup with  
<pre>
<pre>
setup mu2e
mu2einit
source an offline setup.sh script
source an offline setup.sh script
setup mu2etools
setup mu2etools
Line 22: Line 22:
Setup with  
Setup with  
<pre>
<pre>
setup mu2e
mu2einit
setup mu2efiletools
setup mu2efiletools
</pre>
</pre>
Line 30: Line 30:


===mu2eClusterCheckAndMove===
===mu2eClusterCheckAndMove===
Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for a success return code and that all the files match the manifest.  Move the jobs files to "good" or "failed" directory based on the results.
Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success.  Move the jobs files to "good" or "failed" directory based on the results. Checks:
* the logfile is present
* zero exit code from the art process
* correct transfer of all job outputs per the manifest (sha256sum check)
* identify duplicate jobs, both within the cluster and globally
 
===mu2eDatasetList===
Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).


===mu2eDatasetFileList===
===mu2eDatasetFileList===
Line 42: Line 49:


===mu2eFileDeclare===
===mu2eFileDeclare===
Reads names of json files from stdin, and uses each to declare a new file to SAM.  
Reads names of json files from stdin, and uses each to declare a new file to SAM.
If you see errors while declaring files, check that you have
a [[Authentication|valid certificate]].


===mu2eFileMoveToTape===
===mu2eFileMoveToTape===
Line 49: Line 58:
===mu2eFileUpload===
===mu2eFileUpload===
Will read file names on stdin and try to move each file to its
Will read file names on stdin and try to move each file to its
standard location in dCache.  Command options will determine which flavor of [[Dcache|dCache]].
standard location in dCache.  Command options will determine which flavor of [[Dcache|dCache]] (scratch, persistent, or tape).
 
mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [[https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html]]
is as follows
 
<pre>
The authoritative code that defines file paths is in the mu2efilename package.Try
 
perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm
 
(also perldoc on other files in the same directory).  That still does
not document the exact algorithm, which is an implementation detail as
long as all file name manipulation is done via the mu2efilename package.
 
The spreader is obtained by computing a SHA-256 digest of the base
file name, and using the four initial characters in its hex
representation.  For example:
 
name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art
digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804
spreader = aa/a9
</pre>


===mu2eMissingJobs===
===mu2eMissingJobs===
Line 57: Line 87:


===mu2eClusterArchive===
===mu2eClusterArchive===
A scrip to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.  
A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.  
 
 
===printJson===
 
 
All files to be uploaded to tape need to have a [[SAM]] file record.
We create a SAM record by supplying a [http://json.org/ json file]
that contains keyword/value pairs.  We include pairs for the
file [[SamMetadata|metadata]] that we want to supply for the file's SAM file record.
 
Running printJson will produce all the mundane metadata like file size.
For art files, it will run a fast art executable over the file to extract
information like the number of events in the file.
This means '''a version of offline must be set up to run printJson'''.
 
SAM has parent-child links between files.  printJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h".
 
If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself.
 
Here is an example json file output.
 
<pre>
{
    "dh.description": "cd3-beam-g4s1-dsregion",
    "file_type": "mc",
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "dh.first_subrun": 5,
    "file_size": 4290572,
    "file_format": "art",
    "dh.first_run_event": 1002,
    "dh.last_event": 10000,
    "dh.last_subrun": 5,
    "dh.last_run_event": 1002,
    "dh.last_run_subrun": 1002,
    "dh.first_run_subrun": 1002,
    "data_tier": "sim",
    "dh.first_event": 5,
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "runs": [
        [
            1002,
            5,
            "mc"
        ]
    ],
    "dh.configuration": "0506a",
    "event_count": 3018,
    "dh.owner": "mu2e",
    "content_status": "good",
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art",
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb",
    "dh.sequencer": "001002_00000005"
}
</pre>
 


===File path tools===
===File path tools===
Line 73: Line 158:


== jsonMaker==
== jsonMaker==
'''as of 7/2020, jsonMaker is deprecated - replaced by printJson'''


All files to be uploaded to tape need to have a [[SAM]] file record.
All files to be uploaded to tape need to have a [[SAM]] file record.
Line 135: Line 222:
}
}
</pre>
</pre>
[[Category:Computing]]
[[Category:Computing]]
[[Category:Workflows]]
[[Category:Workflows]]
[[Category:DataHandling]]
[[Category:DataHandling]]

Latest revision as of 22:08, 19 July 2024

mu2etools

Tools to help setup fcl for a grid project. Setup with

mu2einit
source an offline setup.sh script
setup mu2etools

generate_fcl

A project may consist of several grid submissions, and each of those submissions may have many jobs. Each of these jobs will typically have a unique fcl file to drive it. For simulation, this fcl will include aunique random number seed, run numbers and output file names. generate_fcl will take a template fcl file, add these unique parts, and write out the complete set of fcl files for the project.

mu2efiletools

Tools to deal with file during grid operations, including method to list, check, count, move and upload all the files of a dataset. Some of these operations can be done with unix commands like find, but these scripts are recommended because they incorporate the most efficient methods, which are not always obvious. They all have "-h" help.

Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:

  • /pnfs/mu2e/scratch/datasets
  • /pnfs/mu2e/persistent/datasets
  • /pnfs/mu2e/tape

Setup with

mu2einit
setup mu2efiletools

mu2eClusterFileList

Given a dataset name and directories that are the output of clusters in a simulation project, find all the files in that directory that belong to the dataset, and print their relative path names. (Note that this is more efficient than unix "find", which runs stat() on each file and overloads the dCache database.)

mu2eClusterCheckAndMove

Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success. Move the jobs files to "good" or "failed" directory based on the results. Checks:

  • the logfile is present
  • zero exit code from the art process
  • correct transfer of all job outputs per the manifest (sha256sum check)
  • identify duplicate jobs, both within the cluster and globally

mu2eDatasetList

Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).

mu2eDatasetFileList

List the files in a SAM dataset and print either just the filenames, or the full file paths. The output is based on the SAM file listing and the automatic path name algorithm.

mu2eDatasetLocation

Given a SAM data name, loop over the files that do not have locations. If there is a file in the standard location and the CRC matches, that add that location to the SAM record.

mu2eDatasetDelete

Delete SAM records or physical disk files for a dataset.

mu2eFileDeclare

Reads names of json files from stdin, and uses each to declare a new file to SAM. If you see errors while declaring files, check that you have a valid certificate.

mu2eFileMoveToTape

Move a file to tape, declare its SAM record, and wait for the file to be confirmed on tape (may take 24h to complete).

mu2eFileUpload

Will read file names on stdin and try to move each file to its standard location in dCache. Command options will determine which flavor of dCache (scratch, persistent, or tape).

mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html] is as follows

The authoritative code that defines file paths is in the mu2efilename package.Try

perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm

(also perldoc on other files in the same directory).  That still does
not document the exact algorithm, which is an implementation detail as
long as all file name manipulation is done via the mu2efilename package.

The spreader is obtained by computing a SHA-256 digest of the base
file name, and using the four initial characters in its hex
representation.  For example:

name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art
digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804
spreader = aa/a9

mu2eMissingJobs

Query the SAM database and print fcl file names for which a set of mu2eprodys jobs submitted with with the given --dsconf and --dsowner did not complete successfully, as defined by mu2eClusterCheckAndMove.

mu2eClusterArchive

A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.


printJson

All files to be uploaded to tape need to have a SAM file record. We create a SAM record by supplying a json file that contains keyword/value pairs. We include pairs for the file metadata that we want to supply for the file's SAM file record.

Running printJson will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run printJson.

SAM has parent-child links between files. printJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h".

If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself.

Here is an example json file output.

{
    "dh.description": "cd3-beam-g4s1-dsregion", 
    "file_type": "mc", 
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "dh.first_subrun": 5, 
    "file_size": 4290572, 
    "file_format": "art", 
    "dh.first_run_event": 1002, 
    "dh.last_event": 10000, 
    "dh.last_subrun": 5, 
    "dh.last_run_event": 1002, 
    "dh.last_run_subrun": 1002, 
    "dh.first_run_subrun": 1002, 
    "data_tier": "sim", 
    "dh.first_event": 5, 
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "runs": [
        [
            1002, 
            5, 
            "mc"
        ]
    ], 
    "dh.configuration": "0506a", 
    "event_count": 3018, 
    "dh.owner": "mu2e", 
    "content_status": "good", 
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", 
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", 
    "dh.sequencer": "001002_00000005"
}


File path tools

These tools, listed here

  • mu2eabsname_tape
  • mu2eabsname_disk
  • mu2eabsname_scratch

can be given a SAM (six-dot-field) file name, and will return the full path for this file in the respective dCache areas. The subdirectories in the path are all derived from the file name, and is unique. Files may be stored anywhere temporarily, but when they go to their permanent (or semi-permanent for scratch) location they should go here.

They only read stdin:

 > ls sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art | mu2eabsname_scratch
/pnfs/mu2e/scratch/datasets/phy-sim/sim/mu2e/example-beam-g4s1/1812a/art/f8/29/sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art

jsonMaker

as of 7/2020, jsonMaker is deprecated - replaced by printJson


All files to be uploaded to tape need to have a SAM file record. (Some other semi-permanent files in other locations may also have SAM records.) We create a SAM record by supplying a json file (which looks a lot like a Python dictionary or a fcl table) that contains keyword/value pairs. We include the keyword/value pairs for the file metadata that we want to supply for the file's SAM file record.

These json files could be written (or edited) by hand, but it is far easier to run jsonMaker on the file. This Python script is put in your path with the dhtools product:

setup dhtools

Running jsonMaker will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run jsonMaker. The code checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

jsonMaker has a help ("-h") option to show the optional switches. There is a lot about moving or copying files to the upload area - this is obsolete functionality. Please see the upload examples for how to use everything.

Here is an example json file output. Please do not use this for upload, let jsonMaker do the right thing...

{
    "dh.description": "cd3-beam-g4s1-dsregion", 
    "file_type": "mc", 
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "dh.first_subrun": 5, 
    "file_size": 4290572, 
    "file_format": "art", 
    "dh.first_run_event": 1002, 
    "dh.last_event": 10000, 
    "dh.last_subrun": 5, 
    "dh.last_run_event": 1002, 
    "dh.last_run_subrun": 1002, 
    "dh.first_run_subrun": 1002, 
    "data_tier": "sim", 
    "dh.first_event": 5, 
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "runs": [
        [
            1002, 
            5, 
            "mc"
        ]
    ], 
    "dh.configuration": "0506a", 
    "event_count": 3018, 
    "dh.owner": "mu2e", 
    "content_status": "good", 
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", 
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", 
    "dh.sequencer": "001002_00000005"
}