FileTools: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
(Created page with " <!********************************************************> == jsonMaker== The jsonMaker is a python script which lives in the dhtools product and should be available at t...")
 
No edit summary
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:


<!********************************************************>
== mu2etools==
==  jsonMaker==
Tools to help setup fcl for a grid project.  Setup with
<pre>
mu2einit
source an offline setup.sh script
setup mu2etools
</pre>
===generate_fcl===
A project may consist of several grid submissions, and each of those submissions may have many jobs.  Each of these jobs will typically have a unique fcl file to drive it.  For simulation, this fcl will include  aunique random number seed, run numbers
and output file names.  '''generate_fcl''' will take a template fcl file, add these unique parts, and write out the complete set of fcl files for the project.
 
== mu2efiletools==
Tools to deal with file during grid operations, including method to list, check, count, move and upload all the files of a dataset.
Some of these operations can be done with unix commands like find, but these scripts are recommended because they incorporate the most efficient methods, which are not always obvious.  They all have "-h" help.
 
Official production datasets, and user datasets manipulated by the [[FileTools|file tools]] will appear under the following designated dataset areas, corresponding to the above flavors:
* /pnfs/mu2e/scratch/datasets
* /pnfs/mu2e/persistent/datasets
* /pnfs/mu2e/tape
 
Setup with
<pre>
mu2einit
setup mu2efiletools
</pre>
 
===mu2eClusterFileList===
Given a dataset name and directories that are the output of clusters in a simulation project, find all the files in that directory that belong to the dataset, and print their relative path names.  (Note that this is more efficient than unix "find", which runs stat() on each file and overloads the dCache database.)
 
===mu2eClusterCheckAndMove===
Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success.  Move the jobs files to "good" or "failed" directory based on the results. Checks:
* the logfile is present
* zero exit code from the art process
* correct transfer of all job outputs per the manifest (sha256sum check)
* identify duplicate jobs, both within the cluster and globally
 
===mu2eDatasetList===
Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).
 
===mu2eDatasetFileList===
List the files in a SAM dataset and print either just the filenames, or the full file paths.  The output is based on the SAM file listing and the automatic path name algorithm.
 
===mu2eDatasetLocation===
Given a SAM data name, loop over the files that do not have locations. If there is a file in the standard location and the CRC matches, that add that location to the SAM record.
 
===mu2eDatasetDelete===
Delete SAM records or physical disk files for a dataset.
 
===mu2eFileDeclare===
Reads names of json files from stdin, and uses each to declare a new file to SAM.
If you see errors while declaring files, check that you have
a [[Authentication|valid certificate]].
 
===mu2eFileMoveToTape===
Move a file to tape, declare its SAM record, and wait for the file to be confirmed on tape (may take 24h to complete).


===mu2eFileUpload===
Will read file names on stdin and try to move each file to its
standard location in dCache.  Command options will determine which flavor of [[Dcache|dCache]] (scratch, persistent, or tape).


The jsonMaker is a python script which lives in the dhtools product
mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [[https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html]]  
and should be available at the command line after "setup dhtools."
is as follows
Please see the [uploadExample.shtml upload examples] page
for details.


<pre>
The authoritative code that defines file paths is in the mu2efilename package.Try


All files to be uploaded should be processed by the jsonMaker,
perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm
which writes the final json file to be included with the  
 
data file in the FTS input directoryEven if all
(also perldoc on other files in the same directory).  That still does
the final json could be written by hand, the jsonMaker
not document the exact algorithm, which is an implementation detail as
checks certain required fields are present and other rules,  
long as all file name manipulation is done via the mu2efilename package.
checks consistency, and writes in a known correct format.
 
The spreader is obtained by computing a SHA-256 digest of the base
file name, and using the four initial characters in its hex
representation.  For example:
 
name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art
digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804
spreader = aa/a9
</pre>
 
===mu2eMissingJobs===
Query the SAM database and print fcl file names for which a set of
mu2eprodys jobs submitted with with the given --dsconf and --dsowner
did not complete successfully, as defined by mu2eClusterCheckAndMove.
 
===mu2eClusterArchive===
A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.
 
 
===printJson===
 
 
All files to be uploaded to tape need to have a [[SAM]] file record.
We create a SAM record by supplying a [http://json.org/ json file]
that contains keyword/value pairs.  We include pairs for the  
file [[SamMetadata|metadata]] that we want to supply for the file's SAM file record.
 
Running printJson will produce all the mundane metadata like file size.
For art files, it will run a fast art executable over the file to extract
information like the number of events in the file.
This means '''a version of offline must be set up to run printJson'''.
 
SAM has parent-child links between filesprintJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h".
 
If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself.
 
Here is an example json file output.
 
<pre>
{
    "dh.description": "cd3-beam-g4s1-dsregion",
    "file_type": "mc",
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "dh.first_subrun": 5,
    "file_size": 4290572,
    "file_format": "art",
    "dh.first_run_event": 1002,
    "dh.last_event": 10000,
    "dh.last_subrun": 5,
    "dh.last_run_event": 1002,
    "dh.last_run_subrun": 1002,
    "dh.first_run_subrun": 1002,
    "data_tier": "sim",
    "dh.first_event": 5,
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "runs": [
        [
            1002,
            5,
            "mc"
        ]
    ],
    "dh.configuration": "0506a",
    "event_count": 3018,
    "dh.owner": "mu2e",
    "content_status": "good",  
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art",  
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb",  
    "dh.sequencer": "001002_00000005"
}
</pre>




Simply run the maker with all the data files and json fragment(s)  
===File path tools===
as input. The help of the code is below. The most
These tools, listed here
useful practical reference is the
* '''mu2eabsname_tape'''
[uploadExample.shtml upload examples] page.
* '''mu2eabsname_disk'''
* '''mu2eabsname_scratch'''
can be given a [[SAM]] (six-dot-field[[FileNames|file name]], and will return the full path for this file in the respective [[Dcache|dCache]] areas. The subdirectories in the path are all derived from the file name, and is uniqueFiles may be stored anywhere temporarily, but when they go to their permanent (or semi-permanent for scratch) location they should go here.


They only read stdin:
<pre>
<pre>
jsonMaker [OPTIONS] ... [FILES] ...
> ls sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art | mu2eabsname_scratch
/pnfs/mu2e/scratch/datasets/phy-sim/sim/mu2e/example-beam-g4s1/1812a/art/f8/29/sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art
</pre>
 
== jsonMaker==
'''as of 7/2020, jsonMaker is deprecated - replaced by printJson'''
 
 
All files to be uploaded to tape need to have a [[SAM]] file record.
(Some other semi-permanent files in other locations may also have SAM records.)
We create a SAM record by supplying a [http://json.org/ json file]
(which looks a lot like a Python dictionary or a fcl table) that contains
keyword/value pairs. We include the keyword/value
pairs for the file [[SamMetadata|metadata]] that we want to supply for
the file's SAM file record.
 
These json files could be written (or edited) by hand, but
it is far easier to run jsonMaker on the file. This Python script
is put in your path with the dhtools product:
setup dhtools


  Create json files which hold metadata information about the file
Running jsonMaker will produce all the mundane metadata like file size.
to be uploaded. The file list can contain data, and other types,
For art files, it will run a fast art executable over the file to extract
of files (foo.bar) to be uploaded.  If foo.bar.json is in the list,  
information like the number of events in the file.
its contents will be added to the json for foo.bar.
This means '''a version of offline must be set up to run jsonMaker'''.
If a generic json file is supplied, it's contents will be
The code checks certain required fields are present and other rules,
added to all output json files.  Output is a json file for each input
checks consistency, and writes in a known correct format.
file, suitable to presenting to the upload FTS server together with
the data file.
  If the input file is an art file, jsonMaker must run
a module over the file in order to extract run and event
information, so a mu2e offline release that contains the module
must be setup.


  -h
jsonMaker has a help ("-h") option to show the optional switches.
      print help
There is a lot about moving or copying files to the upload area -  
  -v LEVEL
this is obsolete functionality. Please see the [[Upload|upload]] examples
      verbose level, 0 to 10, default=1
for how to use everything.
  -x
      perform write/copy of files.  Default is to evaluate the
      upload parameters, but not not write or move anything.
  -c
      copy the data file to the upload area after processing
      Will move the json file too, unless overidden by an explicit -d.
  -m
      mv the data file to the upload area after processing.
      Useful if the data file is already in
      /pnfs/mu2e/scratch where the FTS is.
      Will move the json file too, unless overidden by an explicit -d.
  -e
      just rename the data file where it is
  -s FILE
      FILE contains a list of input files to operate on.
  -p METHOD
      How to match a input json file to a data file
      METHOD="none" for no json input file for each data file (default)
      METHOD="file" pair an input json file with a data file based on the
      fact that if the file is foo, the json is foo.json.
      METHOD="dir" pair a json file and a data file based on the fact that
      they are in the same directory, whatever their names are.
  -j FILE
      a json file fragment to add to the json for all files,
      typically used to supply MC parameters.
  -i PAR=VALUE
      a json file entry to add to the json for all files, like
        -i mc.primary_particle=neutron
        -i mc.primary_particle="neutron" 
        -i mc.simulation_stage=2
      Can be repeated.  Will supersede values given in -j
  -a FILE
      a text file with parent file sam names - usually would only
      be used if there was one data file to be processed.
  -t TAG
      text to prepend to the sequencer field of the output filename.
      This can be useful for non-art datasets which have different
      components uploaded at different times with different jsonMaker
      commands, but intended to be in the same dataset, such as a series
      of backup tarballs from different stages of processing.
  -d DIR
      directory to write the json files in.  Default is ".".
      If DIR="same" then write the json in the same directory as the
      the data file. If DIR="fts" then write it to the FTS directory.
      If -m or -c is set, then -d "fts" is implied unless overidden by
      an explicit -d.
  -f FILE_FAMILY
      the file_family for these files - required
  -r NAME
      this will trigger renaming the data files by the pattern in NAME
      example: -r mcs.batman.beam-2014.fcl-100..art
      The blank sequencer ".." will be replaced by a sequence number
      like ".0001." or first run and subrun for art files.
  -l DIR
      write a file of the data file name and json file name
      followed by the fts directory where they should go, suitable
      for driving a "ifdh cp -f" command to move all files in one lock.
      This file will be named for the dataset plus "command"
        plus a time string.
  -g
      the command file will be written (implies -l) and then
      when all files are evaluated and json files written, execute
      the command file with "ifdh cp -f commandfile". Useful
      to use one lock file to execute all ifdh commands.
      Nullifies -c and -m.


  Requires python 2.7 or greater for subprocess.check_output and
Here is an example json file output. Please do not use this for upload, let jsonMaker
    2.6 or greater for json module.
do the right thing...
  version 2.0


<pre>
{
    "dh.description": "cd3-beam-g4s1-dsregion",
    "file_type": "mc",
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "dh.first_subrun": 5,
    "file_size": 4290572,
    "file_format": "art",
    "dh.first_run_event": 1002,
    "dh.last_event": 10000,
    "dh.last_subrun": 5,
    "dh.last_run_event": 1002,
    "dh.last_run_subrun": 1002,
    "dh.first_run_subrun": 1002,
    "data_tier": "sim",
    "dh.first_event": 5,
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art",
    "runs": [
        [
            1002,
            5,
            "mc"
        ]
    ],
    "dh.configuration": "0506a",
    "event_count": 3018,
    "dh.owner": "mu2e",
    "content_status": "good",
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art",
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb",
    "dh.sequencer": "001002_00000005"
}
</pre>
</pre>
[[Category:Computing]]
[[Category:Workflows]]
[[Category:DataHandling]]

Latest revision as of 22:08, 19 July 2024

mu2etools

Tools to help setup fcl for a grid project. Setup with

mu2einit
source an offline setup.sh script
setup mu2etools

generate_fcl

A project may consist of several grid submissions, and each of those submissions may have many jobs. Each of these jobs will typically have a unique fcl file to drive it. For simulation, this fcl will include aunique random number seed, run numbers and output file names. generate_fcl will take a template fcl file, add these unique parts, and write out the complete set of fcl files for the project.

mu2efiletools

Tools to deal with file during grid operations, including method to list, check, count, move and upload all the files of a dataset. Some of these operations can be done with unix commands like find, but these scripts are recommended because they incorporate the most efficient methods, which are not always obvious. They all have "-h" help.

Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:

  • /pnfs/mu2e/scratch/datasets
  • /pnfs/mu2e/persistent/datasets
  • /pnfs/mu2e/tape

Setup with

mu2einit
setup mu2efiletools

mu2eClusterFileList

Given a dataset name and directories that are the output of clusters in a simulation project, find all the files in that directory that belong to the dataset, and print their relative path names. (Note that this is more efficient than unix "find", which runs stat() on each file and overloads the dCache database.)

mu2eClusterCheckAndMove

Given a directory that is the output of one cluster in a simulation project, sort through each directory, check for success. Move the jobs files to "good" or "failed" directory based on the results. Checks:

  • the logfile is present
  • zero exit code from the art process
  • correct transfer of all job outputs per the manifest (sha256sum check)
  • identify duplicate jobs, both within the cluster and globally

mu2eDatasetList

Print lists of datasets that are known to SAM - all or optionally with restrictions (see -h).

mu2eDatasetFileList

List the files in a SAM dataset and print either just the filenames, or the full file paths. The output is based on the SAM file listing and the automatic path name algorithm.

mu2eDatasetLocation

Given a SAM data name, loop over the files that do not have locations. If there is a file in the standard location and the CRC matches, that add that location to the SAM record.

mu2eDatasetDelete

Delete SAM records or physical disk files for a dataset.

mu2eFileDeclare

Reads names of json files from stdin, and uses each to declare a new file to SAM. If you see errors while declaring files, check that you have a valid certificate.

mu2eFileMoveToTape

Move a file to tape, declare its SAM record, and wait for the file to be confirmed on tape (may take 24h to complete).

mu2eFileUpload

Will read file names on stdin and try to move each file to its standard location in dCache. Command options will determine which flavor of dCache (scratch, persistent, or tape).

mu2eFileUpload also assigns pnfs paths to files. The algorithm, described by Andrei in [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/Sim/941/3/2/1.html] is as follows

The authoritative code that defines file paths is in the mu2efilename package.Try

perldoc /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efilename/v3_4/perllib/Mu2eFilename.pm

(also perldoc on other files in the same directory).  That still does
not document the exact algorithm, which is an implementation detail as
long as all file name manipulation is done via the mu2efilename package.

The spreader is obtained by computing a SHA-256 digest of the base
file name, and using the four initial characters in its hex
representation.  For example:

name = sim.mu2e.cd3-detmix-cut.1109a.000001_00001162.art
digest = aaa96d7c4800bdcc483aeb3b60e4393cb76a2570c8b4f1d6aa4967c04132d804
spreader = aa/a9

mu2eMissingJobs

Query the SAM database and print fcl file names for which a set of mu2eprodys jobs submitted with with the given --dsconf and --dsowner did not complete successfully, as defined by mu2eClusterCheckAndMove.

mu2eClusterArchive

A script to archive log (and may be other) files from a mu2eprodsys cluster to tape and register the archive in SAM.


printJson

All files to be uploaded to tape need to have a SAM file record. We create a SAM record by supplying a json file that contains keyword/value pairs. We include pairs for the file metadata that we want to supply for the file's SAM file record.

Running printJson will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run printJson.

SAM has parent-child links between files. printJson cannot devine the parents of a file, so you must supply the list of parents in a text file, or tell printjson there are no parents, see "-h".

If you are running standard production scripts in mu2egrid, printJson is embedded in the scripts, so you do not have to run it yourself.

Here is an example json file output.

{
    "dh.description": "cd3-beam-g4s1-dsregion", 
    "file_type": "mc", 
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "dh.first_subrun": 5, 
    "file_size": 4290572, 
    "file_format": "art", 
    "dh.first_run_event": 1002, 
    "dh.last_event": 10000, 
    "dh.last_subrun": 5, 
    "dh.last_run_event": 1002, 
    "dh.last_run_subrun": 1002, 
    "dh.first_run_subrun": 1002, 
    "data_tier": "sim", 
    "dh.first_event": 5, 
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "runs": [
        [
            1002, 
            5, 
            "mc"
        ]
    ], 
    "dh.configuration": "0506a", 
    "event_count": 3018, 
    "dh.owner": "mu2e", 
    "content_status": "good", 
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", 
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", 
    "dh.sequencer": "001002_00000005"
}


File path tools

These tools, listed here

  • mu2eabsname_tape
  • mu2eabsname_disk
  • mu2eabsname_scratch

can be given a SAM (six-dot-field) file name, and will return the full path for this file in the respective dCache areas. The subdirectories in the path are all derived from the file name, and is unique. Files may be stored anywhere temporarily, but when they go to their permanent (or semi-permanent for scratch) location they should go here.

They only read stdin:

 > ls sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art | mu2eabsname_scratch
/pnfs/mu2e/scratch/datasets/phy-sim/sim/mu2e/example-beam-g4s1/1812a/art/f8/29/sim.mu2e.example-beam-g4s1.1812a.16638329_000016.art

jsonMaker

as of 7/2020, jsonMaker is deprecated - replaced by printJson


All files to be uploaded to tape need to have a SAM file record. (Some other semi-permanent files in other locations may also have SAM records.) We create a SAM record by supplying a json file (which looks a lot like a Python dictionary or a fcl table) that contains keyword/value pairs. We include the keyword/value pairs for the file metadata that we want to supply for the file's SAM file record.

These json files could be written (or edited) by hand, but it is far easier to run jsonMaker on the file. This Python script is put in your path with the dhtools product:

setup dhtools

Running jsonMaker will produce all the mundane metadata like file size. For art files, it will run a fast art executable over the file to extract information like the number of events in the file. This means a version of offline must be set up to run jsonMaker. The code checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

jsonMaker has a help ("-h") option to show the optional switches. There is a lot about moving or copying files to the upload area - this is obsolete functionality. Please see the upload examples for how to use everything.

Here is an example json file output. Please do not use this for upload, let jsonMaker do the right thing...

{
    "dh.description": "cd3-beam-g4s1-dsregion", 
    "file_type": "mc", 
    "file_name": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "dh.first_subrun": 5, 
    "file_size": 4290572, 
    "file_format": "art", 
    "dh.first_run_event": 1002, 
    "dh.last_event": 10000, 
    "dh.last_subrun": 5, 
    "dh.last_run_event": 1002, 
    "dh.last_run_subrun": 1002, 
    "dh.first_run_subrun": 1002, 
    "data_tier": "sim", 
    "dh.first_event": 5, 
    "dh.source_file": "/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s1-dsregion/0506a/001/307/sim.mu2e.cd3-beam-g4s1-dsregion.0506a.001002_00000005.art", 
    "runs": [
        [
            1002, 
            5, 
            "mc"
        ]
    ], 
    "dh.configuration": "0506a", 
    "event_count": 3018, 
    "dh.owner": "mu2e", 
    "content_status": "good", 
    "dh.dataset": "sim.mu2e.cd3-beam-g4s1-dsregion.0506a.art", 
    "dh.sha256": "e3b5b426ce6c6d4dd2b9fcf2bccb4663205235d3e3fb6011a8dc49ef2ff66dbb", 
    "dh.sequencer": "001002_00000005"
}