UploadFTS

From Mu2eWiki
Jump to navigation Jump to search

Outofdate.jpeg This page is out of date, please help update it!

Notes

redmine docs

Introduction

We used the FTS system in 2015, but not since, so this information is out of date.

Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be available and delivered efficiently. The solution is coordinating several subsystems:

  • dCache: a set of disk servers, a database of files on the servers, and services to deliver those files with high throughput
    • [scratchDcache.shtml scratch dCache] : a dCache where least used files are purged as space is needed.
    • tape-backed dCache: a dCache where all files are on tape and are cycled in and out of the dCache as needed
  • pnfs: an nfs server behind the /pnfs/mu2e parition which looks like a file system to users, but is actually a interface to the dCache file database.
  • Enstore: the Fermilab system of tape and tape drive management
  • SAM: Serial Access to Metadata, a database of file metadata and a system for managing large-scale file delivery
  • FTS:File Transfer Service, a process which manages the intake of files into the tape-backed dCache and SAM.
  • jsonMaker: a piece of mu2e code which helps create and check metadata when creating a SAM record of a file
  • SFA: Small File Aggregation, enstore can tar up small files into a single large file before it goes to tape, to increase tape efficiency.


The basic procedure is for the user to run the jsonMaker on a data file to make the json file , then copy both the data file and the json into an FTS area in </a "scratchDcache.shtml">scratch dCache</a> called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use [sam.shtml SAM] to read the files in tape-backed dCache.


Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.

Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.

<!********************************************************>

Recipe

If you are about to run some new Monte Carlo in the official framework, then the upload will be built into the scripts and documented with the mu2egrid [ Monte Carlo submission] process. this is under development, please ask Andrei for the status


Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the [uploadExample.shtml upload examples] page.

  • choose values for the [#metadata SAM Metadata] , including the appropriate [#ff file family]
  • record the above items in a json file fragment that will apply to all the files in your dataset
  • [#name rename] your files by the upload convention (This can also be done by jsonMaker in the next step.)
  • setup an offline release and run the [#jsonMaker jsonMaker] to write the json file, which will include the fragment from the previous step
  • use "ifdh cp" to copy the data file and the full json file to the FTS area /pnfs/mu2e/scratch/fts (This step can also can be done by jsonMaker.)
  • use [sam.shtml SAM] to access the file or its metadata

The following is some detail you should be aware of in general, but a detailed knowledge is not required.

<!********************************************************>

File Families

 A file family is a set of files which are grouped exclusively on the same 

subset of tapes. File families are used to indicate files that may be treated differently during data-handling operations. This might include tape library location, groupings for migration, deletion, or copy offsite, groupings for access priority or dcache location or lifetime.

Here are the mu2e file families. These should be used for all uploading.

  • phy-sim Monte Carlo simulated or reconstructed art files. These are official collaboration samples only, originated, produced, validated, and documented by physics groups intended for long-term use by many collabrators. Examples are the TDR and CD3 samples. The username associated with the files will be the production username "mu2e".
  • phy-nts non-art format ntuples of phy-sim
  • phy-etc configuration files, tarballs of log files, backups, and other files
  • usr-sim Monte Carlo simulated or reconstructed art files. These samples are produced by one or a few individuals for use in their personal studies. They are probably for short-term use, not documented publically, and not used by many collaborators. The username associated with these files will be the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
  • usr-nts Non-art format ntuples of usr-sim
  • usr-etc Other user-created tarballs of log files, backups
  • tst-cos Testbeam and cosmic data created before commissioning. This would include raw data formats as well as various possible derived formats and tarballs

For real data taking, more file families will be created to hold raw data, reconstructed data, and ntuples, etc.

When uploading files, you will need to specify the file family. You will probably only use usr-sim (for Monte Carlo art files) usr-nts for ntuples and usr-etc for tarballs and anything else. <!********************************************************>

SAM Metadata

One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.

SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file. See the [#name file name section] for a definition of a dataset.

all the metadata fields can be listed:

samweb list-parameters
samweb list-parameters &lt parameter &gt 
samweb list-values  --help-categories
samweb list-values &lt category &gt

The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, you don't know if the database has been maintained. It is not uncommon to find obsolete or invalidated data, unmarked, in repositories. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work. In the following table, "json" refers to an optional json file the user supplies for every uploaded file. "generic json" refers to a file the user will provide, one per dataset uploaded. "[#jsonMaker jsonMaker] " refers to the jsonMaker executable that the user will run. Worked examples are available on the [uploadExample.shtml upload examples] page. The following metadata is required for all uploaded files

<b>file_size</b> Integer, size in bytes  - from json or jsonMaker
<b>crc</b> Integer - supplied by FTS

Note, for debugging purposes, this crc can be computed by: setup encp v3_11; ecrc filename

<b>create_user</b> String - SAM user name (usually a group account) - from FTS
<b>create_date</b> Date, when uploaded  - supplied by FTS
<b>file_name</b> String  - supplied by running jsonMaker

See [#name file name documentation]

<b>data_tier</b> String  - from filename or jsonMaker
   for physics data:
      raw
      rec   reconstructed
      ntd   data ntuples
   for ExtMon data:
      ext   ExtMon raw
      rex   ext production
      xnt   ext data ntuples
   for simulation:
      cnf   set of config files fcl or txt, to drive MC jobs
      sim   result of geant, StepPointMC
      mix   mixed sim files (has multple generators)
      dig   detector hits, like raw data
      mcs   reconstructed data files
      nts   MC ntuples
   other categories:
      log for log files
      bck for backups
      etc for anything else
      job for a production record
   <pre></blockquote>
<pre><b>dh.owner</b> String  - from filename or jsonMaker

For official data samples and Monte Carlo that go into the phy* [#ff file families] , this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.

<b>dh.description</b> String  - from filename or jsonMaker

This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.

<b>dh.configuration</b> String  - from filename or jsonMaker

This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.

<b>dh.sequencer</b> String  - from filename or jsonMaker

This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.

<b>dh.dataset</b> String  - from filename, jsonMaker

a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.

<b>file_format</b> String  - from filename, json, or jsonMaker

This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl

<b>content_status</b> String  - from jsonMaker

always "good" at upload, can be set to "bad" later to deprecate files without deleting them

<b>file_type</b> String  - supplied by running jsonMaker

"data", "MC" or "other"




The following metadata is required for all uploaded art files

<b>event_count</b> Integer  - from jsonMaker

total physics events in the file

<b>dh.first_run_event</b> Integer  - from jsonMaker

run of the lowest sorted physics event ID

<b>dh.first_event</b> Integer  - from jsonMaker

event of the lowest sorted physics event ID

<b>dh.last_run_event</b> Integer  - from jsonMaker

run of the highest sorted physics event ID

<b>dh.last_event</b> Integer  - from jsonMaker

event of the highest sorted physics event ID

<b>dh.first_run_subrun</b> Integer  - from jsonMaker

run of the lowest sorted subrun

<b>dh.first_subrun</b> Integer  - from jsonMaker

event of the lowest sorted subrun

<b>dh.last_run_subrun</b> Integer  - from jsonMaker

run of the highest sorted subrun

<b>dh.last_subrun</b> Integer  - from jsonMaker

event of the highest sorted subrun

<b>runs</b> List of lists - from jsonMaker

The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type

<b>run_type</b> String - from jsonMaker

This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.



The following metadata is required for all uploaded Monte Carlo files

<b>mc.generator_type</b> String - from json or generic json

One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"

<b>mc.simulation_stage</b> Integer  - from json or generic json

Which step in multi-step generation

<b>mc.primary_particle</b> String - from json or generic json

One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"



The following metadata is optional

<b>dh.source_file</b> String  - from json,jsonMaker

The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.

<b>parents</b> List of Strings

For files derived from other specific SAM files, this contains the SAM names of the parent files

<b>retire_date</b> Date

When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten


The following metadata is only for production records

<b>job.cpu</b> int

job cpu time in sec

<b>job.maxres</b> int

job max resident size in KB

<b>job.site</b> string

job grid site name

<b>job.node</b> string

job node

<b>job.disk</b> int

job disk space used, in KB


The following metadata may be created for real data

<b>start_time</b> Date

Time the file was opened during data-taking

<b>end_time</b> Date

Time the file was closed during data-taking



The real data will require others such as run types, goodrun bits, detector configuration, etc.

Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered. <!********************************************************>

File Names

 File names should be relatively short, but 

include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc. The file name must be unique, and should be mnemonic and helpful, but should not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.


All fields of the file name should contain only alphanumeric characters, hyphens, and underscores.

Mu2e will name all files to be uploaded with the following pattern:

<b><font size=+1>data_tier.owner.description.configuration.sequencer.file_format</font></b>

These fields all correspond to required SAM metadata fields. If you remove the sequencer from a file name, you create a string that is unique for this logical dataset, and that will be put in the "dh.dataset" field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. With owner in the file name, potential name conflicts will only occur within one user's files.


An official Monte carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:

    cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
    sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
    nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
    log.mu2e.tdr-beam.TS3ToDS23.001.tgz

If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:

    dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art

When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:

    dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art

This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.


If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:

    dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art

Raw, reconstructed and ntuple beam data might look like:

    raw.mu2e.streamA.triggerTable123.12345678_123456.art
    rec.mu2e.streamA.triggerTable123.12345678_123456.art
    ntd.mu2e.streamA.triggerTable123.0001.root

A backup of an analysis project might look like:

    bck.batman.node123.2014-06-04.aa.tgz

<!********************************************************>

pnfs

/pnfs/mu2e is an nfs server which looks like a file system, but is actually an interface to the dCache file database. Users may interact directly with the [scratchDcache.shtml scratch dCache] , but users will typically never look into the tape-backed dCache area in /pnfs/mu2e. Users will only write to tape through the FTS, not directly to the tape-backed dCache. The user woudl typically read from tape-backed dCache using SAM only, but doing the transition to SAM, and while data loads are manageable, it is OK to use file lists. Remember, /pnfs is a database so you can overload it with demanding queries such as "find .", so please avoid that.

When files are copied into tape-backed dcache, the FTS will move them to a directory made of the file family at the head, followed by the metadata of the file name, two counters, and the filename:

/pnfs/mu2e/file family/data_tier/user/description/configuration/counter1/counter0/filename

For example if a file named

mcs.batman.2014-cosmic.tag001.00012345_000100.art

is uploaded, it would go into the file spec

/pnfs/mu2e/usr-sim/mcs/batman/2014-cosmic/tag001/000/000/mcs.batman.2014-cosmic.tag001.00012345_000100.art

Counter0 and counter1 are created from the SAM file ID and essentially increment when there are 1000 files in the directory, so datasets can have up to a billion files.



<!********************************************************>

jsonMaker

The jsonMaker is a python script which lives in the dhtools product and should be available at the command line after "setup dhtools." Please see the [uploadExample.shtml upload examples] page for details.


All files to be uploaded should be processed by the jsonMaker, which writes the final json file to be included with the data file in the FTS input directory. Even if all the final json could be written by hand, the jsonMaker checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.


Simply run the maker with all the data files and json fragment(s) as input. The help of the code is below. The most useful practical reference is the [uploadExample.shtml upload examples] page.

jsonMaker  [OPTIONS] ... [FILES] ...

  Create json files which hold metadata information about the file
to be uploaded. The file list can contain data, and other types,
of files (foo.bar) to be uploaded.  If foo.bar.json is in the list, 
its contents will be added to the json for foo.bar.
If a generic json file is supplied, it's contents will be
added to all output json files.  Output is a json file for each input 
file, suitable to presenting to the upload FTS server together with 
the data file.
   If the input file is an art file, jsonMaker must run
a module over the file in order to extract run and event
information, so a mu2e offline release that contains the module
must be setup.

   -h 
       print help
   -v LEVEL
       verbose level, 0 to 10, default=1
   -x 
       perform write/copy of files.  Default is to evaluate the
       upload parameters, but not not write or move anything.
   -c
       copy the data file to the upload area after processing
       Will move the json file too, unless overidden by an explicit -d.
   -m
       mv the data file to the upload area after processing. 
       Useful if the data file is already in
       /pnfs/mu2e/scratch where the FTS is.
       Will move the json file too, unless overidden by an explicit -d.
   -e
       just rename the data file where it is
   -s FILE
       FILE contains a list of input files to operate on.
   -p METHOD
      How to match a input json file to a data file
      METHOD="none" for no json input file for each data file (default)
      METHOD="file" pair an input json file with a data file based on the 
      fact that if the file is foo, the json is foo.json.
      METHOD="dir" pair a json file and a data file based on the fact that 
      they are in the same directory, whatever their names are.
   -j FILE
       a json file fragment to add to the json for all files,
       typically used to supply MC parameters.
   -i PAR=VALUE
       a json file entry to add to the json for all files, like
        -i mc.primary_particle=neutron
        -i mc.primary_particle="neutron"  
        -i mc.simulation_stage=2 
       Can be repeated.  Will supersede values given in -j
   -a FILE
       a text file with parent file sam names - usually would only
       be used if there was one data file to be processed.
   -t TAG
       text to prepend to the sequencer field of the output filename.
       This can be useful for non-art datasets which have different
       components uploaded at different times with different jsonMaker 
       commands, but intended to be in the same dataset, such as a series
       of backup tarballs from different stages of processing.
   -d DIR
       directory to write the json files in.  Default is ".".
       If DIR="same" then write the json in the same directory as the 
       the data file. If DIR="fts" then write it to the FTS directory. 
       If -m or -c is set, then -d "fts" is implied unless overidden by 
       an explicit -d.
   -f FILE_FAMILY
       the file_family for these files - required
   -r NAME
       this will trigger renaming the data files by the pattern in NAME
       example: -r mcs.batman.beam-2014.fcl-100..art
       The blank sequencer ".." will be replaced by a sequence number 
       like ".0001." or first run and subrun for art files.
   -l DIR
       write a file of the data file name and json file name
       followed by the fts directory where they should go, suitable
       for driving a "ifdh cp -f" command to move all files in one lock.
       This file will be named for the dataset plus "command" 
        plus a time string.
   -g 
       the command file will be written (implies -l) and then
       when all files are evaluated and json files written, execute
       the command file with "ifdh cp -f commandfile". Useful
       to use one lock file to execute all ifdh commands.
       Nullifies -c and -m.

  Requires python 2.7 or greater for subprocess.check_output and 
     2.6 or greater for json module.
  version 2.0


<!*********************************************************************>

Links

[scratchDcache.shtml scratch dCache]
json.org
SAM (file access) for mu2e
SAM
samweb
samweb user guide
samweb command reference
SAMWEB default metadata fields ifdh_art

FTS monitor 01 &nbsp 02 &nbsp 03
FTS listing
[sam-note.html Note] on FTS upload steps and timing

Introduction

These are examples of how to upload files to tape. The first example is a set of Monte Carlo art files from a single dataset, and created with arbitrary names. Please read this example first because it is the most common use case and contains important overview information that is not repeated in the other examples. "jsonMaker -h" gives a useful summary help. If your dataset is large, containing more than 10,000 files or more than 500GB, please see the section on [#big large datasets] . The section on [#tools tools] gives a few examples of commonly-used commands. A complete description of all SAM procedures and tools is available at the [sam.shtml SAM page] . If jsonMaker stops with an error like "subprocess.check_output does not exist", it means you are using the wrong version of python, please start a new window with the setup as recommended below.

<!********************************************************>

MC Example 1

This is the most common use case. You have a set of MC files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the [tapeUpload.html#name naming convention] . The first step is to determine how to name the files by defining the description, configuration, etc. These fields are described in more detail [tapeUpload.html#metadata metadata] page.

For example, going through the decisions for the fields in the name:

<b><font size=+1>data_tier.owner.description.configuration.sequencer.file_format</font></b>
  • data_tier. This is reconstructed Monte Carlo, so it is "mcs." (Simulated but not reconstructed is "sim").
  • owner. You have generated this yourself for a study, so the owner is your username
  • description. You know these files are for a target geometry study, so that should go here. You know others have also generated MC for this purpose, but you won't conflict with them since you are using your username. The generator is stopped muons, and that is an important high-level physics description, so you should include that. You think you might do this whole study again in the near future so you decide to add a version number so description is "trgt_geo_stopped_v0"
  • Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
  • sequencer. Since these are art files, the sequencer will be generated by jasonMaker from the run and subrun numbers.
  • file_format. These are art files so the extention will be "art".

Your "rename" string will look like:

mcs.batman.trgt_geo_stopped_v0.geom0..art

The ".." is intentional to let jsonMaker know to generate the missing sequencer. Note that by changing the ".." to "." you will have the string that is the name of your dataset

mcs.batman.trgt_geo_stopped_v0.geom0.art

This will be put in the dh.dataset field and is the most common way you will refer to this dataset.

< Next you need to pick the [tapeUpload.html#ff file family] . In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo art files, so they go in "sim", therefore the file family is "usr-sim".

The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. Call it temp.json:

{
"mc.generator_type"   : "stopped_particle",
"mc.simulation_stage" : 3,
"mc.primary_particle" : "muon"
}

note there are commas between the field-value pairs and that strings are quoted, but numbers are not. This information can also be provided on the command line directly by the "-i" switch.

Then run the jsonMaker.

mu2einit
source setup.sh     [setup a mu2e Offline release]
setup dhtools       [add jsonMaker to the path, must be after setup.sh]
kinit               [in case copying to dcache]

<! getcert [cert required to write to sam database] > <! export X509_USER_CERT=/tmp/x509up_u\`id -u\` >

Run a test (no -x switch) on one file to make sure the final command will work

jsonMaker -f usr-sim -j temp.json -v 5 \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
one_of_your_data_files

If there are any errors, they will be printed at the end. They will need to be fixed.

If OK, then commit to the full run. The switch "-c" asks for the data and the json file to be copied to the FTS area, under the appropriate subdirectory according to the file family.

jsonMaker -f usr-sim -x -c -j temp.json  \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files*

There are other options for how to run the jsonMaker, please run "jsonMaker -h" or see the reference [tapeUpload.html#jsonMaker here] . For example, if you files are already in scratch dCache (/pnfs/mu2e/scratch/..) then you can "mv" inside of the scratch dCache to the FTS, also in scratch dCache, which would be more efficient than copying them. You can ask jsonMaker to just write out the json files (-x -d with no -c or -m). It can generate a file containing a list of move commands that can be given to ifdh, so thay can be run with one lock. With -g, jsonMaker will also execute this command. You can always consult with the offline group if you have questions or a special case. Uploading errors can be fixed, but that can be complex, so it is far better to ask questions before rather than after.

for non-art files, jsonMaker will run very quickly. For art files, it has to runa mu2e executable to extract the run numbers. This takes 2s per file, and can take up to 60s if the file is large. In general, we recommend limiting single runs of jsonMaker to 10K files. Larger datasets can be broken into smaller subsets which can be run separately. It may be easiest to do this with the file list input style (-s) instead of command line wildcards. <!********************************************************>

MC Example 2

In this example, the user has provided some additional metadata which is unique for each file. This could be an original file location in "dh.source_file," or parent file names (must be SAM file names). jsonMaker cannot probe anything but art files for run numbers. If you want to upload an ntuple and include run numbers in the SAM metadata, then you can do that by writing a json file for each data file. As a concrete example, suppose a json file like this for each datafile:

{
  "parents" : [  "mcs.batman.trgt_geo_stopped_vo.geom0.12345678_123456.art"   ]
}
The process in this case is the same as in example 1,

with one item added. You need to tell jsonMaker how to determine which json file belongs with which data file. There are two methods, pairing by the fact that if the data file is foo, then the json file is foo.json. The other method is to pair the json file to whatever data file is in the same directory. In this second case, there can only be one data file and json file in each directory.

the command is the same as example 1, but with a pairing 

directive in "-p" and the json files added to the in put on the command line.

jsonMaker -f usr-sim -x -c -j temp.json -p dir \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files* *all_your_json_files*


<!********************************************************>

MC Ntuple Example

You have a set of ntuple (root) files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the [tapeUpload.html#name naming convention] . The first step is to determine how to name the files by defining the description, configuration, etc. as in MC Example 1.

  • data_tier. These are root ntuple files, so it is "nts."
  • owner. You have generated this yourself for a study, so the owner is your username
  • description. If these ntuples were made by reading an art file dataset, it might make sense to use the same description and configuration as this parent dataset. It will be distinguished by the different data_tier in the name. So use "trgt_geo_stopped_v0" (from MC example 1)
  • Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
  • sequencer. Since these are not art files, the sequencer will not be run and subrun, but will be a sequential counter, generated by jsonMaker.
  • file_format. These are root files, but not art so the extention will be "root".

Your "rename" string will look like:

nts.batman.trgt_geo_stopped_v0.geom0..root


Next you need to pick the [tapeUpload.html#ff file family] . In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo root ntuple files, so they go in "nts", therefore the file family is "usr-nts".

The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. jsonMaker will sense this is MC by the data_tier and require that you supply these fields. Call it temp.json:

{
"generator_type"   : "stopped_particle",
"simulation_stage" : 3,
"primary_particle" : "muon"
}

Then run the jsonMaker.

jsonMaker -f usr-nts -x -c -j temp.json  \
-r nts.batman.trgt_geo_stopped_vo.geom0..root \
*all_your_data_files*


<!********************************************************>

Grid Example

In this case, suppose you were generating files on the grid and wanted to upload those files efficiently. This might be Monte Carlo output art files or ntuple files. The best thing to do is to run jsonMaker on the grid node to produce the json file. Copy your data file and json file back to dCache then, when you ready, copy or mv them into the upload area.

Please see the other examples for details of how to run jsonMaker for your particular case, but in general there are couple of options to point out here. One is "-e" which allows renaming of the data file in place. "-d" defaults to writing the json file in the local dir.

mu2einit
source setup.sh     [setup a mu2e Offline release]
setup dhtools       [add jsonMaker to the path, must be after setup.sh]

jsonMaker -f usr-sim -x -e -j generic.json  \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
your_data_file

ifdh cp mcs* /pnfs/mu2e/scratch/users/batman/outdir

in this case, after all processes are done and you've checked the output in dChace, you can move the data files and their json to the fts directory. To avoid putting too many files in one subdirectory, we have subdirectories below /pnfs/mu2e/scratch/fts/usr-sim. Please spread out the files among those directories. The data file and its json need to go into the same directory.

If you believe things are running smoothly,

you can move the data and json directly into the uploader.

jsonMaker -f usr-sim -x -m -j generic.json \ -r mcs.batman.trgt_geo_stopped_vo.geom0..art \ your_data_file

If you generating files that are not art files, then jsonMaker will not have the run and subrun to give the files a unique sequencer. One way to handle this is through the "-t" switch. You could add -t "${CLUSTER}_${PROCESS}" or a tag based on the first run and event in the ntuple. You could also rename the file and its json according to the rename scheme (then do not use -r or -e) and include your own sequencer. Finally, it might be easiest to write the ntuples to scratch dCache and then run jsonMaker on the full set of files interactively, so it can assign sequence numbers logically.


<!********************************************************>

Log File Tarball Example

You may also want to save the log files from this MC, which you have tarred up in a few tarballs. The file names will be same with a few logical changes. The file family has changed to "usr-etc" since these are not data files, and will not be read like data. The data_tier has changed to "bck" and the file_format has changed to "tgz". The command is:

jsonMaker -f usr-etc -x -c -j temp.json -v 5  \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_mc_tar_files*.tgz
the sequencer field is left blank in the rename 

string, which will cause jsonMaker to fill that in with a counter.

In the examples, the simulated data, and ntuples and and tarballs of the log files were uploaded with coordinated dataset names - the same descriptions and configuration fields. This can run into a little conflict in backup up of tarballs. For example, suppose there are multiple steps in making the ntuple, each with their own set of log files. A reasonable solution is to keep adding to your backup dataset, keeping the same descriptions and configuration fields, but modifying the sequencer with "-t".

jsonMaker -f usr-etc -x -c -j temp.json -v 5 -t "step2" \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_other_mc_tar_files*.tgz

Listing the bck.batman.trgt_geo_stopped_v0.geom0.tgz dataset will look like:

bck.batman.trgt_geo_stopped_v0.geom0.000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.001.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-001.tgz

Your logically coordinated datasets are then all

*.batman.trgt_geo_stopped_v0.geom0.*

<!********************************************************>

Config File Example

A run of Monte Carlo can be driven by a set of fcl files, one for each grid process. The fcl could be generated before the job is submitted and they could contain fixed random seeds, for example. This allows all stages of the MC to be driven by an input dataset and is maximally reproducable.

This example shows how to upload a set of MC fcl files. The file family is "usr-etc" since these are not art or root data files. The data_tier has changed to "cnf" (for config) and the file_format has changed to "fcl". Since these are part of a MC production chain, the MC parameters defined the generic.json can be defined and will be required. The command is:

jsonMaker -f usr-etc -x -c -j temp.json -v 5  \
-r cnf.batman.trgt_geo_stopped_v0.geom0..fcl \
your_fcl_files*.fcl


<!********************************************************>

Backup Example

You have an analysis project you are done with and want to get it off disk, but also save it for the forseeable future. The file family with be usr-etc since it is user data and not art files or ntuples.

It is a backup, so data_tier "bck". Since the dataset will include your user name, your description and configuration only have to be unique to you, so pick anything logical, say "target_analysis" for the description and "09_2014" for the configuration.

in this case you don't have to supply the generator info so you don't need a generic json file at all. The command becomes:

jsonMaker -f usr-etc -x -c -v 5  \
-r bck.batman.target_analysis.09_2014..tgz \
your_dir_analysis_tar_files*.tgz

<!********************************************************>

Common Tools

You can see how many of your files are in SAM with:

mu2einit
setup sam_web_client
samweb count-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"

You can see how many of your files in SAM have actually gone to tape:

mu2einit
setup dhtools
samOnTape sim.mu2e.example-beam-g4s1.1812a.art

You can make a list of files in their permanent locations, suitable for feeding to mu2egrid:

mu2einit
setup dhtools
samToPnfs sim.mu2e.example-beam-g4s1.1812a.art > filelist.txt

<!********************************************************>

Considerations for Large Datasets

Very generally, it takes about 8 hours to move 10,000 files or 500GB to the FTS for upload. It might take longer if there is network load or dCache is slower than usual for any reason. The time it takes is important because the transfer to dCache requires a kerberos ticket or VOMS proxy. Your ticket will expire in less than 26 h, and the proxy in less than 48 h, and maybe much less if you created them a while ago. To help prevent the ticket disappearing, you can kinit right before starting your jsonMaker command.

The transfers occur after all the metadata has been gathered. If the files are not art format, then this should run very quickly, less than 1s per file. If jsonMaker is running on art files, it will run the executable to extract run and event ranges, which can take up to 1 min for multi-GB files. You can see the rate by running jsonMaker without "-x" as a non-destructive dry run.

If your datasets are larger than the above limits, you probably want to split the upload into pieces and run them as separate jsonMaker commands. If you have named your files by their final dataset name, or if jsonMaker is renaming the file and the files are art format, then the following is not an issue. If jsonMaker is renaming the files and can't name them accordind to run and run section, like it does with art files, then it has to rename them by a sequencer which is just a counter. If you break your datasets into 1000-file sections, jsonMaker will want to name the first 1000 by the sequencer 0000-0999 and the second also by 0000-0999 and these names will be duplicates. In this case, you can rename the files with your own sequencer before giving them to jsonMaker, so it won't generate the sequencer, or you can add a digit to the sequencer with "-t 0" for the first set and "-t 1" for the second, etc.

Note on upload timing


Michael Diesburg
Wednesday, April 29, 2015 2:52 PM

Here is an outline of what happens for files stored
with FTS:

1) The FTS daemon picks up new file from the dropbox and
    copies it to /pnfs

2) As soon as the copy returns a successful completion code
    FTS starts checking the sam database to see if a tape
    label exists on that pnfs location.   Until a tape label
    shows up, FTS lists the file as "Waiting for Enstore tape label".

3) Depending on how the pnfs destination is configured, one
    of two things can happen:

    1) If the location is not part of an aggregation cache, then
       Enstore queues the file for transfer to tape.   At that
       point it may take anywhere from a few minutes to many hours
       to actually get transferred (depending on how busy the
       tape drives are).

    2) If the pnfs location is part of an aggregation cache, the
       cache will be queued for transfer to tape either when it fills
       up, or when the oldest file in the cache has reached the maximum
       wait time for that cache.  This is usually set to 24 hours.
       So if one file is written to the cache and no more files come
       in, that file may sit on disk for 24 hours before it starts
       the transfer to tape.

4) Once a file is transferred to tape, Enstore makes an entry in
    the "RECENT_FILES_ON_TAPE" listing for that experiment.  I think
    that update is only done once per day around 9:00PM.   That listing
    then contains the actual tape label in Enstore that the file was
    written to.

5) There is a cron job which runs on samweb.fnal.gov and checks the
    'RECENT_FILES_ON_TAPE' listings for each experiment.   It runs once
     every 6 hours.   If it finds a file in the listing it checks the
     locations in the sam database and adds the tape label if it isn't
     already there.

6) The next time FTS checks the file it will find a tape label and update
    the files status on the FTS web page from "Waiting on Enstore tape label"
    to Completed.


        So the bottom line is that once you see the tape label with
'samweb locate' you are assured that the transfer of the file to tape
has completed.   The downside of this is that in a worst timing
case, it might be up to 54 hours after the file actually makes it to
tape before sam or FTS know it is on tape.