UploadFTS: Difference between revisions

Revision as of 15:40, 6 April 2018

This page is out of date, please help update it!

Introduction

Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be available and delivered efficiently. The solution is coordinating several subsystems:

dCache: a set of disk servers, a database of files on the servers, and services to deliver those files with high throughput
- [scratchDcache.shtml scratch dCache] : a dCache where least used files are purged as space is needed.
- tape-backed dCache: a dCache where all files are on tape and are cycled in and out of the dCache as needed
pnfs: an nfs server behind the /pnfs/mu2e parition which looks like a file system to users, but is actually a interface to the dCache file database.
Enstore: the Fermilab system of tape and tape drive management
SAM: Serial Access to Metadata, a database of file metadata and a system for managing large-scale file delivery
FTS:File Transfer Service, a process which manages the intake of files into the tape-backed dCache and SAM.
jsonMaker: a piece of mu2e code which helps create and check metadata when creating a SAM record of a file
SFA: Small File Aggregation, enstore can tar up small files into a single large file before it goes to tape, to increase tape efficiency.

The basic procedure is for the user to run the jsonMaker on a data file to make the json file , then copy both the data file and the json into an FTS area in </a "scratchDcache.shtml">scratch dCache</a> called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use [sam.shtml SAM] to read the files in tape-backed dCache.

Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.

Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.

<!********************************************************>

Recipe

If you are about to run some new Monte Carlo in the official framework, then the upload will be built into the scripts and documented with the mu2egrid [ Monte Carlo submission] process. this is under development, please ask Andrei for the status

Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the [uploadExample.shtml upload examples] page.

choose values for the [#metadata SAM Metadata] , including the appropriate [#ff file family]
record the above items in a json file fragment that will apply to all the files in your dataset
[#name rename] your files by the upload convention (This can also be done by jsonMaker in the next step.)
setup an offline release and run the [#jsonMaker jsonMaker] to write the json file, which will include the fragment from the previous step
use "ifdh cp" to copy the data file and the full json file to the FTS area /pnfs/mu2e/scratch/fts (This step can also can be done by jsonMaker.)
use [sam.shtml SAM] to access the file or its metadata

The following is some detail you should be aware of in general, but a detailed knowledge is not required.

<!********************************************************>

File Families

 A file family is a set of files which are grouped exclusively on the same

subset of tapes. File families are used to indicate files that may be treated differently during data-handling operations. This might include tape library location, groupings for migration, deletion, or copy offsite, groupings for access priority or dcache location or lifetime.

Here are the mu2e file families. These should be used for all uploading.

phy-sim Monte Carlo simulated or reconstructed art files. These are official collaboration samples only, originated, produced, validated, and documented by physics groups intended for long-term use by many collabrators. Examples are the TDR and CD3 samples. The username associated with the files will be the production username "mu2e".
phy-nts non-art format ntuples of phy-sim
phy-etc configuration files, tarballs of log files, backups, and other files
usr-sim Monte Carlo simulated or reconstructed art files. These samples are produced by one or a few individuals for use in their personal studies. They are probably for short-term use, not documented publically, and not used by many collaborators. The username associated with these files will be the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
usr-nts Non-art format ntuples of usr-sim
usr-etc Other user-created tarballs of log files, backups
tst-cos Testbeam and cosmic data created before commissioning. This would include raw data formats as well as various possible derived formats and tarballs

For real data taking, more file families will be created to hold raw data, reconstructed data, and ntuples, etc.

When uploading files, you will need to specify the file family. You will probably only use usr-sim (for Monte Carlo art files) usr-nts for ntuples and usr-etc for tarballs and anything else. <!********************************************************>

SAM Metadata

One of SAM's main purposes is to store metadata about our files. The mu2e instance of a SAM database has a unique set of metadata fields, listed below. We can add to them and, except for a few fundamental fields, we can use them as we see fit. We will require that useful fields be filled wherever possible, and try to make it convenient for users to fill those fields.

SAM does not have the concept of dataset metadata, so all metadata has to be supplied for each file. See the [#name file name section] for a definition of a dataset.

all the metadata fields can be listed:

samweb list-parameters
samweb list-parameters &lt parameter &gt 
samweb list-values  --help-categories
samweb list-values &lt category &gt

The contents and validity of any file or dataset cannot be reliably determined only by a database entry if, for no other reason, you don't know if the database has been maintained. It is not uncommon to find obsolete or invalidated data, unmarked, in repositories. Expert consulations, validation, peer review, and vigilence are always required for selecting and processing data for critical work. In the following table, "json" refers to an optional json file the user supplies for every uploaded file. "generic json" refers to a file the user will provide, one per dataset uploaded. "[#jsonMaker jsonMaker] " refers to the jsonMaker executable that the user will run. Worked examples are available on the [uploadExample.shtml upload examples] page. The following metadata is required for all uploaded files

<b>file_size</b> Integer, size in bytes  - from json or jsonMaker

<b>crc</b> Integer - supplied by FTS

Note, for debugging purposes, this crc can be computed by: setup encp v3_11; ecrc filename

<b>create_user</b> String - SAM user name (usually a group account) - from FTS

<b>create_date</b> Date, when uploaded  - supplied by FTS

<b>file_name</b> String  - supplied by running jsonMaker

See [#name file name documentation]

<b>data_tier</b> String  - from filename or jsonMaker

for physics data:
 raw
 rec reconstructed
 ntd data ntuples
 for ExtMon data:
 ext ExtMon raw
 rex ext production
 xnt ext data ntuples
 for simulation:
 cnf set of config files fcl or txt, to drive MC jobs
 sim result of geant, StepPointMC
 mix mixed sim files (has multple generators)
 dig detector hits, like raw data
 mcs reconstructed data files
 nts MC ntuples
 other categories:
 log for log files
 bck for backups
 etc for anything else
 job for a production record
 <pre></blockquote>
<pre>dh.owner String - from filename or jsonMaker
For official data samples and Monte Carlo that go into the phy* [#ff file families] , this will be "mu2e". For user files, it will be the username of the person most likely to understand how they were created and how they should be used if questions come up a year or two later - the intellectual owner of the data.
dh.description String - from filename or jsonMaker
This is a mnemonic string which fundamentally indicates what this set of files contains and its intended purpose. It can also be thought of as a conceptual project and/or a high-level indication of the physics. This should be limited to 20 characters, but may be more if it is strongly motivated. Examples are "tdr-beam" or "2014-cosmics". It might contain the name of the responsible group. It should not contain a username or detailed configurations.
dh.configuration String - from filename or jsonMaker
This field is intended to capture details of the configuration, or variations of the configurations of the physics indicated in the description field. It might indicate cvs tags/git branches of configurations, fcl file names, offline versions, geometry versions, magnet settings, filter settings, etc. Not all of this information needs to be included, this is just a list of the sort of information intended for this field. This should be limited to 20 characters, but may be more if it is strongly motivated. For complex configurations, in order to avoid a very lengthy string, you should capture all infomation in a simple string like "tag100" or "c427-m1-g5-v2" which is documented elsewhere.
dh.sequencer String - from filename or jsonMaker
This field is simply to give unique filenames to files of a single dataset. It could be a counter 0000, 0001, etc. For art files we will try to make it rrrrrrrr_ssssss, where the r field indicates the run number and the s field indicates the subrun of the lowest sorted subrun eventID in the file. A subrun should only appear in one file so this is uniquely determined for a file in a dataset.
dh.dataset String - from filename, jsonMaker
a convenient search field made from the file name without the sequencer. It is unique for a logical dataset.
file_format String - from filename, json, or jsonMaker
This is the commonly-recognized file type, one of a fixed list (that can be extended): art, root, txt, tar, tgz, tbz (tar.bz2), log, fcl
content_status String - from jsonMaker
always "good" at upload, can be set to "bad" later to deprecate files without deleting them
file_type String - supplied by running jsonMaker
"data", "MC" or "other"

The following metadata is required for all uploaded art files
event_count Integer - from jsonMaker
total physics events in the file
dh.first_run_event Integer - from jsonMaker
run of the lowest sorted physics event ID
dh.first_event Integer - from jsonMaker
event of the lowest sorted physics event ID
dh.last_run_event Integer - from jsonMaker
run of the highest sorted physics event ID
dh.last_event Integer - from jsonMaker
event of the highest sorted physics event ID
dh.first_run_subrun Integer - from jsonMaker
run of the lowest sorted subrun
dh.first_subrun Integer - from jsonMaker
event of the lowest sorted subrun
dh.last_run_subrun Integer - from jsonMaker
run of the highest sorted subrun
dh.last_subrun Integer - from jsonMaker
event of the highest sorted subrun
runs List of lists - from jsonMaker
The list of subruns in this file, represented as a list of triplets of: run, subrun, run_type
run_type String - from jsonMaker
This parameter is not supplied once per file, but once per run in the "runs" parameter. Must be from a fixed list of "test", "MC", or "other". For data, values like "beam", "calib", "cosmic" will be added. Primarily used for data, all Monte Carlo will be called "MC". Different types of MC will be identified by the generator_type and primary_particle fields.

The following metadata is required for all uploaded Monte Carlo files
mc.generator_type String - from json or generic json
One of pre-defined values: "beam," "stopped_particle," "cosmic," "mix," or "unknown"
mc.simulation_stage Integer - from json or generic json
Which step in multi-step generation
mc.primary_particle String - from json or generic json
One of pre-defined values: "proton," "pbar," "electron," "muon," "neutron," "mix," or "unknown"

The following metadata is optional
dh.source_file String - from json,jsonMaker
The full file spec of the data file on disk, useful for understanding the history of the file and for identifying this file as a parent of other files.
parents List of Strings
For files derived from other specific SAM files, this contains the SAM names of the parent files
retire_date Date
When this field is filled, the file becomes permanently retired in the enstore system and may be overwritten

The following metadata is only for production records
job.cpu int
job cpu time in sec
job.maxres int
job max resident size in KB
job.site string
job grid site name
job.node string
job node
job.disk int
job disk space used, in KB

The following metadata may be created for real data
start_time Date
Time the file was opened during data-taking
end_time Date
Time the file was closed during data-taking

The real data will require others such as run types, goodrun bits, detector configuration, etc.
Metadata fields can be added at any time for files created in the future. New metadata fields for existing files can be added but may be quite hard to fill, depending on how the information needs to be gathered. <!********************************************************>
File Names
 File names should be relatively short, but 
include logical patterns to base searches on, and contain some human-recognizable, useful information to help someone distinguish datasets and be sure you are running on the right files, or to pick a file for testing code, etc. The file name must be unique, and should be mnemonic and helpful, but should not be primarily designed as, or assumed to be, complete and clear documentation of the file contents.

All fields of the file name should contain only alphanumeric characters, hyphens, and underscores.
Mu2e will name all files to be uploaded with the following pattern:
data_tier.owner.description.configuration.sequencer.file_format
These fields all correspond to required SAM metadata fields. If you remove the sequencer from a file name, you create a string that is unique for this logical dataset, and that will be put in the "dh.dataset" field. Datasets are all files with the same conceptual and actual metadata except for run numbers and other natural run dependence, and contain no duplicated event ID numbers. SAM does not have the concept of a dataset metadata, so files are made into a conceptual dataset by giving the files the same metadata. All files in a logical dataset will have the same "dh.dataset" field content, which will be unique to this dataset. With owner in the file name, potential name conflicts will only occur within one user's files.

An official Monte carlo may have datasets for cnf, sim, mix, dig, mcs, nts and log and examples of their file names might look like:
 cnf.mu2e.tdr-beam.TS3ToDS23.001-0001.fcl
 sim.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
 mix.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
 dig.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
 mcs.mu2e.tdr-beam.TS3ToDS23.12345678_123456.art
 nts.mu2e.tdr-beam.TS3ToDS23.12345678_123456.root
 log.mu2e.tdr-beam.TS3ToDS23.001.tgz
If a new digitization (dig) file were to be made with a different mix file, then a derived name could be used. Since this is a new set of conditions, it makes sense to modify the configuration field:
 dig.mu2e.tdr-beam.TS3ToDS23-v2.12345678_123456.art
When making variations there is a temptation to include all the information related to the change in the file name. For example, when switching the mix input from 2014.tag123 to 2014a.tag456, it is tempting to add that instead:
 dig.mu2e.tdr-beam.TS3ToDS23-mix2014a-tag456.12345678_123456.art
This style can get out of hand quickly, leading to large, unwieldy names, so we should favor (always with judgment and common sense) to simplify to just "v2" which must be documented elsewhere.

If a user created the change for his own purposes, he would make it into a usr data (and put it in the appropriate file family) by including his user name:
 dig.batman.tdr-beam.TS3ToDS23-v2.12345678_123456.art
Raw, reconstructed and ntuple beam data might look like:
 raw.mu2e.streamA.triggerTable123.12345678_123456.art
 rec.mu2e.streamA.triggerTable123.12345678_123456.art
 ntd.mu2e.streamA.triggerTable123.0001.root
A backup of an analysis project might look like:
 bck.batman.node123.2014-06-04.aa.tgz
<!********************************************************>

pnfs

/pnfs/mu2e is an nfs server which looks like a file system, but is actually an interface to the dCache file database. Users may interact directly with the [scratchDcache.shtml scratch dCache] , but users will typically never look into the tape-backed dCache area in /pnfs/mu2e. Users will only write to tape through the FTS, not directly to the tape-backed dCache. The user woudl typically read from tape-backed dCache using SAM only, but doing the transition to SAM, and while data loads are manageable, it is OK to use file lists. Remember, /pnfs is a database so you can overload it with demanding queries such as "find .", so please avoid that.
When files are copied into tape-backed dcache, the FTS will move them to a directory made of the file family at the head, followed by the metadata of the file name, two counters, and the filename:
/pnfs/mu2e/file family/data_tier/user/description/configuration/counter1/counter0/filename
For example if a file named
mcs.batman.2014-cosmic.tag001.00012345_000100.art
is uploaded, it would go into the file spec
/pnfs/mu2e/usr-sim/mcs/batman/2014-cosmic/tag001/000/000/mcs.batman.2014-cosmic.tag001.00012345_000100.art
Counter0 and counter1 are created from the SAM file ID and essentially increment when there are 1000 files in the directory, so datasets can have up to a billion files.

<!********************************************************>

jsonMaker

The jsonMaker is a python script which lives in the dhtools product and should be available at the command line after "setup dhtools." Please see the [uploadExample.shtml upload examples] page for details.

All files to be uploaded should be processed by the jsonMaker, which writes the final json file to be included with the data file in the FTS input directory. Even if all the final json could be written by hand, the jsonMaker checks certain required fields are present and other rules, checks consistency, and writes in a known correct format.

Simply run the maker with all the data files and json fragment(s) as input. The help of the code is below. The most useful practical reference is the [uploadExample.shtml upload examples] page.
jsonMaker [OPTIONS] ... [FILES] ...

 Create json files which hold metadata information about the file
to be uploaded. The file list can contain data, and other types,
of files (foo.bar) to be uploaded. If foo.bar.json is in the list, 
its contents will be added to the json for foo.bar.
If a generic json file is supplied, it's contents will be
added to all output json files. Output is a json file for each input 
file, suitable to presenting to the upload FTS server together with 
the data file.
 If the input file is an art file, jsonMaker must run
a module over the file in order to extract run and event
information, so a mu2e offline release that contains the module
must be setup.

 -h 
 print help
 -v LEVEL
 verbose level, 0 to 10, default=1
 -x 
 perform write/copy of files. Default is to evaluate the
 upload parameters, but not not write or move anything.
 -c
 copy the data file to the upload area after processing
 Will move the json file too, unless overidden by an explicit -d.
 -m
 mv the data file to the upload area after processing. 
 Useful if the data file is already in
 /pnfs/mu2e/scratch where the FTS is.
 Will move the json file too, unless overidden by an explicit -d.
 -e
 just rename the data file where it is
 -s FILE
 FILE contains a list of input files to operate on.
 -p METHOD
 How to match a input json file to a data file
 METHOD="none" for no json input file for each data file (default)
 METHOD="file" pair an input json file with a data file based on the 
 fact that if the file is foo, the json is foo.json.
 METHOD="dir" pair a json file and a data file based on the fact that 
 they are in the same directory, whatever their names are.
 -j FILE
 a json file fragment to add to the json for all files,
 typically used to supply MC parameters.
 -i PAR=VALUE
 a json file entry to add to the json for all files, like
 -i mc.primary_particle=neutron
 -i mc.primary_particle="neutron" 
 -i mc.simulation_stage=2 
 Can be repeated. Will supersede values given in -j
 -a FILE
 a text file with parent file sam names - usually would only
 be used if there was one data file to be processed.
 -t TAG
 text to prepend to the sequencer field of the output filename.
 This can be useful for non-art datasets which have different
 components uploaded at different times with different jsonMaker 
 commands, but intended to be in the same dataset, such as a series
 of backup tarballs from different stages of processing.
 -d DIR
 directory to write the json files in. Default is ".".
 If DIR="same" then write the json in the same directory as the 
 the data file. If DIR="fts" then write it to the FTS directory. 
 If -m or -c is set, then -d "fts" is implied unless overidden by 
 an explicit -d.
 -f FILE_FAMILY
 the file_family for these files - required
 -r NAME
 this will trigger renaming the data files by the pattern in NAME
 example: -r mcs.batman.beam-2014.fcl-100..art
 The blank sequencer ".." will be replaced by a sequence number 
 like ".0001." or first run and subrun for art files.
 -l DIR
 write a file of the data file name and json file name
 followed by the fts directory where they should go, suitable
 for driving a "ifdh cp -f" command to move all files in one lock.
 This file will be named for the dataset plus "command" 
 plus a time string.
 -g 
 the command file will be written (implies -l) and then
 when all files are evaluated and json files written, execute
 the command file with "ifdh cp -f commandfile". Useful
 to use one lock file to execute all ifdh commands.
 Nullifies -c and -m.

 Requires python 2.7 or greater for subprocess.check_output and 
 2.6 or greater for json module.
 version 2.0
<!*********************************************************************>

Links

[scratchDcache.shtml scratch dCache]
json.org
SAM (file access) for mu2e
SAM
samweb
samweb user guide
samweb command reference
SAMWEB default metadata fields ifdh_art

FTS monitor 01 &nbsp 02 &nbsp 03
FTS listing
[sam-note.html Note] on FTS upload steps and timing

Introduction

These are examples of how to upload files to tape. The first example is a set of Monte Carlo art files from a single dataset, and created with arbitrary names. Please read this example first because it is the most common use case and contains important overview information that is not repeated in the other examples. "jsonMaker -h" gives a useful summary help. If your dataset is large, containing more than 10,000 files or more than 500GB, please see the section on [#big large datasets] . The section on [#tools tools] gives a few examples of commonly-used commands. A complete description of all SAM procedures and tools is available at the [sam.shtml SAM page] . If jsonMaker stops with an error like "subprocess.check_output does not exist", it means you are using the wrong version of python, please start a new window with the setup as recommended below.
<!********************************************************>

MC Example 1

This is the most common use case. You have a set of MC files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the [tapeUpload.html#name naming convention] . The first step is to determine how to name the files by defining the description, configuration, etc. These fields are described in more detail [tapeUpload.html#metadata metadata] page.
For example, going through the decisions for the fields in the name:
data_tier.owner.description.configuration.sequencer.file_format
data_tier. This is reconstructed Monte Carlo, so it is "mcs." (Simulated but not reconstructed is "sim").
owner. You have generated this yourself for a study, so the owner is your username
description. You know these files are for a target geometry study, so that should go here. You know others have also generated MC for this purpose, but you won't conflict with them since you are using your username. The generator is stopped muons, and that is an important high-level physics description, so you should include that. You think you might do this whole study again in the near future so you decide to add a version number so description is "trgt_geo_stopped_v0"
Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
sequencer. Since these are art files, the sequencer will be generated by jasonMaker from the run and subrun numbers.
file_format. These are art files so the extention will be "art".

Your "rename" string will look like:
mcs.batman.trgt_geo_stopped_v0.geom0..art
The ".." is intentional to let jsonMaker know to generate the missing sequencer. Note that by changing the ".." to "." you will have the string that is the name of your dataset
mcs.batman.trgt_geo_stopped_v0.geom0.art
This will be put in the dh.dataset field and is the most common way you will refer to this dataset.
< Next you need to pick the [tapeUpload.html#ff file family] . In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo art files, so they go in "sim", therefore the file family is "usr-sim".
The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. Call it temp.json:
{
"mc.generator_type"  : "stopped_particle",
"mc.simulation_stage" : 3,
"mc.primary_particle" : "muon"
}
note there are commas between the field-value pairs and that strings are quoted, but numbers are not. This information can also be provided on the command line directly by the "-i" switch.
Then run the jsonMaker.
setup mu2e
source setup.sh [setup a mu2e Offline release]
setup dhtools [add jsonMaker to the path, must be after setup.sh]
kinit [in case copying to dcache]
<! getcert [cert required to write to sam database] > <! export X509_USER_CERT=/tmp/x509up_u\`id -u\` >
Run a test (no -x switch) on one file to make sure the final command will work
jsonMaker -f usr-sim -j temp.json -v 5 \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
one_of_your_data_files
If there are any errors, they will be printed at the end. They will need to be fixed.
If OK, then commit to the full run. The switch "-c" asks for the data and the json file to be copied to the FTS area, under the appropriate subdirectory according to the file family.
jsonMaker -f usr-sim -x -c -j temp.json \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files*
There are other options for how to run the jsonMaker, please run "jsonMaker -h" or see the reference [tapeUpload.html#jsonMaker here] . For example, if you files are already in scratch dCache (/pnfs/mu2e/scratch/..) then you can "mv" inside of the scratch dCache to the FTS, also in scratch dCache, which would be more efficient than copying them. You can ask jsonMaker to just write out the json files (-x -d with no -c or -m). It can generate a file containing a list of move commands that can be given to ifdh, so thay can be run with one lock. With -g, jsonMaker will also execute this command. You can always consult with the offline group if you have questions or a special case. Uploading errors can be fixed, but that can be complex, so it is far better to ask questions before rather than after.

for non-art files, jsonMaker will run very quickly. For art files, it has to runa mu2e executable to extract the run numbers. This takes 2s per file, and can take up to 60s if the file is large. In general, we recommend limiting single runs of jsonMaker to 10K files. Larger datasets can be broken into smaller subsets which can be run separately. It may be easiest to do this with the file list input style (-s) instead of command line wildcards. <!********************************************************>
MC Example 2

In this example, the user has provided some additional metadata which is unique for each file. This could be an original file location in "dh.source_file," or parent file names (must be SAM file names). jsonMaker cannot probe anything but art files for run numbers. If you want to upload an ntuple and include run numbers in the SAM metadata, then you can do that by writing a json file for each data file. As a concrete example, suppose a json file like this for each datafile:
{
 "parents" : [ "mcs.batman.trgt_geo_stopped_vo.geom0.12345678_123456.art" ]
}
The process in this case is the same as in example 1,
with one item added. You need to tell jsonMaker how to determine which json file belongs with which data file. There are two methods, pairing by the fact that if the data file is foo, then the json file is foo.json. The other method is to pair the json file to whatever data file is in the same directory. In this second case, there can only be one data file and json file in each directory.
the command is the same as example 1, but with a pairing 
directive in "-p" and the json files added to the in put on the command line.
jsonMaker -f usr-sim -x -c -j temp.json -p dir \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
*all_your_data_files* *all_your_json_files*
<!********************************************************>

MC Ntuple Example

You have a set of ntuple (root) files on disk and want to put them on tape. They are all the same dataset. The files are not named according to the [tapeUpload.html#name naming convention] . The first step is to determine how to name the files by defining the description, configuration, etc. as in MC Example 1.

data_tier. These are root ntuple files, so it is "nts."
owner. You have generated this yourself for a study, so the owner is your username
description. If these ntuples were made by reading an art file dataset, it might make sense to use the same description and configuration as this parent dataset. It will be distinguished by the different data_tier in the name. So use "trgt_geo_stopped_v0" (from MC example 1)
Configuration. You are testing 10 geometries so it is easy to simply call them "geom0" etc.
sequencer. Since these are not art files, the sequencer will not be run and subrun, but will be a sequential counter, generated by jsonMaker.
file_format. These are root files, but not art so the extention will be "root".

Your "rename" string will look like:
nts.batman.trgt_geo_stopped_v0.geom0..root
Next you need to pick the [tapeUpload.html#ff file family] . In this case the files were not generated and documented by the collaboration, so the first part should be "usr" and the files are Monte Carlo root ntuple files, so they go in "nts", therefore the file family is "usr-nts".
The next step is to write a little generic json file to provide the other required fields that the jsonMaker cannot supply. jsonMaker will sense this is MC by the data_tier and require that you supply these fields. Call it temp.json:
{
"generator_type"  : "stopped_particle",
"simulation_stage" : 3,
"primary_particle" : "muon"
}
Then run the jsonMaker.
jsonMaker -f usr-nts -x -c -j temp.json \
-r nts.batman.trgt_geo_stopped_vo.geom0..root \
*all_your_data_files*
<!********************************************************>

Grid Example

In this case, suppose you were generating files on the grid and wanted to upload those files efficiently. This might be Monte Carlo output art files or ntuple files. The best thing to do is to run jsonMaker on the grid node to produce the json file. Copy your data file and json file back to dCache then, when you ready, copy or mv them into the upload area.

Please see the other examples for details of how to run jsonMaker for your particular case, but in general there are couple of options to point out here. One is "-e" which allows renaming of the data file in place. "-d" defaults to writing the json file in the local dir.
setup mu2e
source setup.sh [setup a mu2e Offline release]
setup dhtools [add jsonMaker to the path, must be after setup.sh]

jsonMaker -f usr-sim -x -e -j generic.json \
-r mcs.batman.trgt_geo_stopped_vo.geom0..art \
your_data_file

ifdh cp mcs* /pnfs/mu2e/scratch/users/batman/outdir
in this case, after all processes are done and you've checked the output in dChace, you can move the data files and their json to the fts directory. To avoid putting too many files in one subdirectory, we have subdirectories below /pnfs/mu2e/scratch/fts/usr-sim. Please spread out the files among those directories. The data file and its json need to go into the same directory.
If you believe things are running smoothly,
you can move the data and json directly into the uploader.
jsonMaker -f usr-sim -x -m -j generic.json \ -r mcs.batman.trgt_geo_stopped_vo.geom0..art \ your_data_file
If you generating files that are not art files, then jsonMaker will not have the run and subrun to give the files a unique sequencer. One way to handle this is through the "-t" switch. You could add -t "${CLUSTER}_${PROCESS}" or a tag based on the first run and event in the ntuple. You could also rename the file and its json according to the rename scheme (then do not use -r or -e) and include your own sequencer. Finally, it might be easiest to write the ntuples to scratch dCache and then run jsonMaker on the full set of files interactively, so it can assign sequence numbers logically.

<!********************************************************>

Log File Tarball Example

You may also want to save the log files from this MC, which you have tarred up in a few tarballs. The file names will be same with a few logical changes. The file family has changed to "usr-etc" since these are not data files, and will not be read like data. The data_tier has changed to "bck" and the file_format has changed to "tgz". The command is:
jsonMaker -f usr-etc -x -c -j temp.json -v 5 \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_mc_tar_files*.tgz
the sequencer field is left blank in the rename 
string, which will cause jsonMaker to fill that in with a counter.

In the examples, the simulated data, and ntuples and and tarballs of the log files were uploaded with coordinated dataset names - the same descriptions and configuration fields. This can run into a little conflict in backup up of tarballs. For example, suppose there are multiple steps in making the ntuple, each with their own set of log files. A reasonable solution is to keep adding to your backup dataset, keeping the same descriptions and configuration fields, but modifying the sequencer with "-t".
jsonMaker -f usr-etc -x -c -j temp.json -v 5 -t "step2" \
-r bck.batman.trgt_geo_stopped_v0.geom0..tgz \
your_other_mc_tar_files*.tgz
Listing the bck.batman.trgt_geo_stopped_v0.geom0.tgz dataset will look like:
bck.batman.trgt_geo_stopped_v0.geom0.000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.001.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-000.tgz
bck.batman.trgt_geo_stopped_v0.geom0.step2-001.tgz
Your logically coordinated datasets are then all
*.batman.trgt_geo_stopped_v0.geom0.*
<!********************************************************>

Config File Example

A run of Monte Carlo can be driven by a set of fcl files, one for each grid process. The fcl could be generated before the job is submitted and they could contain fixed random seeds, for example. This allows all stages of the MC to be driven by an input dataset and is maximally reproducable.

This example shows how to upload a set of MC fcl files. The file family is "usr-etc" since these are not art or root data files. The data_tier has changed to "cnf" (for config) and the file_format has changed to "fcl". Since these are part of a MC production chain, the MC parameters defined the generic.json can be defined and will be required. The command is:
jsonMaker -f usr-etc -x -c -j temp.json -v 5 \
-r cnf.batman.trgt_geo_stopped_v0.geom0..fcl \
your_fcl_files*.fcl
<!********************************************************>

Backup Example

You have an analysis project you are done with and want to get it off disk, but also save it for the forseeable future. The file family with be usr-etc since it is user data and not art files or ntuples.
It is a backup, so data_tier "bck". Since the dataset will include your user name, your description and configuration only have to be unique to you, so pick anything logical, say "target_analysis" for the description and "09_2014" for the configuration.
in this case you don't have to supply the generator info so you don't need a generic json file at all. The command becomes:
jsonMaker -f usr-etc -x -c -v 5 \
-r bck.batman.target_analysis.09_2014..tgz \
your_dir_analysis_tar_files*.tgz
<!********************************************************>

Common Tools

You can see how many of your files are in SAM with:
setup mu2e
setup sam_web_client
samweb count-files "dh.dataset=sim.mu2e.example-beam-g4s1.1812a.art"
You can see how many of your files in SAM have actually gone to tape:
setup mu2e
setup dhtools
samOnTape sim.mu2e.example-beam-g4s1.1812a.art
You can make a list of files in their permanent locations, suitable for feeding to mu2egrid:
setup mu2e
setup dhtools
samToPnfs sim.mu2e.example-beam-g4s1.1812a.art > filelist.txt
<!********************************************************>

Considerations for Large Datasets

Very generally, it takes about 8 hours to move 10,000 files or 500GB to the FTS for upload. It might take longer if there is network load or dCache is slower than usual for any reason. The time it takes is important because the transfer to dCache requires a kerberos ticket or VOMS proxy. Your ticket will expire in less than 26 h, and the proxy in less than 48 h, and maybe much less if you created them a while ago. To help prevent the ticket disappearing, you can kinit right before starting your jsonMaker command.

The transfers occur after all the metadata has been gathered. If the files are not art format, then this should run very quickly, less than 1s per file. If jsonMaker is running on art files, it will run the executable to extract run and event ranges, which can take up to 1 min for multi-GB files. You can see the rate by running jsonMaker without "-x" as a non-destructive dry run.
If your datasets are larger than the above limits, you probably want to split the upload into pieces and run them as separate jsonMaker commands. If you have named your files by their final dataset name, or if jsonMaker is renaming the file and the files are art format, then the following is not an issue. If jsonMaker is renaming the files and can't name them accordind to run and run section, like it does with art files, then it has to rename them by a sequencer which is just a counter. If you break your datasets into 1000-file sections, jsonMaker will want to name the first 1000 by the sequencer 0000-0999 and the second also by 0000-0999 and these names will be duplicates. In this case, you can rename the files with your own sequencer before giving them to jsonMaker, so it won't generate the sequencer, or you can add a digit to the sequencer with "-t 0" for the first set and "-t 1" for the second, etc.

@@ Line 708: / Line 708: @@
 [https://cdcvs.fnal.gov/redmine/projects/sam-main/wiki/Sam_web_client_Command_Reference samweb command reference] <br>
 [https://cdcvs.fnal.gov/redmine/projects/sam-web/wiki/Metadata_format SAMWEB default metadata fields]
-[http://mu2e.fnal.gov/atwork/computing/tapeUpload.shtml#metadata SAM Metadata fields]  for mu2e<br>
-[http://mu2e.fnal.gov/atwork/computing/tapeUpload.shtml#name SAM file name conventions]  for mu2e<br>
-[http://mu2e.fnal.gov/atwork/computing/ops/samMon.html SAM dataset listing] <br>
 [https://cdcvs.fnal.gov/redmine/projects/ifdh-art/wiki ifdh_art] <br>
-[tapeUpload.html uploading instructions] <br>
-[uploadExample.shtml upload examples]  <br>
 FTS monitor [http://mu2esamgpvm01.fnal.gov:8787/fts/status 01] &nbsp
 [http://mu2esamgpvm02.fnal.gov:8787/fts/status 02] &nbsp

UploadFTS: Difference between revisions

Revision as of 15:40, 6 April 2018

Contents

Introduction

Recipe

File Families

SAM Metadata

File Names

pnfs

jsonMaker

Links

Introduction

MC Example 1

MC Example 2

MC Ntuple Example

Grid Example

Log File Tarball Example

Config File Example

Backup Example

Common Tools

Considerations for Large Datasets

Navigation menu

UploadFTS: Difference between revisions

Revision as of 15:40, 6 April 2018

Introduction

Recipe

File Families

SAM Metadata

File Names

pnfs

jsonMaker

Links

Introduction

MC Example 1

MC Example 2

MC Ntuple Example

Grid Example

Log File Tarball Example

Config File Example

Backup Example

Common Tools

Considerations for Large Datasets

Navigation menu

Search