Upload: Difference between revisions
Line 176: | Line 176: | ||
=== Option 2, perform individual steps=== | === Option 2, perform individual steps=== | ||
<ol start=4> | <ol start=4> | ||
<li> Generate the metadata file for the data file.</li> | <li> Generate the metadata file for the data file.</li> | ||
generate a json metadata file for each data file: | generate a json metadata file for each data file: |
Revision as of 03:47, 12 November 2020
Introduction
mu2e has several forms of disk space available and large aggregated data disk systems are available in dCache. But we also write a large part of the files we produce to tape, which is less expensive, and can hold much more data. We usually write data to tape for one or more of the following reasons
- to make room for new activity
- to keep it safe for longer than a few months
- to make a permanent record
The tape system is called enstore and consists of several tape libraries and many tape drives with good connections to dCache. We write to tape by copying files into tape-backed dCache and they are copied automatically to tape. The files will, on the scale of weeks if they are unused, be deleted off disk so they are only on tape. We can get them copied from tape to disk again by prestaging them.
All data written to tape must follow conventions and must be written through production scripts. You may want to familiarize yourself with the links in this list. The recipes below will guide you through the steps of a recipe. If you are using production scripts for simulation, most of these conventions are provided for you.
- all files are named by mu2e name conventions
- all files will have a SAM record with SAM metadata, including file location in dCache
- all files are uploaded using standard tools, see especially printJson.
Upload Concepts
Generally there are the following conceptual steps to upload a file. The practical recipes using the production scripts are in the following sections.
- rename the files by the standard convention
- use printJson to generate a json file containing the SAM metadata for each data file.
- declare the files to SAM database using
mu2eFileDeclare
- copy the files to tape-backed dCache using
mu2eFileUpload
- include the final tape location into the SAM record, using
mu2eDatasetLocation
Please also see the comments about file sizes in the job planning page.
In the standard MC workflow, the first 2 steps are done for you by the grid job script. Please skip down to that specific workflow for the remainder of that recipe.
For uploading the log files in the standard MC workflow, the log files need to be tarred up first. Please skip down to that specific workflow for the remainder of that recipe.
For uploading random files not associated with the standard MC workflow, please see that workflow
If you have never uploaded files of a particular type before, you may get a permission denied error during a mkdir command. In this case, please contact mu2eDataAdmin.
MC workflow, art files
In the standard MC workflow, there are three times you might upload files:
- after generating the fcl, uploading the fcl files is part of the generate fcl procedure
- after producing art files (including concatenation if needed), which is described in this section
- upload log files as an archive, which is handled in the following section
After the jobs have completed, and you have checked the output by running mu2eCheckAndMove script, the output datasets will be below a "good" directory like the following, where you will be working:
cd /pnfs/mu2e/persistent/users/mu2epro/workflow/project_name/good
Below this directory, there are directories for each cluster, and below that directories for each job. Each output art file named "a.b.c.d.e.f" should have a associated json file called "a.b.c.d.e.f.json" produced as part of the grid job and containing the SAM record metadata.
There are two steps. First, declare the files to the SAM database
mu2eClusterFileList --dsname <dataset> --json <cluster_directory> | mu2eFileDeclare
where dataset
is the dataset name of the files to find and upload and
the cluster_directory
is one of the cluster subdirectories.
If you see errors while declaring files, check that you have
a valid certificate.
The second step is to move the files to the final location in tape-backed dCache:
mu2eClusterFileList --dsname <dataset> <cluster_directory> | mu2eFileUpload --ifdh --tape
If you see a permission denied error during a mkdir command, please contact mu2eDataAdmin.
Currently (5/2018) we are seeing an increasing number of problems reading and writing files using the nfs interface to dCache. If you see extreme slowness, you can put in a ticket and ask to have the dCache nfs server restarted. Using the "--ifdh" switch will cause the data to be transferred by more reliable protocols.
If you want a list of the files in their final location, instead of an expensive ls with wildcards, please use a file tool
mu2eDatasetFileList --tape <dataset>
Don't forget --tape
is a binary option so doesn't take a n "=".
The third step is to tell SAM where the files are in the tape system, to add their "location" to the SAM record.
mu2eDatasetLocation --add=tape <dataset>
Since it takes about a day, or sometimes more, for a file to migrate to tape and establish its tape location, after being copied to tape-backed Cache, it makes sense to wait a day before running this command
This command should be as many times as needed in order to get the "Nothing to do" message, which means all the files in the dataset now have their location recorded:
> mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016. SAMWeb times: query metadata = 0.00 s, update location = 0.00 s Summary1: out of 0 virtual dataset files 0 were not found on tape. Summary2: successfully verified 0 files, added locations for 0 files. Summary3: found 0 corrupted files and 0 files without tape labels.
MC workflow, log files
After the desired datasets have been extracted from job outputs in
a good
area the mu2eClusterArchive
can be
used to save the rest of the files, usually logs and histograms, to tape.
The mu2eClusterArchive
script by default archives job logs. "Non-interesting" files, such as the TFileService file with names like "nts.*.root"
can either be deleted with e.g.
mu2eClusterFileList --dsname <nts dataset name> <cluster directory> | xargs rm -f mu2eClusterFileList --dsname <nts dataset name> --json <cluster directory> | xargs rm -f
or archived together with the logs (the recommended production procedure):
> mu2eClusterArchive --allow nts.gandr.cd3-pions-g4s1.v567.root <cluster directory> 1 Mon Nov 21 17:59:05 2016 Working on /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Try 1: archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Mon Nov 21 17:59:06 2016 Registering /pnfs/mu2e/tape/usr-etc/bck/gandr/my-test-s1/v567/tbz/f4/9e/bck.gandr.my-test-s1.v567.002700_00000001.tbz in SAM Creating a dataset definition for bck.gandr.my-test-s1.v567.tbz Mon Nov 21 17:59:07 2016 Removing /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465 Done archiving 1 directories. Encountered 0 tar errors.
If you are archiving a cluster whose art output was later concatenated before uploading you should also remove these redundant art files:
mu2eClusterFileList --dsname <art dataset name> <cluster directory> | xargs rm -
Note that the directory to be archived is moved
from good
into a parallel subdirectory
of archiving
before any processing is
done. This is to prevent race conditions with other scripts that
can be working on the same files. If you get an error from
mu2eClusterArchive
, you can recover by moving directory
back into "good" before trying to archive it again. You may need to delete the output tar file that was being written to the tape-backed dCache area (usually if the error occurred during the creation of this file). The file name should be in the command print.
If you see a permission denied error during a mkdir command, please contact mu2eDataAdmin.
To record tape label information for a recently archived dataset:
mu2eDatasetLocation --add=tape bck.gandr.my-test-s1.v567.tbz
If there is no tape label, re-run the command later. You may need to wait a day before a new file acquires a tape label.
If you want a list of the files in their final location, instead of an expensive ls with wildcards, please use a file tool
mu2eDatasetFileList --tape bck.gandr.my-test-s1.v567.tbz
Don't forget --tape
is a binary option so doesn't take a n "=".
Random files
Random files are files that are not created in the process of simulation in the mu2egrid package. For simulation, which is almost all cases, please follow the workflows above.
There are two categories of files:
- data files which you expect to read back as ready-to-use data files. These are usually art files or ntuple files. This also include fcl file which will be used to feed grid jobs.
- all other data such as txt, logs, scripts, analysis areas, etc. should be put into tarballs ( tarred, extension tar, tar and gzipped, extension tgz, or tar and bzip2, extension tbz). Tarballs should be between 0.5 and 2 GB, as an important guideline, but there are no strict limits.
Test beam data is a gray area, where we have used both tarred and untarred files. The choice depends on whether the data will be read back as input data in grid jobs (which should be uploaded as individual files) or archives (upload as tarballs).
Here are the steps to upload files.
- Please see file names documentation. You should end up with a 5-dot dataset name like these examples sim.batman.beam-mytarget.v0.art bck.batman.node123.2014-06-04.tgz Here "batman" represents your username.
- Setup tools:
(setup an appropriate Offline version) setup mu2efiletools setup dhtools
At this point, you have several more steps to do. You have a choice of two methods:
- use
mu2eFileMoveToTape
, which does them in one command, but the command takes up to one day to finish - perform the 4 steps individually, but you have to perform the last one after waiting a day.
There is a day delay in both methods because the system will write files to tape as needed, or once a day. We have to wait for the file to go to tape before we can write its tape location in the SAM database, completing its record.
The mu2eFileMoveToTape
method will only write the files to tape, but the multi-step method you can also write the files to a location (determined by the file name) on persistent or scratch dCache.
Option 1, one command, block for a day
- Let this one command do all steps mu2eFileMoveToTape <files>
the command will not exit until the files are on tape and a location has been determined, finishing the record.
Option 2, perform individual steps
- Generate the metadata file for the data file. generate a json metadata file for each data file:
- Declare files to SAM ls *.json | mu2eFileDeclare If you see errors while declaring files, check that you have a valid certificate.
- move the files to tape-backed dCache: ls *.art | mu2eFileUpload --tape If you see a permission denied error during a mkdir command, please contact mu2eDataAdmin.
- After a day or two, come back to the project. By this time, the files will have migrated to tape, and you can record the final tape location: mu2eDatasetLocation --add=tape <dataset> Since it is hard to predict exactly when all files will go to tape, you may need to re-run this command occasionally until you get the message "Nothing to do".
jsonMaker -f <file family> -v 5 -e -x -r <dataset name with 6 fields> <datafile(s)>
You can control the verbosity with the number following the -v
. Without the -x
, the program will read the file and print results, but not make any changes. The meaning of "dataset name with 6 fields" means take your dataset name like "sim.batman.beam-mytarget.v0.art" and write with an extra dot, like: "sim.batman.beam-mytarget.v0..art" which indicates the missing sequencer. (This quirk in writing a dataset name be changed to the normal dataset name in the future.)
The result for an art file will be like:
> ls -1 mysim_000.art mysim_001.art jsonMaker -f usr-sim -v 0 -e -x -r sim.batman.beam-mytarget.v0..art *.art ... > ls -1 sim.batman.beam-mytarget.v0.00001002_000005.art sim.batman.beam-mytarget.v0.00001002_000005.art.json sim.batman.beam-mytarget.v0.00001002_000016.art sim.batman.beam-mytarget.v0.00001002_000016.art.json
The files have been renamed and the json files produced. The sequencer
field has been filled appropriately according to the conventions.
The command for a different type of file, like a backup tarball might look like:
> ls -1 my* my_backup_disk0.tgz my_backup_disk2.tgz > jsonMaker -f usr-etc -v 5 -e -r bck.batman.node123.2014-06-04..tgz -x my* ... > ls -1 bck* bck.batman.node123.2014-06-04.0000.tgz bck.batman.node123.2014-06-04.0000.tgz.json bck.batman.node123.2014-06-04.0001.tgz bck.batman.node123.2014-06-04.0001.tgz.json
In the above examples, the files were not named correctly for upload and we used the -e
switch to ask jsonMaker
to rename them for us. This is often easiest, particularly if they are art files and we don't know the first run and subrun to make the correct sequencer string. But if you have files named correctly for upload, you can tell jsonMaker
to accept the file name as it is (the name still has to follow the conventions).
> ls -1 my* my_backup_disk0.tgz my_backup_disk1.tgz > mv my_backup_disk0.tgz bck.batman.node123.2014-06-04.0.tgz > mv my_backup_disk1.tgz bck.batman.node123.2014-06-04.1.tgz > jsonMaker -f usr-etc -v 5 -x *.tgz ... > ls -1 bck* bck.batman.node123.2014-06-04.0.tgz bck.batman.node123.2014-06-04.0.tgz.json bck.batman.node123.2014-06-04.1.tgz bck.batman.node123.2014-06-04.1.tgz.json