Upload

From Mu2eWiki
Jump to navigation Jump to search

Introduction

mu2e has several forms of disk space available and large aggregated data disk systems are available in dCache. But we also write a large part of the files we produce to tape, which is less expensive, and can hold much more data. We usually write data to tape for one or more of the following reasons

  • to make room for new activity
  • to keep it safe for longer than a few months
  • to make a permanent record

The tape system is called enstore and consists of several tape libraries and many tape drives with good connections to dCache. We write to tape by copying files into tape-backed dCache and they are copied automatically to tape. The files will, on the scale of weeks if they are unused, be deleted off disk so they are only on tape. We can get them copied from tape to disk again by prestaging them.

All data written to tape must follow conventions and must be written through production scripts. Please familiarize yourself with the links in this list.

The following are the common workflow cases.

Upload steps

Generally there are the following steps:

  1. choose the appropriate tape file family
  2. rename the files by the standard convention
  3. use jsonMaker to generate a json file containing the SAM metadata for each data file.
  4. declare the files to SAM using mu2eFileDeclare
  5. copy the files to tape-backed dCache using mu2eFileUpload
  6. include the final tape location into the SAM record, using mu2eDatasetLocation

Please also see the comments about file sizes in the job planning page.

In the standard MC workflow, the first 3 steps are done for you by the scripts. Please skip down to that specific workflow for the remainder.

For uploading the log files in the standard MC workflow, the first 3 steps are done for you by the scripts. Please skip down to that specific workflow for the remainder.

For uploading random files, please see that workflow

Random files

  1. Please see file family documentation to choose a file family. The result should be a string like "usr-sim".
  2. Please see file names documentation. You should end up with a 5-dot dataset name like these examples
  3. sim.batman.beam-mytarget.v0.art bck.batman.node123.2014-06-04.tgz Here "batman" represents your username.
  4. Generate the metadata file for the data file.
  5. Setup:
    (setup an appropriate Offline version)
    setup dhtools
    

    and generate the json files

    jsonMaker -f <file family> -v 5 -e -x -r <dataset name with 6 fields> <list of datafiles>
    

    You can control the verbosity with the number following the -v. Without the -x, the program will read the file and print results, but not make any changes. The meaning of "dataset name with 6 fields" means take your dataset name like "sim.batman.beam-mytarget.v0.art" and write with an extra dot, like: "sim.batman.beam-mytarget.v0..art" which indicates the missing sequencer. (This quirk in writing a dataset name be changed to the normal dataset name in the future.)

    The result for an art file will be like:

     > ls -1
    mysim_000.art
    mysim_001.art
    jsonMaker -f usr-sim -v 0 -e -x -r sim.batman.beam-mytarget.v0..art  *.art
    ...
     > ls -1
    sim.batman.beam-mytarget.v0.00001002_000005.art
    sim.batman.beam-mytarget.v0.00001002_000005.art.json
    sim.batman.beam-mytarget.v0.00001002_000016.art
    sim.batman.beam-mytarget.v0.00001002_000016.art.json
    

    The files have been renamed and the json files produced. The sequencer field has been filled appropriately according to the conventions.

    The command for a different type of file, like a backup tarball might look like:

      > ls -1 my*
    my_backup_disk0.tgz
    my_backup_disk2.tgz
    > jsonMaker -f usr-etc -v 5 -e -r bck.batman.node123.2014-06-04..tgz -x my*
    ...
    > ls -1 bck*
    bck.batman.node123.2014-06-04.0000.tgz
    bck.batman.node123.2014-06-04.0000.tgz.json
    bck.batman.node123.2014-06-04.0001.tgz
    bck.batman.node123.2014-06-04.0001.tgz.json
    

    In the above examples, the files were not named correctly for upload and we used the -e switch to ask jsonMaker to rename them for us. This is often easiest, particularly if they are art files and we don't know the first run and subrun to make the correct sequencer string. But you have files named correctly for upload, you can tell jsonMaker to accept the file name as it is (the name still has to follow the conventions).

      > ls -1 my*
    my_backup_disk0.tgz
    my_backup_disk1.tgz
     > mv my_backup_disk0.tgz bck.batman.node123.2014-06-04.0.tgz 
     > mv my_backup_disk1.tgz bck.batman.node123.2014-06-04.1.tgz
     > jsonMaker -f usr-etc -v 5 -x *.tgz
    ...
    > ls -1 bck*
    bck.batman.node123.2014-06-04.0.tgz
    bck.batman.node123.2014-06-04.0.tgz.json
    bck.batman.node123.2014-06-04.1.tgz
    bck.batman.node123.2014-06-04.1.tgz.json
    
  6. Declare files to SAM
  7. ls *.json | mu2eFileDeclare
  8. move the files to tape-backed dCache:
  9. ls *.art | mu2eFileUpload --tape
  10. After a day or two, come back to the project. By this time, the files will have migrated to tape, and you can record the final tape location:
  11. mu2eDatasetLocation --add=tape <dataset> Since it is hard to predict exactly when all files will go to tape, you may need to re-run this command occasionally until you get the message "Nothing to do".

MC workflow, art files

In the standard MC workflow, there are three times you might upload files:

  • after generating the fcl, uploading the fcl files is part of that procedure
  • after producing art files (including concatenation if needed), which is described in this section
  • upload log files as an archive, which is handled here

After the jobs have completed, the output datasets will be below a directory like the following, where you will be working:

cd /pnfs/mu2e/persistent/users/mu2epro/workflow/project_name/good

Below this directory, there are directories for each cluster, and below that directories for each job. Each output art file named "a.b.c.d.e.f" should have a associated json file called "a.b.c.d.e.f.json" produced as part of the grid job and containing the SAM record metadata.

There are two steps. First, declare the files to the SAM database

 mu2eClusterFileList --dsname <dataset> --json <cluster_number>  | mu2eFileDeclare

where dataset is the dataset name of the files to find and upload and the cluster_directory is one of the cluster subdirectories. If you see errors while declaring files, check that you have a valid certificate.

The second step is to move the files to the final location in tape-backed dCache:

mu2eClusterFileList --dsname <dataset> <cluster_directory>  | mu2eFileUpload --tape

The third step is to tell SAM where the files are in the tape system, to add their "location" to the SAM record.

mu2eDatasetLocation --add=tape <dataset>

Since it takes about a day, or sometimes more, for a file to migrate to tape and establish its tape location, after being copied to tape-backed Cache, it makes sense to wait a day before running this command

This command should be as many times as needed in order to get the "Nothing to do" message, which means all the files in the dataset now have their location recorded:

> mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art
  No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016.
  SAMWeb times: query metadata = 0.00 s, update location = 0.00 s
  Summary1: out of 0 virtual dataset files 0 were not found on tape.
  Summary2: successfully verified 0 files, added locations for 0 files.
  Summary3: found 0 corrupted files and 0 files without tape labels.

MC workflow, log files

After the desired datasets have been extracted from job outputs in a good area the mu2eClusterArchive can be used to save the rest of the files, usually logs and histograms, to tape.

The mu2eClusterArchive script by default archives job logs. "Non-interesting" files, such as the TFileService file with names like "nts.*.root" can either be deleted with e.g.

mu2eClusterFileList --dsname <nts dataset name>  <cluster directory> | xargs rm -f
mu2eClusterFileList --dsname <nts dataset name> --json <cluster directory> | xargs rm -f

or archived together with the logs (the recommended production procedure):

> mu2eClusterArchive   --allow nts.gandr.cd3-pions-g4s1.v567.root  <cluster directory>
1       Mon Nov 21 17:59:05 2016  Working on /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
Mon Nov 21 17:59:06 2016  Try 1: archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
Mon Nov 21 17:59:06 2016  Archiving /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
Mon Nov 21 17:59:06 2016  Registering /pnfs/mu2e/tape/usr-etc/bck/gandr/my-test-s1/v567/tbz/f4/9e/bck.gandr.my-test-s1.v567.002700_00000001.tbz in SAM
Creating a dataset definition for bck.gandr.my-test-s1.v567.tbz
Mon Nov 21 17:59:07 2016  Removing  /pnfs/mu2e/scratch/users/gandr/workflow/pion-test/archiving/20161121-1759-bwOu/11986465
Done archiving 1 directories. Encountered 0 tar errors.

Note that the directory to be archived is moved from good into a parallel subdirectory of archiving before any processing is done. This is to prevent race conditions with other scripts that can be working on the same files. If you get an error from mu2eClusterArchive, you can recover by moving directory back into "good" before trying to archive it again.


To record tape label information for a recently archived dataset:

mu2eDatasetLocation --add=tape bck.gandr.my-test-s1.v567.tbz

If there is no tape label, re-run the command later. You may need to wait a day before a new file acquires a tape label.