Upload

From Mu2eWiki
Revision as of 19:39, 30 March 2017 by Rlc (talk | contribs) (Created page with "== Introduction == Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be avail...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be available and delivered efficiently. The solution is coordinating several subsystems:

  • dCache: a set of disk servers, a database of files on the servers, and services to deliver those files with high throughput
    • [scratchDcache.shtml scratch dCache] : a dCache where least used files are purged as space is needed.
    • tape-backed dCache: a dCache where all files are on tape and are cycled in and out of the dCache as needed
  • pnfs: an nfs server behind the /pnfs/mu2e parition which looks like a file system to users, but is actually a interface to the dCache file database.
  • Enstore: the Fermilab system of tape and tape drive management
  • SAM: Serial Access to Metadata, a database of file metadata and a system for managing large-scale file delivery
  • FTS:File Transfer Service, a process which manages the intake of files into the tape-backed dCache and SAM.
  • jsonMaker: a piece of mu2e code which helps create and check metadata when creating a SAM record of a file
  • SFA: Small File Aggregation, enstore can tar up small files into a single large file before it goes to tape, to increase tape efficiency.


The basic procedure is for the user to run the jsonMaker on a data file to make the json file , then copy both the data file and the json into an FTS area in </a "scratchDcache.shtml">scratch dCache</a> called a dropbox. The json file is essentially a set of metadata fields with the corresponding values. The FTS will see the file with its json file, and copy the file to a permanent location in the tape-backed dCache and use the json to create a metadata record in SAM. The tape-backed dCache will migrate the file to tape quickly and the SAM record will be updated with the tape location. Users will use [sam.shtml SAM] to read the files in tape-backed dCache.


Since there is some overhead in uploading, storing and retrieving each file, the ideal file size is as large as reasonable. This size limit should be determined by how long an executable will typically take to read the file. This will vary according to exe settings and other factors, so a conservative estimate should be used. A file should be sized so that the longest jobs reading it should take about 4 to 8 hours to run, which generally provides efficient large-scale job processing. A grid job that reads a few files in 4 hours is nearly as efficient, so you can err on the small size. You definately want to avoid a single job section requiring only part of a large file. Generally, file sizes should not go over 20 GB in any case because they get less convenient in several ways. Files can be concatenated to make them larger, or split to make them smaller. Note - we have agreed that a subrun will only appear in one file. Until we get more experience with data handling, and see how important these effects are, we will often upload files in the size we make them or find them.

Once files have been moved into the FTS directories, please do not try to move or delete them since this will confuse the FTS and require a hand cleanup. Once files are on tape, there is an expert procedure to delete them, and files of the same name can then be uploaded to replace the bad files.

<!********************************************************>

Recipe

If you are about to run some new Monte Carlo in the official framework, then the upload will be built into the scripts and documented with the mu2egrid [ Monte Carlo submission] process. this is under development, please ask Andrei for the status


Existing files on local disks can be uploaded using the following steps. The best approach would be to read quickly through the rest of this page for concepts then focus on the [uploadExample.shtml upload examples] page.

  • choose values for the [#metadata SAM Metadata] , including the appropriate [#ff file family]
  • record the above items in a json file fragment that will apply to all the files in your dataset
  • [#name rename] your files by the upload convention (This can also be done by jsonMaker in the next step.)
  • setup an offline release and run the [#jsonMaker jsonMaker] to write the json file, which will include the fragment from the previous step
  • use "ifdh cp" to copy the data file and the full json file to the FTS area /pnfs/mu2e/scratch/fts (This step can also can be done by jsonMaker.)
  • use [sam.shtml SAM] to access the file or its metadata

The following is some detail you should be aware of in general, but a detailed knowledge is not required.