Upload: Difference between revisions
(Created page with "== Introduction == Keeping all Intensity Frontier data on disks is not practical, so large datasets must be written to tape. At the same time, the data must always be avail...") |
No edit summary |
||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
mu2e has several forms of [[Disk|disk space]] and large aggregated disks are available in the [[Dcache|dCache]] system. | |||
But we also write a large part of the files we produce to tape, which is less expensive, and can hold much more data. | |||
We usually write data to tape for one or more of the following reasons | |||
* to make room for new activity | |||
* to keep it safe | |||
* to make a permanent record | |||
The tape system is called [[Enstore|enstore]] and consists of several tape libraries and many tape drives with good connections to dCache. We write to tape by copying files into tape-backed dCache and they are copied automatically to tape. The files will, on the scale of weeks if they are unused, be deleted off disk so they are only on tape. We can get them copied from tape to disk again by [[Presatge|prestaging]] them. | |||
is | |||
to | |||
'''All data written to tape must follow certain conventions.''' Please familiarize yourself with the links in this list. | |||
* all files are named by [[FileNames|mu2e conventions]] | |||
* all files will have a [[SAM|SAM record]] with [[SamMetadata|SAM metadata]], including file location | |||
* all files are uploaded using [[FileTools|standard tools]], see especially [[FileTools#jsonMaker|jsonMaker]]. | |||
The | The following are the common workflow cases. | ||
==Upload steps== | |||
===Generate the metadata=== | |||
[http://json.org json file] , | [http://json.org json file] , | ||
===Generate the metadata=== | |||
===Generate the metadata=== | |||
===Generate the metadata=== | |||
the | |||
==MC production workflow, art files== | |||
In the standard MC [[MCProdWorkflow|workflow]], there are three times you might upload files: | |||
* after generating the fcl, uploading the fcl files is part of [[GenerateFcl|that procedure]] | |||
* after producing art files (including concatenation if needed), this is described in this section | |||
* upload log files are archive, which is handled [[here]] | |||
After the jobs have completed, the output datasets will be below a directory like the following, where you will be working: | |||
cd /pnfs/mu2e/persistent/users/mu2epro/workflow/project_name/good | |||
Below this directory, there are directories for each cluster, and below that directories for each job. | |||
Each output art file named "a.b.c.d.e.f" should have a associated json file called "a.b.c.d.e.f.json" produced as part of the grid job and containing the SAM record metadata. | |||
There are two steps. First, declare the files to the SAM database | |||
<pre> | |||
mu2eClusterFileList --dsname <dataset> --json <cluster_number> | mu2eFileDeclare | |||
</pre> | |||
< | where <code>dataset</code> is the dataset name of the files to find and upload and | ||
the <code>cluster_directory</code> is one of the cluster subdirectories. | |||
The second step is to move the files to the final location in tape-backed dCache: | |||
<pre> | |||
mu2eClusterFileList --dsname <dataset> <cluster_directory> | mu2eFileUpload --tape | |||
</pre> | |||
The third step is to tell SAM where the files are in the tape system, to add their "location" to the SAM record. | |||
<pre> | |||
mu2eDatasetLocation --add=tape <dataset> | |||
</pre> | |||
Since it takes about a day, or sometimes more, for a file to migrate to tape and establish its tape location, after being copied to tape-backed Cache, it makes sense to wait a day before running this command | |||
This command should be as many times as needed in order to get the "Nothing to do" message, which means all the files in the dataset now have their location recorded: | |||
<pre> | |||
> mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art | |||
No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016. | |||
SAMWeb times: query metadata = 0.00 s, update location = 0.00 s | |||
< | Summary1: out of 0 virtual dataset files 0 were not found on tape. | ||
Summary2: successfully verified 0 files, added locations for 0 files. | |||
Summary3: found 0 corrupted files and 0 files without tape labels. | |||
</pre> | |||
</ | |||
==MC workflow, Log Files== | |||
We usually upload the ntuple (TFileService) files along with the log files. | |||
Revision as of 22:02, 11 April 2017
Introduction
mu2e has several forms of disk space and large aggregated disks are available in the dCache system. But we also write a large part of the files we produce to tape, which is less expensive, and can hold much more data. We usually write data to tape for one or more of the following reasons
- to make room for new activity
- to keep it safe
- to make a permanent record
The tape system is called enstore and consists of several tape libraries and many tape drives with good connections to dCache. We write to tape by copying files into tape-backed dCache and they are copied automatically to tape. The files will, on the scale of weeks if they are unused, be deleted off disk so they are only on tape. We can get them copied from tape to disk again by prestaging them.
All data written to tape must follow certain conventions. Please familiarize yourself with the links in this list.
- all files are named by mu2e conventions
- all files will have a SAM record with SAM metadata, including file location
- all files are uploaded using standard tools, see especially jsonMaker.
The following are the common workflow cases.
Upload steps
Generate the metadata
Generate the metadata
Generate the metadata
Generate the metadata
MC production workflow, art files
In the standard MC workflow, there are three times you might upload files:
- after generating the fcl, uploading the fcl files is part of that procedure
- after producing art files (including concatenation if needed), this is described in this section
- upload log files are archive, which is handled here
After the jobs have completed, the output datasets will be below a directory like the following, where you will be working:
cd /pnfs/mu2e/persistent/users/mu2epro/workflow/project_name/good
Below this directory, there are directories for each cluster, and below that directories for each job. Each output art file named "a.b.c.d.e.f" should have a associated json file called "a.b.c.d.e.f.json" produced as part of the grid job and containing the SAM record metadata.
There are two steps. First, declare the files to the SAM database
mu2eClusterFileList --dsname <dataset> --json <cluster_number> | mu2eFileDeclare
where dataset
is the dataset name of the files to find and upload and
the cluster_directory
is one of the cluster subdirectories.
The second step is to move the files to the final location in tape-backed dCache:
mu2eClusterFileList --dsname <dataset> <cluster_directory> | mu2eFileUpload --tape
The third step is to tell SAM where the files are in the tape system, to add their "location" to the SAM record.
mu2eDatasetLocation --add=tape <dataset>
Since it takes about a day, or sometimes more, for a file to migrate to tape and establish its tape location, after being copied to tape-backed Cache, it makes sense to wait a day before running this command
This command should be as many times as needed in order to get the "Nothing to do" message, which means all the files in the dataset now have their location recorded:
> mu2eDatasetLocation --add=tape sim.mu2e.cd3-pions-cs1.v563.art No virtual files in dataset sim.mu2e.cd3-pions-cs1.v563.art. Nothing to do on Mon Nov 21 18:11:29 2016. SAMWeb times: query metadata = 0.00 s, update location = 0.00 s Summary1: out of 0 virtual dataset files 0 were not found on tape. Summary2: successfully verified 0 files, added locations for 0 files. Summary3: found 0 corrupted files and 0 files without tape labels.
MC workflow, Log Files
We usually upload the ntuple (TFileService) files along with the log files.