Prestage: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
(Created page with "In the process of uploading files] to tape, they are copied to a tape-backed [[Dcache|dCache disk area. From there, they migrate automatically to tape and after...")
 
No edit summary
Line 1: Line 1:
In the process of [[Upload|uploading files] to tape, they  
 
==Introduction==
In the process of [[Upload|uploading files]] to tape, they  
are copied to a tape-backed [[Dcache|dCache]] disk area.   
are copied to a tape-backed [[Dcache|dCache]] disk area.   
From there, they migrate  
From there, they migrate  
automatically to tape and after that, the least-used  
automatically to tape and after that, the least-used  
will be deleted from the dCache when disk space is needed.
will be deleted from the dCache when disk space is needed.
Files existing only on tape will be copied to the dCache disk if a user
Files existing only on tape will be copied to the dCache disk if a user
attempts to access the file through its /pnfs filespec or a user
attempts to access the file through its /pnfs filespec.
starts a SAM job to read the files.
It takes up to one minute or more to mount a [[Enstore|tape]] and find a random file.
It takes up to one minute or more to mount a tape and find a random file.
Backlogs at times of high demand on tape drives can cause hours of wait time.
Backlogs at times of high demand on tape drives can cause hours of wait time.
Prestaging a dataset is possible and recommended in most cases.
 
Prestaging is the process of making sure all the files in a dataset have been
copied off tape and written back to disk in dCache so they are ready to be used in a grid job.
 
==When to prestage==
 
First, when in doubt, prestage.  It is harmless, except for the delay, and you can be confident
that tape response will not be a problem.
 
Next, you can check if the files are on disk.  You can run the [[SAM]] utility script <code>samOnDisk</code>:
<pre>
setup dhtools
samOnDisk DATASET
</pre>
on a [[FileNames#Datasets|dataset]] name.  This script selects some files at random and check if they are
on disk.  After a few minutes it should become clear what fraction are on disk.  If it is nearly 100%, you
don't need to prestage.
 
You do not need to prestage a dataset if it is less than a few hundred files.  in this case the system
should respond in time so that your grid job will succeed.  Note that the prestage should also be quick,
so still a good idea to run.
 
==Prestage less than 100K files==
 
As a practical matter, it is better to handle larger dataset by splitting them up first.  Smaller datasets, less than 100K files,
can be prestaged in one command. You can see how many files in your dataset with:
<pre>
samweb count-files "dh.dataset=DATASET"
</pre>
where DATASET is a mu2e [[SAM]] dataset name.  Prestage with
<pre>
setup dhtools
samweb prestage-dataset --parallel=5 --defname=DATASET
</pre>
 
The prestage will create a SAM [[SAM#SAM projects|project]] on the SAM station and create consumers to start requesting files from the project.  The project on the SAM station has a knowledge of all the files it will need, so it does two things:
* gives out the files it thinks are more likely to be on disk first
* look ahead in the file list and start prestaging upcoming files in a logical and efficient manner
The first point can be seen as the prestaging proceeding quickly as long it keeps finding files on disk, then slowing down when it starts requesting files off tape.
 
==Prestage more than 100K files==
 
==Prestage speed==
 
Overall the prestage speed is about 100K files per day if things are going well.  If the project has to get files off tape, and [[Enstore|enstore]] is very busy, it may slow down by factors of 2 or 3.  There may be periods of several minutes when the file count does not progress.

Revision as of 18:58, 29 March 2017

Introduction

In the process of uploading files to tape, they are copied to a tape-backed dCache disk area. From there, they migrate automatically to tape and after that, the least-used will be deleted from the dCache when disk space is needed.

Files existing only on tape will be copied to the dCache disk if a user attempts to access the file through its /pnfs filespec. It takes up to one minute or more to mount a tape and find a random file. Backlogs at times of high demand on tape drives can cause hours of wait time.

Prestaging is the process of making sure all the files in a dataset have been copied off tape and written back to disk in dCache so they are ready to be used in a grid job.

When to prestage

First, when in doubt, prestage. It is harmless, except for the delay, and you can be confident that tape response will not be a problem.

Next, you can check if the files are on disk. You can run the SAM utility script samOnDisk:

setup dhtools
samOnDisk DATASET

on a dataset name. This script selects some files at random and check if they are on disk. After a few minutes it should become clear what fraction are on disk. If it is nearly 100%, you don't need to prestage.

You do not need to prestage a dataset if it is less than a few hundred files. in this case the system should respond in time so that your grid job will succeed. Note that the prestage should also be quick, so still a good idea to run.

Prestage less than 100K files

As a practical matter, it is better to handle larger dataset by splitting them up first. Smaller datasets, less than 100K files, can be prestaged in one command. You can see how many files in your dataset with:

samweb count-files "dh.dataset=DATASET"

where DATASET is a mu2e SAM dataset name. Prestage with

setup dhtools
samweb prestage-dataset --parallel=5 --defname=DATASET

The prestage will create a SAM project on the SAM station and create consumers to start requesting files from the project. The project on the SAM station has a knowledge of all the files it will need, so it does two things:

  • gives out the files it thinks are more likely to be on disk first
  • look ahead in the file list and start prestaging upcoming files in a logical and efficient manner

The first point can be seen as the prestaging proceeding quickly as long it keeps finding files on disk, then slowing down when it starts requesting files off tape.

Prestage more than 100K files

Prestage speed

Overall the prestage speed is about 100K files per day if things are going well. If the project has to get files off tape, and enstore is very busy, it may slow down by factors of 2 or 3. There may be periods of several minutes when the file count does not progress.