DataTransfer
Introduction
Grid jobs that run on thousands of nodes can overwhelm a disk system if they are all reading or writing without throttles. The procedures here are all designed to maximize throughput by scheduling transfers in an orderly way. We recommend using ifdh for all non-trivial transfers. Please review this discussion for limitations on interactive use of dCache via /pnfs
.
Monte Carlo
For running Monte Carlo grid jobs, please follow the instructions at mu2egrid scripts. This will use the proper tools in an optimized way.
General use
Whenever you are transferring a non-trivial amount of data, and always when running on a grid node, you should use ifdh. This product looks at the input and output destination and chooses the best method to transfer data, then schedules your transfer so that no servers or disks are overloaded and the system remains efficient. It usually picks http protocol which is the most efficient mechanism, but can be forced to use other mechanisms.
If used interactively, you should make sure you have a kerberos ticket. If used in a grid job, your authentication is established automatically for you. Then setup the product
setup ifdhc (the c is not a typo, it means "client")
and issue transfer commands:
ifdh cp my-local-file /pnfs/mu2e/scratch/users/$USER/remote-file
You can't use wildcards with ifdh, transfer each file with specific file names. You can use the -D switch to transfer a directory and the "-f" switch to transfer a list of files. ifdh knows about dCache (the /pnfs directory) wherever you are running it, even if the /pnfs directory is not mounted on that grid node.
Try to keep the number of files in a directory under 1000 to maintain good response time, and to avoid excessive numbers of small files, or frequent renaming of files.
Since 2017, we cannot read or write to the /mu2e/data
disks from a grid job. You can write to them interactively where using "cp" is appropriate with a thread or two. Multiple threads should use ifdh.
We recommend to always use dCache for the data source/sink when transferring non-trivial amounts of data, or any size files to/from grid nodes. dCache is fundamentally designed to serve data to grid systems.
ifdh tips
To get extensive debug output
export IFDH_DEBUG=10
To turn off retries:
export IFDH_CP_MAXRETRIES=0\0\0\0\0
A lot of ifdh behavior is configurable. When you set up ifdh, it also sets up a product called IFDHC_CONFIG which contains a config file that can be modified.
Our policy is to take the current ifdhc version from the fermilab cvmfs partition. When we pull scisoft manfests, new versions of ifdhc will be installed on artexternals, but should not be declared current, so will be skipped over.
xrootd
root provides a client-server system which allows the reading of root files over a network. Since it is root-aware, it can read branches and baskets in an optimal way. It only reads what it needs, so if you only read a small part of the file or event, this may save time in transferring data. If local disk space or disk contention is problem, this method will stream the data and not use the local disk at all, so it can completely cure those problems. dCache runs the servers on all of the dCache data areas. To use it other places, an xrootd server needs to be set up.
- you must have a voms proxy, this is done for you automatically when you are running on the grid, but interactively you need to do:
kinit kx509 voms-proxy-init -noregen -rfc -voms fermilab:/fermilab/mu2e/Role=Analysis
The latter two commands are provided as a script in your path
vomsCert
- the file path must be in the dcache conical form, a file name is transformed like:
/pnfs/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s2-mubeam/0728a/031/427/sim.mu2e.cd3-beam-g4s2-mubeam.0728a.001002_00310968.art -> xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s2-mubeam/0728a/031/427/sim.mu2e.cd3-beam-g4s2-mubeam.0728a.001002_00310968.art
It can also copy any files.
xrdcp xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/phy-sim/sim/mu2e/cd3-beam-g4s2-mubeam/0728a/031/427/sim.mu2e.cd3-beam-g4s2-mubeam.0728a.001002_00310968.art .
The above methods are authenticated by voms proxies. An experimental (as of 7/2018) read-only system with no authentication is under development. Use urls:
xroot://fndcagpvm01.fnal.gov:1095//pnfs/fnal.gov/....
If authentication is unreliable or adding too much to the latency, this might be an option. Check with dCache experts for the status and availability.
The xrdfs
bin has the capability to do file management, Check for files, delete them, create and remove directories, check the xrootd server, etc. Here is an example.
xrdfs root://fndca4a.fnal.gov stat /pnfs/fnal.gov/usr/mu2e/scratch/users/rlc/osgMon/10M
How to see the list of active xroot doors
http://fndca.fnal.gov/info/doors?format=json
A suggestion from Ken Herner on how to prevent timeouts. (reference)
export XRD_CONNECTIONRETRY=32 export XRD_REQUESTTIMEOUT=3600 export XRD_REDIRECTLIMIT=255
Later (4/2022), further recommendations
XRD_CONNECTIONRETRY=32 XRD_REQUESTTIMEOUT=14400 XRD_REDIRECTLIMIT=255 XRD_LOADBALANCERTTL=7200 XRD_STREAMTIMEOUT=1800
For extensive debugging print:
export XRD_LOGLEVEL=Debug
Load balancer door
This was introduced (for testing only) in 9/2022. It is high-availability load balancer for xrootd and for http (which will be used with gfal program).
For xrootd
URLS="xroot://fndcadoor.fnal.gov//pnfs/fnal.gov/usr"
For http protocol (gfal program)
URLG="https://fndcadoor.fnal.gov:2880//pnfs/fnal.gov/usr"
these urls are followed by "/mu2e/.." and the rest of the pnfs path.
cpn notes
Since 2017, we cannot read or write to the /mu2e/data
disks from a grid job and the cpn system is largely obsolete. These notes or for general historical information. This system used to prevent overloading the data disks when accessing them from many grid jobs.
setup cpn cpn local-file /mu2e/data/users/$USER/remotefile
With one exception, the cpn behaves just like the Unix cp command. The exception is that it first checks the size of the file. If the file is small, it just copies the file. If the file is large, it checks the number of ongoing copies of large files. If too many copies are happening at the same time, cpn waits for its turn before it does its copy. A side effect of this strategy is that there can be some dead time when your job is occupying a worker node but not doing anything except waiting for a chance to copy; the experience of the MINOS experiment is that this loss is small compared to what occurs when /grid/data starts thrashing.
If you are certain that a file will always be small just use cp. If the file size is variable and may sometimes be large, then use cpn.
We are using the cpn program directly from MINOS,
/grid/fermiapp/minos/scripts/cpn .
The locking mechanism inside cpn uses LOCK files that are maintained in /grid/data/mu2e/LOCK and in corresponding locations for other collaborations. The use of the file-system to perform locking means that locking has some overhead. If a file is small enough, it is less work just to copy the file than it is to use the locking mechanism to serialize the copy. After some experience it was found the 5 MB is a good choice for the boundary between large and small files.
There is a cpn lock cleaner script that is necessary to run cpn, and this is now maintained by the Scientific Computing Division. A failure of this process can be seen as no locks available for long times.