DataTransfer
Introduction
Grid jobs that run on thousands of nodes can overwhelm a disk system if they are all reading or writing without throttles. The procedures here are all designed to maximize throughput by scheduling transfers in an orderly way.
Monte Carlo
For running Monte Carlo grid jobs, please follow the instructions at mu2egrid scripts. This will use the proper tools in an optimized way.
non-Monte Carlo
Whenever you are transferring a non-trivial amount of data, and always when running on a grid, you should use ifdh. This product looks at the input and output destination and chooses the best method to transfer data, then schedules your transfer so that no servers or disks are overloaded and the system remains efficient. It usually picks gridftp which is the most efficient mechanism.
If used interactively, you should make sure you have a kerberos ticket. If used in a grid job, you authentication is established automatically for you. Then setup the product
setup ifdhc (the c is not a typo, it means "client")
and issue transfer commands:
ifdh cp my-local-file /pnfs/mu2e/scratch/users/$USER/remote-file
ifdh knows about dCache (the /pnfs directory) and other Fermilab disks wherever you are running it, even if the /pnfs directory is not mounted on that grid node.
You can also transfer data to/from /mu2e/data/users/$USER. in this case ifdh will use "cpn" locks to make sure no more than 5 grid nodes are writing at any time. Never try to defeat this mechanism by writing from a grid node directly to /mu2e/data. This will almost certainly block access to the disk for anyone else and may crash the system. The /mu2e/data disk is not designed for high bandwidth to multiple processes so this transfer is inherently inefficient.
We recommend to always read and write from dCache when transferring non-trivial amounts of data, or any size files to/from non-trivial numbers of grid nodes. dCache has a much better bandwidth than /mu2e/data and is designed fundamentally to serve data to grid systems.