Dcache: Difference between revisions
(87 intermediate revisions by 4 users not shown) | |||
Line 2: | Line 2: | ||
==Introduction== | ==Introduction== | ||
dCache is a system of many disks aggregated across dozens of linux disk servers. | dCache is a system of many disks aggregated across dozens of linux disk servers; | ||
The system lets all this hardware look like "one big disk" to the user | the total disk space is a few PB. The system lets all this hardware look like "one big disk" to the user | ||
and hides all the details of exactly where the files are and how they | and hides all the details of exactly where the files are and how they | ||
are being transferred. It | are being transferred. It supports load balancing and other optimizations. | ||
The entire system is designed to be high-bandwidth so it can serve | The entire system is designed to be high-bandwidth so it can serve | ||
data files to thousands of grid nodes simultaneously | data files to thousands of grid nodes simultaneously. | ||
Anytime you transfer data files between dCache and a grid node you must | |||
Anytime you transfer | use one of the dedicated [[DataTransfer|tools]]. | ||
When you read or write to | With a few exceptions, you should only put data files in dCache. | ||
Executables, libraries, and UPS products are made visible to grid nodes by using | |||
[[Cvmfs|CVMFS]] or [[Muse#Tarball|tarball]]. One exception is that | |||
you must put fcl files in dCache to make them visible to your grid jobs. Another | |||
is that all output created by grid jobs must be written to dCache: this includes log | |||
files as well as data files. | |||
So far, when we refer to data distributed by dCache, we have been referring | |||
to event data, where every grid node gets a unique file. | |||
There is another case, intermediate between code on CVFMS and | |||
event data on dCache - the case of a single large file that needs to be | |||
distributed to every node but is too large for CVMFS. An example might | |||
be a library of fit templates that is 5 GB. In this case, | |||
the ideal solution is [[StashCache|stashCache]]. | |||
When you read or write to dCache, the request goes through | |||
a server "head node", and the system decides which hardware to | a server "head node", and the system decides which hardware to | ||
read or write. There is a database behind the system to track where | read or write. There is a database behind the system to track where | ||
Line 23: | Line 34: | ||
and not recommended, please see [[Disks| disk page]] for | and not recommended, please see [[Disks| disk page]] for | ||
build and analysis areas. | build and analysis areas. | ||
You cannot overwrite or edit files in dCache. You can remove them and then | |||
replace them with a new file of the same name. | |||
You can access the files through several protocols, including | You can access the files through several protocols, including | ||
Line 33: | Line 47: | ||
dCache has a [https://www.dcache.org/ home] and a | dCache has a [https://www.dcache.org/ home] and a | ||
[http://fndca.fnal.gov/ lab home] and [[ | [http://fndca.fnal.gov/ lab home] and [[OfflineOps|monitors]]. | ||
==Flavors== | ==Flavors== | ||
There are | There are four flavors of dCache. | ||
* '''scratch''' Anyone can write here, and you should use this area as temporary output for your grid jobs. If space is needed, files are deleted according to a least-recently-used algorithm. Your files may last for as little as | * '''scratch''' Anyone can write here, and you should use this area as temporary output for your grid jobs. If space is needed, files are deleted according to a least-recently-used algorithm. Your files may last for as little as one week since the last time you wrote or read them. There is no guaranteed minimum lifetime, so plan ahead to move important files off of scratch to an area where they can remain for a longer time. | ||
* '''persistent''' Files written here will stay on disk until the user deletes them, so this area can fill up. Only production files are written here - it is not for general use. | * '''persistent''' Files written here will stay on disk until the user deletes them, so this area can fill up. Only production files are written here - it is not for general use, though special cases might be considered. The one current exception is that users can write fcl files here instead of uploading them. | ||
* '''tape-backed''' All files written to this area are copied to [[Enstore|tape]] automatically. If space is needed, files are deleted off disk according to a least-recently-used algorithm. As files are requested, they are copied from tape to disk as needed, and a request will hang during tape access. | * '''resilient''' This is a special purpose area. Files written here are copied to many, about 20, different dCache server nodes. When you read a file in resilient, a load balancer automatically directs you to one of the copies. This distributes the load of many grid jobs all reading the same file simultaneously. In the past this area was used for code tarballs made by an obsolete utility named gridexport. We now recommend that you not put code tarballs here; instead you should use [[Cvmfs#Rapid_Code_Distribution_Service_.28RCDS.29 | RCDS]]. Files written to resilient will stay on disk until the user deletes them, so this area can fill up. Users should purge their own areas and the collaboration reserves the right to purge user's files as needed. Only use resilient for files that will be read by many grid jobs simultaneously. | ||
* '''tape-backed''' All files written to this area are copied to [[Enstore|tape]] automatically. If space is needed, files are deleted off disk according to a least-recently-used algorithm. As files are requested, they are copied from tape to disk as needed, and a request will hang during tape access. This is the way collaborations "write to tape". <font color=red>Do not copy data to this area - it is carefully organized and only the [[MCProdWorkflow|production scripts]] can write here.</font> Large datasets should be [[Prestage|prestaged]] from tape to disk before they are read from grid job. | |||
Official production datasets, and user datasets manipulated by the [[FileTools|file tools]] will appear under the following designated dataset areas, corresponding to the above flavors: | |||
* /pnfs/mu2e/scratch/datasets | |||
* /pnfs/mu2e/persistent/datasets | |||
* /pnfs/mu2e/tape | |||
==Using the scratch area== | ==Using the scratch area== | ||
Line 51: | Line 71: | ||
you can use the unix commands in /pnfs: ls, rm, mv, mkdir, rmdir, chmod, cp, cat, more, less, etc. | you can use the unix commands in /pnfs: ls, rm, mv, mkdir, rmdir, chmod, cp, cat, more, less, etc. | ||
/pnfs is not a directory, it is an interface to a database | /pnfs is not a physical directory, it is an interface to a database and servers, | ||
implemented as an nfs server | implemented as an nfs server. Because there is latency to database and server access, there are restrictions to this interface that don't usually apply to local disk systems. | ||
Try to keep the number of files in a directory under 1000 | *'''You should avoid commands which make large demands on the database: "find .", "ls -lr *" or similar'''. Also, on many gpvms (if not all) "ls" is aliased to "ls --color==auto", which provides much slower access. If you find your database access with "ls" is slow, you can check the alias of "ls" with "alias ls". If you get back something other than plain "ls", then try using "\ls" to unalias the command. When you use "ls" there is a quick database access of the directory record, but if you use "find" or "ls -l" there is a much slower database access of the full file records, so a plain "ls" is always preferred. | ||
to maintain good response time | * Try to keep the number of files in a directory under 1000 to maintain good response time - this is necessary even if you simple access the files directly by a known path. Avoid excessive numbers of small files, or frequent renaming of files. | ||
excessive numbers of small files, or frequent renaming of files. | * If you are writing or reading dCache with a large number of processes, such as from a grid job, please use the [[DataTransfer|data transfer tools]] | ||
* dCache does not allow overwriting files, or modifying existing files (such as by an editor), only moving full files. | |||
The scratch area is purged regularly using a least-recently-used algorithm based on the time the file was last transferred to a user ("touch" doesn't update the time). The time between the last use and deletion is not guaranteed and while it is often on the scale of a month or more, it may be as little as a week. | |||
The LRU algorithm is applied on a pool (one piece of the total disk) basis which means there may be some variation in how long a file last since some pools are larger than others, changing the churn rate, and may have one-off events. | |||
==Other Access Protocols== | ==Other Access Protocols== | ||
While we | While we can use NFS protocol (unix disk interface) for small numbers of accesses, | ||
and [[DataTransfer|data transfer tools]] to read or write files | and [[DataTransfer|data transfer tools]] to read or write files | ||
from grid nodes, there are other protocols to access dCache files. | from grid nodes, there are other protocols to access dCache files. | ||
The NFS interface uses unix file permissions to determine who can write where but the other protocols will require a [[Authentication#Certificate|proxy]], most easily done: | |||
mu2einit | |||
kinit | |||
vomsCert | |||
This section is for reference and debugging, no users should be using these commands, please use [[DataTransfer|ifdh]] or [[DataTransfer#xrootd|xrootd]]. | |||
<ul> | <ul> | ||
<li> <b>dcap</b> the native protocol<br> | <li> <b>dcap</b> the native protocol, only available onsite<br> | ||
<pre> | <pre> | ||
dccp /pnfs/mu2e/scratch/users/$USER/filename . | dccp dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/mu2e/scratch/users/$USER/filename . | ||
</pre> | </pre> | ||
dccp is installed on mu2egpvm, but it may have to be installed or setup | dccp is installed on mu2egpvm, but it may have to be installed or setup | ||
(the product is named dcap) on other nodes | (the product is named dcap) on other nodes | ||
<br><br> | <br><br> | ||
<li><b>root</b> protocol | <li><b>root</b> protocol | ||
<pre> | <pre> | ||
Line 89: | Line 112: | ||
which trigger different authentication, protocols and transfer queues. | which trigger different authentication, protocols and transfer queues. | ||
<br><br> | <br><br> | ||
<li><b> | <li>[[DataTransfer#xrootd|<b>xrootd</b> protocol]] | ||
< | <br><br> | ||
<li><b>gridFtp</b><br> | <li><b>gridFtp</b><br> | ||
<pre> | <pre> | ||
kinit | kinit | ||
vomsCert | |||
export X509_USER_CERT=/tmp/x509up_u`id -u` | export X509_USER_CERT=/tmp/x509up_u`id -u` | ||
export X509_USER_KEY=$X509_USER_CERT | export X509_USER_KEY=$X509_USER_CERT | ||
export X509_USER_PROXY=$X509_USER_CERT | export X509_USER_PROXY=$X509_USER_CERT | ||
globus-url-copy gsiftp://fndca1.fnal.gov:2811/scratch/users/$USER/file-name file:///$PWD/file-name | globus-url-copy gsiftp://fndca1.fnal.gov:2811/scratch/users/$USER/file-name file:///$PWD/file-name | ||
</pre> | </pre> | ||
<li><b>http</b><br> | |||
In 2022, http became the preferred transfer protocol and default in ifdh. Underneath ifdh is '''gfal''', a CERN data handling product which is installed on our nodes. It can handle many protocols, but the lab is using it to transfer files by the http protocol. [https://grid-deployment.web.cern.ch/grid-deployment/dms/dmc/docs/ gfal2 doc] | |||
<pre> | |||
gfal-copy https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat <output> | |||
gfal-copy root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat <output> | |||
when using in root macro, use : | |||
root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat | |||
</pre> | |||
gfal timeouts: | |||
* client side - gfal-copy. Query gfal-copy documentation. It may be 1800 seconds, but I am not sure. | |||
* On dCache end: pools allow 10K active transfers. Job idle timeout 72 hours. | |||
* On connection to the door, idle time is 300 seconds. | |||
* connection backlog 4096 (more than that and connections rejected) | |||
* Timeout on connect can happen if there are too many connections to the door. | |||
* Timeout may happen if transfer is queued on pool. | |||
You can also transfer http using '''curl''' | |||
Ports: | |||
* WevDAV is on port 2880 and used for transfers | |||
* REST API is on port 3880 | |||
* currently 1094 is the default for SAM, and is an OK default for dCache | |||
* 1097 experimental | |||
* fnfcaitb4.fnal.gov:2880 experimental | |||
</ul> | </ul> | ||
==Pool structure== | |||
The disk in dCache is served from about 50 servers. Each server serves disk arrays which are further divided into pools. Disk space for a particular purpose, such as Mu2e persistent, is a group of pools. In general the pools corresponding to one logical area, such as Mu2e persistent are spread out over many disks on many servers. | |||
(Fixme: need info on how tape-backed pools are connected to tape queues. Comments seem to says each pool with read requests is treated equally when filling the tape queue.) | |||
A list of all pools, with acitivty stats, is available [https://fndca.fnal.gov:22880/queueInfo here]. | |||
The examples below show some typical pool names: | |||
* p-mu2e-stkendca1901-8 | |||
* p-minos-stkendca2001-8 | |||
* rw-stkendca1901-1 | |||
* rw-gm2-stkendca2011-7 | |||
* v-stkendca1901-3 | |||
Pools with names beginning with p-experiment-name are part of the persistent space of that experiment. Pools beginning with rw- are tape-backed. If no experiment name is present, then it is part of the general read-write tape-backed pool, shared by all experiments. Pools beginning with rw-experiment-name are tape backed pools dedicated to one experiment. At present Mu2e does not have any of these dedicated tape-backed pools but we will once data taking starts. Pools beginning with v- are part of scratch dCache. | |||
I the other parts of the name identify a server and a pool number on that server. | |||
There are also pools in which the name "stkendca" is replaced with something else, like "pubstor". CSAID management decided to give newly acquired hardware different names. The original motivation for the "stk" in the server names was because tape backed dCached use Storage Tek tape hardware; the company no longer exists. | |||
==Access metadata for dCache file== | |||
Note that some command will work with the Mu2e file path <code>/pnfs/mu2e/...</code> and soem only work on the canonical path <code>/pnfs/fnal.gov/usr/mu2e/...</code>. | |||
===Mu2e tools=== | |||
the webdav interface to the dCache database has been written into a more convenient tool: | |||
setup dhtools | |||
dcacheFileInfo -h | |||
dcacheFileInfo -l -t dig.mu2e.FlatGammaOnSpillConv.MDC2020z_sm3_perfect_v1_0.001210_00009051.art | |||
NEARLINE | |||
===WebDav=== | |||
There is [http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5050 CS-doc-5050] reference and an [https://fndca1.fnal.gov:3880/api/v1/# API page]. <br> | |||
The "creationTime" is icrtime(creation time), "mtime" is imtime (modify time). If you divide it by 1000, you could get the epoch time. | |||
Two webdav scripts from Dmitri Litvinsev. Require voms proxy to run. | |||
<pre> | |||
#!/bin/sh | |||
# get properties of a file | |||
if [ ${1:-x} = "x" ]; then | |||
echo "please provide full file name" 1>&2; | |||
echo "usage: $0 <file_path>" | |||
exit 1 | |||
fi | |||
X509_USER_PROXY=/tmp/x509up_u`id -u` | |||
WEBDAV_HOST=https://fndca4a.fnal.gov:2880 | |||
FILE_PATH=${1} | |||
echo $FILE_PATH | |||
curl -L --capath /etc/grid-security/certificates \ | |||
--cert ${X509_USER_PROXY} \ | |||
--cacert ${X509_USER_PROXY} \ | |||
--key ${X509_USER_PROXY} \ | |||
-s -X PROPFIND -H Depth:1 \ | |||
${WEBDAV_HOST}${FILE_PATH} \ | |||
--data '<?xml version="1.0" encoding="utf-8"?> | |||
<D:propfind xmlns:D="DAV:"> | |||
<D:prop xmlns:R="http://www.dcache.org/2013/webdav" | |||
xmlns:S="http://srm.lbl.gov/StorageResourceManager"> | |||
<R:Checksums/> | |||
<S:AccessLatency/> | |||
<S:RetentionPolicy/><S:FileLocality/> | |||
</D:prop> | |||
</D:propfind>' | xmllint -format - | |||
</pre> | |||
<pre> | |||
#!/bin/sh | |||
# get properties of a files in a dir | |||
if [ ${1:-x} = "x" ]; then | |||
echo "please provide full file name" 1>&2; | |||
echo "usage: $0 <file_path>" | |||
exit 1 | |||
fi | |||
X509_USER_PROXY=/tmp/x509up_u`id -u` | |||
WEBDAV_HOST=https://fndca4a.fnal.gov:2880 | |||
FILE_PATH=${1} | |||
curl -L --capath /etc/grid-security/certificates \ | |||
--cert ${X509_USER_PROXY} \ | |||
--cacert ${X509_USER_PROXY} \ | |||
--key ${X509_USER_PROXY} \ | |||
-s -X PROPFIND -H Depth:1 \ | |||
${WEBDAV_HOST}${FILE_PATH} \ | |||
--data '<?xml version="1.0" encoding="utf-8"?> | |||
<D:propfind xmlns:D="DAV:"> | |||
<D:prop xmlns:R="http://www.dcache.org/2013/webdav" | |||
xmlns:S="http://srm.lbl.gov/StorageResourceManager"> | |||
<R:Checksums/> | |||
<S:FileLocality/> | |||
</D:prop> | |||
</D:propfind>' | xmllint -format - | |||
</pre> | |||
Another example (I don't think this works without authentication any more) | |||
<pre> | |||
curl -k -X GET "https://fndca3a.fnal.gov:3880/api/v1/namespace//pnfs/fnal.gov/usr/nova/rawdata/FarDet/000328/32818/fardet_r00032818_s55_DDHmu.raw?locality=true" | |||
{ | |||
"fileMimeType" : "application/octet-stream", | |||
"fileLocality" : "NEARLINE", | |||
"pnfsId" : "000081FAE6A3E9DA48E297F8E0FF66A6CA77", | |||
"fileType" : "REGULAR", | |||
"nlink" : 1, | |||
"mtime" : 1557510913687, | |||
"size" : 69473584, | |||
"creationTime" : 1557510913684 | |||
} | |||
</pre> | |||
and another: | |||
<pre> | |||
> curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X GET "https://fndca1.fnal.gov:3880/api/v1/namespace/pnfs/fnal.gov/usr/mu2e/tape/phy-rec/rec/mu2e/TRK_VST-cosmics/vst003-r02/art/27/02/rec.mu2e.TRK_VST-cosmics.vst003-r02.100033_00000000.art?checksum=true&locality=true" | |||
{ | |||
"fileMimeType" : "image/x-jg", | |||
"fileLocality" : "ONLINE_AND_NEARLINE", | |||
"labels" : [ ], | |||
"size" : 3353865863, | |||
"creationTime" : 1641516239880, | |||
"fileType" : "REGULAR", | |||
"pnfsId" : "0000F4771A46FA0D431DB725DAD227DDA1C2", | |||
"checksums" : [ { | |||
"type" : "ADLER32", | |||
"value" : "da68ce6c" | |||
} ], | |||
"nlink" : 1, | |||
"mtime" : 1641516280517, | |||
"mode" : 420 | |||
} | |||
</pre> | |||
one with tokens authentication: | |||
<pre> | |||
$ curl -k --header "Authorization: Bearer $(< /var/run/user/6233/bt_u6233)" https://fndca1.fnal.gov:3880/api/v1/namespace/nova/production/keepup/R21-07-27-testbeam-production.a/tb/001020/102022/testbeam_r00102022_s12_beamline_R21-07-27-testbeam-production.a_v1_data_keepup.tbreco.root?locality=true | |||
</pre> | |||
===WLCG API=== | |||
The WLCG tape API is available off of here: | |||
https://fndcadoor.fnal.gov:3880/api/v1 | |||
discovery url: | |||
https://fndcadoor.fnal.gov:3880/.well-known/wlcg-tape-rest-api | |||
The help is available [https://fndcadoor.fnal.gov:3880/api/v1 here ], scroll to "tape Support for the TAPE API (bulk)" The description of WLCG tape API is [https://docs.google.com/document/d/1Zx_H5dRkQRfju3xIYZ2WgjKoOvmLtsafP2pKGpHqcfY/edit#heading=h.ozszs1lr7q93 here] | |||
===dot commands=== | |||
These commands can overwhelm the system so they must not be used in parallel threads or at a large scale. These are documented in [https://cd-docdb.fnal.gov/cgi-bin/sso/ShowDocument?docid=5399 CS doc 5399]. | |||
[https://cd-docdb.fnal.gov/cgi-bin/sso/ShowDocument?docid=4698 Small File Aggregation]. | |||
<pre> | |||
> fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix- cat.MDC2018i.001002_00039638.art" | |||
> cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)" | |||
ONLINE_AND_NEARLINE | |||
</pre> | |||
ONLINE means it is only one disk, NEARLINE means it is only on tape, and ONLINE_AND_NEARLINE means it is both on disk and on tape. This command may be time consuming sometimes. | |||
<pre> | |||
check if level 4 exists. This only exists if the file has a tape location. | |||
> fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art" | |||
> cat `dirname $fpath`/".(use)(4)(`basename $fpath`)" | |||
if file is not on tape the cat command fails with "No such file" error | |||
Print format is fixed, from the example: | |||
VR4440M8 | |||
0000_000000000_0000445 | |||
5650536421 | |||
phy-sim | |||
/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art | |||
00004371A6EEFADF4578934301C5C4D337F1 | |||
CDMS156660817000002 | |||
fmv18025:/dev/rmt/tps11d0n:000780350B | |||
2587942076 | |||
the format: | |||
tape label | |||
location_cookie on tape | |||
file size | |||
file family | |||
path | |||
empty line | |||
pnfsid | |||
empty line | |||
bfid | |||
tape drive | |||
CRC | |||
</pre> | |||
<pre> | |||
> fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art" | |||
> cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)" | |||
NEARLINE | |||
</pre> | |||
Tags for a directory, which describes the tape file family | |||
cd /pnfs/mu2e/tape/phy-raw | |||
> cat ".(tags)()" | |||
.(tag)(file_family) | |||
.(tag)(file_family_width) | |||
.(tag)(file_family_wrapper) | |||
.(tag)(library) | |||
.(tag)(OSMTemplate) | |||
.(tag)(sGroup) | |||
.(tag)(storage_group) | |||
> cat ".(tag)(library)" | |||
CD-LTO8G2,CD-LTO8F1 | |||
</pre> | |||
<font color=red>Be aware that the library field in this printout may be wrong.</font> There is a higher level of control that can override this setting. The higher level is the "dCache HSM interface" which is controlled by the dCache admins. We have asked how to query these settings and are waiting for a reply. | |||
See the [[Enstore#Libraries]] section to learn the current library configuration. | |||
All the file family widths | |||
for DD in $(ls -1d /pnfs/mu2e/tape/*); do (cd $DD; pwd; cat ".(tag)(file_family_width)"; echo ); done | |||
===enstore commands=== | |||
<pre> | |||
note: no fnal.gov/usr in file path | |||
setup encp v3_11c -q stken | |||
fspec=/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art | |||
fpath=$(dirname $fspec) | |||
fname=$(basename $fspec) | |||
bfid=$( cat $fpath/".(use)(1)($fname)" ) | |||
enstore info --bfid $bfid | |||
{'active_package_files_count': None, | |||
'archive_mod_time': None, | |||
'archive_status': None, | |||
'bfid': 'CDMS156660817000002', | |||
'cache_location': None, | |||
'cache_mod_time': None, | |||
'cache_status': None, | |||
'complete_crc': 2587942076L, | |||
'deleted': 'no', | |||
'drive': 'fmv18025:/dev/rmt/tps11d0n:000780350B', | |||
'external_label': 'VR4440M8', | |||
'file_family': 'phy-sim', | |||
'file_family_width': 2, | |||
'gid': 0, | |||
'library': 'CD-LTO8F1', | |||
'location_cookie': '0000_000000000_0000445', | |||
'original_library': 'CD-LTO8F1', | |||
'package_files_count': None, | |||
'package_id': None, | |||
'pnfs_name0': '/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art', | |||
'pnfsid': '00004371A6EEFADF4578934301C5C4D337F1', | |||
'r_a': (('131.225.240.49', 52527), | |||
1L, | |||
'131.225.240.49-52527-1670623050.950488-3311-139918693267264'), | |||
'sanity_cookie': (65536L, 1834246494L), | |||
'size': 5650536421L, | |||
'storage_group': 'mu2e', | |||
'tape_label': 'VR4440M8', | |||
'uid': 0, | |||
'update': '2019-08-23 19:56:11.152691', | |||
'wrapper': 'cpio_odc'} | |||
</pre> | |||
The library information in this printout is always correct; it says where the file really is. Note that new files in this file family may be written to a different library. | |||
==Assorted notes== | |||
===Kafka stream=== | |||
dCache publishes storage events - store/restore/remove to Kafka server maintained at Fermilab. You can subscribe to it and receive events and act upon them (programmatically). Example client: | |||
https://github.com/DmitryLitvintsev/scripts/blob/master/python/dcache/kafka/kafka_example.py | |||
uses python3. | |||
python3 kafka_example.py | |||
will start it. This is just an example. You need to modify it to make is useful (e.g. to have a thread pool and produce/consume pattern to execute something upon receiving each event) | |||
===Note on SFA files=== | |||
Getting the physical tape location for a file that has been tarred with other files in [https://cd-docdb.fnal.gov/cgi-bin/sso/ShowDocument?docid=4698 Small File Aggregation (SFA)], is more complicated. | |||
SAM expect a tape label and a location cookie for each file on tape, like | |||
enstore:/pnfs/mu2e/tape/usr-etc/bck/gandr/dhtest-large/v1/tar/c7/45(3@vr0846m8) | |||
For files in SFA archive dCache uses non-numeric cookies: instead of the "3" in the example above they look like | |||
/volumes/aggwrite/cache/6ad/0c5/0000E730EAE13FA64B76BB38C7D0530C56FE | |||
which, if I remember correctly, SAM does not accept. So for files in SFA I have to look up the containing archive, and put into SAM the physical location of the archive file instead. The information is used by SAM to | |||
optimize file pre-staging, so giving it the tape location of the file that has to be actually read from tape makes sense. | |||
===Note on checksums=== | |||
enstore uses alder32 "0 seeded". dCache uses adler32 with '1-seeded". The ecrc exe can do either. SAM calls 0-seeded algorithm "enstore" and the 1-seeded algorithm "adler32". | |||
In 2019, enstore started using the "1-seeded" versions, and is migrating files checksums when they are migrated on tape. | |||
$ setup encp v3_11 -q stken | |||
# enstore | |||
$ ecrc -0 raw.mu2e.file_test.v0.000.art | |||
CRC 3565947252 | |||
# dCache | |||
$ ecrc -1 -h raw.mu2e.file_test.v0.000.art | |||
CRC 0xc9a20975 | |||
note the SCD systems store the hex number without the "0x" | |||
$ samweb get-metadata raw.mu2e.file_test.v0.000.art | |||
Checksum: enstore:3565947252 | |||
adler32:c9a20975 | |||
===Single-file prestage=== | |||
Forcing files to be prestaged using command line. For large-scale processing SAM prestage needs to be used. Attempting to access a not-prestaged file from the NFS directory mount (/pnfs) will give an Input/Output error and will not cause the file to be prestaged. | |||
for a random file | |||
<pre> | |||
1) dccp -P (man dccp) | |||
dccp -P dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/sim/mu2e/flsh0s31b0/su2020/art/40/b6/sim.mu2e.flsh0s31b0.su2020.001000_00000000.art | |||
2) touch ".(fset)()()()()" | |||
e.g. touch ".(fset)(foo)(pin)(2)(MINUTES)" | |||
3) globus-url-copy -t ..... | |||
4) REST QoS transition: | |||
curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X POST -H "Accept: application/json" -H "Content-Type: application/json" https://fndca1.fnal.gov:3880/api/v1/namespace/pnfs/fnal.gov/usr/my2e/foo --data '{"action" : "qos", "target" : "disk+tape"}' | |||
5) gfal-bringonline | |||
</pre> | |||
Some examples using gfal | |||
> gfal-bringonline https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz | |||
https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz QUEUED | |||
> gfal-xattr https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz user.status | |||
NEARLINE | |||
===Pin files=== | |||
pinning a file means to stage it to disk, then require it stays there, immune to LRU purging. Not normally needed, and should only be used in exceptional cases. | |||
curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X POST -H "Accept: application/json" -H "Content-Type: application/json" https://fndcadoor.fnal.gov:3880/api/v1/namespace/full/path/of/the/file--data '{"action" : "pin", "lifetime" : "7", "lifetime-unit" : "DAYS"}' | |||
In python | |||
stage_url https://fndcadoor.fnal.gov:3880/api/v1/tape/stage | |||
header {'Authorization': 'Bearer <token text>' } | |||
filesdd {'path': '/mu2e/tape/phy- sim/dig/mu2e/CeEndpointOnSpillTriggerable/MDC2020ae_best_v1_3/art/84/c0/dig.mu2e.CeEndpointOnSpillTriggerable.MDC2020ae_best_v1_3.001210_00000002.art', 'diskLifetime': 'P7D'} | |||
response <Response [201]> | |||
response text { | |||
"requestId" : "eb7d9d62-7821-4935-be81-b534e2ce1bd5" | |||
} | |||
Checking pins. The two dates are start and stop times | |||
build02 c0 > cat ".(get)(dig.mu2e.CeEndpointOnSpillTriggerable.MDC2020ae_best_v1_3.001210_00000002.art)(pins)" | |||
411362918 2024-10-25T16:48:16.710Z 2024-10-27T16:48:16.720Z "eb7d9d62-7821-4935-be81-b534e2ce1bd5" REMOVABLE PINNED | |||
[[Category:Computing]] | |||
[[Category:Workflows]] | |||
[[Category:DataHandling]] |
Latest revision as of 16:57, 9 November 2024
Introduction
dCache is a system of many disks aggregated across dozens of linux disk servers; the total disk space is a few PB. The system lets all this hardware look like "one big disk" to the user and hides all the details of exactly where the files are and how they are being transferred. It supports load balancing and other optimizations. The entire system is designed to be high-bandwidth so it can serve data files to thousands of grid nodes simultaneously. Anytime you transfer data files between dCache and a grid node you must use one of the dedicated tools.
With a few exceptions, you should only put data files in dCache. Executables, libraries, and UPS products are made visible to grid nodes by using CVMFS or tarball. One exception is that you must put fcl files in dCache to make them visible to your grid jobs. Another is that all output created by grid jobs must be written to dCache: this includes log files as well as data files.
So far, when we refer to data distributed by dCache, we have been referring to event data, where every grid node gets a unique file. There is another case, intermediate between code on CVFMS and event data on dCache - the case of a single large file that needs to be distributed to every node but is too large for CVMFS. An example might be a library of fit templates that is 5 GB. In this case, the ideal solution is stashCache.
When you read or write to dCache, the request goes through a server "head node", and the system decides which hardware to read or write. There is a database behind the system to track where all the files are logically and physically. Accessing this database adds latency to all commands accessing dCache, so using dCache interactively for code building or analysis, for example, is not efficient and not recommended, please see disk page for build and analysis areas.
You cannot overwrite or edit files in dCache. You can remove them and then replace them with a new file of the same name.
You can access the files through several protocols, including a nfs server, which makes dCache look like a simple file system mounted as the /pnfs file system. If you are moving data in and out of dCache using a few interactive processes, you can use simple unix commands: cp, mv, rm. Once you need to move data using may parallel processes, such as to or from grid nodes, please use the tools here.
dCache has a home and a lab home and monitors.
Flavors
There are four flavors of dCache.
- scratch Anyone can write here, and you should use this area as temporary output for your grid jobs. If space is needed, files are deleted according to a least-recently-used algorithm. Your files may last for as little as one week since the last time you wrote or read them. There is no guaranteed minimum lifetime, so plan ahead to move important files off of scratch to an area where they can remain for a longer time.
- persistent Files written here will stay on disk until the user deletes them, so this area can fill up. Only production files are written here - it is not for general use, though special cases might be considered. The one current exception is that users can write fcl files here instead of uploading them.
- resilient This is a special purpose area. Files written here are copied to many, about 20, different dCache server nodes. When you read a file in resilient, a load balancer automatically directs you to one of the copies. This distributes the load of many grid jobs all reading the same file simultaneously. In the past this area was used for code tarballs made by an obsolete utility named gridexport. We now recommend that you not put code tarballs here; instead you should use RCDS. Files written to resilient will stay on disk until the user deletes them, so this area can fill up. Users should purge their own areas and the collaboration reserves the right to purge user's files as needed. Only use resilient for files that will be read by many grid jobs simultaneously.
- tape-backed All files written to this area are copied to tape automatically. If space is needed, files are deleted off disk according to a least-recently-used algorithm. As files are requested, they are copied from tape to disk as needed, and a request will hang during tape access. This is the way collaborations "write to tape". Do not copy data to this area - it is carefully organized and only the production scripts can write here. Large datasets should be prestaged from tape to disk before they are read from grid job.
Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:
- /pnfs/mu2e/scratch/datasets
- /pnfs/mu2e/persistent/datasets
- /pnfs/mu2e/tape
Using the scratch area
On the interactive nodes, you can create your area in scratch dCache:
mkdir /pnfs/mu2e/scratch/users/$USER
If you are moving a few files using a few processes, you can use the unix commands in /pnfs: ls, rm, mv, mkdir, rmdir, chmod, cp, cat, more, less, etc.
/pnfs is not a physical directory, it is an interface to a database and servers, implemented as an nfs server. Because there is latency to database and server access, there are restrictions to this interface that don't usually apply to local disk systems.
- You should avoid commands which make large demands on the database: "find .", "ls -lr *" or similar. Also, on many gpvms (if not all) "ls" is aliased to "ls --color==auto", which provides much slower access. If you find your database access with "ls" is slow, you can check the alias of "ls" with "alias ls". If you get back something other than plain "ls", then try using "\ls" to unalias the command. When you use "ls" there is a quick database access of the directory record, but if you use "find" or "ls -l" there is a much slower database access of the full file records, so a plain "ls" is always preferred.
- Try to keep the number of files in a directory under 1000 to maintain good response time - this is necessary even if you simple access the files directly by a known path. Avoid excessive numbers of small files, or frequent renaming of files.
- If you are writing or reading dCache with a large number of processes, such as from a grid job, please use the data transfer tools
- dCache does not allow overwriting files, or modifying existing files (such as by an editor), only moving full files.
The scratch area is purged regularly using a least-recently-used algorithm based on the time the file was last transferred to a user ("touch" doesn't update the time). The time between the last use and deletion is not guaranteed and while it is often on the scale of a month or more, it may be as little as a week.
The LRU algorithm is applied on a pool (one piece of the total disk) basis which means there may be some variation in how long a file last since some pools are larger than others, changing the churn rate, and may have one-off events.
Other Access Protocols
While we can use NFS protocol (unix disk interface) for small numbers of accesses, and data transfer tools to read or write files from grid nodes, there are other protocols to access dCache files.
The NFS interface uses unix file permissions to determine who can write where but the other protocols will require a proxy, most easily done:
mu2einit kinit vomsCert
This section is for reference and debugging, no users should be using these commands, please use ifdh or xrootd.
- dcap the native protocol, only available onsite
dccp dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/mu2e/scratch/users/$USER/filename .
dccp is installed on mu2egpvm, but it may have to be installed or setup (the product is named dcap) on other nodes
- root protocol
root [0] ff = TFile::Open("/pnfs/mu2e/scratch/users/$USER/file")
There are other versions of this access, through different plugins which trigger different authentication, protocols and transfer queues.
- xrootd protocol
- gridFtp
kinit vomsCert export X509_USER_CERT=/tmp/x509up_u`id -u` export X509_USER_KEY=$X509_USER_CERT export X509_USER_PROXY=$X509_USER_CERT globus-url-copy gsiftp://fndca1.fnal.gov:2811/scratch/users/$USER/file-name file:///$PWD/file-name
- http
In 2022, http became the preferred transfer protocol and default in ifdh. Underneath ifdh is gfal, a CERN data handling product which is installed on our nodes. It can handle many protocols, but the lab is using it to transfer files by the http protocol. gfal2 docgfal-copy https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat <output> gfal-copy root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat <output> when using in root macro, use : root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvaging-006/dat/a7/b4/raw.mu2e.CRV_wideband_cosmics.crvaging-006.001241_000.dat
gfal timeouts:
- client side - gfal-copy. Query gfal-copy documentation. It may be 1800 seconds, but I am not sure.
- On dCache end: pools allow 10K active transfers. Job idle timeout 72 hours.
- On connection to the door, idle time is 300 seconds.
- connection backlog 4096 (more than that and connections rejected)
- Timeout on connect can happen if there are too many connections to the door.
- Timeout may happen if transfer is queued on pool.
You can also transfer http using curl
Ports:
- WevDAV is on port 2880 and used for transfers
- REST API is on port 3880
- currently 1094 is the default for SAM, and is an OK default for dCache
- 1097 experimental
- fnfcaitb4.fnal.gov:2880 experimental
Pool structure
The disk in dCache is served from about 50 servers. Each server serves disk arrays which are further divided into pools. Disk space for a particular purpose, such as Mu2e persistent, is a group of pools. In general the pools corresponding to one logical area, such as Mu2e persistent are spread out over many disks on many servers.
(Fixme: need info on how tape-backed pools are connected to tape queues. Comments seem to says each pool with read requests is treated equally when filling the tape queue.)
A list of all pools, with acitivty stats, is available here.
The examples below show some typical pool names:
- p-mu2e-stkendca1901-8
- p-minos-stkendca2001-8
- rw-stkendca1901-1
- rw-gm2-stkendca2011-7
- v-stkendca1901-3
Pools with names beginning with p-experiment-name are part of the persistent space of that experiment. Pools beginning with rw- are tape-backed. If no experiment name is present, then it is part of the general read-write tape-backed pool, shared by all experiments. Pools beginning with rw-experiment-name are tape backed pools dedicated to one experiment. At present Mu2e does not have any of these dedicated tape-backed pools but we will once data taking starts. Pools beginning with v- are part of scratch dCache.
I the other parts of the name identify a server and a pool number on that server.
There are also pools in which the name "stkendca" is replaced with something else, like "pubstor". CSAID management decided to give newly acquired hardware different names. The original motivation for the "stk" in the server names was because tape backed dCached use Storage Tek tape hardware; the company no longer exists.
Access metadata for dCache file
Note that some command will work with the Mu2e file path /pnfs/mu2e/...
and soem only work on the canonical path /pnfs/fnal.gov/usr/mu2e/...
.
Mu2e tools
the webdav interface to the dCache database has been written into a more convenient tool:
setup dhtools dcacheFileInfo -h dcacheFileInfo -l -t dig.mu2e.FlatGammaOnSpillConv.MDC2020z_sm3_perfect_v1_0.001210_00009051.art NEARLINE
WebDav
There is CS-doc-5050 reference and an API page.
The "creationTime" is icrtime(creation time), "mtime" is imtime (modify time). If you divide it by 1000, you could get the epoch time.
Two webdav scripts from Dmitri Litvinsev. Require voms proxy to run.
#!/bin/sh # get properties of a file if [ ${1:-x} = "x" ]; then echo "please provide full file name" 1>&2; echo "usage: $0 <file_path>" exit 1 fi X509_USER_PROXY=/tmp/x509up_u`id -u` WEBDAV_HOST=https://fndca4a.fnal.gov:2880 FILE_PATH=${1} echo $FILE_PATH curl -L --capath /etc/grid-security/certificates \ --cert ${X509_USER_PROXY} \ --cacert ${X509_USER_PROXY} \ --key ${X509_USER_PROXY} \ -s -X PROPFIND -H Depth:1 \ ${WEBDAV_HOST}${FILE_PATH} \ --data '<?xml version="1.0" encoding="utf-8"?> <D:propfind xmlns:D="DAV:"> <D:prop xmlns:R="http://www.dcache.org/2013/webdav" xmlns:S="http://srm.lbl.gov/StorageResourceManager"> <R:Checksums/> <S:AccessLatency/> <S:RetentionPolicy/><S:FileLocality/> </D:prop> </D:propfind>' | xmllint -format -
#!/bin/sh # get properties of a files in a dir if [ ${1:-x} = "x" ]; then echo "please provide full file name" 1>&2; echo "usage: $0 <file_path>" exit 1 fi X509_USER_PROXY=/tmp/x509up_u`id -u` WEBDAV_HOST=https://fndca4a.fnal.gov:2880 FILE_PATH=${1} curl -L --capath /etc/grid-security/certificates \ --cert ${X509_USER_PROXY} \ --cacert ${X509_USER_PROXY} \ --key ${X509_USER_PROXY} \ -s -X PROPFIND -H Depth:1 \ ${WEBDAV_HOST}${FILE_PATH} \ --data '<?xml version="1.0" encoding="utf-8"?> <D:propfind xmlns:D="DAV:"> <D:prop xmlns:R="http://www.dcache.org/2013/webdav" xmlns:S="http://srm.lbl.gov/StorageResourceManager"> <R:Checksums/> <S:FileLocality/> </D:prop> </D:propfind>' | xmllint -format -
Another example (I don't think this works without authentication any more)
curl -k -X GET "https://fndca3a.fnal.gov:3880/api/v1/namespace//pnfs/fnal.gov/usr/nova/rawdata/FarDet/000328/32818/fardet_r00032818_s55_DDHmu.raw?locality=true" { "fileMimeType" : "application/octet-stream", "fileLocality" : "NEARLINE", "pnfsId" : "000081FAE6A3E9DA48E297F8E0FF66A6CA77", "fileType" : "REGULAR", "nlink" : 1, "mtime" : 1557510913687, "size" : 69473584, "creationTime" : 1557510913684 }
and another:
> curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X GET "https://fndca1.fnal.gov:3880/api/v1/namespace/pnfs/fnal.gov/usr/mu2e/tape/phy-rec/rec/mu2e/TRK_VST-cosmics/vst003-r02/art/27/02/rec.mu2e.TRK_VST-cosmics.vst003-r02.100033_00000000.art?checksum=true&locality=true" { "fileMimeType" : "image/x-jg", "fileLocality" : "ONLINE_AND_NEARLINE", "labels" : [ ], "size" : 3353865863, "creationTime" : 1641516239880, "fileType" : "REGULAR", "pnfsId" : "0000F4771A46FA0D431DB725DAD227DDA1C2", "checksums" : [ { "type" : "ADLER32", "value" : "da68ce6c" } ], "nlink" : 1, "mtime" : 1641516280517, "mode" : 420 }
one with tokens authentication:
$ curl -k --header "Authorization: Bearer $(< /var/run/user/6233/bt_u6233)" https://fndca1.fnal.gov:3880/api/v1/namespace/nova/production/keepup/R21-07-27-testbeam-production.a/tb/001020/102022/testbeam_r00102022_s12_beamline_R21-07-27-testbeam-production.a_v1_data_keepup.tbreco.root?locality=true
WLCG API
The WLCG tape API is available off of here:
https://fndcadoor.fnal.gov:3880/api/v1
discovery url:
https://fndcadoor.fnal.gov:3880/.well-known/wlcg-tape-rest-api
The help is available here , scroll to "tape Support for the TAPE API (bulk)" The description of WLCG tape API is here
dot commands
These commands can overwhelm the system so they must not be used in parallel threads or at a large scale. These are documented in CS doc 5399. Small File Aggregation.
> fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix- cat.MDC2018i.001002_00039638.art" > cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)" ONLINE_AND_NEARLINE
ONLINE means it is only one disk, NEARLINE means it is only on tape, and ONLINE_AND_NEARLINE means it is both on disk and on tape. This command may be time consuming sometimes.
check if level 4 exists. This only exists if the file has a tape location. > fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art" > cat `dirname $fpath`/".(use)(4)(`basename $fpath`)" if file is not on tape the cat command fails with "No such file" error Print format is fixed, from the example: VR4440M8 0000_000000000_0000445 5650536421 phy-sim /pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art 00004371A6EEFADF4578934301C5C4D337F1 CDMS156660817000002 fmv18025:/dev/rmt/tps11d0n:000780350B 2587942076 the format: tape label location_cookie on tape file size file family path empty line pnfsid empty line bfid tape drive CRC
> fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art" > cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)" NEARLINE
Tags for a directory, which describes the tape file family
cd /pnfs/mu2e/tape/phy-raw > cat ".(tags)()" .(tag)(file_family) .(tag)(file_family_width) .(tag)(file_family_wrapper) .(tag)(library) .(tag)(OSMTemplate) .(tag)(sGroup) .(tag)(storage_group)
> cat ".(tag)(library)" CD-LTO8G2,CD-LTO8F1
Be aware that the library field in this printout may be wrong. There is a higher level of control that can override this setting. The higher level is the "dCache HSM interface" which is controlled by the dCache admins. We have asked how to query these settings and are waiting for a reply.
See the Enstore#Libraries section to learn the current library configuration.
All the file family widths
for DD in $(ls -1d /pnfs/mu2e/tape/*); do (cd $DD; pwd; cat ".(tag)(file_family_width)"; echo ); done
enstore commands
note: no fnal.gov/usr in file path setup encp v3_11c -q stken fspec=/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art fpath=$(dirname $fspec) fname=$(basename $fspec) bfid=$( cat $fpath/".(use)(1)($fname)" ) enstore info --bfid $bfid {'active_package_files_count': None, 'archive_mod_time': None, 'archive_status': None, 'bfid': 'CDMS156660817000002', 'cache_location': None, 'cache_mod_time': None, 'cache_status': None, 'complete_crc': 2587942076L, 'deleted': 'no', 'drive': 'fmv18025:/dev/rmt/tps11d0n:000780350B', 'external_label': 'VR4440M8', 'file_family': 'phy-sim', 'file_family_width': 2, 'gid': 0, 'library': 'CD-LTO8F1', 'location_cookie': '0000_000000000_0000445', 'original_library': 'CD-LTO8F1', 'package_files_count': None, 'package_id': None, 'pnfs_name0': '/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art', 'pnfsid': '00004371A6EEFADF4578934301C5C4D337F1', 'r_a': (('131.225.240.49', 52527), 1L, '131.225.240.49-52527-1670623050.950488-3311-139918693267264'), 'sanity_cookie': (65536L, 1834246494L), 'size': 5650536421L, 'storage_group': 'mu2e', 'tape_label': 'VR4440M8', 'uid': 0, 'update': '2019-08-23 19:56:11.152691', 'wrapper': 'cpio_odc'}
The library information in this printout is always correct; it says where the file really is. Note that new files in this file family may be written to a different library.
Assorted notes
Kafka stream
dCache publishes storage events - store/restore/remove to Kafka server maintained at Fermilab. You can subscribe to it and receive events and act upon them (programmatically). Example client:
https://github.com/DmitryLitvintsev/scripts/blob/master/python/dcache/kafka/kafka_example.py
uses python3.
python3 kafka_example.py
will start it. This is just an example. You need to modify it to make is useful (e.g. to have a thread pool and produce/consume pattern to execute something upon receiving each event)
Note on SFA files
Getting the physical tape location for a file that has been tarred with other files in Small File Aggregation (SFA), is more complicated.
SAM expect a tape label and a location cookie for each file on tape, like
enstore:/pnfs/mu2e/tape/usr-etc/bck/gandr/dhtest-large/v1/tar/c7/45(3@vr0846m8)
For files in SFA archive dCache uses non-numeric cookies: instead of the "3" in the example above they look like
/volumes/aggwrite/cache/6ad/0c5/0000E730EAE13FA64B76BB38C7D0530C56FE
which, if I remember correctly, SAM does not accept. So for files in SFA I have to look up the containing archive, and put into SAM the physical location of the archive file instead. The information is used by SAM to optimize file pre-staging, so giving it the tape location of the file that has to be actually read from tape makes sense.
Note on checksums
enstore uses alder32 "0 seeded". dCache uses adler32 with '1-seeded". The ecrc exe can do either. SAM calls 0-seeded algorithm "enstore" and the 1-seeded algorithm "adler32". In 2019, enstore started using the "1-seeded" versions, and is migrating files checksums when they are migrated on tape.
$ setup encp v3_11 -q stken # enstore $ ecrc -0 raw.mu2e.file_test.v0.000.art CRC 3565947252 # dCache $ ecrc -1 -h raw.mu2e.file_test.v0.000.art CRC 0xc9a20975 note the SCD systems store the hex number without the "0x" $ samweb get-metadata raw.mu2e.file_test.v0.000.art Checksum: enstore:3565947252 adler32:c9a20975
Single-file prestage
Forcing files to be prestaged using command line. For large-scale processing SAM prestage needs to be used. Attempting to access a not-prestaged file from the NFS directory mount (/pnfs) will give an Input/Output error and will not cause the file to be prestaged.
for a random file
1) dccp -P (man dccp) dccp -P dcap://fndca1.fnal.gov:24125/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/sim/mu2e/flsh0s31b0/su2020/art/40/b6/sim.mu2e.flsh0s31b0.su2020.001000_00000000.art 2) touch ".(fset)()()()()" e.g. touch ".(fset)(foo)(pin)(2)(MINUTES)" 3) globus-url-copy -t ..... 4) REST QoS transition: curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X POST -H "Accept: application/json" -H "Content-Type: application/json" https://fndca1.fnal.gov:3880/api/v1/namespace/pnfs/fnal.gov/usr/my2e/foo --data '{"action" : "qos", "target" : "disk+tape"}' 5) gfal-bringonline
Some examples using gfal
> gfal-bringonline https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz QUEUED > gfal-xattr https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/cdf/cdfsoft/tarfiles/rel/1/rel-V7.2.1wmass-Linux2.6-GCC_4_4.tar.gz user.status NEARLINE
Pin files
pinning a file means to stage it to disk, then require it stays there, immune to LRU purging. Not normally needed, and should only be used in exceptional cases.
curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --key /tmp/x509up_u`id -u` -X POST -H "Accept: application/json" -H "Content-Type: application/json" https://fndcadoor.fnal.gov:3880/api/v1/namespace/full/path/of/the/file--data '{"action" : "pin", "lifetime" : "7", "lifetime-unit" : "DAYS"}'
In python
stage_url https://fndcadoor.fnal.gov:3880/api/v1/tape/stage header {'Authorization': 'Bearer <token text>' } filesdd {'path': '/mu2e/tape/phy- sim/dig/mu2e/CeEndpointOnSpillTriggerable/MDC2020ae_best_v1_3/art/84/c0/dig.mu2e.CeEndpointOnSpillTriggerable.MDC2020ae_best_v1_3.001210_00000002.art', 'diskLifetime': 'P7D'} response <Response [201]> response text { "requestId" : "eb7d9d62-7821-4935-be81-b534e2ce1bd5" }
Checking pins. The two dates are start and stop times
build02 c0 > cat ".(get)(dig.mu2e.CeEndpointOnSpillTriggerable.MDC2020ae_best_v1_3.001210_00000002.art)(pins)" 411362918 2024-10-25T16:48:16.710Z 2024-10-27T16:48:16.720Z "eb7d9d62-7821-4935-be81-b534e2ce1bd5" REMOVABLE PINNED