Dcache

From Mu2eWiki
Jump to navigation Jump to search

Introduction

dCache is a system of many disks aggregated across dozens of linux disk servers. The system lets all this hardware look like "one big disk" to the user and hides all the details of exactly where the files are and how they are being transferred. It allows load balancing and optimization, such as serving a popular file from several different machines. The entire system is designed to be high-bandwidth so it can serve data files to thousands of grid nodes simultaneously. For example, all the disk servers have special network paths to the grid nodes. Anytime you transfer data files to/from a grid node, it should be to/from dCache using the dedicated tools. Executables, libraries, and UPS products are transferred by CVMFS or by tarball.

So far, when we refer to data distributed by dCache, we have been referring to event data, where every grid node gets a unique file. There is another case, intermediate between code on CVFMS and event data on dCache - the case of a single large file that needs to be distributed to every node but is too large for CVMFS. An example might be a library of fit templates that is 5 GB. In this case, the ideal solution is stashCache.

When you read or write to this dCache, the request goes through a server "head node", and the system decides which hardware to read or write. There is a database behind the system to track where all the files are logically and physically. Accessing this database adds latency to all commands accessing dCache, so using dCache interactively for code building or analysis, for example, is not efficient and not recommended, please see disk page for build and analysis areas.

You can access the files through several protocols, including a nfs server, which makes dCache look like a simple file system mounted as the /pnfs file system. If you are moving data in and out of dCache using a few interactive processes, you can use simple unix commands: cp, mv, rm. Once you need to move data using may parallel processes, such as to or from grid nodes, please use the tools here.

dCache has a home and a lab home and monitors.

Flavors

There are three flavors of dCache.

  • scratch Anyone can write here, and you should use this area as temporary output for your grid jobs. If space is needed, files are deleted according to a least-recently-used algorithm. Your files may last for as little as one week since the last time you wrote or read them, and technically there is no guaranteed minimum lifetime, so plan ahead.
  • persistent Files written here will stay on disk until the user deletes them, so this area can fill up. Only production files are written here - it is not for general use, though special cases might be considered. The one current exception is that users can write fcl files here instead of uploading them.
  • resilient This is a special purpose area. Files written here are copied to many, about 20, different dCache server nodes. This procedure will allow the files to be accessed efficiently by many gird node trying to read them simultaneously. We expect this area will only be used for gridexport code tarballs, and perhaps other limited cases. Files written here will stay on disk until the user deletes them, so this area can fill up. Users should purge their own areas and the collaboration reserves the right to purge user's files as needed.
  • tape-backed All files written to this area are copied to tape automatically. If space is needed, files are deleted off disk according to a least-recently-used algorithm. As files are requested, they are copied from tape to disk as needed, and a request will hang during tape access. This is the way collaborations "write to tape". Do not copy data to this area - it is carefully organized and only the production scripts can write here. Large datasets should be prestaged from tape to disk before they are read from grid job.

Official production datasets, and user datasets manipulated by the file tools will appear under the following designated dataset areas, corresponding to the above flavors:

  • /pnfs/mu2e/scratch/datasets
  • /pnfs/mu2e/persistent/datasets
  • /pnfs/mu2e/tape

Using the scratch area

On the interactive nodes, you can create your area in scratch dCache:

mkdir /pnfs/mu2e/scratch/users/$USER

If you are moving a few files using a few processes, you can use the unix commands in /pnfs: ls, rm, mv, mkdir, rmdir, chmod, cp, cat, more, less, etc.

/pnfs is not a physical directory, it is an interface to a database and servers, implemented as an nfs server. Because there is latency to database and server access, there are restrictions to this interface that don't usually apply to local disk systems.

  • You should avoid commands which make large demands on the database: "find .", "ls -lr *" or similar. Also, on many gpvms (if not all) "ls" is aliased to "ls --color==auto", which provides much slower access. If you find your database access with "ls" is slow, you can check the alias of "ls" with "alias ls". If you get back something other than plain "ls", then try using "\ls" to unalias the command. When you use "ls" there is a quick database access of the directory record, but if you use "find" or "ls -l" there is a much slower database access of the full file records, so a plain "ls" is always preferred.
  • Try to keep the number of files in a directory under 1000 to maintain good response time - this is necessary even if you simple access the files directly by a known path. Avoid excessive numbers of small files, or frequent renaming of files.
  • If you are writing or reading dCache with a large number of processes, such as from a grid job, please use the data transfer tools
  • dCache does not allow overwriting files, or modifying existing files (such as by an editor), only moving full files.

The scratch area is purged regularly using a least-recently-used algorithm based on the time the file was last transferred to a user ("touch" doesn't update the time). The time between the last use and deletion is not guaranteed and while it is often on the scale of a month or more, it may be as little as a week.

The LRU algorithm is applied on a pool (one piece of the total disk) basis which means there may be some variation in how long a file last since some pools are larger than others, changing the churn rate, and may have one-off events.

Other Access Protocols

While we will typically use nfs protocol for small numbers of accesses, and data transfer tools to read or write files from grid nodes, there are other protocols to access dCache files.

  • dcap the native protocol
    dccp /pnfs/mu2e/scratch/users/$USER/filename .
    

    dccp is installed on mu2egpvm, but it may have to be installed or setup (the product is named dcap) on other nodes

  • root protocol
    root [0] ff = TFile::Open("/pnfs/mu2e/scratch/users/$USER/file")
    

    There are other versions of this access, through different plugins which trigger different authentication, protocols and transfer queues.

  • xrootd protocol

  • gridFtp
    kinit
    getcert
    export X509_USER_CERT=/tmp/x509up_u`id -u`
    export X509_USER_KEY=$X509_USER_CERT
    export X509_USER_PROXY=$X509_USER_CERT
    grid-proxy-init
    voms-proxy-init -noregen -rfc -voms fermilab:/fermilab/mu2e/Role=Analysis
    globus-url-copy  gsiftp://fndca1.fnal.gov:2811/scratch/users/$USER/file-name file:///$PWD/file-name
    

Access metadata for dCache file

Note that some command will work with the Mu2e file path /pnfs/mu2e/... and soem only work on the canonical path /pnfs/fnal.gov/usr/mu2e/....

  • webdav There is CS-doc-5050 reference
    curl -1 -L  --cacert $X509_USER_PROXY  --capath /etc/grid-security/certificates  --cert /tmp/x509up_u1311  https://fndca1.fnal.gov:2880/pnfs/fnal.gov/usr/mu2e/scratch/users/rlc/s1
    (not currently working)
    

    Two webdav scripts from Dmitri Litvinsev. Require voms proxy to run

    #!/bin/sh
    # get properties of a file
    if [ ${1:-x} = "x" ]; then
        echo "please provide full file name" 1>&2;
        echo "usage: $0 <file_path>"
        exit 1
    fi
    
    X509_USER_PROXY=/tmp/x509up_u`id -u`
    WEBDAV_HOST=https://fndca4a.fnal.gov:2880
    FILE_PATH=${1}
    
    echo $FILE_PATH
    
    curl  -L --capath /etc/grid-security/certificates \
             --cert ${X509_USER_PROXY} \
             --cacert ${X509_USER_PROXY} \
             --key ${X509_USER_PROXY} \
    	 -s -X PROPFIND -H Depth:1 \
             ${WEBDAV_HOST}${FILE_PATH} \
      --data '<?xml version="1.0" encoding="utf-8"?>
              <D:propfind xmlns:D="DAV:">
                  <D:prop xmlns:R="http://www.dcache.org/2013/webdav"
                          xmlns:S="http://srm.lbl.gov/StorageResourceManager">
                      <R:Checksums/>
                      <S:AccessLatency/>
                      <S:RetentionPolicy/><S:FileLocality/>
                  </D:prop>
              </D:propfind>' | xmllint -format -
    
    


    #!/bin/sh
    # get properties of a files in a dir
    
    if [ ${1:-x} = "x" ]; then
        echo "please provide full file name" 1>&2;
        echo "usage: $0 <file_path>"
        exit 1
    fi
    
    X509_USER_PROXY=/tmp/x509up_u`id -u`
    WEBDAV_HOST=https://fndca4a.fnal.gov:2880
    FILE_PATH=${1}
    
    curl  -L --capath /etc/grid-security/certificates \
             --cert ${X509_USER_PROXY} \
             --cacert ${X509_USER_PROXY} \
             --key ${X509_USER_PROXY} \
    	 -s -X PROPFIND -H Depth:1 \
             ${WEBDAV_HOST}${FILE_PATH} \
      --data '<?xml version="1.0" encoding="utf-8"?>
              <D:propfind xmlns:D="DAV:">
                  <D:prop xmlns:R="http://www.dcache.org/2013/webdav"
                          xmlns:S="http://srm.lbl.gov/StorageResourceManager">
                      <R:Checksums/>
                      <S:FileLocality/>
                  </D:prop>
              </D:propfind>' | xmllint -format -
    
    

    Another example

    curl -k -X GET "https://fndca3a.fnal.gov:3880/api/v1/namespace//pnfs/fnal.gov/usr/nova/rawdata/FarDet/000328/32818/fardet_r00032818_s55_DDHmu.raw?locality=true"
    {
    "fileMimeType" : "application/octet-stream",
    "fileLocality" : "NEARLINE",
    "pnfsId" : "000081FAE6A3E9DA48E297F8E0FF66A6CA77",
    "fileType" : "REGULAR",
    "nlink" : 1,
    "mtime" : 1557510913687,
    "size" : 69473584,
    "creationTime" : 1557510913684
    }
    
  • dot commands
    These are documented in CS doc 5399.
    > fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix- cat.MDC2018i.001002_00039638.art"
    > cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)"
    ONLINE_AND_NEARLINE
    

    ONLINE means it is only one disk, NEARLINE means it is only on tape, and ONLINE_AND_NEARLINE means it is both on disk and on tape. This command may be time consuming sometimes.

    check if level 4 exists.  This only exists if the file has a tape location.
    > fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art"
    > cat `dirname $fpath`/".(use)(4)(`basename $fpath`)"
    
    if file is not on tape the cat command fails with "No such file" error
    
    Print format is fixed, from the example:
    
    VR4440M8
    0000_000000000_0000445
    5650536421
    phy-sim
    /pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art
    
    00004371A6EEFADF4578934301C5C4D337F1
    
    CDMS156660817000002
    fmv18025:/dev/rmt/tps11d0n:000780350B
    2587942076
    
    the format:
    
    tape label
    location_cookie on tape
    file size
    file family
    path
    empty line
    pnfsid
    empty line
    bfid
    tape drive
    CRC
    


    > fpath="/pnfs/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art"
    > cat `dirname $fpath`/".(get)(`basename $fpath`)(locality)"
    NEARLINE
    
    
  • enstore commands
    setup encp v3_11c -q stken
     > enstore info --file /pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art
    {'active_package_files_count': None,
     'archive_mod_time': None,
     'archive_status': None,
     'bfid': 'CDMS156660817000002',
     'cache_location': None,
     'cache_mod_time': None,
     'cache_status': None,
     'complete_crc': 2587942076L,
     'deleted': 'no',
     'drive': 'fmv18025:/dev/rmt/tps11d0n:000780350B',
     'external_label': 'VR4440M8',
     'file_family': 'phy-sim',
     'file_family_width': 2,
     'gid': 0,
     'library': 'CD-LTO8F1',
     'location_cookie': '0000_000000000_0000445',
     'original_library': 'CD-LTO8F1',
     'package_files_count': None,
     'package_id': None,
     'pnfs_name0': '/pnfs/fnal.gov/usr/mu2e/tape/phy-sim/dig/mu2e/DS-cosmic-mix-cat/MDC2018i/art/cc/5b/dig.mu2e.DS-cosmic-mix-cat.MDC2018i.001002_00039638.art',
     'pnfsid': '00004371A6EEFADF4578934301C5C4D337F1',
     'r_a': (('131.225.240.49', 58869),
             1L,
             '131.225.240.49-58869-1568218341.393365-453-140514482403136'),
     'sanity_cookie': (65536L, 1834246494L),
     'size': 5650536421L,
     'storage_group': 'mu2e',
     'tape_label': 'VR4440M8',
     'uid': 0,
     'update': '2019-08-23 19:56:11.152691',
     'wrapper': 'cpio_odc'}
    
    

Note on SFA files

Getting the physical tape location for a file that has been tarred with other files in Small File Aggregation (SFA), is more complicated.

SAM expect a tape label and a location cookie for each file on tape, like

enstore:/pnfs/mu2e/tape/usr-etc/bck/gandr/dhtest-large/v1/tar/c7/45(3@vr0846m8) 

For files in SFA archive dCache uses non-numeric cookies: instead of the "3" in the example above they look like

/volumes/aggwrite/cache/6ad/0c5/0000E730EAE13FA64B76BB38C7D0530C56FE 

which, if I remember correctly, SAM does not accept. So for files in SFA I have to look up the containing archive, and put into SAM the physical location of the archive file instead. The information is used by SAM to optimize file pre-staging, so giving it the tape location of the file that has to be actually read from tape makes sense.