Cvmfs: Difference between revisions
Line 249: | Line 249: | ||
To resolve this, add another level of nesting to .cvmfsdirtab . Note: the effected files may have their create date reset. | To resolve this, add another level of nesting to .cvmfsdirtab . Note: the effected files may have their create date reset. | ||
===Rollback=== | |||
Older versions of cvmfs content are kept for about two weeks and you can roll back to any available version. When rolling back, versions between the current head and the rollback version are lost. To see the available versions | |||
cvmfs_server tag -lx mu2e.opensciencegrid.org | |||
The version tags are the "generic-<date>" strings. To rollback one version | |||
cvmfs_server rollback mu2e.opensciencegrid.org | |||
or this command can also take a specific tag from the list. During a recovery process, you can tag a version with "tag" command and look at differences with the "diff" command. | |||
==Rapid Code Distribution Service (RCDS)== | ==Rapid Code Distribution Service (RCDS)== |
Latest revision as of 22:57, 1 November 2024
Introduction
CERN Virtual Machine File System (CVMFS) is a distributed disk system for providing an experiment's code and libraries to interactive node and grids worldwide. It is used by CMS and Atlas and well as most experiments at FNAL.
The code manager copies a code release to a CVMFS work space and "publishes" it. This process examines the code, compresses it, and inserts it in a database. The original database is called the tier 0 copy. Large remote sites, such as grids, may support tier 1 copies of the database, synced to the tier 0. Small (university group) sites can connect to tier 0 directly.
The user's grid job sees a CVMFS disk mounted and containing a copy of the experiment's code, which can be accessed in any way the code would be accessed on a standard disk. The disk is actually a custom nfs server with a small ( ~8 GB) local cache on the node and a backend that sends file requests to a squid web cache. The squid may get its data from the tier 1 database, if available, or from the tier 0. As a practical matter, most grid jobs do not access much in a release, usually just a small set of shared object libraries, and these end up cached on the worker node, or on the squid, thereby avoiding a long-distance network transfer.
CVMFS is efficient only for distributing code and small data files which are required by a large number of nodes on the grid. On the other hand, datasets, such as event data files, consist of many large files which are each sent to only one node during a grid job. CVMFS is not efficient for this type of data distribution or for this sort of data volume. Data files should be distributed through dCache, which is designed to deliver each file to one node, and to handle the data volume. A single large file which is to be distributed to all nodes also need to be avoided since it would would churn or overflow the small local caches. Examples of this sort of file are the Genie flux files or a large analysis fit template library. The lab has developed a stashCache feature for CVMFS, to address this sort of file, and we have our own stashCache area. The mu2e B-field files and stopped muon ntuples are distributed by CVMFS, but are about at the limit of file size that is appropriate.
Mu2e has two paritions:
mu2e.opensciencegrid.org mu2e-development.opensciencegrid.org
The first is for the tagged releases which update slowly. The second, mu2e-development, is intended for rapid-turnaround uses, such as loading a build every day, or even more often. The difference is the mu2e partition does not do garbage collection, but the dev partition does.
The mu2e CVMFS partitions are available on all grid sites that mu2e can submit to. The Fermigrid local farm and interactive nodes have the same setup, so you do not need to modify your script for onsite versus offsite. CVMFS is mounted on the mu2e interactive nodes, and is the default source for code and products.
CVMFS at NERSC
The treatment of cvmfs at NSERC is different than at other sites. At most sites newly published material is automatically visible within an hour after publication, often faster than that. If a publication contains only a few small files it may be visible within 5 minutes. NERSC maintains a local copy of the cvmfs content and syncs it only once or twice per day.
Also, the #Rapid Code Distribution Service (RCDS) content is not visible at NERSC.
Using CVMFS
Cvmfs is standard at all universities and grid sties, so we use it to provide all code. Please see Code or ComputingTutorials for details of accessing code.
Occasionally, on OSG remote sites, we can land on a node that knows of the latest version of cvmfs contents, but still returns the next-to-latest version of the contents for one minute. In order to avoid this (rare) case, you can run these commands before accessing cvmfs:
/cvmfs/grid.cern.ch/util/cvmfs-uptodate /cvmfs/fermilab.opensciencegrid.org /cvmfs/grid.cern.ch/util/cvmfs-uptodate /cvmfs/mu2e.opensciencegrid.org
Installing CVMFS
It is preferred that all mu2e sites use this as a local code disk.
If you have root access, it is straightforward to install a readonly CVMFS client on a remote linux system. For a small load (a few desktops), please use this recipe.
Tip: if
yum install cvmfs cvmfs-config-default
fails to find the package, try this first:
yum install https://ecsft.cern.ch/dist/cvmfs/cvmfs-release/cvmfs-release-latest.noarch.rpm
For /etc/cvmfs/default.local
variables, you want
CVMFS_HTTP_PROXY=DIRECT CVMFS_REPOSITORIES="fermilab.opensciencegrid.org,mu2e.opensciencegrid.org"
For a large installation like a farm, you will need to use these instructions and investigate a local "squid" cache for CVMFS_HTTP_PROXY.
You can size your local cache so that everything you normally use will sit in the cache. Therefore you pay network latency only on the first use of any new ups product; on subsequent uses, the products are acessed from the cache. Therefore it is possible to work on your laptop even when it is not connected to a network.
There was a Jan 2017 discussion on hypernews about installing cvmfs on SL7.
Adding CVMFS content
These are the steps to copy code into the cvmfs repository, and publish it so that it goes out to all sites. Since only collaboration tagged releases and Fermilab static products are distributed this way, this procedure is only performed by the code manager.
Here is some background information on maintaining cvmfs and its limitations.
Download instructions is good for an overview of how to pull art products, DataFiles and setupmu2e-art.sh to a local disk, which is useful background information.
There are three categories of files that make up the content of mu2e environment and are copied to cvmfs.
- Base releases are built by hand or on Jenkins build server and installed on cvmfs. [
- The art product and all the products it depends on are distributed a set defined by a manifest. The manifests are listed on scisoft.fnal.gov. These will tell you the arguments to the pullProducts script below. We use relocatable ups products which can also be pulled individually from scisoft as tarballs.
- magnetic field maps and other special data files
- Note! G4beamline is not pulled with pullProdcuts, it is not a real product. You will need to copy that separately below.
- Note! If you have ups setup from anywhere while running pullProducts, it will not pull ups, even if it is on the manifest (unsetup ups to avoid this).
The most common procedure is to load a new tagged release, built on Jenkins, followed by updating a mu2e-maintianed product and then loading a whole new art product suite.
To publish files on the mu2e CVMFS server, your kerberos principal will need to be in the k5login for cvmfsmu2e on oasiscfs.fnal.gov. An offline manager can do this.
All the recipes follow this pattern:
- login to the cvmfs server node (permission through the .k5login there). The username is different depending if you are going to load mu2e or mu2e-development.
ssh -X cvmfsmu2e@oasiscfs.fnal.gov - or - ssh -X cvmfsmu2edev@oasiscfs.fnal.gov
- tell the cmfs server to record the changes you are about to make, and go to the write area. Use mu2e-development if that's appropriate.
cvmfs_server transaction mu2e.opensciencegrid.org cd /cvmfs/mu2e.opensciencegrid.org
- maintain files, rsync, scp, pullProducts, wget, cp, find, rm etc.
- must cd out of the data directory or the next step will fail! Finish the transaction and publish.
cd ~ cvmfs_server publish mu2e.opensciencegrid.org
- to abort the transaction For now (9/2024), after an abort command, you must perform a normal transaction and publish or the permissions on the partition will be incorrect on the cvmfs server machine.
cd ~ cvmfs_server abort mu2e.opensciencegrid.org
- run gabage collection (only on mu2e-development)
cd ~ cvmfs_server gc mu2e-development.opensciencegrid.org
Examples
All the examples will follow the above pattern for opening and publishing a cvmfs transaction. These examples only include the commands to move files.
Jenkins build
Here is an example of pulling a release built on Jenkins directly on the cvmfs host machine. This will try to copy SLF6 prof,debug
source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups setup codetools REL=v5_5_1 cd /cvmfs/mu2e.opensciencegrid.org/Offline copyFromJenkins.sh $REL # delete the intermediate .os and other temp files cleanupRelease.sh $PWD/$REL/SLF6/prof/Offline cleanupRelease.sh $PWD/$REL/SLF6/debug/Offline cleanupRelease.sh $PWD/$REL/SLF7/prof/Offline cleanupRelease.sh $PWD/$REL/SLF7/debug/Offline
rsync
Here is an example of using rsync. In the rsync command, if the source is the name of a directory, and if the directory name contains a trailing slash, rsync interprets this to mean a wildcard of the directory contents, not the name of directory itself.
cd /cvmfs/mu2e.opensciencegrid.org rsync -aur rlc@mu2egpvm01:/mu2e/app/Offline/tmp/DataFiles . rsync -aur rlc@mu2egpvm01:/mu2e/app/Offline/tmp/setupmu2e-art.sh . rsync -aur rlc@mu2egpvm01:/grid/fermiapp/products/mu2e/artexternals/G4beamline artexternals
scisoft product
Here is an example of installing a single product from Scisoft packages directly on the cvmfs host machine. The Scisoft package list page may take a minute to load and open. Note that pulling a product can update the version that is declared "current", and activate the product though default setups.
cd wget http://scisoft.fnal.gov/scisoft/bundles/tools/pullPackage chmod a+x pullPackage cd /cvmfs/mu2e.opensciencegrid.org/artexternals #pullPackage <options> <product_topdir> <OS> <prod-spec> [<qual_set> [<build-spec>]] ~/pullPackage -r $PWD slf6 ifdhc-v2_1_0 e14-p2713d prof
In the above block, the wget and chmod lines only need called if the pullPackage script is updated, not every time a product is installed.
art, geant4, or mu manifest
Here is an example of pulling the products from an art manifest directly to the cvmfs host machine. Please see Scisoft for details. Note that pulling a product can update the version that is declared "current", and activate the product though default setups.
cd wget http://scisoft.fnal.gov/scisoft/bundles/tools/pullProducts chmod a+x pullProducts # pullProducts <options> <product_topdir> <OS> <bundle-spec> <qual_set> <build-spec> cd /cvmfs/mu2e.opensciencegrid.org/artexternals ~/pullProducts -r $PWD slf7 mu-v3_14_01 e28 prof ~/pullProducts -r $PWD slf7 mu-v3_14_01 e20 debug ~/pullProducts -r $PWD slf7 geant4-v4_10_7_p01a e28-mt prof ~/pullProducts -r $PWD slf7 geant4-v4_10_7_p01a e28-qt prof
NOTE: ifdh and ifdhc_config are part of the mu manifest, but our policy is to take the "current" version from the fermilab cvmfs partition. If we download the manifest before the ifdh products are updated on the fermilab partition, then the products will be loaded on the mu2e partition. If they are declared current on the mu2e partition, then they might start to override the current from the fermilab partition (mu2e is before fermilab in the products search path). Bottom line, check that there is no
/cvmfs/mu2e.opensciencegrid.org/artexternals/ifdhc*/current.chain
link release
Here is an example of creating a link release
cd /cvmfs/mu2e.opensciencegrid.org/OfflineSpecial mkdir -p v5_7_9-cosmic_target5-v3/SLF6/prof/Offline cd v5_7_9-cosmic_target5-v3/SLF6/prof/Offline find /cvmfs/mu2e.opensciencegrid.org/Offline/v5_7_9/SLF6/prof/Offline -maxdepth 1 | while read FF; do ln -s $FF; done rm JobControl mkdir JobControl cd JobControl find /cvmfs/mu2e.opensciencegrid.org/Offline/v5_7_9/SLF6/prof/Offline/JobControl -maxdepth 1 | while read FF; do ln -s $FF; done rm cd3 mkdir cd3 # continue with links and replacing with real files where needed # check that real files are there find /cvmfs/mu2e.opensciencegrid.org/OfflineSpecial/v5_7_9-cosmic_target5-v3 -type f
Muse envset Files
Here is an example of adding a new envset file to the /Muse/ directory in /cvmfs/. Make sure that you start a transaction and kinit in both cvmfs with your username and in the area which you are copying the envset file from. In the example, kinit was executed within cvmfs and mu2ebuild02, then a transaction was started.
cd /cvmfs/mu2e.opensciencegrid.org/DataFiles/Muse/ scp macndev@mu2ebuild02:/mu2e/app/users/macndev/Offline_PR527/Muse/config/p010 .
unbundled products
Here are some products which are pulled occasionally by hand because they are not part of the art suite and its manifest.
- allinea - a debugger
- historoot - needed for g4beamline
- artdaq - not clear if we need this
- mu2egrid - we only need the most recent version
- mpich - this is for testing with multithreading - used by Mike Wang in trigger studies - we need it.
- toyExperiment - take only the most recent version - used by art_workbook
- upd - probably should add this so that we can install from kits
- .updfiles - same
- valgrind - memory checker
Pulled by hand for G4Beamline:
- g4radiative v3_6
- g4photon v2_3
- root v5_28_00c
Trouble Shooting
Corrupt Cache
If you suspect that a file in your local cache is corrupt, you can force cvmfs to flush it from cache and to reload it on the next use with:
sudo cvmfs_talk -i <reponame> evict <path>"
For example
sudo cvmfs_talk -i /cvmfs/mu2e.opensciencegrid.org evict /Offline/v09_06_00/SLF7/prof/Offline/setup.sh
Note that the path must start with a slash (/). You may only evict files, not directories. If you try to evict a directory you will get an error message: "No such regular file."
If this fails, you can try unmounting and remounting /cvmfs. The last resort is to reboot your machine.
Large Catalog Error
The Mu2e cvmfs catalog is nested into many smaller catalogs; a catalog that is too large stresses the transport infrastructure. The rules for nesting are specified in the file /cvmfs/mu2e.opensciencegrid.org/.cvmfsdirtab . As of September 2023, this file has the content:
/Offline/*/*/* /artexternals /artexternals/*/* ! *.version ! *.chain /DataFiles /Musings/*/*
If any catalog gets too big, you will get an error like the following:
FATAL: catalog at / has more than 200000 entries (201857). Large catalogs stress the CernVM-FS transport infrastructure. Please split it into nested catalogs or increase the limit. terminate called after throwing an instance of 'ECvmfsException' what(): PANIC: /builddir/build/BUILD/cvmfs-2.10.0/cvmfs/catalog_mgr_rw.cc : 1198 catalog at / has more than 200000 entries (201857). /usr/bin/cvmfs_server: line 4105: 31610 Killed $user_shell "$sync_command" Synchronization failed
To resolve this, add another level of nesting to .cvmfsdirtab . Note: the effected files may have their create date reset.
Rollback
Older versions of cvmfs content are kept for about two weeks and you can roll back to any available version. When rolling back, versions between the current head and the rollback version are lost. To see the available versions
cvmfs_server tag -lx mu2e.opensciencegrid.org
The version tags are the "generic-<date>" strings. To rollback one version
cvmfs_server rollback mu2e.opensciencegrid.org
or this command can also take a specific tag from the list. During a recovery process, you can tag a version with "tag" command and look at differences with the "diff" command.
Rapid Code Distribution Service (RCDS)
There are two main ways for the standard Mu2e tools to run code on Fermigrid and OSG: you can run code that is found in the Mu2e cvmfs repository /cvmfs/mu2e.opensciencegrid.org/ or you can package your code as compressed tar file and tell the grid to use that tar file. If you are running Mu2e Offline using mu2eprodsys you make the tar file using muse tarball.
Starting in July 2020 mu2eprodsys changed to using the Rapid Code Distribution Service [https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Rapid_Code_Distribution_Service_via_CVMFS_using_Jobsub
(RCDS)]. You enable this feature with the --use-cvfmfs-dropbox option to jobsub. You an look at the generated jobsub command that is in the output of the mu2eprodsys command to see if this flag is present.
When you submit a job using RCDS, jobsub does the following:
- It copies your tar file and unpacks it into one of several cvmfs repositories that are dedicated to hosting these tar files.
- All of these cvmfs repos are visible on all grid worker nodes, just like /cvmfs/mu2e.opensciencegrid.org/ .
- It sets an environment variable INPUT_TAR_DIR that is visible inside your running grid job and that points to your code in one of these cvmfs repos
- Your grid job can run this code directly from cvmfs
Some comments:
- When you submit your job there will be a pause while the tar file is uploaded to /cvmfs and untarred.
- Files in the RCDS cvmfs space are purged when they are older than some cut-off (at one point it was one month but that may change).
- Before copying your tar file, jobsub computes a checksum for your tar file. If that checksum matches that of a previously-used tar file, then jobsub assumes that the files are the same; it does not do the copy and and reuses the unpacked code from the earlier time. In the early releases of RCDS this did not extend the lifetime of your cached files; that may (or may not) change.
- Presumably there is some sort of load balancing among these repositories but I don't know details.
- jobsub is smart enough that it does not start any of your grid processes until it knows that the process will be able to see the code in cvmfs.
If you read documentation that says to copy your tar file to /pnfs/mu2e/resilient, do NOT follow those instructions. Use RCDS, as described above instead. Also tell Mu2e coomputing management where you read the documentation about resilient so that we can remove or update it.
Notes
To see the size of the cvmfs cache on a node,
df -h /cvmfs/mu2e.opensciencegrid.org/
this is shared by all cvmfs partitions. Also useful is the config file:
cat /etc/cvmfs/default.local
which contains CVMFS_QUOTA_LIMIT, in MB. The actual cache is usually in /var/cache. mu2egpvm seems to have 8GB, grid nodes have 3.6GB.
The compressed database of files is under
/exp/cvmfs/mu2e.opensciencegrid.org
on oasiscfs.fnal.gov.
To list the contents of the cache on a machine:
sudo cvmfs_talk -i mu2e.opensciencegrid.org cache list
Guidelines on Usage
On September 6 2016, we asked Dave Dykstra for information about quotas and/or other usage guidelines. The short version of his reply is that we have a quota of about 191 GB on:
/exp/cvmfs/mu2e.opensciencegrid.org
and no quota (ie usage limited only by the free space on the file system) on:
/cvmfs/mu2e.opensciencegrid.org
Throughout this discussion 1 GB means 10243 bytes.
On Sep 6, 2016 the Mu2e quota on /exp/cvmfs/mu2e.opensciencegrid.org was doubled from 95 GB to 191 GB. This directory holds the de-duplicated and compressed database and is visible only on oasiscfs.fnal.gov. The partition that contains this directory also contains databases for other experiments and as of, Sept 6, 2016, the partition has about 4TB of which about 50% is used.
To check the Mu2e quota and the used fraction of the quota:
quota -s /exp/cvmfs/mu2e.opensciencegrid.org
and look for the information about:
/dev/mapper/vgData-srv_cvmfs
You can also check the usage on this disk using du -sh; its answer should agree with the answer given by quota. If necessary, we may ask for an increase in the quota.
On oasiscfs.fnal.gov the directory:
/cvmfs/mu2e.opensciencegrid.org
contains the source from which the the de-duplicated compresssed database is created. On other machines it is a cached image of that database. On oasiscfs.fnal.gov there is no Mu2e quota on this disk.
The following table gives the disk space used on /cvmfs/mu2e.opensciencegrid.org/ (Uncompressed) and /exp/cvmfs/mu2e.opensciencegrid.org (Compressed) at various times:
Date | Uncompressed (GB) | Compressed (GB) |
---|---|---|
Sep 3, 2015 | 80 | 32 |
Dec 20, 2015 | 144 | -- |
Sep 6, 2016 | 236 | 96 |
The ratio of compressed to uncompressed is roughly constant with time.
Some other details about the compressed and de-duplicated database: /exp/cvmfs/mu2e.opensciencegrid.org.
- The total space used by all experiments is 2.0 TB and there is 2.0 TB free.
- Another 4TB file system is available to handle overflow.
- NOvA is the big user with 880TB (on Sep 3, 2015; probably more now)
- If the present ratio of compressed to uncompressed remains unchanged, 191GB on the compressed de-duplicated database corresponds to about 500GB on the published repository.
Other factoids:
- The compressed and de-duplicated repository is what is copied to the stratum 1 servers. The owners of these servers prefer that usage on the servers does not grow "too fast"; at this time the BNL server is tight for space.
- Removing files from the publshed repo does not automatically recover space in the compressed and de-duplicated repo.
- Space is only recovered if a snapshot is taken, which is a manual process and only happens rarely.
- The snapshot process needs to be done at each stratum 1 server.
Hard links
In March 2017 on INC000000837206, we investigated this error:
...OverlayFS has copied-up a file (/cvmfs/mu2e.opensciencegrid.org/artexternals/root/v6_08_04e/Linux64bit+2.6-2.12-e10-prof/bin/genreflex) with existing hardlinks in lowerdir (linkcount 3). OverlayFS cannot handle hardlinks and would produce inconsistencies. > > Consider running this command: > cvmfs_server eliminate-hardlinks
On the node that uploads the file, oasiscfs.fnal.gov, the filesystem we are interacting with is actually OverlayFS. It used to be aufs, but when it changed to OverlayFS, this problem with hard links was introduced. We found hard links in gcc and root bin areas. You can fix this by aborting the transaction, writing the products on a local disk, then rsync (without -H which preserves hard link) to the upload area. The rsync will remove the hard link and replace with unique identical files. It you want to fix the problem in place on the upload disk, you have to delete both file hard-linked files and replace them. We are not sure if this is still a problem today. We asked that hard links be removed from products before being put on scisoft.
Update 6/14/17 This seems to be fixed, the message now look like
Warning: Found file with linkcount > 1 (/cvmfs/mu2e.opensciencegrid.org/artexternals/root/v6_08_06g/Linux64bit+2.6-2.12-e14-prof/bin/rootcint). We will break up these hardlinks.
Cvmfs is not backed up
Our files in /cvmfs are not backed up by /cvmfs -- but this is OK.
The underlying cvmfs data is replicated onto at least 8 stratum 1s worldwide, so there are a lot of copies of it. Also it is possible to look at all older publication numbers using a cvmfs client option, and it is possible to roll back to previous publications on the cvmfs server. So data is not likely to be lost once it is published in cvmfs, especially on a repository that has no garbage collection.
Most of the Mu2e content on /cvmfs is source that lives in a GitHub repository and binaries built from that source. If needed, the art-suite binaries can be downloaded again from scisoft.fnal.gov and our own code can be rebuilt from source. The art-suite binaries could even be rebuilt from source if needed. One obvious caveat is that it would be difficult to regenerate files built on an old OS that is no longer readiiy available. But if we needed to do that, we would need to find such a machine in order to run the jobs; so rebuilding the code is possible, just tedious.
This leaves the magnetic field map files, whose authors have copies of the original soruce.