Disks
Introduction
There are several categories of disk space available at Fermilab. There are limited home areas, BlueArc disks for building code and small datasets, and dCache for large datasets and sending data to tape.
When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.
If you need to use one of the other disks, please make yourself a directory within the user area of that disk. You must name the directory with your kerberos principal so that our management scripts know who you are.
The table below summarizes the information found in the sections that follow. An entry of the form (1) or (2) indicates that you should read the numbered note below the table.
Name | Quota (GB) | Backed up? | Worker | Interactive | Purpose/Comments |
---|---|---|---|---|---|
Mu2e Project Disk on BlueArc | |||||
/grid/data/mu2e | 2,560 | No | rw- | rw- | Event-data files, log files, ROOT files. |
/mu2e/data | 71,860 | No | rw- | rw- | Event-data files, log files, ROOT files. |
/grid/fermiapp/mu2e | 232 | Yes | r-x | rwx | Grid accessible executables and shared libraries. No data/log/root files. |
/mu2e/app | 1,024 | No | r-x | rwx | Grid accessible executables and shared libraries. No data/log/root files. |
/grid/app/mu2e | 30 | Yes | rwx | rw- | See the discussion below. |
Home Disks | |||||
/afs/fnal.gov/files/home/room* | 0.5 | Yes | --- | rwx | mu2egpvm* and FNALU only |
/sim1 | 20 | Yes | --- | rwx | detsim only; in BlueArc space |
Local Scratch | |||||
//scratch/mu2e/ | 954 | No | --- | rwx | mu2egpvm* only; NFS mounted from gpcf015. |
//scratch/mu2e/ | 568 | No | --- | rwx | detsim only; local disk. |
Mu2e web Site | |||||
//web/sites/mu2e.fnal.gov | 8 | Yes | --- | rwx | mounted on mu2egpvm* and FNALU (not detsim); see ! Website instructions. |
Marsmu2e Project disk on BlueArc | |||||
/grid/data/marsmu2e | 400 | No | rw- | rw- | Event-data files, log files, ROOT files. |
/grid/fermiapp/marsmu2e | 30 | Yes | r-x | rwx | Grid accessible executables and shared libraries |
Obsolete AFS Space | |||||
/afs/fnal.gov/files/code/mu2e/d | 58 | Yes | --- | rwx | Obsolete. See below. |
/afs/fnal.gov/files/data/mu2e/d | 20 | No | --- | rwx | Obsolete. See below. |
Notes on the table:
- The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
- The home disks have individual quotas. All others have only group quotas.
- The afs home disks have per user quotas and the default quota is small, 500 MB for most new users and smaller for older users. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore the introductory instructions tell you to build your code either on our project disks or in scratch space. You can contact the Service Desk to request additional quota on your home disk.
- Despite the name asymmetry, /mu2e/app is intended as additional space for the role of /grid/fermiapp/mu2e/, not for that of /grid/app/mu2e.
- The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (detsim, mu2egpvm*). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.
BlueArc Disks
Fermilab operates a large, disk pool that is mounted over the network on many different machines, including detsim, the GPCF interactive nodes, the GPCF local batch nodes and the GP Grid worker nodes. It is not mounted on most grid worker nodes outside of GP Grid and it is not mounted on FNALU. The pool is built using Network Attached Storage systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.
This pool is shared by all Intenstiy Frontier experiments. As of January 2012 Mu2e has a quota of about 37 TB, distributed as shown in the table above. Each year computing division purchases additional BlueArc systems and each year Mu2e gets additional quota on the new systems.
The following admonition is taken from the GPCF Getting Started page:
It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the MVN and CPN commands (just like the unix mv and cp commands, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.
Additional information about this is available on the Mu2e Fermigrid page. See, in particular the sections on: CPN, staging input files, and staging output files.
The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for event-data, log files ROOT files and so on. These disks are mounted as noexec on all machines; therefore, if you put a script or an executable file in this disk space, it cannot be executed; if you attempt to execute a file in this disk space you will get a file permission error. Why are there two separate file systems? When we needed disk space beyond our initial allocation, the server holding the first block of space was full so we were given space on a new disk server. Neither of these areas is backed up.
If you want to run an application on the grid, the executable file(s) and the shared libraries for that application must reside on /grid/fermiapp/mu2e or /mu2e/app; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. The recommended use is to compile code on one of the interactive nodes and place the executables and .so files in either /grid/fermiapp/mu2e or /mu2e/app. Because this disk space is executable on all of detsim, GPCF, and the GP Grid worker nodes, it is straight forward to develop and debug jobs interactively and then to submit the long jobs to the grid.
/grid/app/mu2e
For the foreseeable future, Mu2e will not use /grid/app/mu2e for its intended purpose. This file system is intended for users who are authorized to use FermiGrid but who do have access to interactive machines that mount the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should develop and test our code on detsim or GPCF; then put the debugged the excutable on either /grid/fermiapp/mu2e or /mu2e/app; then submit grid jobs that use those executables.
BlueArc Execute and Write Permissions
In the table above, one can see that some disk partitions are either not executable or not writable on certain nodes; this is primitive security precaution. Suppose that an unauthorized user gains access to a grid worker node; that person cannot write malware onto /grid/fermiapp/mu2e or /mu2e/app, both of which are write protected on grid worker nodes. That person can write malware onto the data disks or onto /grid/app/mu2e; however none of those disks are executable on the interactive nodes. Therefore, if an unauthorized user gains access to a worker node, they cannot deposit executable malware into a place from which it can be executed on one of the interactive nodes.
BlueArc Snapshots
In the table above, some of the bluearc disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.
In addition to backup to tape, the bluearc file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.
On the data disks, a snapshot is taken nightly and then deleted the next night; so once a file has been deleted it will be recoverable for the remainder of the working day. On /grid/fermiapp and /mu2e/app, a snapshot is taken nightly and retained for 4 nights; so a deleted file can be recovered for up to 4 calendar days.
If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.
After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.
How does this work? While the bluearc file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer of bluearc.
You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/ or /grid/fermiapp/.snapshot/ . Snapshots are readonly to us.
Home Disks
The interactive nodes in GPCF and FNALU, and the batch nodes in GPCF, share the same home disks. These home disks are mounted using over the network using AFS; see the Mu2e notes on AFS. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas, I believe 400 or 800 MB by default. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.
The home disks on detsim are different than those mounted on GPCF and FNALU. They are mounted only on detsim and nowhere else; these are the same home disks that were previously mounted on ilcsim and ilcsim2. Giving detsim its own home disks, separate from those of GPCF and FNALU, was a deliberate decision. Because afs is a wide area network, it intermittently suffers from very long latencies; in order to avoid these latency issues, detsim has home disks mounted using a more local network with shorter latency.
The grid worker nodes do not see either of these home disks. When your job lands on a grid worker node, it lands in an empty directory.
The home disks do not have enough disk space to build a release of mu2e Offline
Local Scratch Disks
On both GPCF and detsim there is scratch space available for general Mu2e use. However different physical disks are mounted on the two facilities: on mu2egpvm* /scratch/mu2e is NFS mounted from gpcf015 and has about 954 GB of available space; on detsim, /scratch/mu2e is a local disk with a size of about 568 GB. The mu2egpvm* scratch disk is not visible on detsim and vice-versa. Neither scratch disk is visible on the grid worker nodes or FNALU.
dCache
This is a cache disk system that is described in dcache.shtml.
Mu2e website
The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm* and from FNALU but not from detsim. All Mu2e members have read and write access to this disk space. For additional information see the instructions for the Mu2e web site.
Disks for the group marsmu2e
There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.
The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.
This is discussed further on the pages that discussion running MARS for Mu2e.
Disk Quotas
Non-AFS Disks
On the project and scratch disks, the servers are configured to enforce quotas on a per group basis; there are no individual user quotas. The only way to look at file usage by individuals is do a du -s on their user areas. To examine the usage and quotas for the mu2e group you can issue the following command on any node that mounts our disks:
quota -gs mu2e
Disk quotas for group mu2e (gid 9914):
Filesystem blocks quota limit grace files quota limit grace
blue3.fnal.gov:/mu2e/data 4374G 0 10240G 206k 0 0 gpcf015.fnal.gov:/scratch/mu2e 66936M 916G 954G 56844 0 0 blue2:/fermigrid-fermiapp 50594M 0 61440M 16123k 0 0 blue2:/fermigrid-app 2206M 0 30720M 3494k 0 0 blue2:/fermigrid-data 1993G 0 2048G 6295k 0 0 blue3.fnal.gov:/mu2e-app 46452M 0 1024G 461k 0 0
The top line, for example, reads as follows: on /mu2e/data we have a quota of 10 TB of which we have used about 4.4 TB in 206,000 files. The disk described by the second line is not a blue arc served disk and its quota system is configured differently: when the usage reaches 916 GB, we will get a warning; there is hard limit at 954 GB. The bluearc disks have only a hard limit.
Aside: on some systems on which I have worked before, when the quota was exceeded, but not the hard limit, it was possible to continue to write to files that were already open but it was not possible to create new files. I don't know how this system is configured.
When the -s flag is not present, the quota command populates the blocks and limit columns in units of 1K blocks. When the -s flag is present, quota will choose human friendly units.
The group marsmu2e has their own quotas on /grid/fermiapp/marsmu2e and /grid/data/marsmu2e. People who are members of both mu2e and marsmu2e may copy or move files among all of the disk spaces.
Don't be confused by the following. If you do df -h on, for example, /grid/fermiapp/mu2e, you will see that it has a size of 1.1 TB, not the 60 GB mentioned in the table above. The additional space is allocated to other experiments. To find the space available to mu2e, you must use the quota command.
AFS Disks
On the AFS disks the quotas are inspected using the command fs listquota. For example,
fs listquota /afs/fnal.gov/files/code/mu2e/d1/ Volume Name Quota Used %Used Partition c.mu2e.d1 5000000 4604170 92%<< 63%
The columns Quota and Used are given in units if 1K blocks. Additional information is available on the Mu2e AFS page.
Obsolete AFS Space
When the Mu2e project began at Fermilab, we were assigned AFS disk space as to use a project space for code and data. This disk space is mounted on both FNALU and GPCF but not on detsim. Mu2e is no longer actively using this space and it's only purpose is to archive some earlier work. Additional details are available on the Mu2e AFS page.