Disks

From Mu2eWiki
Jump to navigation Jump to search

Introduction

There are several categories of disk space available at Fermilab. Thsee include limited home areas, Mu2e project disks for building code and small datasets, dcache (/pnfs) for large datasets and sending data to tape, and a wide area readonly disk ( /cvmfs) for distribution of code and some auxillary data files.

When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.

To learn how to where you may create your own directory on the project and dCache disks, see #Recommended_use_patterns and the data transfer page. When you do make your own directory, you must name it with using your kerberos principal (your Fermilab username).

The table below summarizes the information found in the sections that follow.

Name Quota (GB) Backed up? Worker Interactive Purpose/Comments
User Home Disks
/nashome 5.2 Yes --- rwx mu2egpvm*, mu2ebuild*, and FNALU only
Mu2e Project Disk on Ceph (Phased in during fall 2023 - please start new work here)
/exp/mu2e/data 87,961 No --- rwx Event-data files, log files, ROOT files.
/exp/mu2e/app 3,848 No --- rwx Exe's and shared libraries. No data/log/root files.
/grid/fermiapp/mu2e 232 Yes --- rwx Deprecated. Not for use by general users.
Special Disks
/cvmfs - Indirectly r-x r-x readonly code distribution - all interactive and grid nodes
/pnfs - No/Yes --- rwx distributed data disks - all interactive nodes
Mu2e web Site
/web/sites/mu2e.fnal.gov/htdocs 8 Yes --- rwx mounted on mu2egpvm* and FNALU; see website instructions
Marsmu2e Project disk on NAS
/grid/data/marsmu2e 400 No rw- rw- Event-data files, log files, ROOT files.
/grid/fermiapp/marsmu2e 30 Yes r-x rwx Grid accessible executables and shared libraries

Notes on the table:

  1. The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
  2. The Ceph disks have directory tree based quotas.
  3. The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (mu2egpvm*, mu2ebuild02). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.

Ceph Transition

You will be responsible for moving your own files from /mu2e/data and /mu2e/data2 to /exp/mu2e/data. The reason for two data areas was an accident of what disk space was available when we asked that /mu2e/data be extended. We are consolidating both into /exp/mu2e/data.

Before copying your files to /exp/mu2e we ask that you audit your files to identify old files that you can delete or archive to tape. Please do not copy these files to /exp/mu2e. You can find files older than, say 4 years (1460 days), with the command:

find /mu2e/app/users/<your-username>  -type f -not -mtime -1460 -exec ls -ld {} \; 

If your old files have no archival value, please delete them. If they do have archival value, please archive them to tape; contact the Mu2e computing leadership if you need help archiving files to tape. Please complete this by Jan 12, 2024.

The recommended way to copy files from a /mu2e disk to a /exp/mu2e disk is:

cd /exp/mu2e/data/users/<yourname>
rsync -ar /mu2e/data/users/<yourname>/<directory_name> .
rm -rf /mu2e/data/users/<yourname>/<directory_name>

Check that rsync completed correctly before deleting the original. This rsync command will recursively (-r) copy the directory named as the first positional argument to the current working directory and it will trahsfer the files in archive mode (-a), which preserve file metadata such as permissions and dates. In one recent test it took about 5 minutes to copy 8 GB.

Everyone with a quota on the /mu2e disks has a similarly sized quota on the /exp/mu2e disks.

The NAS disks had user based quotas. The Ceph disks have directory based quotas. That means that /exp/mu2e/app has a quota and we can set smaller quotas at any directory level. For example each user directory has a quota and each project directory has a quota.

Existing directories on /exp/mu2e/app

If you do not already have a directory /exp/mu2e/app/users/<yourname>, then the migration on Nov 15 will be simple. When the interactive machines are rebooted following the new downtime, your files will be in the new location. Some people do already have a directory /exp/mu2e/app/users/<yourname>; for those people, their migrated files will be at

/exp/mu2e/app/sync/users/<yourname>

Please use mv to move directories and files from the sync area to /exp/mu2e/app/users/<yourname>, taking care to not overwrite existing files. When done, delete your directory in the sync area.

Reseating Symbolic Links

Many people have used the following pattern to make it easy to keep source code and binaries on the app disk while providing low-keystroke access to related files on the data disk:

cd /mu2e/app/users/<yourname>/<my project>
mkdir -p /mu2e/data/users/<yourname>/<my project>
ln -s /mu2e/data/users/<yourname>/<my project> out

Different people have used different names for the symbolic link, with the two most common being "data" and "out".

After you move your files from /mu2e/data(2) to /exp/mu2e/data, you will need to reseat your symbolic links, as follows:

cd /exp/mu2e/app/users/<yourname>/<my project>
rm out
ln -s /exp/mu2e/data/users/<yourname>/<my project> out

Reseating Symbolic Links For the Computing Tutorials

Many people worked on the ComputingTutorials at the Mu2e Tutorial Day, Saturday Oct 4, 2023, or soon after. At that time the CEPH disks were named differently than they are now:

 /srv/mu2e/app
 /srv/mu2e/data

The tutorial instructions told you to use symbolic link pattern described in the previous section.

Since that time, these directories have been renamed /exp instead of /srv/. Your files are now in the newly named locations.

If you worked on the tutorials at that time, when you return to your working area you will need to reseat the symbolic links to the data area.

 cd /exp/mu2e/app/users/<yourname>/Tutorial
 rm out
 ln -s /exp/mu2e/data/users/<yourname>/Tutorial out

Ceph Disk Notes

Route SNOW Tickets Directly to Ceph

https://fermi.servicenowservices.com/nav_to.do?uri=%2Fservice_offering.do%3Fsys_id%3Df3907a4e1b1321906ee0ea42f54bcb0e%26sysparm_view%3Dess%26sysparm_affiliation%3D

Quotas

To see your quota and used space on /exp/mu2e/app/users/<yourname>, /exp/mu2e/data/users/<yourname> and ~<yourname>. use the command:

mu2einit
mu2eQuota

You can also look at the quotas and space used on the ceph disks for another user:

 mu2eQuota <other_user_name>

For more details, see the next section.


Quotas And Other Attributes

The Ceph disks have directory based quotas. For example, /exp/mu2e/app has a quota and each directory in /exp/mu2e/app/users has a quota. If a directory does not explicitly set a quota, then you should walk up the directory tree to find the first directory for which a quota is set; that quota is controlling. The default quota for a user directory in /exp/mu2e/app/users is 25 GiB and the default quota for a user directory in /exp/mu2e/data/users is 150 GiB. A Mu2e collaborator may request to the Mu2e Offline Computing Coordinators that their quota be increased; you will need to provide a good reason for your request.

To see the quota for a directory that has a quota:

getfattr -n ceph.quota.max_bytes /exp/mu2e/data/projects/tracker

On SL7 you can see all of the attributes of a directory using:

getfattr -d -m 'ceph.*' /exp/mu2e/data/projects/tracker

However on AL9 the wildcard features has been turned off and you can only get individual attributes by name, for example:

 getfattr -n "ceph.dir.rbytes" /exp/mu2e/data/projects/tracker

where the named attribute is the total number bytes in directory tree descended from the specified directory.

Here are the full set of named attributes that match the wildcard on SL7:

Name Meaning
ceph.dir.entries Number of entries in the specified directory, including files and directories.
ceph.dir.files Number of files in the specfied directory.
ceph.dir.rbytes Number of bytes alllocated to files in the directory tree (recursive).
ceph.dir.rctime It is intended to be the highest modification time of anything in the directory tree. It is known to be buggy.
ceph.dir.rentries Number of entries in the directory tree (recursive), files and directories
ceph.dir.rfiles Number of files in the directory tree (recursive).
ceph.dir.rsubdirs Number of subdirectories in the directory tree (recursive).
ceph.dir.subdirs Number of subdirectories in the specfied directory.

Ceph Snapshots

Ceph supports snapshots. See the general discussion of #Snapshots. There are two details of snapshots that are unique to the ceph disks.

The snapshots exist at each level, for example:

/exp/mu2e/app/users/mu2epro/nightly/secondary/repo/.snap/_scheduled-2023-12-13-00_00_00_UTC_1099511627788/REve.log

There is a small glitch with ceph snapshots. If you ls a directory that is below a snapshot directory, the command will sometimes hang. But, if you open a file in that directory, it will work correctly. After you have opened the file, then the ls will work, perhaps with a small delay the first time.


Sharing Files

If you wish for your colleagues to be able to read or write files in Ceph diskspace that you own, use the normal unix group permissions. All members of mu2e are in the unix group named "mu2e".

Moving Files Across Quota Domains

Using the unix mv command to move files from quota domain to another will actually do a copy and delete, not a true mv. Instead use rsync

Instead of:

 mv /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a

Use:

 rsync -ar /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a
 rm -rf /exp/mu2e/app/users/a/my_directory

NAS Disks

Fermilab operates a large disk pool that is mounted over the network on many different interactive machines. It is not mounted on grid nodes. The pool is built using Network Attached Storage (NAS) systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.

As of 2023, Mu2e has a quota of about 90 TB, distributed as shown in the Mu2e Project disk section of the table above.

The disk space on /mu2e/data and /mu2e/data2 is intended as our primary disk space for event-data, log files ROOT files and so on. This space is not backed up.

If you want to run an application on the grid, the executable file(s) and the shared libraries might be delivered in two ways. If it is pre-built release of the code, it will be available, read-only, on cvmfs and this mounted on all grid nodes. If you are building your own custom code, that should be built on /mu2e/app, available on all the interactive nodes. See Muse for code building and making tarballs for submissions to grid.

In the summer of 2023, these disks will be replaced with new disks based on the CEPH technology: [1].

Snapshots

In the table above, some of the NAS disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.

In addition to backup to tape, the NAS file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.

On /mu2e/app and /exp/mu2e/app a snapshot is taken nightly and retained for 14 nights; so a deleted file can be recovered for up to 14 calendar days. Many years ago snapshots were also used on /mu2e/data but that is no longer done.

If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.

After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.

How does this work? While the NAS file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer.

You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/, /grid/fermiapp/.snapshot/ and /exp/mu2e/app/.snap . Snapshots are readonly to us.

There are some features of snapshots that are unqiue to the ceph disks, see #Ceph Snapshots.

Home Disks

The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. The home disks do not have enough disk space to build a release of mu2e Offline. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.

The grid worker nodes do not see the home disk. When your job lands on a grid worker node, it lands in an empty directory.

As of SL7 OS version, access to the home disk requires a kerberos ticket. The nfs system can cache your ticket so you can continue to access the home area in a old window even after the ticket is expired, so you probably won't notice. But this does come up more in cron jobs, which may fail because they do not use the interactive kerberos patterns. If you have a cron job that sometimes can't access the home disk, see this article. You can also setup you cron job to use kcron, a kerberos aware cron command.

Snapshots

Snapshots of your home disk are visible at:

ls ~/.snapshot

You will see that snapshots are made 4 times per day and kept for 30 days.

Sharing files

By default all of your home area is private, which makes it hard to share files with collaborators. You can copy files to /mu2e/app, or make a mu2e directory:

cd $HOME
mkdir mu2e
chmod 750 mu2e

This directory can remain group-readable, but other areas will revert to private automatically. The same can be done with other experiments, like "nova".

cvmfs

This is a distributed disk system that is described in cvmfs. It is used to provide pre-built releases of the code and UPS products to all users, interactive nodes, and grids.

dCache

This is a distributed disk system that is described in dCache. It has a very large capacity and is used for high-volume and high-throughput data interactively or in grid jobs. All grid jobs may read and write event data to/from dCache; it is not possible for grid jobs to move data to/from the NAS disks.


stashCache

There exists the case of rather large files (more than a GB) that has to be sent to every grid node. This might be a library of fit or simulation templates, or a set of pre-computed simulation distributions. CVMFS is best for many small files, but has a size limit. For this case stashCache is the ideal solution.

Mu2e website

The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm*. Selected Mu2e members have read and write access to this area - ask offline management if you need to get access. For additional information see the instructions for the Mu2e web site. The space is run by the central web services and space is monitored here.

Disks for the group marsmu2e

There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.

The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.

This is discussed further on the pages that discussion running MARS for Mu2e.


Recommended use patterns

Here is a summary of recommended use patterns

  • Personal utility scripts, analysis scripts, histograms and documents should go on the home area.
  • Builds of the offline code should go under /mu2e/app/users/$USER.
  • Small (<100 GB) datasets, such as analysis ntuples, should go under /mu2e/data/user/$USER.
  • Large datasets (>100GB), and any dataset that is written or read in parallel from a grid job should reside on scratch dCache: /pnfs/mu2e/scratch/users/$USER. This area will purge your old files without warning.
  • Datasets of widespread interest or semi-permanent usefulness should be uploaded to tape.