Disks: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
Line 22: Line 22:
! colspan="6"  scope="row" | Mu2e Project Disk on BlueArc
! colspan="6"  scope="row" | Mu2e Project Disk on BlueArc
|-
|-
| /mu2e/data ||75,161 ||No ||rw- ||rw- ||Event-data files, log files, ROOT files.
| /mu2e/data ||75,161 ||No ||--- ||rwx ||Event-data files, log files, ROOT files.
|-
|-
| /mu2e/data2 ||10,737 ||No ||rw- ||rw- ||Event-data files, log files, ROOT files.
| /mu2e/data2 ||10,737 ||No ||--- ||rw- ||Event-data files, log files, ROOT files.
|-
|-
| /mu2e/app || 1,024 || No || r-x || rwx || Exe's and shared libraries. No data/log/root files.
| /mu2e/app || 1,024 || No || r-x ||--- || Exe's and shared libraries. No data/log/root files.
|-
|-
| /grid/fermiapp/mu2e ||232 ||Yes || r-x ||rwx ||Exe's and shared libraries. No data/log/root files.
| /grid/fermiapp/mu2e ||232 ||Yes ||--- ||rwx ||Exe's and shared libraries. No data/log/root files.
|-
|-
! colspan="6"  scope="row" | Special Disks
! colspan="6"  scope="row" | Special Disks

Revision as of 15:34, 26 November 2022

Introduction

There are several categories of disk space available at Fermilab. There are limited home areas, BlueArc disks for building code and small datasets, and dCache for large datasets and sending data to tape.

When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.

For recommendations on how to use this disks, please see recommendations and the data transfer page.

If you need to use one of the other disks, please make yourself a directory within the /path/users area of that disk. You must name the directory with your kerberos principal so that our management scripts know who you are.

The table below summarizes the information found in the sections that follow. An entry of the form (1) or (2) indicates that you should read the numbered note below the table.

Name Quota (GB) Backed up? Worker Interactive Purpose/Comments
Home Disks
/nashome 5.2 Yes --- rwx mu2egpvm*, mu2ebuild*, and FNALU only
/sim1 20 Yes --- rwx detsim only; in BlueArc space
Mu2e Project Disk on BlueArc
/mu2e/data 75,161 No --- rwx Event-data files, log files, ROOT files.
/mu2e/data2 10,737 No --- rw- Event-data files, log files, ROOT files.
/mu2e/app 1,024 No r-x --- Exe's and shared libraries. No data/log/root files.
/grid/fermiapp/mu2e 232 Yes --- rwx Exe's and shared libraries. No data/log/root files.
Special Disks
/cvmfs - Yes r-x r-x readonly code distribution - all interactive and grid nodes
/pnfs - No/Yes --- rwx distributed data disks - all interactive nodes
Local Scratch
/scratch/mu2e/ 568 No --- rwx detsim only; local disk.
Mu2e web Site
/web/sites/mu2e.fnal.gov/htdocs 8 Yes --- rwx mounted on mu2egpvm* and FNALU (not detsim); see website instructions
Marsmu2e Project disk on BlueArc
/grid/data/marsmu2e 400 No rw- rw- Event-data files, log files, ROOT files.
/grid/fermiapp/marsmu2e 30 Yes r-x rwx Grid accessible executables and shared libraries

Notes on the table:

  1. The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
  2. The home disks have individual quotas. All others have only group quotas.
  3. The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (detsim, mu2egpvm*). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.

BlueArc Disks

Fermilab operates a large disk pool that is mounted over the network on many different machines, including the interactive nodes and the Fermigrid worker nodes. It is not mounted on grid worker nodes. The pool is built using Network Attached Storage (NAS) systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.

This pool is shared by all Intensity Frontier experiments. As of 2017, Mu2e has a quota of about 90 TB, distributed as shown in the table above. Each year computing division purchases additional BlueArc systems and each year Mu2e gets additional quota on the new systems.

The disk space on /mu2e/data is intended as our primary disk space for event-data, log files ROOT files and so on. These disks are mounted as noexec on all machines; therefore, if you put a script or an executable file in this disk space, it cannot be executed; if you attempt to execute a file in this disk space you will get a file permission error. This space is not backed up.

If you want to run an application on the grid, the executable file(s) and the shared libraries might be delivered in two ways. If it is pre-built release of the code, it will be available, read-only, on cvmfs and this mounted on all grid nodes. If you are building your own custom code, that should be built on /mu2e/app, available on all the interactive nodes. To run this custom code on the grid, it needs to be made into a tarball to be sent with the submission.


BlueArc Execute and Write Permissions

In the table above, one can see that some disk partitions are either not executable or not writable on certain nodes; this is primitive security precaution, designed in the days when these disks were mounted on the grid nodes. Suppose that an unauthorized user gains access to a grid worker node; that person cannot write malware onto /grid/fermiapp/mu2e or /mu2e/app, both of which were write protected on grid worker nodes. That person can write malware onto the data disks; however none of those disks are executable on the interactive nodes. Therefore, if an unauthorized user gains access to a worker node, they cannot deposit executable malware into a place from which it can be executed on one of the interactive nodes.

BlueArc Snapshots

In the table above, some of the bluearc disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.

In addition to backup to tape, the bluearc file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.

On the data disks, a snapshot is taken nightly and then deleted the next night; so once a file has been deleted it will be recoverable for the remainder of the working day. On /grid/fermiapp and /mu2e/app, a snapshot is taken nightly and retained for 4 nights; so a deleted file can be recovered for up to 4 calendar days.

If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.

After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.

How does this work? While the bluearc file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer of bluearc.

You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/ or /grid/fermiapp/.snapshot/ . Snapshots are readonly to us.

Home Disks

The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. The home disks do not have enough disk space to build a release of mu2e Offline. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.

The grid worker nodes do not see the home disk. When your job lands on a grid worker node, it lands in an empty directory.

As of SL7 OS version, access to the home disk requires a kerberos ticket. The nfs system can cache your ticket so you can continue to access the home area in a old window even after the ticket is expired, so you probably won't notice. But this does come up more in cron jobs, which may fail because they do not use the interactive kerberos patterns. If you have a cron job that sometimes can't access the home disk, see this article. You can also setup you cron job to use kcron, a kerberos aware cron command.

Sharing files

By default all of your home area is private, which makes it hard to share files with collaborators. You can copy files to /mu2e/app, or make a mu2e directory:

cd $HOME
mkdir mu2e
chmod 750 mu2e

This directory can remain group-readable, but other areas will revert to private automatically. The same can be done with other experiments, like "nova".

cvmfs

This is a distributed disk system that is described in cvmfs. It is used to provide pre-built releases of the code and UPS products to all users, interactive nodes, and grids.

dCache

This is a distributed disk system that is described in dCache. It has a very large capacity and is used for high-volume and high-throughput data interactively or in grid jobs. All grid jobs should read and write event data to/from dCache, and avoid the BlueArc data disk.

stashCache

There exists the case of rather large files (more than a GB) that has to be sent to every grid node. This might be a library of fit or simulation templates, or a set of pre-computed simulation distributions. CVMFS is best for many small files, but has a size limit. For this case stashCache is the ideal solution.

Mu2e website

The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm*. Selected Mu2e members have read and write access to this area - ask offline management if you need to get access. For additional information see the instructions for the Mu2e web site. The space is run by the central web services and space is monitored here.

Disks for the group marsmu2e

There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.

The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.

This is discussed further on the pages that discussion running MARS for Mu2e.


Recommended use patterns

Here is a summary of recommended use patterns

  • Personal utility scripts, analysis scripts, histograms and documents should go on the home area.
  • Builds of the offline code should go under /mu2e/app/users/$USER.
  • Small (<100 GB) datasets, such as analysis ntuples, should go under /mu2e/data/user/$USER.
  • Large datasets (>100GB), and any dataset that is written or read in parallel from a grid job should reside on scratch dCache: /pnfs/mu2e/scratch/users/$USER. This area will purge your old files without warning.
  • Datasets of widespread interest or semi-permanent usefulness should be uploaded to tape.