Disks: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
Line 123: Line 123:
==cvmfs==
==cvmfs==


This is a distributed disk system that is described in [[cvmfs]].  It is used to provide built releases  
This is a distributed disk system that is described in [[cvmfs]].  It is used to provide pre-built releases  
of the code and UPS products to all users, interactive and grids.
of the code and UPS products to all users, interactive nodes, and grids.


==dCache==
==dCache==

Revision as of 20:31, 30 March 2017

Construction.jpeg This page is a draft, please help complete it!

(how to use code disk on grid)


Introduction

There are several categories of disk space available at Fermilab. There are limited home areas, BlueArc disks for building code and small datasets, and dCache for large datasets and sending data to tape.

When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.

If you need to use one of the other disks, please make yourself a directory within the /path/users area of that disk. You must name the directory with your kerberos principal so that our management scripts know who you are.


The table below summarizes the information found in the sections that follow. An entry of the form (1) or (2) indicates that you should read the numbered note below the table.

Name Quota (GB) Backed up? Worker Interactive Purpose/Comments
Home Disks
/nashome 5.2 Yes --- rwx mu2egpvm* and FNALU only
/sim1 20 Yes --- rwx detsim only; in BlueArc space
Mu2e Project Disk on BlueArc
/mu2e/data 75,161 No rw- rw- Event-data files, log files, ROOT files.
/mu2e/data2 10,737 No rw- rw- Event-data files, log files, ROOT files.
/mu2e/app 1,024 No r-x rwx Grid accessible executables and shared libraries. No data/log/root files.
/grid/fermiapp/mu2e 232 Yes r-x rwx Grid accessible executables and shared libraries. No data/log/root files.
/grid/app/mu2e 30 Yes rwx rw- See the discussion below.
Special Disks
/cvmfs - Yes r-x r-x readonly code distribution - all interactive and grid nodes
/pnfs - No/Yes --- rwx distributed data disks - all interactive nodes
Local Scratch
/scratch/mu2e/ 10,737 No --- rwx mu2egpvm* only; NFS mounted from gpcf015.
/scratch/mu2e/ 568 No --- rwx detsim only; local disk.
Mu2e web Site
/web/sites/mu2e.fnal.gov/htdocs 8 Yes --- rwx mounted on mu2egpvm* and FNALU (not detsim); see ! Website instructions.
Marsmu2e Project disk on BlueArc
/grid/data/marsmu2e 400 No rw- rw- Event-data files, log files, ROOT files.
/grid/fermiapp/marsmu2e 30 Yes r-x rwx Grid accessible executables and shared libraries

Notes on the table:

  1. The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
  2. The home disks have individual quotas. All others have only group quotas.
  3. Despite the name asymmetry, /mu2e/app is intended as additional space for the role of /grid/fermiapp/mu2e/, not for that of /grid/app/mu2e.
  4. The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (detsim, mu2egpvm*). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.

BlueArc Disks

Fermilab operates a large, disk pool that is mounted over the network on many different machines, including detsim, the GPCF interactive nodes, the GPCF local batch nodes and the GPGrid worker nodes. It is not mounted on most grid worker nodes outside of GPGrid and it is not mounted on FNALU. The pool is built using Network Attached Storage systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.

This pool is shared by all Intensity Frontier experiments. As of 2017, Mu2e has a quota of about 90 TB, distributed as shown in the table above. Each year computing division purchases additional BlueArc systems and each year Mu2e gets additional quota on the new systems.

The following admonition is taken from the GPCF Getting Started page:

It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the MVN and CPN commands (just like the unix mv and cp commands, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.

Additional information about this is available on the Mu2e Fermigrid page. See, in particular the sections on: CPN, staging input files, and staging output files.

The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for event-data, log files ROOT files and so on. These disks are mounted as noexec on all machines; therefore, if you put a script or an executable file in this disk space, it cannot be executed; if you attempt to execute a file in this disk space you will get a file permission error. Why are there two separate file systems? When we needed disk space beyond our initial allocation, the server holding the first block of space was full so we were given space on a new disk server. Neither of these areas is backed up.

If you want to run an application on the grid, the executable file(s) and the shared libraries for that application must reside on /mu2e/app or /grid/fermiapp/mu2e; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. The recommended use is to compile code on one of the interactive nodes and place the executables and .so files in either /mu2e/app or /grid/fermiapp/mu2e. Because this disk space is executable on all of detsim, GPCF, and the GP Grid worker nodes, it is straight-forward to develop and debug jobs interactively and then to submit the long jobs to the grid.

/grid/app/mu2e

For the foreseeable future, Mu2e will not use /grid/app/mu2e for its intended purpose. This file system is intended for users who are authorized to use FermiGrid but who do have access to interactive machines that mount the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should develop and test our code on detsim or GPCF; then put the debugged the excutable on either /grid/fermiapp/mu2e or /mu2e/app; then submit grid jobs that use those executables.

BlueArc Execute and Write Permissions

In the table above, one can see that some disk partitions are either not executable or not writable on certain nodes; this is primitive security precaution. Suppose that an unauthorized user gains access to a grid worker node; that person cannot write malware onto /grid/fermiapp/mu2e or /mu2e/app, both of which are write protected on grid worker nodes. That person can write malware onto the data disks or onto /grid/app/mu2e; however none of those disks are executable on the interactive nodes. Therefore, if an unauthorized user gains access to a worker node, they cannot deposit executable malware into a place from which it can be executed on one of the interactive nodes.

BlueArc Snapshots

In the table above, some of the bluearc disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.

In addition to backup to tape, the bluearc file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.

On the data disks, a snapshot is taken nightly and then deleted the next night; so once a file has been deleted it will be recoverable for the remainder of the working day. On /grid/fermiapp and /mu2e/app, a snapshot is taken nightly and retained for 4 nights; so a deleted file can be recovered for up to 4 calendar days.

If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.

After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.

How does this work? While the bluearc file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer of bluearc.

You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/ or /grid/fermiapp/.snapshot/ . Snapshots are readonly to us.

Home Disks

The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.

The home disks on detsim are different than those mounted on GPCF and FNALU. They are mounted only on detsim and nowhere else; these are the same home disks that were previously mounted on ilcsim and ilcsim2.

The grid worker nodes do not see either of these home disks. When your job lands on a grid worker node, it lands in an empty directory.

The home disks do not have enough disk space to build a release of mu2e Offline

Local Scratch Disks

On both GPCF and detsim there is scratch space available for general Mu2e use. The mu2egpvm* scratch disk is not visible on detsim and vice-versa. Neither scratch disk is visible on the grid worker nodes or FNALU.

cvmfs

This is a distributed disk system that is described in cvmfs. It is used to provide pre-built releases of the code and UPS products to all users, interactive nodes, and grids.

dCache

This is a distributed disk system that is described in dCache. It has a very large capacity and is used for high-volume and high-throughput data interactively or in grid jobs. It is preferred over and BlueArc data disk.

Mu2e website

The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm* and from FNALU but not from detsim. All Mu2e members have read and write access to this disk space. For additional information see the instructions for the Mu2e web site.

Disks for the group marsmu2e

There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.

The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.

This is discussed further on the pages that discussion running MARS for Mu2e.


Under construction!

In the above table, full permissions are rwx, which denote read, write, execute, respectively. If in any of rwx is missing in a cell in the table, then that permission is absent.


The disk space /grid/data/mu2e and /mu2e/data are intended as our primary disk space for data and MC events. Why are there two separate file systems? When we wanted more disk space, the server holding the first block of space was full so we were given space on a new disk server.

If you want to run an application on the grid, the executable file(s) and the shared libraries for that application should reside on /grid/fermiapp/mu2e; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. Since this disk space is executable on both detsim, GPCF, and the worker nodes it is relatively straight forward to develop and debug jobs interactively and then to submit the long jobs to the grid.

The gymnastics with the x permission is a security precaution; any file system that is writable from the grid, is NOT executable on detsim. The scenario against which this protects is if a rogue grid user writes malware to /grid/data; that malware will not be executable on detsim and, therefore, cannot do damage on detsim (unless you copy the executable file to another disk and then execute it).


Mu2e will not normally use /grid/app/mu2e. The /grid/app file system is intended for users who are authorized to use FermiGrid but who do not have the equivalent of /grid/fermiapp for their group. Such users can, within a grid job, copy their executables to their space on /grid/app and then execute those applications. Or they can compile and link an executable during one grid job and leave it on /grid/app for future grid jobs to use. Under most circumstances we should be developing and testing our code on detsim, puting the excutable on /grid/fermiapp/mu2e and then submitting grid jobs that use the application on /grid/fermiapp/mu2e.


For these disks, the servers are configured to enforce quotas on a per group basis; there are no individual quotas. To examine the usage and quotas for mu2e you can issue issue the following command on any of detsim or mu2egpvm*:

quota -gs mu2e

The -s option tells quota to display sizes in convenient units rather than always choosing bytes. On mu2egpvm02 the output will look like: