Disks: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
 
(55 intermediate revisions by 2 users not shown)
Line 1: Line 1:


===Introduction===
===Introduction===
There are several categories of disk space available at Fermilab.   There are limited home areas, BlueArc disks for building code and small datasets, and dCache for large datasets and sending data to tape.
There are several categories of disk space available at Fermilab. Thsee include limited home areas, Mu2e project disks for building code and small datasets, [[Dcache | dcache ]] (/pnfs) for large datasets and sending data to tape, and a wide area readonly disk ([[Cvmfs | /cvmfs]]) for distribution of code and some auxillary data files.


When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.
When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.


If you need to use one of the other disks, please make yourself a directory within the '''/path/users''' area of that disk. You must name the directory with your kerberos principal so that our management scripts know who you are.
To learn how to where you may create your own directory on the project and dCache disks, see [[#Recommended_use_patterns]] and the [[DataTransfer| data transfer page]].  When you do make your own directory, you must name it with using your kerberos principal (your Fermilab username).


The table below summarizes the information found in the sections that follow. An entry of the form (1) or (2) indicates that you should read the numbered note below the table.
The table below summarizes the information found in the sections that follow.


{| class="wikitable"
{| class="wikitable"
! Name ||  Quota (GB) ||  Backed up? ||  Worker   || Interactive ||    Purpose/Comments           
! Name ||  Quota (GB) ||  Backed up? ||  Worker   || Interactive ||    Purpose/Comments           
|-
|-
! colspan="6"  scope="row" | Home Disks
! colspan="6"  scope="row" | User Home Disks
|-
|-
| /nashome ||5.2 ||Yes || --- ||rwx || mu2egpvm* and FNALU only
| /nashome ||5.2 ||Yes || --- ||rwx || mu2egpvm*, mu2ebuild*, and FNALU only
|-
|-
| /sim1     || 20 || Yes || --- || rwx || detsim only; in BlueArc space
! colspan="6"  scope="row" | Mu2e Project Disk on Ceph (Phased in during fall 2023 - please start new work here)
|-
|-
! colspan="6"  scope="row" | Mu2e Project Disk on BlueArc
| /exp/mu2e/data || 87,961 ||No ||--- ||rwx ||Event-data files, log files, ROOT files.
|-
|-
| /mu2e/data ||75,161 ||No ||rw- ||rw- ||Event-data files, log files, ROOT files.
|| /exp/mu2e/app         || 3,848         || No ||--- ||rwx || Exe's and shared libraries. No data/log/root files.
|-
|-
| /mu2e/data2 ||10,737 ||No ||rw- ||rw- ||Event-data files, log files, ROOT files.
| /grid/fermiapp/mu2e ||232 ||Yes ||--- ||rwx || Deprecated. Not for use by general users.
|-
| /mu2e/app || 1,024 || No || r-x || rwx || Grid accessible executables and shared libraries. No data/log/root files.
|-
| /grid/fermiapp/mu2e ||232 ||Yes || r-x ||rwx ||Grid accessible executables and shared libraries. No data/log/root files.
|-
|-
! colspan="6"  scope="row" | Special Disks
! colspan="6"  scope="row" | Special Disks
|-
|-
| /cvmfs || - || Yes || r-x ||r-x || readonly code distribution - all interactive and grid nodes
| /cvmfs || - || [[Cvmfs#Cvmfs_is_not_backed_up | Indirectly]] || r-x ||r-x || readonly code distribution - all interactive and grid nodes
|-
|-
| /pnfs || - || No/Yes||--- ||rwx || distributed data disks - all interactive nodes
| /pnfs || - || No/Yes||--- ||rwx || distributed data disks - all interactive nodes
|-
! colspan="6"  scope="row" | Local Scratch
|-
| /scratch/mu2e/ || 568 || No ||--- ||rwx || detsim only; local disk.
|-
|-
! colspan="6"  scope="row" | Mu2e web Site
! colspan="6"  scope="row" | Mu2e web Site
|-
|-
|/web/sites/mu2e.fnal.gov/htdocs || 8 || Yes || ---  || rwx ||mounted on mu2egpvm* and FNALU (not detsim); see ! Website instructions.
|/web/sites/mu2e.fnal.gov/htdocs || 8 || Yes || ---  || rwx ||mounted on mu2egpvm* and FNALU; see [[#website|website instructions]]
|-
|-
! colspan="6"  scope="row" | Marsmu2e Project disk on BlueArc
! colspan="6"  scope="row" | Marsmu2e Project disk on NAS
|-
|-
| /grid/data/marsmu2e || 400 || No || rw- ||rw- || Event-data files, log files, ROOT files.
| /grid/data/marsmu2e || 400 || No || rw- ||rw- || Event-data files, log files, ROOT files.
|-
|-
| /grid/fermiapp/marsmu2e || 30 || Yes ||r-x || rwx || Grid accessible executables and shared libraries
| /grid/fermiapp/marsmu2e || 30 || Yes ||r-x || rwx || Grid accessible executables and shared libraries
|-
|}
|}


Line 52: Line 45:


#The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
#The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
#The home disks have individual quotas. All others have only group quotas.
#The Ceph disks have directory tree based quotas.
#The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (detsim, mu2egpvm*). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.  
#The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (mu2egpvm*, mu2ebuild02). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.
 
== Ceph Transition ==


==BlueArc Disks==
You will be responsible for moving your own files from /mu2e/data and /mu2e/data2 to /exp/mu2e/data.  The reason for two data areas was an accident of what disk space was available when we asked that /mu2e/data be extended.  We are consolidating both into /exp/mu2e/data.


Fermilab operates a large, disk pool that is mounted over the network on many different machines, including detsim, the GPCF interactive nodes, the GPCF local batch nodes and the GPGrid worker nodes. It is not mounted on most grid worker nodes outside of GPGrid and it is not mounted on FNALU. The pool is built using Network Attached Storage (NAS) systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.
Before copying your files to /exp/mu2e we ask that you audit your files to identify old files that you can delete or archive to tape. Please do not copy these files to /exp/mu2e. You can find files older than, say 4 years (1460 days), with the command:


This pool is shared by all Intensity Frontier experiments. As of 2017, Mu2e has a quota of about 90 TB, distributed as shown in the table above. Each year computing division purchases additional BlueArc systems and each year Mu2e gets additional quota on the new systems.
find /mu2e/app/users/<your-username>  -type f -not -mtime -1460 -exec ls -ld {} \;


'''It is very important to not have all of your hundreds of grid jobs all accessing
If your old files have no archival value, please delete them.  If they do have archival value, please archive them to tape; contact the Mu2e computing leadership if you need help archiving files to tape. Please complete this by Jan 12, 2024.
the BlueArc disk at the same time. '''
Additional information about this is available on the [[DataTransfer| data transfer page]].


The disk space on /mu2e/data is intended as our primary disk space for event-data, log files ROOT files and so on. These disks are mounted as noexec on all machines; therefore, if you put a script or an executable file in this disk space, it cannot be executed; if you attempt to execute a file in this disk space you will get a file permission error.  This space is not backed up.
The recommended way to copy files from a /mu2e disk to a /exp/mu2e disk is:


If you want to run an application on the grid, the executable file(s) and the shared libraries for that application must reside on /mu2e/app or /grid/fermiapp/mu2e; this includes both the standard software releases of the experiment and any personal code that will be run on the grid. The recommended use is to compile code on one of the interactive nodes and place the executables and .so files in either /mu2e/app or /grid/fermiapp/mu2e. Because this disk space is executable on all of detsim, GPCF, and the GP Grid worker nodes, it is straight-forward to develop and debug jobs interactively and then to submit the long jobs to the grid.
cd /exp/mu2e/data/users/<yourname>
rsync -ar /mu2e/data/users/<yourname>/<directory_name> .
rm -rf /mu2e/data/users/<yourname>/<directory_name>


Check that rsync completed correctly before deleting the original.  This rsync command will recursively (-r) copy the directory named as the first positional argument to the current working directory and it will trahsfer the files in archive mode (-a), which preserve file metadata such as permissions and dates.  In one recent test it took about 5 minutes to copy 8 GB.


===BlueArc Execute and Write Permissions===
Everyone with a quota on the /mu2e disks has a similarly sized quota on the /exp/mu2e disks.


In the table above, one can see that some disk partitions are either not executable or not writable on certain nodes; this is primitive security precaution. Suppose that an unauthorized user gains access to a grid worker node; that person cannot write malware onto /grid/fermiapp/mu2e or /mu2e/app, both of which are write protected on grid worker nodes. That person can write malware onto the data disks; however none of those disks are executable on the interactive nodes. Therefore, if an unauthorized user gains access to a worker node, they cannot deposit executable malware into a place from which it can be executed on one of the interactive nodes.
The NAS disks had user based quotas.  The Ceph disks have directory based quotas. That means that /exp/mu2e/app has a quota and we can set smaller quotas at any directory level.   For example each user directory has a quota and each project directory has a quota.


===BlueArc Snapshots===
=== Existing directories on /exp/mu2e/app ===


In the table above, some of the bluearc disks are shown to be backed up. The full policy for backup to tape is available at the [http://computing.fnal.gov/site-backups/faq.html Fermilab Backup FAQ].
If you do not already have a directory /exp/mu2e/app/users/<yourname>, then the migration on Nov 15 will be simple.  When the interactive machines are rebooted following the new downtime, your files will be in the new location. Some people do already have a directory /exp/mu2e/app/users/<yourname>; for those people, their migrated files will be at


In addition to backup to tape, the bluearc file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.
/exp/mu2e/app/sync/users/<yourname>


On the data disks, a snapshot is taken nightly and then deleted the next night; so once a file has been deleted it will be recoverable for the remainder of the working day. On /grid/fermiapp and /mu2e/app, a snapshot is taken nightly and retained for 4 nights; so a deleted file can be recovered for up to 4 calendar days.
Please use mv to move directories and files from the sync area to /exp/mu2e/app/users/<yourname>, taking care to not overwrite existing files.  When done, delete your directory in the sync area.
 
=== Reseating Symbolic Links===
 
Many people have used the following pattern to make it easy to keep source code and binaries on the app disk while providing low-keystroke access to related files on the data disk:
 
cd /mu2e/app/users/<yourname>/<my project>
mkdir -p /mu2e/data/users/<yourname>/<my project>
ln -s /mu2e/data/users/<yourname>/<my project> out
 
Different people have used different names for the symbolic link, with the two most common being "data" and "out".
 
After you move your files from /mu2e/data(2) to /exp/mu2e/data, you will need to reseat your symbolic links, as follows:
 
cd /exp/mu2e/app/users/<yourname>/<my project>
rm out
ln -s /exp/mu2e/data/users/<yourname>/<my project> out
 
=== Reseating Symbolic Links For the Computing Tutorials ===
 
Many people worked on the [[ComputingTutorials]] at the Mu2e Tutorial Day, Saturday Oct 4, 2023, or soon after.  At that time the CEPH disks were named differently than they are now:
 
  /srv/mu2e/app
  /srv/mu2e/data
 
The tutorial instructions told you to use symbolic link pattern described in the previous section. 
 
Since that time, these directories have been renamed /exp instead of /srv/.  Your files are now in the newly named locations.
 
If you worked on the tutorials at that time, when you return to your working area you will need to reseat the symbolic links to the data area.
 
  cd /exp/mu2e/app/users/<yourname>/Tutorial
  rm out
  ln -s /exp/mu2e/data/users/<yourname>/Tutorial out
 
==Ceph Disk Notes==
 
===Route SNOW Tickets Directly to Ceph===
 
https://fermi.servicenowservices.com/nav_to.do?uri=%2Fservice_offering.do%3Fsys_id%3Df3907a4e1b1321906ee0ea42f54bcb0e%26sysparm_view%3Dess%26sysparm_affiliation%3D
 
===Quotas===
 
To see your quota and used space on /exp/mu2e/app/users/<yourname>, /exp/mu2e/data/users/<yourname> and ~<yourname>. use the command:
mu2einit
mu2eQuota
 
You can also look at the quotas and space used on the ceph disks for another user:
 
  mu2eQuota <other_user_name>
 
For more details, see the next section.
 
 
===Quotas And Other Attributes===
 
The Ceph disks have directory based quotas. For example, /exp/mu2e/app has a quota and each directory in /exp/mu2e/app/users has a quota.  If a directory does not explicitly set a quota, then you should walk up the directory tree to find the first directory for which a quota is set; that quota is controlling.  The default quota for a user directory in /exp/mu2e/app/users is 25 GiB and the default quota for a user directory in /exp/mu2e/data/users is 150 GiB.  A Mu2e collaborator may request to the Mu2e Offline Computing Coordinators that their quota be increased; you will need to provide a good reason for your request. 
 
To see the quota for a directory that has a quota:
getfattr -n ceph.quota.max_bytes /exp/mu2e/data/projects/tracker
 
On SL7 you can see all of the attributes of a directory using:
getfattr -d -m 'ceph.*' /exp/mu2e/data/projects/tracker
 
However on AL9 the wildcard features has been turned off and you can only get individual attributes by name, for example:
  getfattr -n "ceph.dir.rbytes" /exp/mu2e/data/projects/tracker
where the named attribute is the total number bytes in directory tree descended from the specified directory.
 
Here are the full set of named attributes that match the wildcard on SL7:
{| class="wikitable"
! Name || Meaning
|-
! ceph.dir.entries || Number of entries in the specified directory, including files and directories.
|-
! ceph.dir.files || Number of files in the specfied directory.
|-
! ceph.dir.rbytes || Number of bytes alllocated to files in the directory tree (recursive).
|-
! ceph.dir.rctime || It is intended to be the highest modification time of anything in the directory tree.  It is known to be buggy.
|-
! ceph.dir.rentries || Number of entries in the directory tree (recursive), files and directories
|-
! ceph.dir.rfiles  || Number of files in the directory tree (recursive).
|-
! ceph.dir.rsubdirs  || Number of subdirectories in the directory tree (recursive).
|-
! ceph.dir.subdirs  || Number of subdirectories in the specfied directory.
|}
 
===Ceph Snapshots===
 
Ceph supports snapshots.  See the general discussion of [[#Snapshots]].  There are two details of snapshots that are unique to the ceph disks.
 
The snapshots exist at each level, for example:
 
/exp/mu2e/app/users/mu2epro/nightly/secondary/repo/.snap/_scheduled-2023-12-13-00_00_00_UTC_1099511627788/REve.log
 
There is a small glitch with ceph snapshots.  If you ls a directory that is below a snapshot directory, the command will sometimes hang.  But, if you open a file in that directory, it will work correctly.  After you have opened the file, then the ls will work, perhaps with a small delay the first time.
 
 
===Sharing Files===
 
If you wish for your colleagues to be able to read or write files in Ceph diskspace that you own, use the normal unix group permissions.  All members of mu2e are in the unix group named "mu2e".
 
===Moving Files Across Quota Domains===
 
Using the unix mv command to move files from quota domain to another will actually do a copy and delete, not a true mv.  Instead use rsync
 
Instead of:
  mv /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a
Use:
  rsync -ar /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a
  rm -rf /exp/mu2e/app/users/a/my_directory
 
==NAS Disks==
 
Fermilab operates a large disk pool that is mounted over the network on many different interactive machines. It is not mounted on grid nodes. The pool is built using Network Attached Storage (NAS) systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.
 
As of 2023, Mu2e has a quota of about 90 TB, distributed as shown in the Mu2e Project disk section of the table above.
 
The disk space on /mu2e/data and /mu2e/data2 is intended as our primary disk space for event-data, log files ROOT files and so on.  This space is not backed up.
 
If you want to run an application on the grid, the executable file(s) and the shared libraries might be delivered in two ways.  If it is pre-built release of the code, it will be available, read-only, on [[Cvmfs|cvmfs]] and this mounted on all grid nodes.  If you are building your own custom code, that should be built on /mu2e/app, available on all the interactive nodes. See [[Muse]] for code building and making tarballs for submissions to grid.
 
In the summer of 2023, these disks will be replaced with new disks based on the CEPH technology: [https://fifewiki.fnal.gov/wiki/Ceph].
 
===Snapshots===
 
In the table above, some of the  NAS disks are shown to be backed up. The full policy for backup to tape is available at the [http://computing.fnal.gov/site-backups/faq.html Fermilab Backup FAQ].
 
In addition to backup to tape, the NAS file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.
 
On /mu2e/app and /exp/mu2e/app a snapshot is taken nightly and retained for 14 nights; so a deleted file can be recovered for up to 14 calendar days.  Many years ago snapshots were also used on /mu2e/data but that is no longer done.


If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.
If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.
Line 86: Line 215:
After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.
After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.


How does this work? While the bluearc file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer of bluearc.
How does this work? While the NAS file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer.


You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/ or /grid/fermiapp/.snapshot/ . Snapshots are readonly to us.
You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/, /grid/fermiapp/.snapshot/ and /exp/mu2e/app/.snap . Snapshots are readonly to us.
 
There are some features of snapshots that are unqiue to the ceph disks, see [[#Ceph Snapshots]].


==Home Disks==
==Home Disks==


The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.
The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. The home disks do not have enough disk space to build a release of mu2e Offline.  Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.


The home disks on detsim are different than those mounted on GPCF and FNALU. They are mounted only on detsim and nowhere else; these are the same home disks that were previously mounted on ilcsim and ilcsim2.  
The grid worker nodes do not see the home disk. When your job lands on a grid worker node, it lands in an empty directory.


The grid worker nodes do not see either of these home disks. When your job lands on a grid worker node, it lands in an empty directory.
As of SL7 OS version, access to the home disk requires a kerberos ticket.  The nfs system can cache your ticket so you can continue to access the home area in a old window even after the ticket is expired, so you probably won't notice.  But this does come up more in cron jobs, which may fail because they do not use the interactive kerberos patterns.  If you have a cron job that sometimes can't access the home disk, see [https://fermi.servicenowservices.com/nav_to.do?uri=kb_knowledge.do?sys_id=050871931b1e9c90ced962cfe54bcb8e this article]. You can also setup you cron job to use [[Authentication#Kerberos|kcron, a kerberos aware cron command]].


The home disks do not have enough disk space to build a release of mu2e Offline
===Snapshots===
 
Snapshots of your home disk are visible at:
 
ls ~/.snapshot
 
You will see that snapshots are made 4 times per day and kept for 30 days. 
 
===Sharing files===
 
By default all of your home area is private, which makes it hard to share files with collaborators. You can copy files to <code>/mu2e/app</code>, or make a mu2e directory:
cd $HOME
mkdir mu2e
chmod 750 mu2e
This directory can remain group-readable, but other areas will revert to private automatically.  The same can be done with other experiments, like "nova".


==cvmfs==
==cvmfs==
Line 108: Line 253:


This is a distributed disk system that is described in [[Dcache|dCache]].  It has a very large capacity and
This is a distributed disk system that is described in [[Dcache|dCache]].  It has a very large capacity and
is used for high-volume and high-throughput data interactively or in grid jobs.  All grid jobs should read and write  
is used for high-volume and high-throughput data interactively or in grid jobs.  All grid jobs may read and write  
event data to/from dCache, and avoid the BlueArc data disk.
event data to/from dCache; it is not possible for grid jobs to move data to/from the NAS disks.
 


==stashCache==
==stashCache==
Line 116: Line 262:
library of fit or simulation templates, or a set of pre-computed simulation distributions.  CVMFS is best for
library of fit or simulation templates, or a set of pre-computed simulation distributions.  CVMFS is best for
many small files, but has a size limit.  For this case [[StashCache|stashCache]] is the ideal solution.
many small files, but has a size limit.  For this case [[StashCache|stashCache]] is the ideal solution.
<div id="website"></div>


==Mu2e website==
==Mu2e website==


The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm* and from FNALU but not from detsim. All Mu2e members have read and write access to this disk space. For additional information see the instructions for the [http://mu2e.fnal.gov/atwork/general/webinfo/intro.shtml Mu2e web site].
The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm*. Selected Mu2e members have read and write access to this area - ask offline management if you need to get access. For additional information see the instructions for the [http://mu2e.fnal.gov/atwork/general/webinfo/intro.shtml Mu2e web site].  The space is run by the central web services and space is [http://metrics.fnal.gov/cws/apache.html monitored here].


==Disks for the group marsmu2e==
==Disks for the group marsmu2e==
Line 135: Line 283:
<ul>
<ul>
<li> Personal utility scripts, analysis scripts, histograms and documents should go on the home area.
<li> Personal utility scripts, analysis scripts, histograms and documents should go on the home area.
<li> Builds of the offline code should go under /mu2e/app/users/$USER.  These will be visible (but not writable) from [[Grids|GPGrid]].
<li> Builds of the offline code should go under /mu2e/app/users/$USER.
<li> Small (<100 GB) datasets, such as analysis ntuples, should go under /mu2e/data/user/$USER.
<li> Small (<100 GB) datasets, such as analysis ntuples, should go under /mu2e/data/user/$USER.
<li> Large datasets (>100GB), and any dataset that is written or read in parallel from a [[Workflows|grid job]]
<li> Large datasets (>100GB), and any dataset that is written or read in parallel from a [[Workflows|grid job]]

Latest revision as of 23:36, 18 July 2024

Introduction

There are several categories of disk space available at Fermilab. Thsee include limited home areas, Mu2e project disks for building code and small datasets, dcache (/pnfs) for large datasets and sending data to tape, and a wide area readonly disk ( /cvmfs) for distribution of code and some auxillary data files.

When reading this section pay careful attention to which disks are backed up. It is your responsibility to ensure that files you require to be backed up are kept on an appropriate disk. It is equally your responsibility to use the backed up space wisely and not fill it with files that can easily be regenerated, such as root files, event-data files, object files, shared libraries and binary executables.

To learn how to where you may create your own directory on the project and dCache disks, see #Recommended_use_patterns and the data transfer page. When you do make your own directory, you must name it with using your kerberos principal (your Fermilab username).

The table below summarizes the information found in the sections that follow.

Name Quota (GB) Backed up? Worker Interactive Purpose/Comments
User Home Disks
/nashome 5.2 Yes --- rwx mu2egpvm*, mu2ebuild*, and FNALU only
Mu2e Project Disk on Ceph (Phased in during fall 2023 - please start new work here)
/exp/mu2e/data 87,961 No --- rwx Event-data files, log files, ROOT files.
/exp/mu2e/app 3,848 No --- rwx Exe's and shared libraries. No data/log/root files.
/grid/fermiapp/mu2e 232 Yes --- rwx Deprecated. Not for use by general users.
Special Disks
/cvmfs - Indirectly r-x r-x readonly code distribution - all interactive and grid nodes
/pnfs - No/Yes --- rwx distributed data disks - all interactive nodes
Mu2e web Site
/web/sites/mu2e.fnal.gov/htdocs 8 Yes --- rwx mounted on mu2egpvm* and FNALU; see website instructions
Marsmu2e Project disk on NAS
/grid/data/marsmu2e 400 No rw- rw- Event-data files, log files, ROOT files.
/grid/fermiapp/marsmu2e 30 Yes r-x rwx Grid accessible executables and shared libraries

Notes on the table:

  1. The project and scratch spaces each have a subdirectory named users. To use these disks, make a subdirectory users/your_kerberos_principal and put your files under that subdirectory.
  2. The Ceph disks have directory tree based quotas.
  3. The columns headed Worker and Interactive show the permission with which each disk is mounted on, respectively, the grid worker nodes and the interactive nodes (mu2egpvm*, mu2ebuild02). In the above table, full permissions are rwx, which denote read, write, execute, respectively. If one of rwx is replaced with a - then that permission is missing on the indicated machine. If the the permission is given as ---, then that disk is not mounted on the indicated machine. The point of some partitions not having w or x permission is a security measure, discussed below.

Ceph Transition

You will be responsible for moving your own files from /mu2e/data and /mu2e/data2 to /exp/mu2e/data. The reason for two data areas was an accident of what disk space was available when we asked that /mu2e/data be extended. We are consolidating both into /exp/mu2e/data.

Before copying your files to /exp/mu2e we ask that you audit your files to identify old files that you can delete or archive to tape. Please do not copy these files to /exp/mu2e. You can find files older than, say 4 years (1460 days), with the command:

find /mu2e/app/users/<your-username>  -type f -not -mtime -1460 -exec ls -ld {} \; 

If your old files have no archival value, please delete them. If they do have archival value, please archive them to tape; contact the Mu2e computing leadership if you need help archiving files to tape. Please complete this by Jan 12, 2024.

The recommended way to copy files from a /mu2e disk to a /exp/mu2e disk is:

cd /exp/mu2e/data/users/<yourname>
rsync -ar /mu2e/data/users/<yourname>/<directory_name> .
rm -rf /mu2e/data/users/<yourname>/<directory_name>

Check that rsync completed correctly before deleting the original. This rsync command will recursively (-r) copy the directory named as the first positional argument to the current working directory and it will trahsfer the files in archive mode (-a), which preserve file metadata such as permissions and dates. In one recent test it took about 5 minutes to copy 8 GB.

Everyone with a quota on the /mu2e disks has a similarly sized quota on the /exp/mu2e disks.

The NAS disks had user based quotas. The Ceph disks have directory based quotas. That means that /exp/mu2e/app has a quota and we can set smaller quotas at any directory level. For example each user directory has a quota and each project directory has a quota.

Existing directories on /exp/mu2e/app

If you do not already have a directory /exp/mu2e/app/users/<yourname>, then the migration on Nov 15 will be simple. When the interactive machines are rebooted following the new downtime, your files will be in the new location. Some people do already have a directory /exp/mu2e/app/users/<yourname>; for those people, their migrated files will be at

/exp/mu2e/app/sync/users/<yourname>

Please use mv to move directories and files from the sync area to /exp/mu2e/app/users/<yourname>, taking care to not overwrite existing files. When done, delete your directory in the sync area.

Reseating Symbolic Links

Many people have used the following pattern to make it easy to keep source code and binaries on the app disk while providing low-keystroke access to related files on the data disk:

cd /mu2e/app/users/<yourname>/<my project>
mkdir -p /mu2e/data/users/<yourname>/<my project>
ln -s /mu2e/data/users/<yourname>/<my project> out

Different people have used different names for the symbolic link, with the two most common being "data" and "out".

After you move your files from /mu2e/data(2) to /exp/mu2e/data, you will need to reseat your symbolic links, as follows:

cd /exp/mu2e/app/users/<yourname>/<my project>
rm out
ln -s /exp/mu2e/data/users/<yourname>/<my project> out

Reseating Symbolic Links For the Computing Tutorials

Many people worked on the ComputingTutorials at the Mu2e Tutorial Day, Saturday Oct 4, 2023, or soon after. At that time the CEPH disks were named differently than they are now:

 /srv/mu2e/app
 /srv/mu2e/data

The tutorial instructions told you to use symbolic link pattern described in the previous section.

Since that time, these directories have been renamed /exp instead of /srv/. Your files are now in the newly named locations.

If you worked on the tutorials at that time, when you return to your working area you will need to reseat the symbolic links to the data area.

 cd /exp/mu2e/app/users/<yourname>/Tutorial
 rm out
 ln -s /exp/mu2e/data/users/<yourname>/Tutorial out

Ceph Disk Notes

Route SNOW Tickets Directly to Ceph

https://fermi.servicenowservices.com/nav_to.do?uri=%2Fservice_offering.do%3Fsys_id%3Df3907a4e1b1321906ee0ea42f54bcb0e%26sysparm_view%3Dess%26sysparm_affiliation%3D

Quotas

To see your quota and used space on /exp/mu2e/app/users/<yourname>, /exp/mu2e/data/users/<yourname> and ~<yourname>. use the command:

mu2einit
mu2eQuota

You can also look at the quotas and space used on the ceph disks for another user:

 mu2eQuota <other_user_name>

For more details, see the next section.


Quotas And Other Attributes

The Ceph disks have directory based quotas. For example, /exp/mu2e/app has a quota and each directory in /exp/mu2e/app/users has a quota. If a directory does not explicitly set a quota, then you should walk up the directory tree to find the first directory for which a quota is set; that quota is controlling. The default quota for a user directory in /exp/mu2e/app/users is 25 GiB and the default quota for a user directory in /exp/mu2e/data/users is 150 GiB. A Mu2e collaborator may request to the Mu2e Offline Computing Coordinators that their quota be increased; you will need to provide a good reason for your request.

To see the quota for a directory that has a quota:

getfattr -n ceph.quota.max_bytes /exp/mu2e/data/projects/tracker

On SL7 you can see all of the attributes of a directory using:

getfattr -d -m 'ceph.*' /exp/mu2e/data/projects/tracker

However on AL9 the wildcard features has been turned off and you can only get individual attributes by name, for example:

 getfattr -n "ceph.dir.rbytes" /exp/mu2e/data/projects/tracker

where the named attribute is the total number bytes in directory tree descended from the specified directory.

Here are the full set of named attributes that match the wildcard on SL7:

Name Meaning
ceph.dir.entries Number of entries in the specified directory, including files and directories.
ceph.dir.files Number of files in the specfied directory.
ceph.dir.rbytes Number of bytes alllocated to files in the directory tree (recursive).
ceph.dir.rctime It is intended to be the highest modification time of anything in the directory tree. It is known to be buggy.
ceph.dir.rentries Number of entries in the directory tree (recursive), files and directories
ceph.dir.rfiles Number of files in the directory tree (recursive).
ceph.dir.rsubdirs Number of subdirectories in the directory tree (recursive).
ceph.dir.subdirs Number of subdirectories in the specfied directory.

Ceph Snapshots

Ceph supports snapshots. See the general discussion of #Snapshots. There are two details of snapshots that are unique to the ceph disks.

The snapshots exist at each level, for example:

/exp/mu2e/app/users/mu2epro/nightly/secondary/repo/.snap/_scheduled-2023-12-13-00_00_00_UTC_1099511627788/REve.log

There is a small glitch with ceph snapshots. If you ls a directory that is below a snapshot directory, the command will sometimes hang. But, if you open a file in that directory, it will work correctly. After you have opened the file, then the ls will work, perhaps with a small delay the first time.


Sharing Files

If you wish for your colleagues to be able to read or write files in Ceph diskspace that you own, use the normal unix group permissions. All members of mu2e are in the unix group named "mu2e".

Moving Files Across Quota Domains

Using the unix mv command to move files from quota domain to another will actually do a copy and delete, not a true mv. Instead use rsync

Instead of:

 mv /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a

Use:

 rsync -ar /exp/mu2e/app/users/a/my_directory /exp/mu2e/app/users/a
 rm -rf /exp/mu2e/app/users/a/my_directory

NAS Disks

Fermilab operates a large disk pool that is mounted over the network on many different interactive machines. It is not mounted on grid nodes. The pool is built using Network Attached Storage (NAS) systems from the BlueArc Corporation. This system has RAID 6 level error detection and correction.

As of 2023, Mu2e has a quota of about 90 TB, distributed as shown in the Mu2e Project disk section of the table above.

The disk space on /mu2e/data and /mu2e/data2 is intended as our primary disk space for event-data, log files ROOT files and so on. This space is not backed up.

If you want to run an application on the grid, the executable file(s) and the shared libraries might be delivered in two ways. If it is pre-built release of the code, it will be available, read-only, on cvmfs and this mounted on all grid nodes. If you are building your own custom code, that should be built on /mu2e/app, available on all the interactive nodes. See Muse for code building and making tarballs for submissions to grid.

In the summer of 2023, these disks will be replaced with new disks based on the CEPH technology: [1].

Snapshots

In the table above, some of the NAS disks are shown to be backed up. The full policy for backup to tape is available at the Fermilab Backup FAQ.

In addition to backup to tape, the NAS file system supports a feature known as snapshots, which works as follows. Each night the snapshot code runs and it effectively makes a hard link to every file in the filesystem. If you delete a file the next day, the blocks allocated to the file are still allocated to the snapshot version of the file. When the snapshot is deleted, the blocks that make up the file will be returned to the free list. So you have a window, after deleting a file, during which you can recover the file. If the file is small, you can simply copy it out of the snapshot. If the file is very large you can ask for it to be recreated in place.

On /mu2e/app and /exp/mu2e/app a snapshot is taken nightly and retained for 14 nights; so a deleted file can be recovered for up to 14 calendar days. Many years ago snapshots were also used on /mu2e/data but that is no longer done.

If you create a file during the working day, it will not be protected until the next snapshot is taken, on the following night. If you delete the file before the snapshot is taken, it is not recoverable.

After a file has been deleted, but while it is still present in a shapshot, space occupied by the file is not charged to the mu2e quota. This works because the disks typically have free space beyond that allocated to the various experiments. However it is always possible for an atypical usage pattern to eat up all available space. In such a case we can request that snapshots be removed.

How does this work? While the NAS file system looks to us as an nfs mounted unix filesystem, it is actually a much more powerful system. It has a front end that allows a variety of actions such as journaling and some amount of transaction processing. The snapshots take place in the front end layer.

You can view the snapshots of the file systems at, for example, /mu2e/app/.snapshot/, /grid/fermiapp/.snapshot/ and /exp/mu2e/app/.snap . Snapshots are readonly to us.

There are some features of snapshots that are unqiue to the ceph disks, see #Ceph Snapshots.

Home Disks

The interactive nodes in GPCF and FNALU share the same home disks. Fermilab policy is that large files such as ROOT files, event-data files and builds of our code, should live in project space, not in our home areas. Therefore these home disks have small quotas. The home disks do not have enough disk space to build a release of mu2e Offline. Therefore the Mu2e getting started instructions tell you to build your code on our project disks. You can contact the Service desk to request additional quota but you will not get multiple GB.

The grid worker nodes do not see the home disk. When your job lands on a grid worker node, it lands in an empty directory.

As of SL7 OS version, access to the home disk requires a kerberos ticket. The nfs system can cache your ticket so you can continue to access the home area in a old window even after the ticket is expired, so you probably won't notice. But this does come up more in cron jobs, which may fail because they do not use the interactive kerberos patterns. If you have a cron job that sometimes can't access the home disk, see this article. You can also setup you cron job to use kcron, a kerberos aware cron command.

Snapshots

Snapshots of your home disk are visible at:

ls ~/.snapshot

You will see that snapshots are made 4 times per day and kept for 30 days.

Sharing files

By default all of your home area is private, which makes it hard to share files with collaborators. You can copy files to /mu2e/app, or make a mu2e directory:

cd $HOME
mkdir mu2e
chmod 750 mu2e

This directory can remain group-readable, but other areas will revert to private automatically. The same can be done with other experiments, like "nova".

cvmfs

This is a distributed disk system that is described in cvmfs. It is used to provide pre-built releases of the code and UPS products to all users, interactive nodes, and grids.

dCache

This is a distributed disk system that is described in dCache. It has a very large capacity and is used for high-volume and high-throughput data interactively or in grid jobs. All grid jobs may read and write event data to/from dCache; it is not possible for grid jobs to move data to/from the NAS disks.


stashCache

There exists the case of rather large files (more than a GB) that has to be sent to every grid node. This might be a library of fit or simulation templates, or a set of pre-computed simulation distributions. CVMFS is best for many small files, but has a size limit. For this case stashCache is the ideal solution.

Mu2e website

The mu2e web site lives at /web/sites/mu2e.fnal.gov; this is visible from mu2egpvm*. Selected Mu2e members have read and write access to this area - ask offline management if you need to get access. For additional information see the instructions for the Mu2e web site. The space is run by the central web services and space is monitored here.

Disks for the group marsmu2e

There are two additional disks that are available only to members of the group marsmu2e; only a few Mu2e collaborators are members of this group. The group marsmu2e was created to satisfy access restrictions on the MCNP software that is used by MARS. Only authorized users may have read access to the MARS executable its associated cross-section databases. This access control is enforced by creating the group marsmu2e, limiting membership in the group and making the critical files readable only by marsmu2e.

The two disks discussed here are /grid/fermiapp/marsmu2e, which has the same role as /grid/fermiapp/mu2e, and /grid/data/mars, which has the same role as /grid/data/mu2e.

This is discussed further on the pages that discussion running MARS for Mu2e.


Recommended use patterns

Here is a summary of recommended use patterns

  • Personal utility scripts, analysis scripts, histograms and documents should go on the home area.
  • Builds of the offline code should go under /mu2e/app/users/$USER.
  • Small (<100 GB) datasets, such as analysis ntuples, should go under /mu2e/data/user/$USER.
  • Large datasets (>100GB), and any dataset that is written or read in parallel from a grid job should reside on scratch dCache: /pnfs/mu2e/scratch/users/$USER. This area will purge your old files without warning.
  • Datasets of widespread interest or semi-permanent usefulness should be uploaded to tape.