Enstore

From Mu2eWiki
Jump to navigation Jump to search

Introduction

The Scientific Computing Division maintains a system of data tapes called enstore (manual, project), which allows us to "write data to tape" and read files back from tape. The tapes are held in robotic libraries in Feynman Computing Center (FCC) and the Grid Computing Center (GCC) buildings. We access these files using the /pnfs file system, which is part of dCache. This page presumes basic familiarity with dCache.

This page gives an end-user view of enstore. It also describes a critical detail that you must understand as a user of enstore: you must pre-stage files from tape to disk prior to using them. If you fail to do so, your jobs may take 100 times longer to run and your waiting jobs will block others from using shared resources. Not only is this a waste of high-demand shared resources but it also wears out equipment more quickly.

Instructions for prestaging files are available on the Prestage wiki page.


Mu2e Conventions

Before you write any files, read and understand the following:

If you are running standard Mu2e grid jobs to create standardly named Mu2e datasets, much of this is automated for you; see Upload and other information on the Workflows page.

Mental Model

To write a file to tape, copy the file to tape backed dCache disk; the path will begin /pnfs/mu2e/tape/. Follow the Mu2e file naming conventions, for example

/pnfs/mu2e/tape/phy-sim/dig/mu2e/NoPrimary-mix/MDC2018e/art/01/15/dig.mu2e.NoPrimary-mix.MDC2018e.001002_00000800.art

To read a file in tape backed dCache, prestage the file to disk and then read the file using the same path.

There are some old Mu2e files that omit the tape/ subdirectory in the path but these will soon be deleted.

When you copy a file to tape backed dCache, the file is copied into a large dCache disk cache and it is queued to be copied to tape; the actual copy to tape might take place in a few minutes or a few days, depending on competing demands for resources. As of November 2020, the size of the cache is a few PB. When space is needed in the disk cache for new files, old files that have already been copied to tape will be deleted to make room for the new files. Roughly the algorithm is first created, first deleted, regardless of intermediate use; but there are details and corner cases. As of November 2020, files typically live in the cache for about 30 days before they expire; but a small fraction may exprire in as little as 7 days.

Until your file expires from the cache, you can read it using its /pnfs path. After your file expires from disk cache, it will still be visible in the /pnfs file system and you can access some meta-data about the file, such as its size and creation date. However if you try to read the file, there will be a delay, between minutes and days, while it is restored from tape. The file will again remain on disk until it naturally expires from the cache.

When a file is needed, a robot arm retrieves the tape containing the file and inserts it into a tape drive. The file is read from the tape and copied back to it's original location in the /pnfs file system. The latency for a tape drive to become available is somewhere between a few minutes and a few days, depending on the load on the system. The latency for the robot arm motion and tape mount is of order 1 minute. The tape has to be fast forwarded to the start of the requested file, which can take several minutes. Finally the file is read, which will take of order 10 seconds. Then the tape is rewound, dismounted and returned to its slot in the library; this may also take few minutes.

If you prestage all of the files in a dataset from tape to disk, the software knows which files live on which tapes and the order in which they appear on each tape. This allows the software to organize the file retrieval to minimize the expensive steps in the above process: every required tape will be mounted exactly once and the files will be read from each tape in the order in which they appear on the tape.

On the other hand, if you give a grid job a list of files to process, as we do using generate_fcl, the system does not know how to do any of this optimization, which may result in a large number of wasted operations.

Tape volumes, tape drives and robot arms have a limited lifespan. Enstore tracks how often each tape has been mounted, how much tape motion has occurred on each tape, what operations each drive has performed and what opertations each arm has performed. Once the operation count on tape volume has passed a given threshold, all files still active will be copied from that tape to a new one and the old tape will be retired. When the operation counts on drives and arms cross thresholds, then maintenance is scheduled, which may include downtime for field replaceable parts.

There are similar optimization considerations for writing files to tape but end users are not exposed to any of the details.

Generations of Tape Media

There are currently (9/2018) two types of tapes and we are transitioning from the first to the second.

T10K StorageTek The T10000KC and T10000KD hold 5TB and 8TB each, respectively. A tape drive can read at 250MB/s so one tape can be read in about 5h. We currently (2017) share about 20 tape drives with all of the Intensity Frontier.

LTO Linear Tape Open The LTO8 drives have a quoted maximum speed between 300 MB/s and 350 MB/s depending on media type. The LTO8 tapes have a capacity of 12TB and there is a total of 56 drives for the intensity frontier, but a large fraction are used for converting T10K data to LTO8. During 2018 and 2019 there was a global unavailability of LTO8 tapes, so Fermilab used "M8" (LTO7 formatted with LTO8) tapes which have a capacity of 9 TB.

File Families

Mu2e tapes are divided into several file families depending on the type of data: production or user data, raw or reco, beam data or sim. The files assigned to a file family will go to one set of tapes for that file family. It can be useful to segregate data like this so it can be treated specially. For example, raw data might be stored with two copies in different buildings, while this is unnecessary for sim data.

The maximum read rates of tapes are 100's of MB/s, but in reality, tapes are rarely read all the way though or efficiently - we typically access single files at a time. Typical access times are:

  • 1m to find and mount the tape
  • 1m to seek to the file
  • 10s to read the file
  • 1m to dismount and replace the tape

If requests for multiple files from a single tape are queued, then those requests will be grouped and ordered to improve the drive efficiency and reduce wear.

Once the file is off the tape, it has to be copied to tape-backed dCache. If the file is over 300MB, it was written to tape as lone file and the copy to dCache is immediate. If it is smaller, then it was rolled in a tarball with other small files in a system called Small File Aggregation (SFA). In this case, it has to be extracted fro the tarball before being written to dCache and this can add up to 15s latency.

We can mix large and small files in our data. The largest file that can be written is something like 1TB, but as a practical matter, all mu2e files should be less than about 20GB, and when there is an option, they should be 2-5 GB.

We write to tape through the tape-backed dCache. We only write files that are properly named, organized and documented by the mu2egrid scripts.

To read files from tape, we would usually access them by the mu2egrid scripts or the /pnfs mount of dCache. See also SAM to help with the file names and locations. Large numbers of files, such as those required by grid jobs, require prestaging to make sure they are off tape and on disk before reading them.

A list of complete file listings is available, updated each day (use wget, too big for a browser).

SFA (Small File Aggregation)

The tape capacities mentioned above are only correct if we write only large files to tape; if we write small files to tape, the inter-file gaps will use a larger fraction of the tape space, reducing the capacity. The smaller the files, the smaller the total tape capacity. The advice from the Fermilab ENSTORE experts is to aim for about 1000 files per tape in order to get nominal capacity. For the LTO8 generation of tape media, this means that the target file size should be about 12 GB. On the other hand, we design our data processing campaigns to so that each grid job will run for a few hours; usually this means that we want input files to be much, much smaller than 12 GB.

To balance these competing needs, SCD provides an automated service called Small File Aggregation (SFA). See the links on the OfflineOps page for the SFA documentation and a link to the active SFA configuration. SFA attributes are organized by File Family and all of the Mu2e FileFamilies have an SFA configuration.

The mental model for writing to tape via SFA is as follows. If a file written to tape backed dCache exceeds a specified size (max_member_size), it will be written to tape as a single file. See the table below for the value of max_member_size and other parameters that control SFA. If a file is smaller than max_member_size then the SFA system will put this file into its queue. As more files are written to the same same File Family, they are added to the queue. When the files in the queue meet of a set of triggers, all of the files in the queue will be processed as follows:

  1. All of the files in the queue will be written into a single tar file
  2. The tar file will be written to tape
  3. All of the required record keeping will be done.

The trigger condition is the first of the following to be satisfied

  1. The total size of the files in the queue exceeds minimal_file_size
  2. The number of files in the queue exceeds max_files_in_pack.
  3. The oldest file in the queue is older than max_waiting_time


The values of these parameters are available at [1]. As of Sept 13, 2022 the values of these parameters was:

File Family max_member_size (b) minimal_file_size (kB) max_files_in_pack max_waiting_time (s) Width
phy-raw 300000000 7812500 2000 86400 2
phy-rec 300000000 7812500 2000 86400 2
phy-ntd 300000000 7812500 2000 86400 2
phy-sim 300000000 7812500 3000 86400 2
phy-nts 300000000 4882812 3000 86400 2
phy-etc 300000000 4882812 3000 86400 2
usr-dat 300000000 7812500 2000 86400 2
usr-sim 300000000 4882812 3000 86400 2
usr-nts 400000000 9765625 2000 86400 2
usr-etc 300000000 4882812 3000 86400 2
tat-cos 300000000 4882812 3000 86400 2

Some of these values were reset during the migration to LTO8 media but some are stale. The meaning of the file family width is discussed on the wiki page about FileFamilies.

SFA Issues

As of September 2022 there have been many reports of very slow restore from tape operations. I looked carefully into some of these and they turned out to be opposite extremes.

Case 1: Ray remembers from some years ago that SFA always restores the requested files from a pack one file at a time; therefore to restore N files, requires N separate tar commands. When you want to restore all files from a pack this is very inefficient.


Case 2: Roberto presaged the SAM dataset dts.mu2e.EleBeamFlashCat.MDC2020p.art. The prestige took about 18 days. This dataset contained 9700 files each about 22 MB for a total of 210 GB. These were packed into 28 SFA packs that were contiguous files on a single volume. The total size of this dataset is small enough that it should have been possible to copy all 28 SFA packs to temporary disk in a single tape mount and process everything from there. Marc Mengel explained that there is a 1000 file limit when restoring SFA datasets.

Case 3: Fixme: add the discussion of Yuri's case.

Status on 9/23/2020

TDR and CD3 era tapes were written on T10K media

  • Since Sept 28, 2018 at ~11:50 AM writing to LTO8 media
    • 12 TB per volume
    • Recommended minimum file size 1.2 GB
  • Selected files on T10K media are being migrated to LTO8
    • Other files have been
  • SCD has asked us to identify which files on T10K media need to be migrated to LTO8 and which we can allow to expire.

In 10/2020, as part of migration from T10Kc to LTO8 tape formats, Rob deleted about 700TB of datasets, including most of the largest from CD3. google spreadsheet list

As part of the migration, we changed the SFA parameters from max file size of 300MB to 600MB and the target tarball size from 5GB to 8GB.