Rucio: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
 
(39 intermediate revisions by 2 users not shown)
Line 2: Line 2:
==Introduction==
==Introduction==


Rucio is a CERN software system for storing file metadata and organizing the delivery of that data to users.  Its primary features are scalability, flexibility, adaptive file replication, and built-in monitoring.  It can use various backends for databases, various platforms for its servers and daemons, various transfer and storage method plug-ins, and a command line and python interface for users.
In spring 2024, Mu2e is planning ot migrate from the SAM file catalog and tools, to a new set of tools. The new system would consist of these parts
 
* Metacat - a database of file metadata ([https://metacat.readthedocs.io/en/latest/ docs] [https://metacat.fnal.gov:9443/mu2e_meta_prod/app/gui/index GUI])
The new system would consist of these parts
* Metacat - a database of file metadata ([https://metacat.readthedocs.io/en/latest/ docs])
* Rucio - a database of file locations, and servers which can move and track data, responding to user rules ([https://rucio.cern.ch/documentation/ docs])
* Rucio - a database of file locations, and servers which can move and track data, responding to user rules ([https://rucio.cern.ch/documentation/ docs])
* Data Dispatcher - a modern replacement for SAM project file delivery ([https://data-dispatcher.readthedocs.io/en/latest/ docs])
* Data Dispatcher - a replacement for SAM project file delivery ([https://data-dispatcher.readthedocs.io/en/latest/ docs] [https://metacat.fnal.gov:9443/mu2e_dd_prod/gui/index GUI])
* Shrek - the name for the 3 packages Rucio, Metacat and Data Dispatcher ( [https://mu2e-docdb.fnal.gov/cgi-bin/sso/ShowDocument?docid=50705 Introduction to Shrek (mp4)].
* mdh - Mu2e data-handling commands added to supplement the above systems (see <code>mdh -h</code>)
* mdh - Mu2e data-handling commands added to supplement the above systems (see <code>mdh -h</code>)
* we expect little to no interaction of users and Rucio or DataDispatcher, almost all user work can be done with metacat and mdh


A few overarching concepts to keep in mind
A few overarching concepts to keep in mind
* these system only recognition authentication with [[Authentication#Tokens|tokens]]
* metacat recognizes authentication with [[Authentication#Tokens|tokens]], but Rucio used x509 proxies.  mdh will make either from a kerberos ticket, as needed.
* metacat requires you to be ''authenticated'' to write to the database (create files, datasets)
* metacat requires you to be ''authenticated'' to write to the database (create files, datasets)
* all files belong to a namespace
* all files belong to a namespace, also known as the ''scope''. Namespaces can't be deleted.
** if you create a namespace, it must start with your username (by policy)
** the combination of ''namespace:filename'' uniquely identifies a file and is called a ''did''
** if you create a namespace, it must start with your username, like ${USER}_myana (by policy)
** Rucio scopes and standard dataset names will follow the metacat conventions (by policy)
* files must be named by the [[FileNames|naming convention]] (by policy)
* files must be named by the [[FileNames|naming convention]] (by policy)
* all files are readable by all users
* all files are readable by all users
* metacat has ''roles''. Users can be members of a role, and then the role can create and own objects
* metacat has ''roles''. Users can be members of a role, and then the role can create and own objects
* files records are not deleted - either retired (no new file of the same name can be created) or modified
* when declared, a file must belong to at least one existing dataset (which might not be in the same namespace)


==Quick start==
==Quick start==
===setup===
===setup===
  setup mu2e
  mu2einit
  setup mdh
  muse setup ops
will setup all related data-handling tools
will setup all related data-handling tools


===Authentication===
===Authentication===
Authenticate yourself
As of 1/2024, all collaboration members should have a metacat account, but rucio accounts have to be made by hand.
 
To use metacat commands to list files and other database content, you do not need authentication.
To use metacat commands that involve a database write, you must authenticate yourself to metacat.
  metacat auth login -m token $USER
  metacat auth login -m token $USER
* if you get a token ''file not found'', please run getToken, or see [[Authentication#Tokens|token docs]]
* if you get a token ''file not found'', please run getToken, or see [[Authentication#Tokens|token docs]]
* if you get ''Authentication failed'', you might not have an account
* if you get ''Authentication failed'', you might not have an account
Your authentication lasts as long as your token valid period
Your authentication lasts as long as your token valid period.  To check your authentication
metacat auth list
There is no logout
If your authentication is expired, and you attempt a write command, the only error you get may be <code>Connection reset by peer</code>.


To check your authtication
Data dispatcher uses the same authentication plan.
metacat auth list
 
To use mdh commands, you only need a kerberos ticket and mdh will manage your authentication.


There is no logout
Rucio uses an x509 proxy, which you can get from a kerberos ticket using vomsCert.  Rucio requires the proxy for all commands.  Rucio should change to tokens at some point.


===Listing===
===Listing===
list namespaces
 
  metacat namespace list
<pre>
list your namespaces
metacat namespace list            # list all namespaces
  metacat namespace list -u $USER
  metacat namespace list -u $USER    # list your namespaces
  metacat namespace list -u mu2epro  # list production namespaces
 
metacat dataset list              # list all datasets
metacat dataset list $USER:*      # list your datasets
metacat dataset list mu2e:dig.*MDC2020*      # list production datasets
 
metacat file show rlc:mcs.rlc.dh_test.001.001200_000000.art -m -p    # print metadata about one file
 
metacat query 'datasets matching mu2e:dig.mu2e*'  # best to use single quotes for query strings
 
metacat query 'files from rlc:mcs.rlc.dh_test.001.art'
metacat query 'files from rlc:mcs.rlc.dh_test.001.art where rs.first_subrun=0'
metacat query 'files from rlc:etc.rlc.dh_test.000.txt where fn.sequencer ~ ".*2"' # regex string match
</pre>
 
===creating===
<pre>
metacat namespace create $USER    # your personal namespace
metacat namespace create -o pro sim  # a namespace owned by a role
 
metacat dataset create rlc:test1 -M -m @ds.json "some comment"  # create an ad-hoc dataset
  ds.json contains:
  {
    "ds.myMetadata" : "myValue"
  }
 
metacat dataset create rlc: -M -m @ds.json "some comment"  # create an ad-hoc dataset
metacat dataset create rlc:mcs.rlc.dh_test.001.art "first test"  # a formal file dataset
metacat dataset create sim:mcs.mu2e.dh_test.002.art "test files for a role"  # a new pro dataset
 
# make json metadata for a file -s = scope (namespace) f,n,v,p options to add to metadata
mdh file-json -s rlc -f production -n Reco -v 000-000-000  -p mcs.mu2e.dh_test_parent.001.001200_000000.art  mcs.mu2e.dh_test.001.001200_000000.art
 
# all files must be added to a dataset when they are created
# the dataset must be official dataset name, by policy
metacat file declare  -f mcs.rlc.dh_test.001.001200_000000.art.json rlc:mcs.rlc.dh_test.001.art
ls *.json  | while read FF; do metacat file declare -f $FF sim:mcs.mu2e.dh_test.002.art; done
 
</pre>


==Implementation==
==Implementation==
<pre>
<pre>
export METACAT_SERVER_URL=http://dbweb5.fnal.gov:9094/mu2e_meta_prod/app
 
export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/mu2e_meta_prod/app
export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/mu2e
export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/mu2e
export DATA_DISPATCHER_URL=https://metacat.fnal.gov:9443/mu2e_dd_prod/data
export DATA_DISPATCHER_URL=https://metacat.fnal.gov:9443/mu2e_dd_prod/data
export DATA_DISPATCHER_AUTH_URL=https://metacat.fnal.gov:8143/auth/mu2e
export DATA_DISPATCHER_AUTH_URL=https://metacat.fnal.gov:8143/auth/mu2e
Rucio request
/pnfs/mu2e/tape
/pnfs/mu2e/persistent/datasets
/pnfs/mu2e/scratch/datasets  (expect to have a greedy cleanup of two weeks)
Rucio 6/21/23
two nondeterministic RSEs
FNAL_DCACHE_SCRATCH
FNAL_DCACHE_PERSISTENT
</pre>
</pre>
==Admin==
===create metacat account===
* create new metacat accounts via the [https://metacat.fnal.gov:9443/mu2e_meta_prod/app/gui/datasets GUI]
** "anonymized user Id" is for token access and is the text from the "sub" field from the user's token.  Enabling token access is required.
** DN is for proxy access and easiest to get from the user account and <code>metacat auth mydn -c /tmp/x509up_u$UID</code>. Could also be left blank
** password can be left blank since we expect only token access
* create new metacat roles via the GUI
** add role on the role metacat GUI
** add user to a role via the user's page (not role page)
===create Rucio account===
rucio-admin -a root account add  --type USER --email kutschke@fnal.gov kutschke
rucio-admin scope add --account kutschke --scope kutschke
rucio-admin identity add --account kutschke --email kutschke@fnal.gov --type X509 --id "/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Rob Kutschke/CN=UID:kutschke"
rucio-admin account list-identities kutschke
===create user metacat namespace===
First make sure your namespace exists
    metacat namespace list $USER
if you don't see your namespace you can run this command (only ever once).
    kinit
    getToken
    metacat auth login -m token $USER
    metacat namespace create $USER
===Rucio commands===
list datasets
rucio list-dids --filter type=dataset mu2e:*
  filter avoids containers
count the files in a dataset
rucio list-files mu2e:bck.mu2e.beams1-g4-10-5.g4-10-5.tbz | grep Total
list the files in a dataset
rucio list-files mu2e:bck.mu2e.beams1-g4-10-5.g4-10-5.tbz
count files from dataset in each replica
rucio list-dataset-replicas --deep mu2e:dig.mu2e.CeEndpointMixTriggered.MDC2020k.art
list replicas for a file
rucio list-file-replicas mu2e:dig.mu2e.CeEndpointMixTriggered.MDC2020k.001210_00000000.art


==References ==
==References ==

Latest revision as of 21:54, 16 October 2024

Introduction

In spring 2024, Mu2e is planning ot migrate from the SAM file catalog and tools, to a new set of tools. The new system would consist of these parts

  • Metacat - a database of file metadata (docs GUI)
  • Rucio - a database of file locations, and servers which can move and track data, responding to user rules (docs)
  • Data Dispatcher - a replacement for SAM project file delivery (docs GUI)
  • Shrek - the name for the 3 packages Rucio, Metacat and Data Dispatcher ( Introduction to Shrek (mp4).
  • mdh - Mu2e data-handling commands added to supplement the above systems (see mdh -h)
  • we expect little to no interaction of users and Rucio or DataDispatcher, almost all user work can be done with metacat and mdh


A few overarching concepts to keep in mind

  • metacat recognizes authentication with tokens, but Rucio used x509 proxies. mdh will make either from a kerberos ticket, as needed.
  • metacat requires you to be authenticated to write to the database (create files, datasets)
  • all files belong to a namespace, also known as the scope. Namespaces can't be deleted.
    • the combination of namespace:filename uniquely identifies a file and is called a did
    • if you create a namespace, it must start with your username, like ${USER}_myana (by policy)
    • Rucio scopes and standard dataset names will follow the metacat conventions (by policy)
  • files must be named by the naming convention (by policy)
  • all files are readable by all users
  • metacat has roles. Users can be members of a role, and then the role can create and own objects
  • files records are not deleted - either retired (no new file of the same name can be created) or modified
  • when declared, a file must belong to at least one existing dataset (which might not be in the same namespace)

Quick start

setup

mu2einit
muse setup ops

will setup all related data-handling tools

Authentication

As of 1/2024, all collaboration members should have a metacat account, but rucio accounts have to be made by hand.

To use metacat commands to list files and other database content, you do not need authentication. To use metacat commands that involve a database write, you must authenticate yourself to metacat.

metacat auth login -m token $USER
  • if you get a token file not found, please run getToken, or see token docs
  • if you get Authentication failed, you might not have an account

Your authentication lasts as long as your token valid period. To check your authentication

metacat auth list

There is no logout If your authentication is expired, and you attempt a write command, the only error you get may be Connection reset by peer.

Data dispatcher uses the same authentication plan.

To use mdh commands, you only need a kerberos ticket and mdh will manage your authentication.

Rucio uses an x509 proxy, which you can get from a kerberos ticket using vomsCert. Rucio requires the proxy for all commands. Rucio should change to tokens at some point.

Listing

 metacat namespace list             # list all namespaces
 metacat namespace list -u $USER    # list your namespaces
 metacat namespace list -u mu2epro  # list production namespaces

 metacat dataset list              # list all datasets
 metacat dataset list $USER:*      # list your datasets
 metacat dataset list mu2e:dig.*MDC2020*      # list production datasets

 metacat file show rlc:mcs.rlc.dh_test.001.001200_000000.art -m -p    # print metadata about one file

 metacat query 'datasets matching mu2e:dig.mu2e*'  # best to use single quotes for query strings

 metacat query 'files from rlc:mcs.rlc.dh_test.001.art'
 metacat query 'files from rlc:mcs.rlc.dh_test.001.art where rs.first_subrun=0'
 metacat query 'files from rlc:etc.rlc.dh_test.000.txt where fn.sequencer ~ ".*2"' # regex string match

creating

metacat namespace create $USER    # your personal namespace
metacat namespace create -o pro sim  # a namespace owned by a role

metacat dataset create rlc:test1 -M -m @ds.json "some comment"   # create an ad-hoc dataset
   ds.json contains:
  {
    "ds.myMetadata" : "myValue"
  }

metacat dataset create rlc: -M -m @ds.json "some comment"   # create an ad-hoc dataset
metacat dataset create rlc:mcs.rlc.dh_test.001.art "first test"  # a formal file dataset
metacat dataset create sim:mcs.mu2e.dh_test.002.art "test files for a role"  # a new pro dataset

# make json metadata for a file -s = scope (namespace) f,n,v,p options to add to metadata
mdh file-json -s rlc -f production -n Reco -v 000-000-000  -p mcs.mu2e.dh_test_parent.001.001200_000000.art   mcs.mu2e.dh_test.001.001200_000000.art

# all files must be added to a dataset when they are created
# the dataset must be official dataset name, by policy
metacat file declare  -f mcs.rlc.dh_test.001.001200_000000.art.json rlc:mcs.rlc.dh_test.001.art
ls *.json  | while read FF; do metacat file declare -f $FF sim:mcs.mu2e.dh_test.002.art; done

Implementation


export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/mu2e_meta_prod/app
export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/mu2e
export DATA_DISPATCHER_URL=https://metacat.fnal.gov:9443/mu2e_dd_prod/data
export DATA_DISPATCHER_AUTH_URL=https://metacat.fnal.gov:8143/auth/mu2e

Rucio request
/pnfs/mu2e/tape
/pnfs/mu2e/persistent/datasets
/pnfs/mu2e/scratch/datasets  (expect to have a greedy cleanup of two weeks)

Rucio 6/21/23
two nondeterministic RSEs
FNAL_DCACHE_SCRATCH
FNAL_DCACHE_PERSISTENT

Admin

create metacat account

  • create new metacat accounts via the GUI
    • "anonymized user Id" is for token access and is the text from the "sub" field from the user's token. Enabling token access is required.
    • DN is for proxy access and easiest to get from the user account and metacat auth mydn -c /tmp/x509up_u$UID. Could also be left blank
    • password can be left blank since we expect only token access
  • create new metacat roles via the GUI
    • add role on the role metacat GUI
    • add user to a role via the user's page (not role page)

create Rucio account

rucio-admin -a root account add   --type USER --email kutschke@fnal.gov kutschke
rucio-admin scope add --account kutschke --scope kutschke
rucio-admin identity add --account kutschke --email kutschke@fnal.gov --type X509 --id "/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Rob Kutschke/CN=UID:kutschke"
rucio-admin account list-identities kutschke


create user metacat namespace

First make sure your namespace exists

   metacat namespace list $USER

if you don't see your namespace you can run this command (only ever once).

   kinit
   getToken
   metacat auth login -m token $USER
   metacat namespace create $USER

Rucio commands

list datasets

rucio list-dids --filter type=dataset mu2e:*
 filter avoids containers

count the files in a dataset

rucio list-files mu2e:bck.mu2e.beams1-g4-10-5.g4-10-5.tbz | grep Total

list the files in a dataset

rucio list-files mu2e:bck.mu2e.beams1-g4-10-5.g4-10-5.tbz

count files from dataset in each replica

rucio list-dataset-replicas --deep mu2e:dig.mu2e.CeEndpointMixTriggered.MDC2020k.art

list replicas for a file

rucio list-file-replicas mu2e:dig.mu2e.CeEndpointMixTriggered.MDC2020k.001210_00000000.art

References