Elastic Analysis Facility (EAF): Difference between revisions
(13 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
= Accessing EAF = | = Accessing EAF = | ||
EAF is entirely web-based, that is, there is no interactive <code>ssh</code> access at present. This means that in order to access EAF from outside the FNAL network you will either need to use the Fermilab VPN, or set up a proxy. You will also need an active services account. | EAF is entirely web-based at https://analytics-hub.fnal.gov, that is, there is no interactive <code>ssh</code> access at present. This means that in order to access EAF from outside the FNAL network you will either need to use the Fermilab VPN, or set up a proxy. You will also need an active services account. | ||
There are number of ways to set up a proxy. One method, which is endorsed by SCD and should be applicable to all operating systems, is use Firefox with some modified network settings. Instructions on how to do this are given below. | There are number of ways to set up a proxy. One method, which is endorsed by SCD and should be applicable to all operating systems, is use Firefox with some modified network settings. Instructions on how to do this are given below. | ||
Line 116: | Line 116: | ||
Custom environments allow for flexibility beyond the base image, which is managed by SCD. To provide users with the tools needed to conduct analysis, we have an installed a Python environment on <code>/exp/data</code> that can used on by EAF and the virtual machines. This will eventually be moved to the <code>/cvmfs</code> area. | Custom environments allow for flexibility beyond the base image, which is managed by SCD. To provide users with the tools needed to conduct analysis, we have an installed a Python environment on <code>/exp/data</code> that can used on by EAF and the virtual machines. This will eventually be moved to the <code>/cvmfs</code> area. | ||
To use this environment, create a symlink in your EAF <code>.conda</code> directory that points to the environment, as shown below. | To use this environment, create a symlink in your EAF <code>~/.conda</code> directory that points to the environment (currently on v1.0.0), as shown below. | ||
<pre> | <pre> | ||
ln -s / | ln -s /cvmfs/mu2e.opensciencegrid.org/env/ana/current ~/.conda/envs/mu2e_env | ||
</pre> | </pre> | ||
Line 146: | Line 146: | ||
conda-pack | conda-pack | ||
fsspec-xrootd | fsspec-xrootd | ||
htop | |||
anapytools==2.0.0 | |||
</pre> | </pre> | ||
Line 157: | Line 158: | ||
After refreshing the page, you should then see <code>mu2e_env</code> appear as an option when launching a notebook or interactive console. | After refreshing the page, you should then see <code>mu2e_env</code> appear as an option when launching a notebook or interactive console. | ||
You can also install the <code>mu2e_env</code> environment in your user area from a <code>YMAL</code> file, for example: | |||
<pre> | <pre> | ||
mamba env create -f /exp/mu2e/data/users/sgrant/EAF/env/mu2e_env | mamba env create -f /exp/mu2e/data/users/sgrant/EAF/env/yml/mu2e_env.v1.0.0.yml | ||
mamba activate mu2e_env | mamba activate mu2e_env.v1.0.0 | ||
</pre> | </pre> | ||
In any case, you should now see <code>(mu2e_env)</code> as a prefix on your command line. | In any case, you should now see <code>(mu2e_env)</code>, or similar, as a prefix on your command line. | ||
<pre> | <pre> | ||
Line 170: | Line 171: | ||
</pre> | </pre> | ||
To | To use this environment on the Mu2e virtual machines, you can use the <code>activate</code>/<code>deactivate</code> binaries in the environment directory directly. So, to activate: | ||
<pre> | <pre> | ||
source / | source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/activate | ||
</pre> | </pre> | ||
Line 179: | Line 180: | ||
<pre> | <pre> | ||
source / | source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/deactivate | ||
</pre> | </pre> | ||
Line 185: | Line 186: | ||
<pre> | <pre> | ||
alias activate_mu2e_env="source / | alias activate_mu2e_env="source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/activate" | ||
alias deactivate_mu2e_env="source / | alias deactivate_mu2e_env="source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/deactivate" | ||
</pre> | </pre> | ||
Once you have activated your environment, you can run a brief test to check that your packages are working: | |||
<pre> | |||
$ python | |||
>>> import numpy | |||
</pre> | |||
This first time you import a package may take a minute -- subsequent imports should be faster (thanks to file caching). | |||
[[Category:EAF]] [[ category:Custom environments]] | [[Category:EAF]] [[ category:Custom environments]] | ||
Line 201: | Line 211: | ||
</pre> | </pre> | ||
You can then install whatever packages you need using <code>mamba install <package_name></code>. | |||
[[Category:Mu2e EAF tools]] [[ category:Mu2e EAF tools]] | [[Category:Mu2e EAF tools]] [[ category:Mu2e EAF tools]] | ||
= | = anapytools = | ||
Along with some standard Python libraries, <code>mu2e_env</code> | Along with some standard Python libraries, <code>mu2e_env</code> comes with a some additional utilities from https://github.com/Mu2e/anapytools.git. | ||
At present, <code> | At present, <code>anapytools</code> allows users to interface with <code>SAM</code> and <code>/pnfs</code> from EAF, and provides a multithreading tool. These can be imported as packages from <code>anapytools</code> called <code>read_data</code> and <code>parallelise</code>. | ||
Before reading files, you will need a valid access token/certificate. Run the following: | Before reading files, you will need a valid access token/certificate. Run the following: | ||
Line 221: | Line 231: | ||
<pre> | <pre> | ||
from | $ python | ||
file_list = | >>> from anapytools.read_data import DataReader | ||
>>> reader = DataReader() | |||
>>> file_list = reader.get_file_list(defname='nts.mu2e.CeEndpointMix1BBSignal.Tutorial_2024_03.tka') | |||
</pre> | </pre> | ||
Line 228: | Line 240: | ||
<pre> | <pre> | ||
from | $ python | ||
file = | >>> from anapytools.read_data import DataReader | ||
>>> reader = DataReader() | |||
>>> file = reader.read_file(filename='nts.sgrant.CosmicCRYExtractedCatTriggered.MDC2020ae_best_v1_3.001205_00000000.root') | |||
</pre> | </pre> | ||
An example of <code>read_data</code> in use is shown below. The <code>read_data</code> functions also include a <code>quiet</code> flag to suppress their printouts, if needed. | An example of <code>read_data</code> in use is shown below. The <code>read_data</code> functions also include a <code>quiet</code> flag to suppress their printouts, if needed. | ||
[[File:read_data.png|768px|center|read_data()]] | [[File:read_data.png|768px|center|read_data.png()]] | ||
To run parallel reads on a list of files using multithreading, run: | To run parallel reads on a list of files using multithreading, run: | ||
<pre> | <pre> | ||
from | $ python | ||
def process_function(filename): | >>> from anapytools.read_data import DataReader | ||
>>> from anapytools.parallelise import ParallelProcessor | |||
>>> reader = DataReader() | |||
>>> processor = ParallelProcessor() | |||
>>> file_list = reader.get_file_list(defname='nts.sgrant.CosmicCRYExtractedCatTriggered.MDC2020ae_best_v1_3.root', quiet=False) | |||
>>> def process_function(filename): | |||
file = reader.read_file(filename, quiet=True) | |||
return | |||
>>> processor.multithread(process_function, file_list) | |||
</pre> | </pre> | ||
Line 250: | Line 269: | ||
[[File:parallelise.png|768px|center|parallelise.png()]] | [[File:parallelise.png|768px|center|parallelise.png()]] | ||
More tools will be added | More tools will be added in future. |
Latest revision as of 16:29, 15 November 2024
Introduction
This page is intended as a guide to the Fermilab Elastic Analysis Facility (EAF) for Mu2e collaborators. The official EAF documentation from SCD can be found at https://eafjupyter.readthedocs.io/en/latest/.
EAF is a web-based platform intended for Python analysis and ML tasks. The key to EAF is the container-based infrastructure, which distinguishes it from traditional virtual machines. The benefit of this approach is that underlying resources (the hardware) can be swapped without breaking the container, adding elasticity.
EAF is a a powerful and flexible platform, ideal for running Mu2e Python analyses and ML tasks.
Accessing EAF
EAF is entirely web-based at https://analytics-hub.fnal.gov, that is, there is no interactive ssh
access at present. This means that in order to access EAF from outside the FNAL network you will either need to use the Fermilab VPN, or set up a proxy. You will also need an active services account.
There are number of ways to set up a proxy. One method, which is endorsed by SCD and should be applicable to all operating systems, is use Firefox with some modified network settings. Instructions on how to do this are given below.
- Ensure you have a valid kerberos ticket: check with
klist
, runkinit <username>@FNAL.GOV
to make a new one. Make sure to replace<username>
with your FNAL username. - Open a terminal and start an
ssh
tunnel to an FNAL machine on which you have an account (such asmu2egpvm01
).ssh -f -N -D 9999 <username>@mu2egpvm01.fnal.gov
- Open Firefox and type
about:config
into the address bar, click OK to ignore the warning. - Use the search bar to change the parameters by to the values shown in the table below.
Parameter | Value |
---|---|
network.proxy.socks | 127.0.0.1 |
network.proxy.socks_port | 9999 |
network.proxy.socks_remote_dns | true |
network.proxy.type | 1 |
To stop the proxy, change network.proxy.type
back to its default value by pressing the reset button to the right of the edit button.
See https://library.fnal.gov/off-site-electronic-access-instructions for more information.
Starting an EAF server
Once you have started your proxy server and ssh
tunnel (if you are off-site), go the EAF home page at
https://analytics-hub.fnal.gov
on which you should see a welcome page which will invite you to sign in with your Fermilab Services (SSO) account.
Make sure you're using the same browser as the one you have configured to use a proxy! You should see a page with a Start My Server
button. If you click on this button it will take you to a Server Options
page.
Follow the instructions below to start an AL9 server.
- Go to the
FIFE
server box:
- Click
CPU Interactives
. - Select
AL9
. - Scroll down to bottom of the page and click
Start
.
The server may take a few minutes to start up. You will see a page like this
There is also an option to start a server from a Mu2e AL9 image, which is now deprecated. Instead, it is recommended to use the mu2e_env
environment described in The Mu2e environment.
JupterHub and the EAF area
On loading an EAF server you be land on a JupyterHub
launcher page. From here you should see various options to applications such as a terminal, a python notebook, a python file editor, or an interactive python console. It also provides options run the interactive applications with different Python kernels, where kernels represent both the execution engine and the packages, libraries, and dependencies available to the Python interpreter. You can also access these options by clicking the blue "+" button on the top right.
If you open a terminal and run pwd
, you will see that a user area has been automatically created for you in /home
. From here, you have direct access to the /exp/mu2e/app
and /exp/mu2e/data
areas. Access to /pnfs
requires xroot
, which is included in the mu2e_env
environment described in The Mu2e environment.
Each user has eight guaranteed cores, a 64 GB memory limit, and 23 GB of storage on their EAF user area.
Conda/Mamba
Conda and Mamba are open-source packages and environment management systems. Mamba is a C++
reimplementation of Conda: it has the same command syntax but is supposed to be more efficient. Mamba is the tool used in this example to set up our Mu2e environment, and can be used to create and manage custom user environments.
Upon initialising, Mamba will write some lines to your $HOME/.bashrc
files, so first make sure that your $HOME/.bash_profile
is set up to read from the .bashrc
by adding the following lines to your .bash_profile
using any command line text editor (such as emacs
or vim
).
# Get aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi
To start using Mamba, again using a terminal from the JupyterHub launcher, and initialise Mamba as shown below.
mamba init
This will prompt you to open a new shell, so close the current session and start a new one.
You only need to do this once.
The Mu2e environment
Custom environments allow for flexibility beyond the base image, which is managed by SCD. To provide users with the tools needed to conduct analysis, we have an installed a Python environment on /exp/data
that can used on by EAF and the virtual machines. This will eventually be moved to the /cvmfs
area.
To use this environment, create a symlink in your EAF ~/.conda
directory that points to the environment (currently on v1.0.0), as shown below.
ln -s /cvmfs/mu2e.opensciencegrid.org/env/ana/current ~/.conda/envs/mu2e_env
You can then activate the environment using mamba.
mamba activate mu2e_env
Once inside the environment, you have access to the libraries listed below.
matplotlib pandas uproot scipy scikit-learn pytorch tensorflow jupyterlab notebook statsmodels awkward urllib3==1.26.16 ipykernel conda-pack fsspec-xrootd htop anapytools==2.0.0
To use these packages interactively, you can install an ipython
kernel in your user area with following command.
python -m ipykernel install --user --name mu2e_env --display-name "mu2e_env"
After refreshing the page, you should then see mu2e_env
appear as an option when launching a notebook or interactive console.
You can also install the mu2e_env
environment in your user area from a YMAL
file, for example:
mamba env create -f /exp/mu2e/data/users/sgrant/EAF/env/yml/mu2e_env.v1.0.0.yml mamba activate mu2e_env.v1.0.0
In any case, you should now see (mu2e_env)
, or similar, as a prefix on your command line.
(mu2e_env) [<username>@jupyter-<username> ~]$
To use this environment on the Mu2e virtual machines, you can use the activate
/deactivate
binaries in the environment directory directly. So, to activate:
source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/activate
and to deactivate:
source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/deactivate
It also might be convenient to alias this command in your .my_bashrc
file, to something like
alias activate_mu2e_env="source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/activate" alias deactivate_mu2e_env="source /cvmfs/mu2e.opensciencegrid.org/env/ana/current/bin/deactivate"
Once you have activated your environment, you can run a brief test to check that your packages are working:
$ python >>> import numpy
This first time you import a package may take a minute -- subsequent imports should be faster (thanks to file caching).
Custom environments
To start entirely from scratch, you can create a clean custom environment with the commands below.
mamba create -q -y -n my_env mamba activate my_env
You can then install whatever packages you need using mamba install <package_name>
.
anapytools
Along with some standard Python libraries, mu2e_env
comes with a some additional utilities from https://github.com/Mu2e/anapytools.git.
At present, anapytools
allows users to interface with SAM
and /pnfs
from EAF, and provides a multithreading tool. These can be imported as packages from anapytools
called read_data
and parallelise
.
Before reading files, you will need a valid access token/certificate. Run the following:
source /cvmfs/mu2e.opensciencegrid.org/setupmu2e-art.sh kinit ${USER}@FNAL.GOV /cvmfs/mu2e.opensciencegrid.org/bin/vomsCert
To create a file list from a SAM dataset:
$ python >>> from anapytools.read_data import DataReader >>> reader = DataReader() >>> file_list = reader.get_file_list(defname='nts.mu2e.CeEndpointMix1BBSignal.Tutorial_2024_03.tka')
To read a file from /pnfs
using xroot
:
$ python >>> from anapytools.read_data import DataReader >>> reader = DataReader() >>> file = reader.read_file(filename='nts.sgrant.CosmicCRYExtractedCatTriggered.MDC2020ae_best_v1_3.001205_00000000.root')
An example of read_data
in use is shown below. The read_data
functions also include a quiet
flag to suppress their printouts, if needed.
To run parallel reads on a list of files using multithreading, run:
$ python >>> from anapytools.read_data import DataReader >>> from anapytools.parallelise import ParallelProcessor >>> reader = DataReader() >>> processor = ParallelProcessor() >>> file_list = reader.get_file_list(defname='nts.sgrant.CosmicCRYExtractedCatTriggered.MDC2020ae_best_v1_3.root', quiet=False) >>> def process_function(filename): file = reader.read_file(filename, quiet=True) return >>> processor.multithread(process_function, file_list)
An example of parallelise
in use is shown below.
More tools will be added in future.