Ntuples: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 15: Line 15:
Currently this package is maintained in its own git repo by Pasha Murat and Giani Pezzullo.
Currently this package is maintained in its own git repo by Pasha Murat and Giani Pezzullo.


* [https://sites.google.com/view/stntuple/home Stntuple] docs
* [https://github.com/Mu2e/Stntuple/blob/main/doc/Stntuple.org github docs]
* [ssh://p-mu2eofflinesoftwarestntuple@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwarestntuple/Stntuple.git git url]
* [https://github.com/Mu2e/Stntuple.git git url]


==TrkAna==
==TrkAna==
Line 51: Line 51:
* [https://www.hdfgroup.org/ HDF5]
* [https://www.hdfgroup.org/ HDF5]
* [https://www.r-project.org/ r]
* [https://www.r-project.org/ r]
==Elastic Analysis Facility==
The EAF is a SCD project to provide powerful resources for analysis of large datasets, especially for machine learning.  The control is via JupyterHub notebooks.  The "elastic" part is accessing compute resources on demand. The available resources include graphics cards.  Anyone can log on, provided that are on the lab network or VPN. [https://analytics-hub.fnal.gov login] [https://eafjupyter.readthedocs.io/en/latest/  docs] [https://indico.fnal.gov/event/53944/contributions/239939/attachments/156199/203704/Acosta_UsersMeeting22EAF.pdf intro talk]
[https://nextjournal.com/schmudde/how-to-version-control-jupyter note on version control]

Latest revision as of 16:17, 31 July 2023

Introduction

Our primary data is stored in the art format. This format uses root I/O, but embeds it in a framework with restrictive rules for data access. These framework rules are important during primary processing in order to precisely track the provenance of the data. At some point later in the analysis process, the dominate problem becomes accessing the data in a convenient way, rather than a very controlled way. The solution is to copy the most high-level parts of the data (such the number of hits on a track, and its momentum) into a smaller and faster format - an ntuple. Once the data is in this format, usually a root tree, the user can make histograms with simple cuts. Since only the high-level data values are stored, the dataset is very small and access is very fast.

Ideally the collaboration would chose a primary ntuple format and officially support this format. The official support would include code and document support, and priority in support and processing. With one central format, everyone's work in creating datasets and tools could be shared. At this time (8/2018) the selection of a primary ntuple format has not occurred, but it is still the plan. Meanwhile, there are several ntuple systems that are supported by small groups. The files in these formats can be uploaded to tape and documented like art datasets.

In choosing an ntuple to match a goal, several factors can be considered. One factor is personal preference - do you value security or speed? Do you want to take the time to learn a new system? If the problem is simple and limited, a custom root ntuple or art histogramming may be most appropriate. If the project is very unique, you may need to make your own ntuple. If the project is larger or more complex, then the primary concern may be to coordinate with the the other people you will be working with - your mentor, your working group, detector group, University group, etc. What tools and datasets are available and who supports them?

Stntuple

This procedure can copy large parts of the data, essentially all of it, if you want, into a more compact and fast-access format. It copies the art products into more compact c++ objects, which are then written to root branches. Access is by a custom lightweight framework which supports user analysis modules, which access the data as objects, and make histograms. Because access is modular (the user modules can be shared and chained into paths) and data is always accessed as an object (like a track or cluster) it is easy to share and build up tools. The branches are not split which usually makes access faster, but it is not easily browsable. All access is by complied code.

Currently this package is maintained in its own git repo by Pasha Murat and Giani Pezzullo.

TrkAna

This ntuple is a root tree with each entry corresponding to a track. More details can be found on the TrkAna wiki page.

Custom root tree

You can make your own format root trees, which is pretty straightforward if you follow an example. You write a module which creates a root tree, and fills it by reading the art products and copying variables to the tree. The tree will be written to the TService file. The biggest advantage of this approach is the ntuple is completely custom, so holds only what you want. Access is very fast and interactive browsing is easy. The downside is that you generally can't share tools and datasets with other users.

Some examples of this form of ntuple are:

  • Analyses/src/ReadBack_module.cc
  • Analyses/src/SimParticleAnalyzer_module.cc
  • Analyses/src/CosmicAnalysis_module.cc

art histogramming

In the previous section, we gave some examples of how to write a module to create a root tree for analysis. You can also create the histograms directly in the art module, and they will appear in the TService file. Any module can ask for the TService file handle, create and fill a histogram, and it will be automatically saved. An advantage of this method is that it doesn't require any code except art code, which is collaboration-supported. All art-based tools are available at all time while making histograms. There is no additional dataset to make or manage. A disadvantage is that the code compilation and startup is slower, and there is no interactive or browsing access.

Some examples of this form of histogramming are:

  • Analyses/src/ReadBack_module.cc
  • Analyses/src/StatusG4Analyzer_module.cc

gallery

The art team has provided basic tools to support this data access method. The idea is that you write a little module that can access the art root file products in a more light-weight and fast way. It strips off the framework, and only provides the class definitions, and connects them to the art root branches. It effectively turns the art production file into a form a ntuple. You should be able to use all art-based tools in this method. I believe this method does not provide service modules, so you can't automatically load the geometry, for example. gallery does not support writing of art files, though it may be possible to write art products in pseudo-art format, but this has not been demonstrated.

Other formats

Tools other than root have been explored at times, but there is no major effort on mu2e at this writing.

Elastic Analysis Facility

The EAF is a SCD project to provide powerful resources for analysis of large datasets, especially for machine learning. The control is via JupyterHub notebooks. The "elastic" part is accessing compute resources on demand. The available resources include graphics cards. Anyone can log on, provided that are on the lab network or VPN. login docs intro talk note on version control