Ntuples: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 4: Line 4:
Our primary data is stored in the [[Code|art]] format.  This format uses root I/O, but embeds it in a framework with restrictive rules for data access.  These framework rules are important during primary processing in order to precisely track the provenance of the data.  At some point later in the analysis process, the dominate problem becomes accessing the data in a convenient way, rather than a very controlled way.  The solution is to copy the most high-level parts of the data (such the number of hits on a track, and its momentum) into a smaller and faster format - an ntuple.  Once the data is in this format, usually a root tree, the user can make histograms with simple cuts.  Since only the high-level data values are stored, the dataset is very small and access is very fast.
Our primary data is stored in the [[Code|art]] format.  This format uses root I/O, but embeds it in a framework with restrictive rules for data access.  These framework rules are important during primary processing in order to precisely track the provenance of the data.  At some point later in the analysis process, the dominate problem becomes accessing the data in a convenient way, rather than a very controlled way.  The solution is to copy the most high-level parts of the data (such the number of hits on a track, and its momentum) into a smaller and faster format - an ntuple.  Once the data is in this format, usually a root tree, the user can make histograms with simple cuts.  Since only the high-level data values are stored, the dataset is very small and access is very fast.


Ideally the collaboration would chose a primary ntuple format and officially support this format.  The official support would include code and document support, and priority in support and processing.  With one central format, everyone's work in creating datasets and tools could be shared.  At this time (8/2018) the selection of a primary ntuple format is not done, but it is still the plan.  Meanwhile, there are several methods
Ideally the collaboration would chose a primary ntuple format and officially support this format.  The official support would include code and document support, and priority in support and processing.  With one central format, everyone's work in creating datasets and tools could be shared.  At this time (8/2018) the selection of a primary ntuple format has not occurred, but it is still the plan.  Meanwhile, there are several ntuple systems that are supported by small groups.  The files in these formats can be uploaded to tape and [[Workflows|documented]] like art datasets.
 
In choosing an ntuple to match a goal, several factors can be considered.  One factor is personal preference - do you value security or speed? Do you want to take the time to learn a new system?  If the problem is simple and limited, a custom root ntuple or art histogramming may be most appropriate.  If the project is very unique, you may need to make your own ntuple. If the project is larger or more complex, then the primary concern may be to coordinate with the the other people you will be working with - your mentor, your working group, detector group, University group, etc.  What tools and datasets are available and who supports them?


==Stntuple==
==Stntuple==
** [https://sites.google.com/view/stntuple/home Stntuple] [ssh://p-mu2eofflinesoftwarestntuple@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwarestntuple/Stntuple.git git url] - one choice of user ntuple
 
This procedure can copy large parts of the data, essentially all of it, if you want, into a more compact and fast-access format.  It copies the art products into more compact c++ objects, which are then written to root branches.  Access is by a custom lightweight framework which supports user analysis modules, which access the data as objects, and make histograms.  Because access is modular (the user modules can be shared and chained into paths) and data is always accessed as an object (like a track or cluster) it is easy to share and build up tools.  The branches are not split which makes usually makes access faster, but it is not easily browsable.
All access is by complied code.
 
Currently this package is maintained in its own git repo by Pasha Murat and Giani Pezzullo.
 
** [https://sites.google.com/view/stntuple/home Stntuple]
**[ssh://p-mu2eofflinesoftwarestntuple@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwarestntuple/Stntuple.git git url]


==TrkAna==
==TrkAna==


This ntuple is a root tree with branches that represent aspects of the data: hits, tracks, MC info.  Since mu2e events will rarely have more than one track, this ntuple fundamental loop basis is a track, not an event.  The branches are fully split so it is easy to browse and use interactive histogram making.  It is part of the main Offline git repo because it is the basis of track monitoring.
This package is maintained by Dave Brown (LBL). 
TrkAna [https://mu2e-docdb.fnal.gov/cgi-bin/private/ShowDocument?docid=7775 docdb 7775]
TrkAna [https://mu2e-docdb.fnal.gov/cgi-bin/private/ShowDocument?docid=7775 docdb 7775]


==Custom root tree==
==Custom root tree==
You can make your own root trees, which is pretty straingofroward if you follow an example.  You write a module which creates a root tree, and fills it by reading the art products and copying variables to the tree.  The tree will be written to the TService file.
The biggest advantage of this approach is the ntuple is completely custom, so holds only what you want.  Access is very fast and interactive browsing is easy. The downside is that you generally can't share tools and datasets with other users.


==gallery==
==gallery==

Revision as of 18:25, 2 August 2018

Introduction

Our primary data is stored in the art format. This format uses root I/O, but embeds it in a framework with restrictive rules for data access. These framework rules are important during primary processing in order to precisely track the provenance of the data. At some point later in the analysis process, the dominate problem becomes accessing the data in a convenient way, rather than a very controlled way. The solution is to copy the most high-level parts of the data (such the number of hits on a track, and its momentum) into a smaller and faster format - an ntuple. Once the data is in this format, usually a root tree, the user can make histograms with simple cuts. Since only the high-level data values are stored, the dataset is very small and access is very fast.

Ideally the collaboration would chose a primary ntuple format and officially support this format. The official support would include code and document support, and priority in support and processing. With one central format, everyone's work in creating datasets and tools could be shared. At this time (8/2018) the selection of a primary ntuple format has not occurred, but it is still the plan. Meanwhile, there are several ntuple systems that are supported by small groups. The files in these formats can be uploaded to tape and documented like art datasets.

In choosing an ntuple to match a goal, several factors can be considered. One factor is personal preference - do you value security or speed? Do you want to take the time to learn a new system? If the problem is simple and limited, a custom root ntuple or art histogramming may be most appropriate. If the project is very unique, you may need to make your own ntuple. If the project is larger or more complex, then the primary concern may be to coordinate with the the other people you will be working with - your mentor, your working group, detector group, University group, etc. What tools and datasets are available and who supports them?

Stntuple

This procedure can copy large parts of the data, essentially all of it, if you want, into a more compact and fast-access format. It copies the art products into more compact c++ objects, which are then written to root branches. Access is by a custom lightweight framework which supports user analysis modules, which access the data as objects, and make histograms. Because access is modular (the user modules can be shared and chained into paths) and data is always accessed as an object (like a track or cluster) it is easy to share and build up tools. The branches are not split which makes usually makes access faster, but it is not easily browsable. All access is by complied code.

Currently this package is maintained in its own git repo by Pasha Murat and Giani Pezzullo.

TrkAna

This ntuple is a root tree with branches that represent aspects of the data: hits, tracks, MC info. Since mu2e events will rarely have more than one track, this ntuple fundamental loop basis is a track, not an event. The branches are fully split so it is easy to browse and use interactive histogram making. It is part of the main Offline git repo because it is the basis of track monitoring.


This package is maintained by Dave Brown (LBL). TrkAna docdb 7775

Custom root tree

You can make your own root trees, which is pretty straingofroward if you follow an example. You write a module which creates a root tree, and fills it by reading the art products and copying variables to the tree. The tree will be written to the TService file. The biggest advantage of this approach is the ntuple is completely custom, so holds only what you want. Access is very fast and interactive browsing is easy. The downside is that you generally can't share tools and datasets with other users.


gallery

Other formats

Tools other than root have been explored at times, but there is no major effort on mu2e at this writing.