DQM

From Mu2eWiki
Jump to navigation Jump to search

Introduction

Data Quality monitoring is the process of running checks on new data as it comes in, and record the results. For example, as pass1 reconstruction is run, we plan to also create and save a set of histograms. To evaluate the status of the data quality, we expect two general approaches. First, extracting a set of simple numbers (such as the mean number of hits on a track) that would be sensitive to overall detector performance and data quality. These quantities can be saved in a database and plotted as a function of time or run number. The second general approach is to compare the histograms to previous runs, including perhaps quantitative comparison, such as a chi2. Generally, the Offline operations and online shift crews would be responsible for reviewing these monitors to spot unexpected changes and drive fixes. Eventually, the DQM content may be used to determine good runs.

In a little more detail, we expect the data to be moved from the DAQ system offline as soon as possible. The data will come in several datasets, such as electron triggers, or off-spill cosmics. The offline system will run pass1 reconstruction on these streams of data. Pass1 will be an initial look at the data, and to prepare histograms and ntuples for the calibration process. After the pass1 reco file is produced, a DQM exe will be run it. The DQM exe will produce a DQM histogram file which will primarily contain histograms of fundamental quantities. These DQM histogram files will be saved and written to tape. When the DQM files are available, a few simple bins, called extractors, will run on the histogram file and write out text files containing DQM metrics, such as the average cluster energy, the average number of hits on a track, or the number of defective channels detected. Another bin (dqmTool) will insert these metrics into the permanent database. When it is possible, an aggregator will add together all the histogram files related to one run, producing a DQM histogram file for the run. This file will also be saved, the extractors run, and the metrics saved. The same sort of aggregation might be done for a weekly report, or other purposes. The DQM histograms and the extracted metrics, presented as timelines, can be monitored by operations and shift crews to detect problems in the detector, data format, conditions or code, and lead to quick corrections. The same procedure will occur for pass2 reconstruction, and possibly other procedures, such as nightly code validation.

File names

DQM file names will have a specific pattern:

ntd.mu2e.DQM_stream.process_aggregation_version.run_subrun.root
  • the names, such as the data_tier, owner, and sequencer, should follow the standard file name pattern
  • DQM string is always first in the description
  • stream would typically represents a logical stream of files written by the DAQ, expected to be named as a dataset, and fed into the offline processing. This field is expected to be something like "ele" or "cosmic".
  • process is the procedures which produced this DQM plots, for example, "pass1" or "pass2"
  • aggregation allows for the fact that smaller DQM files are likely to be added together, so the, say, 20 files that go into a stream during a run can be added together and a single DQM result can be recorded for the whole run. Some aggregation key words might be "file" for a single file, or "run" or "week".
  • version is an integer and should reflect the version of the process which produced the file which this DQM file represents. For example, we expect pass1 will occasionally be stopped, repaired or improved, and then re-run on some subset of data. This will advance the pass1 version number and DQM should follow that version.

Database

The database is used to record numerical metrics derived by the extractors from a DQM histogram file. These metrics can then be plotted in timelines. Each entry needs to know

  • the source of the metrics. Logically this is the combination of "process, stream, aggregation, version".
  • the relevant run period and/or time period for the metrics.
  • the values of the metrics


The source values can be derived directly from the standard DQM file name, so it will be very useful to maintain this pattern. If the source is not a file with a standard name, it can be represented by the equivalent 4 words. The more these words are standardized the more straightforward it will be to organize and search for the metrics. They shoudl be treated as case-sensitive, and all should be lower-case.

We expect that timelines can usually be adequately plotted using only the run or time of the start of the period when the metric is relevant. For example, a DAQ file might contain run 100000, subrun 0 to subrun 100. The follow file might represent subruns 101 to 150. The metrics extracted from these file could be plotted at the points 100000:0 and 100000:101. However, the database allows for the start and stop of the relevant period to be recorded. So the relevant period first file can be represented by the run range 100000:0-100000:100, using the standard run range format. We can also enter the start and stop times of the relevant period. The entry requires at least a start to the run range OR a start to the time period. The end run range and end time are optional. Both run range and time period may be present.

The numerical metric is labeled by three fields: "group, subgroup ,name". For example, for the metrics derived from histograms made in pass1, the groups might be "cal", "trk", or "crv". The subgroups might be "digi", "track", or "cluster". The "name" might be "meanEnergy", "rmsEnergy", "zeroERate". While it is possible to write a metric name like "Average momentum (MeV) for tracks with p>10 and MeV 20 hits and cal cluster E>10.0 MeV", ultimately this will be more annoying than helpful. If short names need to be documented, it is probably best to do that in a parallel system, not in the database name. No commas are allowed (csv format is used internally), and some other special characters might also fail.

The numerical metric is represented by a float for the value, a float for its uncertainty (0.0 if N/A) and an integer. The values of the integer are determined by an enum in DQM/inc/DqmValue.hh. Zero is normal, success.

If a source and interval can apply to many metrics begin entered, the metrics can be listing, one per line, in a text file and committed in one command.


In this example, an extractor has produced a set of metrics in a text file. The extract was run on a properly-named histogram file. The run and subrun of the start of the period is taken from the file name. The time range will be null.

cat myvalues.txt
 cal,digi,meanE,0.125,0.001,0
 cal,digi,fracZeroE,0.001,0.0,0
 cal,cluster,meanE,25.2,0.12,0
 cal,cluster,rmsE,2.2,0.25,0
dqmTool commit-value \
 --source ntd.mu2e.DQM_ele.pass1_file_0.100000_00000100.root \
 --value myvalues.txt

A single value can be committed with text:

dqmTool commit-value \
 --source ntd.mu2e.DQM_ele.pass1_file_0.100000_00000100.root \
 --value "cal,digi,meanE,0.125,0.001,0"

and the source can be explicit if it is not a file.

dqmTool commit-value \
 --source "valNightly,reco,day,0" \
 --value myvalues.txt

The commit can contain full period information. Times are in ISO 8601 format. If time zone is missing, the current time zone will be taken from the computer running the exe, and will be saved in the database in UTC.

dqmTool commit-value \
 --source ntd.mu2e.DQM_ele.pass1_file_0.100000_00000100.root \
 --runs "100000:100-100000:999999" \
 --start "2022-01-01T16:04:10.255-6:00" \
 --end   "2022-01-01T22:32:44.908-6:00" \
 --value myvalues.txt


To read the database, list the known sources of metrics

> dqmTool print-sources
...
5,valNightly,reco,day,0
...

The initial integer is the database lookup index, or "SID" of this source. The other fields are "process, stream, aggregation, version".

List the known values (metrics) names

> dqmTool print-values 
...
6,ops,stats,CPU

The first integer is the database lookup index of the value name and the rest are the value's "group, subgroup ,name".

Now list all the known intervals (not too useful)

> dqmTool print-intervals | head
45,4,0,0,0,0,2022-01-23 00:01:00-06:00,2022-01-23 00:01:00-06:00
...

The first integer is the database lookup index or "IID" of each interval, the second is the SID of the source for the interval. The next 4 ints are the run and subrun range, if applicables, the last two fields are start and stop times, if applicable.

List all the numerical metrics for source SID 5 and value VID 6 dqmTool print-numbers --source 5 --value 6 --heading The last three numbers are the number, its uncertainty, and the status code.

Or, in more readable format

> dqmTool print-numbers --source 5 --value 6 --heading --expand | tail
45214,1095,150.0,0  5,valNightly,reco,day,0  6,ops,stats,CPU  7549,5,0,0,0,0,2024-07-24 00:01:00-05:00,2024-07-24 00:01:00-05:00

The metric here is 1095+-150, code 0.

Monitors