CodeDebugging

From Mu2eWiki
Revision as of 03:12, 26 August 2020 by Genser (talk | contribs)
Jump to navigation Jump to search

Introduction

The two main tools we use to debug code are the standard gnu debugger gdb and the memory debugger valgrind. Some people use gprof or openspeedshop as a profiler.


art tools

art has some debugging tools built in art tools, including time and memory checking systems.

gdb

gdb is the gnu standard text-only debugger. We set up a compatible version when you set up an Offline version.

Here are a few simple commands to get started, there is an online manual.

If your command is

 mu2e -s somefile.art -c Print/fcl/print.fcl

Evoke gdb with:

 gdb --args mu2e -s somefile.art -c Print/fcl/print.fcl

Start execution:

(gdb) run

You can also restart execution by typing "run" again. If you run once, the libraries won't all be loaded, but when you re-run they will be. Show the call stack:

(gdb) where

Select second frame in stack:

(gdb) frame 2

Typically, the exe will have been built on another machine, so gdb can't find the source code. You can tell it about source directories like:

(gdb) dir  /cvmfs/mu2e.opensciencegrid.org/Offline/v7_0_4/SLF6/prof/Offline/Print/src

These commands can be put in a .gdbinit file. To remap path of libraries built elsewhere use:

(gdb) set substitute-path from to

Step one line, stepping over function calls:

(gdb) n

Step one line, stepping into function calls:

(gdb) s

Set break by function (tab completion available if libraries are loaded)

(gdb) break 'mu2e::CaloHitPrinter::Print(art::Event const& event, std::ostream& os)'

Set break by line number

(gdb) break 'CaloHitPrinter.cc:102'

Continue after a break:

(gdb) cont

Run to the end of a stack call:

(gdb) finish

Print local variable "x":

(gdb) p x

Print stl vector "myVector"

(gdb) print *(myVector._M_impl._M_start)@myVector.size()

list code line n

(gdb) list n

Catch art throws

(gdb) catch throw

It is possible to to do very much more such as setting break on a memory location write, attach to a running process, examine threads, call functions, set values, break after certain conditions, etc.

ddt

ddt

valgrind

valgrind is a memory debugger which largely works by inserting itself into heap memory access. It can detect:

  • use of unintialized variables
  • accessing memory freed or never allocated
  • double deletion
  • memory leaks

It is installed on all interactive machines, or you can UPS setup a particular version.

Here is a typical command

setup valgrind v3_13_0
valgrind --leak-check=yes --error-limit=no -v  \
        --demangle=yes --show-reachable=yes  --num-callers=20 mu2e -c my.fcl

The additional memory checking causes the exe to run much. much slower.

You may find that packages like root libraries may have so many (probably not consequential) errors detected that it drowns out the useful messages. "Conditional jump or move depends on uninitialised value(s)" is a common ROOT error. valgrind allows you to suppress errors that are not important to you. We have an example suppressions file which can be added with

valgrind --leak-check=yes --error-limit=no -v  \
        --suppressions=$MU2E_BASE_RELEASE/scripts/valgrind/all_including_geant_callbacks.supp \
        --demangle=yes --show-reachable=yes  --num-callers=20 mu2e -c my.fcl

This file is aggressive in removing benign errors, and it may also remove some useful errors. For example, the easiest way to suppress the thousands of errors from geant, is to suppress errors that include the geant paths in the call stack. Unfortunately, the way geant is setup, we register our simulation code with geant, and geant calls our code at appropriate times. To valgrind, this makes our code look like a part of geant and the naive suppression method may also suppress errors from our simulation code. There is no easy way around this, but it may be possible with sufficient effort.

art floating point service

This section discusses the management of floating point exceptions. The for those who need background information, Wikipedia has a good article about floating point arithmetic: [1]; of most relevance is the section that discusses floating point exceptions: [2]

art provides an interface that let's us customize how the processsor responds to floating point exceptions. The art wiki includes documentation about the art FloatingPointControl service https://cdcvs.fnal.gov/redmine/projects/art/wiki/FloatingPointControl ]. You can check if the documentation is current using art's command line help:

 mu2e --print-description FloatingPointControl

More complete information is available in the gnu documentation.

The recommended configuration is:

services.FloatingPointControl: {
    enableDivByZeroEx  : true
    enableInvalidEx    : true
    enableOverFlowEx   : true
    enableUnderFlowEx  : false   # See note below
    setPrecisionDouble : false   # see note below
    reportSettings     : true
}
  • When one of these parameters is set to true, if the named condition occurs, the floating point until will trap and raise the signal SIGFPE. This will immediately stop execution of the program and, if enabled, dump core. If you are running inside of gdb you can ask gdb to show you the line of code that produced the error. If the variable is false, the code will set the result to a value described in the gnu documentation and continue execution.
  • enableUnderFlowEx. Setting it to false is usually safe. Setting it true is likely to produce many false positives but it can be used to identify poorly written code in which a different order of operations would give a more precise result.
  • setPrecisonDouble.
    • Inside the FPU, the registers are 80 bits long but normal CPU registers are the size of a double, 64 bits.
    • setPrecisionDouble : true tells the FPU to round results to the precision of a double after each operation; the alternative is to retain all 80 bits for use in the next operation. This results in a loss of precision that is usually unimportant. One good use case for setting it true is when validating a new compiler or validating an architecture with AVX vs one without. In both cases it improves the chances of getting exact bit-for-bit identical results.
    • From Marc Paterno: "The IEEE-754 double has an effective 53 bit mantissa (including an implied leading ‘1’ bit, for all except subnormals). The Intel FPU has 80-bit extended precision that uses a 64 bit mantissa (no implied leading ‘1’. The SSE and AVX units do not have the extended precision. setPrecisionDouble: true in effect makes the FPU more like the SSE and AVX units. I am not sure what effect it has on speed, but it does have an effect on accuracy. In particular, it can help avoid some catastrophic cancellations (although careful coding can also help avoid them)."

When you are developing your code we recommend that you enable floating point exceptions as described above. One issue with this system is that it only permits configuration of FPEs on a whole job basis; if there are buggy modules ahead of yours in the art job, the work around is to run those modules in a separate job, without enabling FPEs, and write an output file. Then do your development, with FPEs enabled, using that file as your input.

gprof

openspeedshop

This profiling package needs to be installed from documents.

osspcsamp "mu2e -s somefile.art -c Print/fcl/print.fcl" >& log

Arm forger profile

arm forge's map program:

setup forge_tools
map --profile --start --nompi $(type -p mu2e) -c full.fcl <some file>.root

which will generate a map output file. If you then run

map <generated file>.map