CodeDebugging: Difference between revisions
(callgrind tool and qcachegrind use) |
|||
(10 intermediate revisions by 2 users not shown) | |||
Line 12: | Line 12: | ||
[https://sourceware.org/gdb/current/onlinedocs/gdb/ gdb] is the gnu standard text-only debugger. We set up a compatible version when you set up an Offline version. | [https://sourceware.org/gdb/current/onlinedocs/gdb/ gdb] is the gnu standard text-only debugger. We set up a compatible version when you set up an Offline version. | ||
=== getting started === | |||
Here are a few simple commands to get started, there is an [https://sourceware.org/gdb/current/onlinedocs/gdb/ online manual]. | Here are a few simple commands to get started, there is an [https://sourceware.org/gdb/current/onlinedocs/gdb/ online manual]. | ||
Line 18: | Line 19: | ||
Evoke gdb with: | Evoke gdb with: | ||
gdb --args mu2e -s somefile.art -c Print/fcl/print.fcl | gdb --args mu2e -s somefile.art -c Print/fcl/print.fcl | ||
Start execution: | Start execution (intercepting art exceptions): | ||
(gdb) catch throw | |||
(gdb) run | (gdb) run | ||
You can also restart execution by typing "run" again. If you run once, the libraries won't all be loaded, but when you re-run they will be. Show the call stack: | You can also restart execution by typing "run" again. If you run once, the libraries won't all be loaded, but when you re-run they will be. Show the call stack: | ||
Line 51: | Line 53: | ||
It is possible to to do very much more such as setting break on a memory location write, attach to a running process, examine threads, call functions, set values, break after certain conditions, etc. | It is possible to to do very much more such as setting break on a memory location write, attach to a running process, examine threads, call functions, set values, break after certain conditions, etc. | ||
=== running GDB with emacs === | |||
* https://www.gnu.org/software/emacs/manual/html_node/emacs/GDB-Graphical-Interface.html | |||
==ddt== | ==ddt== | ||
Line 65: | Line 70: | ||
Here is a typical command | Here is a typical command | ||
setup valgrind | setup valgrind v3_21_0 | ||
valgrind --leak-check=yes --error-limit=no -v \ | valgrind --leak-check=yes --error-limit=no -v \ | ||
--demangle=yes --show-reachable=yes --track-origins=yes --num-callers=20 \ | --demangle=yes --show-reachable=yes --track-origins=yes --num-callers=20 \ | ||
Line 83: | Line 88: | ||
==30413== Warning: client switching stacks? SP change: 0x1ffeff2ba8 --> 0x1ffed07d30 | ==30413== Warning: client switching stacks? SP change: 0x1ffeff2ba8 --> 0x1ffed07d30 | ||
==30413== to suppress, use: --max-stackframe=3059320 or greater | ==30413== to suppress, use: --max-stackframe=3059320 or greater | ||
Valgrind can also be used to assist timing performance study as it can clock all the various routines used with the executable. To do so, we need to use a dedicated tool called cachegrind in the following way: | |||
valgrind --tool=callgrind mu2e -c file.fcl ... | |||
Callgrind will produce a file `callgrind.out.xxxx`. To inspect it, the simpler way is to use `qcachegrind`. A dedicated setup sequence needs to be run to load the qcachegrind libraries: | |||
mu2einit | |||
spack load qcachegrind | |||
Then you can launch the software in the following way: | |||
qcachegrind callgrind.out.XXXX | |||
==vector bounds check== | ==vector bounds check== | ||
Line 149: | Line 167: | ||
==fsanitize bounds check== | ==fsanitize bounds check== | ||
gcc provides a switch to insert address checking code many places in a build. It seems to be a little fragile and not very well maintained, but it has found vector bounds errors and an local | gcc provides a switch to insert address checking code many places in a build. It seems to be a little fragile and not very well maintained, but it has found vector bounds errors and an local c-style array overrun. You can read more about sanitize in the [https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc.pdf gcc manual] and [https://github.com/google/sanitizers/wiki/AddressSanitizer AddressSanitizer docs] | ||
This method is different than the above section on compiler defines. The above method replaces <code>std::vector</code> with <code>std::__debug::vector</code>, but | This method is different than the above section on compiler defines. The above method replaces <code>std::vector</code> with <code>std::__debug::vector</code>, but | ||
the sanitize method seems to work more like valgrind, in putting hooks in system calls, like malloc. | the sanitize method seems to work more like valgrind, in putting hooks in system calls, like malloc. | ||
To turn on sanitize, you have to build with certain switches set: | |||
muse build ... --mu2eSanitize | |||
and set the following environmental, which prevents the exe from exiting at the first warning: | |||
export ASAN_OPTIONS="verify_asan_link_order=0:alloc_dealloc_mismatch=0:detect_leaks=0" | |||
and then run the exe. It will run several times slower than a normal exe. | |||
Sanitize will only print one line of traceback. If that line is in an stl method, for example, you will want a full traceback. One way is to set a break in gdb | |||
break '__ubsan::ScopedReport::ScopedReport(__ubsan::ReportOptions, __ubsan::Location, __ubsan::ErrorType)' | |||
===Historical notes=== | |||
To activate this method add the following lines to SConstruct in Muse/python | To activate this method add the following lines to SConstruct in Muse/python | ||
Line 167: | Line 197: | ||
LD_PRELOAD=/cvmfs/mu2e.opensciencegrid.org/artexternals/gcc/v9_3_0/Linux64bit+3.10-2.17/lib64/libasan.so.5 \ | LD_PRELOAD=/cvmfs/mu2e.opensciencegrid.org/artexternals/gcc/v9_3_0/Linux64bit+3.10-2.17/lib64/libasan.so.5 \ | ||
mu2e -n 1000 -c Production/Validation/ceSimReco.fcl &>check4.log | mu2e -n 1000 -c Production/Validation/ceSimReco.fcl &>check4.log | ||
this can also be solved with | |||
ASAN_OPTIONS=verify_asan_link_order=0 | |||
if you need to set multiple ASAN options, serparate with colons: | |||
ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1 | |||
Line 216: | Line 253: | ||
When you are developing your code we recommend that you enable floating point exceptions as described above. One issue with this system is that it only permits configuration of FPEs on a whole job basis; if there are buggy modules ahead of yours in the art job, the work around is to run those modules in a separate job, without enabling FPEs, and write an output file. Then do your development, with FPEs enabled, using that file as your input. | When you are developing your code we recommend that you enable floating point exceptions as described above. One issue with this system is that it only permits configuration of FPEs on a whole job basis; if there are buggy modules ahead of yours in the art job, the work around is to run those modules in a separate job, without enabling FPEs, and write an output file. Then do your development, with FPEs enabled, using that file as your input. | ||
==stack corruption== | |||
Interesting [https://rkd.me.uk/posts/2020-04-11-stack-corruption-and-how-to-debug-it.html blog post] | |||
==gprof== | ==gprof== |
Latest revision as of 12:15, 9 October 2024
Introduction
The two main tools we use to debug code are the standard gnu debugger gdb and the memory debugger valgrind. Some people use gprof or openspeedshop as a profiler.
gcc manual useful for understanding flags.
art tools
art has some debugging tools built in art tools, including time and memory checking systems.
gdb
gdb is the gnu standard text-only debugger. We set up a compatible version when you set up an Offline version.
getting started
Here are a few simple commands to get started, there is an online manual.
If your command is
mu2e -s somefile.art -c Print/fcl/print.fcl
Evoke gdb with:
gdb --args mu2e -s somefile.art -c Print/fcl/print.fcl
Start execution (intercepting art exceptions):
(gdb) catch throw (gdb) run
You can also restart execution by typing "run" again. If you run once, the libraries won't all be loaded, but when you re-run they will be. Show the call stack:
(gdb) where
Select second frame in stack:
(gdb) frame 2
Typically, the exe will have been built on another machine, so gdb can't find the source code. You can tell it about source directories like:
(gdb) dir /cvmfs/mu2e.opensciencegrid.org/Offline/v7_0_4/SLF6/prof/Offline/Print/src
These commands can be put in a .gdbinit file. To remap paths of libraries built elsewhere, use:
(gdb) set substitute-path from to
Step one line, stepping over function calls:
(gdb) n
Step one line, stepping into function calls:
(gdb) s
Set break by function (tab completion available if libraries are loaded)
(gdb) break 'mu2e::CaloHitPrinter::Print(art::Event const& event, std::ostream& os)'
Set break by line number
(gdb) break 'CaloHitPrinter.cc:102'
Continue after a break:
(gdb) cont
Run to the end of a stack call:
(gdb) finish
Print local variable "x":
(gdb) p x
Print stl vector "myVector"
(gdb) print *(myVector._M_impl._M_start)@myVector.size()
list code line n
(gdb) list n
Catch art throws
(gdb) catch throw
It is possible to to do very much more such as setting break on a memory location write, attach to a running process, examine threads, call functions, set values, break after certain conditions, etc.
running GDB with emacs
ddt
valgrind
valgrind is a memory debugger which largely works by inserting itself into heap memory access. It can detect:
- use of unintialized variables
- accessing memory freed or never allocated
- double deletion
- memory leaks
It is installed on all interactive machines, or you can UPS setup a particular version.
Here is a typical command
setup valgrind v3_21_0 valgrind --leak-check=yes --error-limit=no -v \ --demangle=yes --show-reachable=yes --track-origins=yes --num-callers=20 \ mu2e -c my.fcl
The additional memory checking causes the exe to run much. much slower.
You may find that packages like root libraries may have so many (probably not consequential) errors detected that it drowns out the useful messages. "Conditional jump or move depends on uninitialised value(s)" is a common ROOT error. valgrind allows you to suppress errors that are not important to you. We have an example suppressions file which can be added with
valgrind --leak-check=yes --error-limit=no -v \ --suppressions=$MU2E_BASE_RELEASE/scripts/valgrind/all_including_geant_callbacks.supp \ --demangle=yes --show-reachable=yes --track-origins=yes --num-callers=20 \ mu2e -c my.fcl
This file is aggressive in removing benign errors, and it may also remove some useful errors. For example, the easiest way to suppress the thousands of errors from geant, is to suppress errors that include the geant paths in the call stack. Unfortunately, the way geant is setup, we register our simulation code with geant, and geant calls our code at appropriate times. To valgrind, this makes our code look like a part of geant and the naive suppression method may also suppress errors from our simulation code. There is no easy way around this, but it may be possible with sufficient effort.
One known problem is a large stack item which leads to this warning, with a way to suppress
==30413== Warning: client switching stacks? SP change: 0x1ffeff2ba8 --> 0x1ffed07d30 ==30413== to suppress, use: --max-stackframe=3059320 or greater
Valgrind can also be used to assist timing performance study as it can clock all the various routines used with the executable. To do so, we need to use a dedicated tool called cachegrind in the following way:
valgrind --tool=callgrind mu2e -c file.fcl ...
Callgrind will produce a file `callgrind.out.xxxx`. To inspect it, the simpler way is to use `qcachegrind`. A dedicated setup sequence needs to be run to load the qcachegrind libraries:
mu2einit spack load qcachegrind
Then you can launch the software in the following way:
qcachegrind callgrind.out.XXXX
vector bounds check
Overrunning the bounds of a vector is a common problem which might show up as seg faults, wrong results, or unstable results.
The standard method for checking bounds is to set _GLIBCXX_DEBUG. We would do this
Muse/python/sconstruct_helper.py method mergeFlags(mu2eOpts)
where it would be added as
-D_GLIBCXX_DEBUG
to the list of flags. In our code, this doesn't work because the method replaces
std::vector
with
std::__debug::vector
this works, except at the interface to a third-party package, such as root or fhiclcpp. If those interfaces include passing a vector, the signatures will no longer match, because those packages are not compiled with this gcc flag defined.
Here is another method which worked 9/2021.
1. create a Muse work dir containing Offline, Production, Muse, and MuseConfig
2. run codetools/python/vectorCheck.py
vectorCheck.py Offline
You can also run it on subsets of directories
vectorCheck.py Offline/Validation
The effect will be to change each instance of
#include <vector>
to
#include "vectorCheck.h"
3. create your new vector include file
cp /cvmfs/mu2e.opensciencegrid.org/artexternals/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/ ./vectorCheck.h
inside vectorCheck.h, find the line
#include <bits/stl_vector.h>
comment it out, and edit in the file content itself
/cvmfs/mu2e.opensciencegrid.org/artexternals/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h
to be sure you have the right files, you can remove -o
and add -E
to dump the include sources.
Now put in the range check. Find the const and non-const operator[]
and edit in the range change. (Do not call at() since that calls [].)
reference operator[](size_type __n) _GLIBCXX_NOEXCEPT { __glibcxx_requires_subscript(__n); _M_range_check(__n); return *(this->_M_impl._M_start + __n); }
4. Remove the error flag from the build. (This could be made a switch.) Edit Muse/python//sconstruct_helper.py
and remove the -DWerror
flag from the list in mergeFlags
method. This was necessary because a header in fchilcpp was giving a "misleading indentation" warning. The code is in boost, far from our mods and the warning seems to be incorrect, so we are assuming it is a spurious warning.
4. Create a fake Muse setup
export MUSE_DIR=$PWD/Muse export PATH=$MUSE_DIR/bin:$PATH export MUSE_ENVSET_DIR=$PWD/MuseConfig/envset alias muse="source muse"
and execute it.
5. Build the area and run in gdb. If there is a range error, gdb will stop and you can issue where
to see the code location.
fsanitize bounds check
gcc provides a switch to insert address checking code many places in a build. It seems to be a little fragile and not very well maintained, but it has found vector bounds errors and an local c-style array overrun. You can read more about sanitize in the gcc manual and AddressSanitizer docs
This method is different than the above section on compiler defines. The above method replaces std::vector
with std::__debug::vector
, but
the sanitize method seems to work more like valgrind, in putting hooks in system calls, like malloc.
To turn on sanitize, you have to build with certain switches set:
muse build ... --mu2eSanitize
and set the following environmental, which prevents the exe from exiting at the first warning:
export ASAN_OPTIONS="verify_asan_link_order=0:alloc_dealloc_mismatch=0:detect_leaks=0"
and then run the exe. It will run several times slower than a normal exe.
Sanitize will only print one line of traceback. If that line is in an stl method, for example, you will want a full traceback. One way is to set a break in gdb
break '__ubsan::ScopedReport::ScopedReport(__ubsan::ReportOptions, __ubsan::Location, __ubsan::ErrorType)'
Historical notes
To activate this method add the following lines to SConstruct in Muse/python
env.PrependUnique(CCFLAGS=['-fsanitize=address']) env.PrependUnique(LINKFLAGS=['-fsanitize=address'])
after MergeFlags
and build.
A mu2e exe command will probably result in the following error:
==19064==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
which I think is due to the fact that the "mu2e" exe is inside art and is therefore not compiled with this switch. In a local toy example, I don't get this error. The error can be avoided:
LD_PRELOAD=/cvmfs/mu2e.opensciencegrid.org/artexternals/gcc/v9_3_0/Linux64bit+3.10-2.17/lib64/libasan.so.5 \ mu2e -n 1000 -c Production/Validation/ceSimReco.fcl &>check4.log
this can also be solved with
ASAN_OPTIONS=verify_asan_link_order=0
if you need to set multiple ASAN options, serparate with colons:
ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1
I also hit this error
==29458==ERROR: AddressSanitizer: alloc-dealloc-mismatch (malloc vs operator delete) on 0x611005473a00
inside a root routine. The root code was not compiled with this flag, but this was in a header file, so it might be compiled locally. Luckily, it comes with this advice:
==29458==HINT: if you don't care about these errors you may set ASAN_OPTIONS=alloc_dealloc_mismatch=0
so adding
export ASAN_OPTIONS="alloc_dealloc_mismatch=0"
caused it not stop at this point any more.
This system is supposed to have flags which prevent the code from stopping on errors,
-fsanitize-recover=all or -fsanitize-recover=address
but neither worked for me, even in a toy example.
art floating point service
This section discusses the management of floating point exceptions. The for those who need background information, Wikipedia has a good article about floating point arithmetic: [1]; of most relevance is the section that discusses floating point exceptions: [2]
art provides an interface that let's us customize how the processsor responds to floating point exceptions. The art wiki includes documentation about the art FloatingPointControl service https://cdcvs.fnal.gov/redmine/projects/art/wiki/FloatingPointControl ]. You can check if the documentation is current using art's command line help:
mu2e --print-description FloatingPointControl
More complete information is available in the gnu documentation.
The recommended configuration is:
services.FloatingPointControl: { enableDivByZeroEx : true enableInvalidEx : true enableOverFlowEx : true enableUnderFlowEx : false # See note below setPrecisionDouble : false # see note below reportSettings : true }
- When one of these parameters is set to true, if the named condition occurs, the floating point until will trap and raise the signal SIGFPE. This will immediately stop execution of the program and, if enabled, dump core. If you are running inside of gdb you can ask gdb to show you the line of code that produced the error. If the variable is false, the code will set the result to a value described in the gnu documentation and continue execution.
- enableUnderFlowEx. Setting it to false is usually safe. Setting it true is likely to produce many false positives but it can be used to identify poorly written code in which a different order of operations would give a more precise result.
- setPrecisonDouble.
- Inside the FPU, the registers are 80 bits long but normal CPU registers are the size of a double, 64 bits.
- setPrecisionDouble : true tells the FPU to round results to the precision of a double after each operation; the alternative is to retain all 80 bits for use in the next operation. This results in a loss of precision that is usually unimportant. One good use case for setting it true is when validating a new compiler or validating an architecture with AVX vs one without. In both cases it improves the chances of getting exact bit-for-bit identical results.
- From Marc Paterno: "The IEEE-754 double has an effective 53 bit mantissa (including an implied leading ‘1’ bit, for all except subnormals). The Intel FPU has 80-bit extended precision that uses a 64 bit mantissa (no implied leading ‘1’. The SSE and AVX units do not have the extended precision. setPrecisionDouble: true in effect makes the FPU more like the SSE and AVX units. I am not sure what effect it has on speed, but it does have an effect on accuracy. In particular, it can help avoid some catastrophic cancellations (although careful coding can also help avoid them)."
When you are developing your code we recommend that you enable floating point exceptions as described above. One issue with this system is that it only permits configuration of FPEs on a whole job basis; if there are buggy modules ahead of yours in the art job, the work around is to run those modules in a separate job, without enabling FPEs, and write an output file. Then do your development, with FPEs enabled, using that file as your input.
stack corruption
Interesting blog post
gprof
openspeedshop
This profiling package needs to be installed from documents.
osspcsamp "mu2e -s somefile.art -c Print/fcl/print.fcl" >& log
Arm forger profile
arm forge's map program:
setup forge_tools map --profile --start --nompi $(type -p mu2e) -c full.fcl <some file>.root
which will generate a map output file. If you then run
map <generated file>.map