ErrorRecovery

From Mu2eWiki
Revision as of 23:00, 26 February 2020 by Rlc (talk | contribs) (→‎Grid Workflows)
Jump to navigation Jump to search

Introduction

Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here.

logins

When ssh-ing into an interactive node, you see the following error

while connecting to SL7 machines:
-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
The warning was not present before and it is still not present on mu2egpvm01. Does someone know who generates this? I am having a running error connected to it:

---- StdException BEGIN
  ServiceCreation std::exception caught during construction of service type art::TFileService: locale::facet::_S_create_c_locale name not valid
---- StdException END

The solution is:

fixed it by commenting out the line (on my local machine):
SendEnv LANG LC_*
from the file:
/etc/ssh/ssh_config

See also stackoverflow

git

KeyError on git push

Traceback (most recent call last):
  File "hooks/post-receive", line 73, in <module>
    main()
  File "hooks/post-receive", line 62, in main
    os.environ["GL_USER"] = os.environ["REMOTEUSER"]
  File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__
    raise KeyError(key)
KeyError: 'REMOTEUSER'
error: hooks/post-receive exited with error code 1
To ssh://p-mu2eofflinesoftwaremu2eoffline@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwaremu2eoffline/Offline.git

This is due to inconsistent ssh settings, please see this page.

Could not resolve hostname

On any git command

> git pull
ssh: Could not resolve hostname cdcvs.fnal.gov:: Name or service not known
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Check which git you are using

> which git
/usr/bin/git

This result is wrong, this is an old version and has inconsistencies with our settings. You should be using a ups version:

> which git
/cvmfs/mu2e.opensciencegrid.org/artexternals/git/v2_20_1/Linux64bit+3.10-2.17/bin/git

Exactly which git version may change.. You get the ups verison when you setup mu2e. This error may also have other causes

fatal: not a git repository

When issuing any of several git commands

fatal: not a git repository (or any parent up to mount point /mu2e)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Probably you are not in a git repository directory. All git repo directories (such as our Offline), will have a .git subdirectory.

scons

art and fcl

art exit codes

art exit codes

cling::AutoloadingVisitor

At beginning of run time you see a spew of errors like this.

Error in cling::AutoloadingVisitor::InsertIntoAutoloadingState:
   Missing FileEntry for MCDataProducts/inc/ExtMonFNALPatRecTruthAssns.hh
   requested to autoload type mu2e::ExtMonFNALTrkFit

These are caused by cling system as implemented by root. Eventually this system will pre-compile header files (and maybe dictionaries?) so that compiling source files is very fast. In root 6, this not fully implemented and cling needs to find the header file for reasons that are not clear to us. It is searching the path given by ROOT_INCLUDE_PATH which should be set adequately by the setups. If, for some reason, the path is not correct or you have removed the header files, the error message is harmless - we know of no consequence except the spew.

Update (1/2020): we have seen this error also be correlated with crashes, in other words, missing header files might be providing semething root needs. We are trying to figure out what root might be getting from the headers and provide it all in the libraries.

And example crash (Lisa, 1/28/20 in a docker container, ceSimReco)

%MSG-s ArtException:  PostEndJob 28-Jan-2020 19:53:16 UTC ModuleEndJob
---- FatalRootError BEGIN
  Fatal Root Error: @SUB=
  ! (prop&kIsClass) && "Impossible code path" violated at line 462 of `/scratch/workspace/canvas-products/v3_09_00-/SLF7/e19-prof/build/root/v6_18_04c/source/root-6.18.04/io/io/src/TGenCollectionProxy.cxx'
---- FatalRootError END
%MSG

Another from Dave (LBL) 12/23/19. This seems to be a real missing dictionary.

%MSG-s ArtException: FileDumperOutput:dumper@Construction 23-Dec-2019 10:00:54 CST ModuleConstruction
cet::exception caught in art
---- FileReadError BEGIN
---- FatalRootError BEGIN
Fatal Root Error: @SUB=TBufferFile::ReadClassBuffer
Could not find the StreamerInfo for version 2 of the class art::ProcessHistory, object skipped at offset 108
---- FatalRootError END

---- FileReadError END
%MSG

cannot access private key

Attempting to read a root or xroot input file spec in art or root, see error:

180518 15:04:27 27909 secgsi_InitProxy: cannot access private key file: /nashome/r/rlc/.globus/userkey.pem
%MSG-s ArtException:  Early 18-May-2018 15:04:27 CDT JobSetup
cet::exception caught in art
---- FileOpenError BEGIN
  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TNetXNGFile::Open
    [FATAL] Auth failed
  ---- FatalRootError END

The problem is that you need a voms proxy. See authentication. A kx509 cert or proxy identifies you, but does not identify that you are a member of mu2e. The voms proxy adds this information (the VO or Virtual Organization in the voms).

'present' is not actually present

  Principal::getForOutput
    A product with a status of 'present' is not actually present.
    The branch name is mu2e::CaloShowerSteps_CaloShowerStepFromStepPt_calorimeterRO_photon.
    Contact a framework developer.

Needs a new version of art (2_11), but can be worked around, see this bug report

Unsuccessful attempt to convert FHiCL parameter

At the start of the job you see and art error message:

  Unsuccessful attempt to convert FHiCL parameter 'inputFiles' to type 'std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'.

You could expand your fcl file:

 fhicl-dump myfile.fcl > temp.txt

and look for the fcl parameter, "inputFiles" in this massage, which will probably be set to null. You can see what stanza this was part of and take it from there.

A common case is that the stopped muon input file was not defined:

 physics.producers.generate.muonStops.inputFiles

needs to be set to value, currently defined in EventGenerator/fcl/prolog.fcl

unable to find the service

At the beginning of an art executable run you see:

  ---- ServiceNotFound BEGIN
    art::ServicesManager unable to find the service of type 'art::TFileService'.
  ---- ServiceNotFound END

art has a service called TFileService which opens a root file to hold histograms and ntuples made by user modules. In some fcl, the service is not configured with a default file name, then if a module asks the service for the root file, this error is generated. You can fix it by adding a line to your fcl:

 services.TFileService.fileName: "myfile.root"

if you are modifying this .fcl for grid running you would write:

services.TFileService.fileName: "nts.owner.desc.version.sequencer.root"

where "desc" is the field you enter in generate_fcl.

If running nteractively, you can add the file name to the command line:

mu2e -T myfile.root -c ...

There doesn't need to be any specific fcl text in the services stanza to create the service, art will create it automatically if the root file name is defined.

Found zero products matching all criteria

In MDC there were a number of changes made to data product names. A standard job to read MDC output is TrkDiag/fcl/TrkAnaDigis.fcl. Let's say you wanted to weight the events from an MDC output (say JobConfig/primary/flateminus.fcl) using the DecayInOrbitWeight module. A natural thing to do is change the trigger_path to:

           physics.TrkAnaTriggerPath : [ @sequence::TrkAna.TrkCaloRecoSequence,DIOWeight ]

which weights the generated DIO electron, but this will fail with the "zero product error". You need to add:

           physics.producers.DIOWeight.inputModule: "compressDigiMCs"

How would you know this? Run Print/fcl/print.fcl on the output of flateminus.fcl, perhaps dig.owner.flateminus.version.sequencer.art. Then you will see the data products listed and something like:

                   Friendly Class Name        Module Label    Instance Name  Process Name     Product ID
                    mu2e::GenParticles     compressDigiMCs                     flateminus  2502071937

error writing all requested bytes

Your jobs start failing at a few percent rate and you see this error in the log file.

%MSG-s ArtException:  PostEndJob 22-Sep-2018 02:06:49 UTC ModuleEndJob
cet::exception caught in art
---- OtherArt BEGIN
  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TFile::WriteBuffer
    error writing all requested bytes to file ./RootOutput-ed23-499f-b549-ae0f.root, wrote 374 of 1272
  ---- FatalRootError END
  ---- OtherArt BEGIN
    ---- FatalRootError BEGIN
      Fatal Root Error: @SUB=TTree::SetEntries
      Tree branches have different numbers of entries, eg EventAuxiliary has 918 entries while art::TriggerResults_TriggerResults__flateminus. has 919 entries.
    ---- FatalRootError END
  ---- OtherArt END
---- OtherArt END
%MSG

This seems to be new around 9/25/18. It is probably a random Docker-related error, just resubmit the job.


Proditions service not found

---- ServiceNotFound BEGIN
  Unable to create ServiceHandle.
  Perhaps the FHiCL configuration does not specify the necessary service?
  The class of the service is noted below...
  ---- ServiceNotFound BEGIN
    art::ServicesManager unable to find the service of type 'mu2e::ProditionsService'.
  ---- ServiceNotFound END
---- ServiceNotFound END
%MSG

This service provides access to the database for conditions data. It was introduced Feb 2019. Adding this dependence mean that when you run modules that need this service (such as straw reco), then you need to configure the service. Either use the default services:

services : @local::Services.SimAndReco

or if you need to use your own explicit services fcl stanza, add these lines:

   DbService : @local::DbEmpty
   ProditionsService: @local::Proditions

getByLabel: Found zero products matching all criteria

On first event, the error occurs

          getByLabel: Found zero products matching all criteria
          Looking for type: std::map<art::Ptr<mu2e::SimParticle>,double>
          Looking for module label: compressDigiMCs
          Looking for productInstanceName: cosmicTimeMap
          cet::exception going through module ComboHitDiag/CHD run: 1002 subRun: 39 event: 10

The details of which product is missing from which module may be different.

The first possibility is that you are requesting a product that doesn't exist in the file. Read about products, or work with an expert on debugging the fcl or the path.

In Summer 2019, there was an additional, temporary cause of this error. This is caused by a change in the art behavior with art 3.02 (v7_5_1 to v7_5_4) probably fixed in art 3.04. This changes gets triggers by products with similar names. See hypernews and Kyle's talk. Dave Brown has work-arounds which involve making the product references more explicit in fcl.

Assertion an overridden cannot be out-of-date

/scratch/workspace/canvas-products/vdevelop-/SLF7/e17-debug/build/root/v6_16_00/source/root-6.16.00/interpreter/llvm/src/tools/clang/include/clang/Serialization/Module.h:72:
clang::serialization::InputFile::InputFile(const clang::FileEntry*,
bool, bool): Assertion `!(isOverridden && isOutOfDate) && "an overridden
cannot be out-of-date"' failed.

See this dicussion

Grid Workflows

Hold codes

6 sub 0,2 could not execute glidein, and/or docker did not run
9 not enough memory; remember that you need to include the size of the code tarball/release
12 sub 2 could not execute glidein
13
26 sub 8 wall time (also in the Docker era)
26 sub 1 memory limits
26 sub 4 SYSTEM_PERIODIC_HOLD  Starts/limit 31/10 - too many restarts?
28 sub -10000,512,768,256 sandbox
30 sub -10000,768 Job put on hold by remote host,  job proxy is not valid
34 sub 0 - memory limits (also in the Docker era)
35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node
??? disk limits - probably 26 sub ?

See also condor docs

A subset of holds will be released automatically. If the job has started less than 10 times and was held for reasons 6 or 35 the job will restart automatically.

(NumJobStarts < 10) && (HoldReasonCode=?=6 || HoldReasonCode=?=35) && ((time()-EnteredCurrentStatus) > 1200)

mu2eFileDeclare: Metadata is invalid

 3610  OK:       /pnfs/mu2e/scratchError: got server response 400 Bad Request.
Metadata is invalid:
Parent file cnf.rhbob.pbarTracksFromAscii.pbarTracksFromAscii.001000_00006374.fcl not found

This error can occur when you are trying to declare a file to SAM. The declaration include submitting a metadata file to the SAM. This file contains all the information you want to store in the SAM database about the file. One the useful bits is the "parents" of the file. These are the fcl files and the input files used to create this file. If a parent file is "not found", that means that parent file was not declared in the SAM database. In the normal workflow, you would have declared the fcl files and the input file before you even started the job which produced the file at hand. Here are two recovery procedures:

  • go back to the area and the log file for when you declared the file in the error message. In this case it was the fcl file which drove the creation of the file being declared. It might be possible to see what went wrong there, and fix it by re-declaring the fcl file, for example.
  • if you do not need every file, and you do not see this error too often, you can move or delete the directory containing the result from this job, and restart mu2eFileDeclare.

SSL negotiation

Error creating dataset definition for ...
500 SSL negotiation failed: .

Your certificate is not of the right form


dCache hangs

A simple access to dCache (accessing filespecs like /pnfs/mu2e) can sometimes hang for a long time. This is difficult to deal with because there are legitimate reasons dCache could respond slowly. First, please read dCache page for background information.

dCache could be operating normally yet respond slowly because

  • your request was excessive, such as running find or a ls -l on a large number (>few hundred) files. If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files. Use file tools and plain ls where possible.
  • you, or other users, or even other experiments could be overloading dCache. This is difficult to determine, see operations page for some monitors. dCache has several choke points and not all are easily monitored.
  • the files you are accessing are on tape and you have to wait for them to come off tape. The solution is to prestage files

It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common. Here are some guidelines for when to put in a ticket

  • if a simple ls on a directory which does not contain many files or subdirectories hangs for more than 2 min.
  • if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
  • if accessing random files not recently accessed, and known to be prestaged, when access hangs more than 8h.
  • is prestaging does not progress after 8h

Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node.

Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache.

mkdir failed mu2eFileUpload

The first time you invoke mu2eFileUpload fo r a new dataset, you get a permission denied error in mkidr

mkdir /pnfs/mu2e/tape/usr-sim/sim/rlc: 
Permission denied at /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efiletools/v3_7/bin/mu2eFileUpload line 71

This is because you are writing to the dCache tape or persistent areas. By default no one can write to those areas and as needed, permission is given to write to specific directories. Please write to mu2eDataAdmin.

jobsub_submit cigetcert try

During a jobsub_submit command, you see

 cigetcert try 1 of 3 failed
 Command '/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/cigetcert/v1_16_1/Linux64bit-2-6-2-12/bin/cigetcert -s fifebatch.fnal.gov -n -o /tmp/x509up_u1311' returned non-zero exit status 1: cigetcert: Kerberos initialization failed: GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))

jobsub could not create a voms proxy from your kerberos ticket. Probably your ticket has just expired. Please kinit and try again. Authentication reference.

ifdh authentication error

This error seen to be saying that ifdh could not finish the authentication to write a file. In this case it turned out to be that the user was trying to use ifdh to write to the /mu2e/data disk fro a grid job. This is not allowed, apparently by blocking authentication. If the output disk is not the problem, put in a ticket.

> error: globus_ftp_control: gss_init_sec_context failedGSS Major Status:
> Unexpected Gatekeeper or Service Nameglobus_gsi_gssapi: Authorization
> denied: The name of the remote entity (/DC=org/DC=opensciencegrid/O=Open
> Science Grid/OU=Services/CN=bestgftp1.fnal.gov), and the expected name
> for the remote entity (/CN=fg-bestman1.fnal.gov) do not matchprogram:
> globus-url-copy -rst-retries 1 -gridftp2 -nodcau -restart -stall-timeout
> 14400 gsi
> ftp://fg-bestman1.fnal.gov:2811/mu2e/data/users/dlin/mu2eSimu/caloSimu18/fcl/job/000/cnf.dlin.caloSimu.v02.000002_00000199.fcl
> file:////storage/local/data1/condor/execute/dir_30333/no_xfer/./cnf.dlin.caloSimu.v02.000002_00000199.fcl
> 2>/storage/local/data1/condor/execute/dir_30333/no_xfer/ifdh_53271_1/errtxtexited
> status 1 delaying 12 ... retrying...

Command terminated by signal 9

Running on the grid, the log files shows the exe started but ended with

----->Command terminated by signal 9<-----

A signal 9 is an intentional kill command so something in the grid system stopped your job on purpose. If the grid system kills your job for going over limits, it should put your job to hold. In this case you won't see the log file, but occasionally we see a log file with a kill command. We don't know why this get through. It might be fine if you just resubmit. One case we have seen is a memory request much smaller than needed (which should have gone to hold).

Strange root I/O errors while using xrootd

While using xrootd to read input files, a small fraction fail with strange root I/O errors:

[ERROR] Invalid session

or

[FATAL] Socket timeout

or

Fatal Root Error: @SUB=TUnixSystem::GetHostByName
getaddrinfo failed for 'fndca1.fnal.gov': Temporary failure in name resolution 

or

Fatal Root Error: @SUB=TNetXNGFile::Open
[FATAL] Invalid address

or

   Fatal Root Error: @SUB=
   ! (prop&kIsClass) && "Impossible code path" violated at line 445 of `/scratch/workspace/canvas-products/vdevelop/e17/SLF6/prof/build/root/v6_12_06a/source/root-6.12.06/io/io/src/TGenCollectionProxy.cxx'

The only solution is to resubmit. There seems to be something flaky deep in the art/FermiGrid/xrootd/dCache path that has bugs or load issues.

samweb SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED]

A samweb command gives the ssl error:

> samweb list-definition-files dig.mu2e.NoPrimary-mix-det.MDC2018e.art
SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:618)

this might be caused by an expired certificate:

 > voms-proxy-info
 ...
 timeleft  : 0:00:00

samweb can handle the case of no cert, but it doesn't seem to handle the case of expired cert. Fix is to delete the cert

voms-proxy-delete

or get a new one:

 kx509

Mac

Attempting to do grahics (often geometry browsing in root) you see "libGL error: failed to load driver: swrast". The solution is discussed here