ErrorRecovery: Difference between revisions
Line 58: | Line 58: | ||
=== unable to find the service === | === unable to find the service === | ||
At the beginning of | At the beginning of an art executable run you see: | ||
<pre> | <pre> | ||
---- ServiceNotFound BEGIN | ---- ServiceNotFound BEGIN |
Revision as of 16:44, 19 September 2018
Introduction
Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here.
scons
art and fcl
art exit codes
cling::AutoloadingVisitor
At beginning of run time you see a spew of errors like this.
Error in cling::AutoloadingVisitor::InsertIntoAutoloadingState: Missing FileEntry for MCDataProducts/inc/ExtMonFNALPatRecTruthAssns.hh requested to autoload type mu2e::ExtMonFNALTrkFit
These are caused by cling system as implemented by root. Eventually this system will pre-compile header files (and maybe dictionaries?) so that compiling source files is very fast. In root 6, this not fully implemented and cling needs to find the header file for reasons that are not clear to us. It is searching the path given by ROOT_INCLUDE_PATH which should be set adequately by the setups. The error message is harmless, we know of no consequence except the spew.
cannot access private key
Attempting to read a root or xroot input file spec in art or root, see error:
180518 15:04:27 27909 secgsi_InitProxy: cannot access private key file: /nashome/r/rlc/.globus/userkey.pem %MSG-s ArtException: Early 18-May-2018 15:04:27 CDT JobSetup cet::exception caught in art ---- FileOpenError BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TNetXNGFile::Open [FATAL] Auth failed ---- FatalRootError END
The problem is that you need a voms proxy. See authentication. A kx509 cert or proxy identifies you, but does not identify that you are a member of mu2e. The voms proxy adds this information (the VO or Virtual Organization in the voms).
'present' is not actually present
Principal::getForOutput A product with a status of 'present' is not actually present. The branch name is mu2e::CaloShowerSteps_CaloShowerStepFromStepPt_calorimeterRO_photon. Contact a framework developer.
Needs a new version of art (2_11), but can be worked around, see this bug report
Unsuccessful attempt to convert FHiCL parameter
At the start of the job you see and art error message:
Unsuccessful attempt to convert FHiCL parameter 'inputFiles' to type 'std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'.
You could expand your fcl file:
fhicl-dump myfile.fcl > temp.txt
and look for the fcl parameter, "inputFiles" in this massage, which will probably be set to null. You can see what stanza this was part of and take it from there.
A common case is that the stopped muon input file was not defined:
physics.producers.generate.muonStops.inputFiles
needs to be set to value, currently defined in EventGenerator/fcl/prolog.fcl
unable to find the service
At the beginning of an art executable run you see:
---- ServiceNotFound BEGIN art::ServicesManager unable to find the service of type 'art::TFileService'. ---- ServiceNotFound END
art has a service called TFileService which opens a root file to hold histograms and ntuples made by user modules. In some fcl, the service is not configured with a default file name. If a module asks the service for the root file, this error is generated. You can fix it by adding a line to your fcl:
services.TFileService.fileName: "myfile.root"
or by adding the file name to the command line:
mu2e -T myfile.root -c ...
Grid Workflows
Hold codes
6 sub 0,2 could not execute glidein, and/or docker did not run 12 sub 2 could not execute glidein 13 26 sub 8 wall time (also in the Docker era) 26 sub 1 memory limits 26 sub 4 SYSTEM_PERIODIC_HOLD Starts/limit 31/10 - too many restarts? 28 sub -10000,512,768,256 sandbox 30 sub -10000,768 Job put on hold by remote host, job proxy is not valid 34 sub 0 - memory limits (also in the Docker era) 35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node ??? disk limits - probably 26 sub ?
See also condor docs
mu2eFileDeclare: Metadata is invalid
3610 OK: /pnfs/mu2e/scratchError: got server response 400 Bad Request. Metadata is invalid: Parent file cnf.rhbob.pbarTracksFromAscii.pbarTracksFromAscii.001000_00006374.fcl not found
This error can occur when you are trying to declare a file to SAM. The declaration include submitting a metadata file to the SAM. This file contains all the information you want to store in the SAM database about the file. One the useful bits is the "parents" of the file. These are the fcl files and the input files used to create this file. If a parent file is "not found", that means that parent file was not declared in the SAM database. In the normal workflow, you would have declared the fcl files and the input file before you even started the job which produced the file at hand. Here are two recovery procedures:
- go back to the area and the log file for when you declared the file in the error message. In this case it was the fcl file which drove the creation of the file being declared. It might be possible to see what went wrong there, and fix it by re-declaring the fcl file, for example.
- if you do not need every file, and you do not see this error too often, you can move or delete the directory containing the result from this job, and restart mu2eFileDeclare.
SSL negotiation
Error creating dataset definition for ... 500 SSL negotiation failed: .
Your certificate is not of the right form
dCache hangs
A simple access to dCache (accessing filespecs like /pnfs/mu2e
) can sometimes hang for a long time. This is difficult to deal with because there are legitimate reasons dCache could respond slowly. First, please read dCache page for background information.
dCache could be operating normally yet respond slowly because
- your request was excessive, such as running find or a
ls -l
on a large number (>few hundred) files. If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files. Use file tools and plainls
where possible. - you, or other users, or even other experiments could be overloading dCache. This is difficult to determine, see operations page for some monitors. dCache has several choke points and not all are easily monitored.
- the files you are accessing are on tape and you have to wait for them to come off tape. The solution is to prestage files
It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common. Here are some guidelines for when to put in a ticket
- if a simple
ls
on a directory which does not contain many files or subdirectories hangs for more than 2 min. - if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
- if accessing random files not recently accessed, and known to be prestaged, when access hangs more than 8h.
- is prestaging does not progress after 8h
Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node.
Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache.
mkdir failed mu2eFileUpload
The first time you invoke mu2eFileUpload fo r a new dataset, you get a permission denied error in mkidr
mkdir /pnfs/mu2e/tape/usr-sim/sim/rlc: Permission denied at /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efiletools/v3_7/bin/mu2eFileUpload line 71
This is because you are writing to the dCache tape or persistent areas. By default no one can write to those areas and as needed, permission is given to write to specific directories. Please write to mu2eDataAdmin.
jobsub_submit cigetcert try
During a jobsub_submit
command, you see
cigetcert try 1 of 3 failed Command '/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/cigetcert/v1_16_1/Linux64bit-2-6-2-12/bin/cigetcert -s fifebatch.fnal.gov -n -o /tmp/x509up_u1311' returned non-zero exit status 1: cigetcert: Kerberos initialization failed: GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
jobsub could not create a voms proxy from your kerberos ticket. Probably your ticket has just expired. Please kinit and try again. Authentication reference.
ifdh authentication error
This error seen to be saying that ifdh could not finish the authentication to write a file. In this case it turned out to be that the user was trying to use ifdh to write to the /mu2e/data
disk fro a grid job. This is not allowed, apparently by blocking authentication. If the output disk is not the problem, put in a ticket.
> error: globus_ftp_control: gss_init_sec_context failedGSS Major Status: > Unexpected Gatekeeper or Service Nameglobus_gsi_gssapi: Authorization > denied: The name of the remote entity (/DC=org/DC=opensciencegrid/O=Open > Science Grid/OU=Services/CN=bestgftp1.fnal.gov), and the expected name > for the remote entity (/CN=fg-bestman1.fnal.gov) do not matchprogram: > globus-url-copy -rst-retries 1 -gridftp2 -nodcau -restart -stall-timeout > 14400 gsi > ftp://fg-bestman1.fnal.gov:2811/mu2e/data/users/dlin/mu2eSimu/caloSimu18/fcl/job/000/cnf.dlin.caloSimu.v02.000002_00000199.fcl > file:////storage/local/data1/condor/execute/dir_30333/no_xfer/./cnf.dlin.caloSimu.v02.000002_00000199.fcl > 2>/storage/local/data1/condor/execute/dir_30333/no_xfer/ifdh_53271_1/errtxtexited > status 1 delaying 12 ... retrying...
Mac
Attempting to do grahics (often geometry browsing in root) you see "libGL error: failed to load driver: swrast". The solution is discussed here