ErrorRecovery: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
No edit summary
 
(One intermediate revision by the same user not shown)
Line 756: Line 756:
  MemoryUsage = undefined
  MemoryUsage = undefined
  RequestMemory = 2499
  RequestMemory = 2499
===SAM SSL certifcate error===
running sam commands, you see the following
400 The SSL certificate error
400 Bad Request
The SSL certificate error
This is because you have a voms (kx509) proxy file, but the proxy is expired, run vomsCert


==Mac==
==Mac==

Latest revision as of 03:03, 6 November 2024

Introduction

Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here.

UPS

If you type "muse setup" or issue a UPS setup command and get the following error message:

You are attempting to run "setup" which requires administrative
privileges, but more information is needed in order to do so.
Password for root:

the solution is discussed at: UPS#A_Common_Error_Message.

logins

Locale failed

When ssh-ing into an interactive node, you see the following error

while connecting to SL7 machines:
-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
The warning was not present before and it is still not present on mu2egpvm01. Does someone know who generates this? I am having a running error connected to it:

---- StdException BEGIN
  ServiceCreation std::exception caught during construction of service type art::TFileService: locale::facet::_S_create_c_locale name not valid
---- StdException END

The solution is:

fixed it by commenting out the line (on my local machine):
SendEnv LANG LC_*
from the file:
/etc/ssh/ssh_config

See also stackoverflow

Could not chdir to home directory

When you log into one of the Mu2e interactive machines, your kerberos ticket must be forwarded to the machine that serves the home directories. If this does not work properly, you will see an error message similar to:

Could not chdir to home directory /nashome/k/kutschke: Permission denied.

The issue could either be at your end or it could be a problem with the services provided by the lab. First follow the discussion below to ensure that you are doing everything correctly. If you are, and if the problem remains, then open a service desk ticket; be sure to mention that you have followed these instructions (you can send them the url to this subsection).

To diagnose this situation, follow this checklist:

  1. Check if you can log into some of the other Mu2e interactive machines and see your home disk.
    1. If you can, then there is one weird corner case to check; make sure that your ~/.ssh/config file does not enable forwarding to some of the Mu2e interactive machines but not others.
    2. If your ~/.ssh/config file is good, then the problem is definitely on the lab side; open a service desk ticket and tell them which machines have the problem and which do not.
  2. On your laptop/desktop, check that your kerberos ticket is still valid and is forwardable. Use the command "klist -f"; check the expiration date and if it has "forwardable" under the "Ticket flags". If you do have not ticket, or if it is no longer valid or not forwardable, do another kinit and try again. For details see the discussion on the Mu2e wiki page about logging in to the interactive machines.
  3. If you started on your laptop/desktop and logged into machine A and are trying to log into machine B from machine A, make sure that your kerberos ticket is still valid and is forwardable on the intermediate machines. If not, you can kinit on the intermediate machines.
  4. Check that you are correctly forwarding your kerberos ticket when you use the ssh command. See the Mu2e wiki page about logging in to the interactive machines for how to do this either with command line options or with configuration files.
  5. on 4/12/21, the admins added a cron to restart the nfs permissions demons if they crashed, which they are doing occasionally. So it is reasonable to wait 30min to see if this will recover the access.
  6. If none of this works, open a service desk ticket.

Can't log in to mu2egpvm

Logins hang when ssh into mu2egpvm01. Existing processes have trouble accessing kerberized disk (/nashome and /web). This might potentially hit other nodes.

  • (from Tim Skirvin, 4/2023) At least one case of this being fixed by his "fix gssproxy" script
  • (from Ed Simmons, 4/2023) There is another gssproxy-related bug. There is an automatically-generated 'machine ticket' that is stored in /tmp which can also expire without being renewed by the kernel when it should be. When this happens, restarting nfs-secure, or gss-proxy fixes it. Believed rare.

Remote Host Identification has changed

When you try to log in to one of the Mu2e interactive machines you may ocaissonally see the following message:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:ZushUU76+0vOnVrChUmJKzHUmvA/cogAyR6p3L0jpcQ.
Please contact your system administrator.
Add correct host key in /Users/kutschke/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/kutschke/.ssh/known_hosts:32

If you see this message, issue the following command:

 ssh-keygen -R <hostname>

This will remove all lines referring to <hostname> from ~/.ssh/knownhosts. The first time that you log in after this you will see a message like:

The authenticity of host 'mu2egpvm03 (131.225.67.36)' can't be established.
ED25519 key fingerprint is SHA256:BedPAEFKfEy8WvBbVLr+IATD0meotoBkdX1OWETJIyI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? 

Reply yes and continue.

The backstory is that the machine in question has been issued a new kerberos host principal. This often happens when a machine has a major OS upgrade but it can happen at other times.

You can perform the same operation as ssh-keygen by hand. Edit the file ~/.ssh/known_hosts and remove all lines containing the name of the machine. You may need to remove several lines, for example "mu2egpvm03" and "mu2egpvm03.fnal.gov".

git

KeyError on git push

Traceback (most recent call last):
  File "hooks/post-receive", line 73, in <module>
    main()
  File "hooks/post-receive", line 62, in main
    os.environ["GL_USER"] = os.environ["REMOTEUSER"]
  File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__
    raise KeyError(key)
KeyError: 'REMOTEUSER'
error: hooks/post-receive exited with error code 1
To ssh://p-mu2eofflinesoftwaremu2eoffline@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwaremu2eoffline/Offline.git

This is due to inconsistent ssh settings, please see this page.

Could not resolve hostname

On any git command

> git pull
ssh: Could not resolve hostname cdcvs.fnal.gov:: Name or service not known
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Check which git you are using

> which git
/usr/bin/git

This result is wrong, this is an old version and has inconsistencies with our settings. You should be using a ups version:

> which git
/cvmfs/mu2e.opensciencegrid.org/artexternals/git/v2_20_1/Linux64bit+3.10-2.17/bin/git

Exactly which git version may change.. You get the ups verison when you mu2einit. This error may also have other causes

fatal: not a git repository

When issuing any of several git commands

fatal: not a git repository (or any parent up to mount point /mu2e)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Probably you are not in a git repository directory. All git repo directories (such as our Offline), will have a .git subdirectory.

SSL certificate problem

clone command fails with

fatal: unable to access 'https://github.com/Mu2e/Offline.git/': SSL certificate problem: unable to get local issuer certificate

Likely the problem is that CA certs on the local machine are behind github's. Probable solution:

 git clone -c http.sslverify=false https://github.com/Mu2e/Offline.git

also, if on a central node, likely certs will be updated on the node, and this might fix it

root

Missing Dictionaries

If you are on a AL9 machine, using root to read a root file, and see error messages like:

Processing proc.c...
Warning in <TClass::Init>: no dictionary for class mu2e::EventInfo is available
Warning in <TClass::Init>: no dictionary for class mu2e::EventInfoMC is available

you need to define LD_LIBRARY_PATH. See Spack#Using_root_with_Pre_Built_Dictionaries. This is a known issue with Stntuple and TrkAna.

scons

spack

matches multiple packages

During a spack load command it complains about multiple packages

 build02 spackm > spack load python
 ==> Error: python matches multiple packages.
  Matching packages:
    kz52h4k python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2
    mssmz2i python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2

solution is to pick one, in this case, for example

spack load python/kz52h4k

if you'd like to investigate the versions, you can look at differences in the dependencies of the build

spack diff python/kz52h4k python/mssmz2i 


art and fcl

art exit codes

art exit codes

cling::AutoloadingVisitor

At beginning of run time you see a spew of errors like this.

Error in cling::AutoloadingVisitor::InsertIntoAutoloadingState:
   Missing FileEntry for MCDataProducts/inc/ExtMonFNALPatRecTruthAssns.hh
   requested to autoload type mu2e::ExtMonFNALTrkFit

These are caused by cling system as implemented by root. Eventually this system will pre-compile header files (and maybe dictionaries?) so that compiling source files is very fast. In root 6, this not fully implemented and cling needs to find the header file for reasons that are not clear to us. It is searching the path given by ROOT_INCLUDE_PATH which should be set adequately by the setups. If, for some reason, the path is not correct or you have removed the header files, the error message is harmless - we know of no consequence except the spew.

Update (1/2020): we have seen this error also be correlated with crashes, in other words, missing header files might be providing semething root needs. We are trying to figure out what root might be getting from the headers and provide it all in the libraries.

And example crash (Lisa, 1/28/20 in a docker container, ceSimReco)

%MSG-s ArtException:  PostEndJob 28-Jan-2020 19:53:16 UTC ModuleEndJob
---- FatalRootError BEGIN
  Fatal Root Error: @SUB=
  ! (prop&kIsClass) && "Impossible code path" violated at line 462 of `/scratch/workspace/canvas-products/v3_09_00-/SLF7/e19-prof/build/root/v6_18_04c/source/root-6.18.04/io/io/src/TGenCollectionProxy.cxx'
---- FatalRootError END
%MSG

Another from Dave (LBL) 12/23/19. This seems to be a real missing dictionary.

%MSG-s ArtException: FileDumperOutput:dumper@Construction 23-Dec-2019 10:00:54 CST ModuleConstruction
cet::exception caught in art
---- FileReadError BEGIN
---- FatalRootError BEGIN
Fatal Root Error: @SUB=TBufferFile::ReadClassBuffer
Could not find the StreamerInfo for version 2 of the class art::ProcessHistory, object skipped at offset 108
---- FatalRootError END

---- FileReadError END
%MSG

cannot access private key

Attempting to read a root or xroot input file spec in art or root, see error:

180518 15:04:27 27909 secgsi_InitProxy: cannot access private key file: /nashome/r/rlc/.globus/userkey.pem
%MSG-s ArtException:  Early 18-May-2018 15:04:27 CDT JobSetup
cet::exception caught in art
---- FileOpenError BEGIN
  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TNetXNGFile::Open
    [FATAL] Auth failed
  ---- FatalRootError END

The problem is that you need a voms proxy. See authentication. A kx509 cert or proxy identifies you, but does not identify that you are a member of mu2e. The voms proxy adds this information (the VO or Virtual Organization in the voms).

'present' is not actually present

  Principal::getForOutput
    A product with a status of 'present' is not actually present.
    The branch name is mu2e::CaloShowerSteps_CaloShowerStepFromStepPt_calorimeterRO_photon.
    Contact a framework developer.

Needs a new version of art (2_11), but can be worked around, see this bug report

Unsuccessful attempt to convert FHiCL parameter

At the start of the job you see and art error message:

  Unsuccessful attempt to convert FHiCL parameter 'inputFiles' to type 'std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'.

You could expand your fcl file:

 fhicl-dump myfile.fcl > temp.txt

and look for the fcl parameter, "inputFiles" in this massage, which will probably be set to null. You can see what stanza this was part of and take it from there.

A common case is that the stopped muon input file was not defined:

 physics.producers.generate.muonStops.inputFiles

needs to be set to value, currently defined in EventGenerator/fcl/prolog.fcl

unable to find the service

At the beginning of an art executable run you see:

  ---- ServiceNotFound BEGIN
    art::ServicesManager unable to find the service of type 'art::TFileService'.
  ---- ServiceNotFound END

art has a service called TFileService which opens a root file to hold histograms and ntuples made by user modules. In some fcl, the service is not configured with a default file name, then if a module asks the service for the root file, this error is generated. You can fix it by adding a line to your fcl:

 services.TFileService.fileName: "myfile.root"

if you are modifying this .fcl for grid running you would write:

services.TFileService.fileName: "nts.owner.desc.version.sequencer.root"

where "desc" is the field you enter in generate_fcl.

If running nteractively, you can add the file name to the command line:

mu2e -T myfile.root -c ...

There doesn't need to be any specific fcl text in the services stanza to create the service, art will create it automatically if the root file name is defined.

Found zero products matching all criteria

In MDC there were a number of changes made to data product names. A standard job to read MDC output is TrkDiag/fcl/TrkAnaDigis.fcl. Let's say you wanted to weight the events from an MDC output (say JobConfig/primary/flateminus.fcl) using the DecayInOrbitWeight module. A natural thing to do is change the trigger_path to:

           physics.TrkAnaTriggerPath : [ @sequence::TrkAna.TrkCaloRecoSequence,DIOWeight ]

which weights the generated DIO electron, but this will fail with the "zero product error". You need to add:

           physics.producers.DIOWeight.inputModule: "compressDigiMCs"

How would you know this? Run Print/fcl/print.fcl on the output of flateminus.fcl, perhaps dig.owner.flateminus.version.sequencer.art. Then you will see the data products listed and something like:

                   Friendly Class Name        Module Label    Instance Name  Process Name     Product ID
                    mu2e::GenParticles     compressDigiMCs                     flateminus  2502071937

error writing all requested bytes

Your jobs start failing at a few percent rate and you see this error in the log file.

%MSG-s ArtException:  PostEndJob 22-Sep-2018 02:06:49 UTC ModuleEndJob
cet::exception caught in art
---- OtherArt BEGIN
  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TFile::WriteBuffer
    error writing all requested bytes to file ./RootOutput-ed23-499f-b549-ae0f.root, wrote 374 of 1272
  ---- FatalRootError END
  ---- OtherArt BEGIN
    ---- FatalRootError BEGIN
      Fatal Root Error: @SUB=TTree::SetEntries
      Tree branches have different numbers of entries, eg EventAuxiliary has 918 entries while art::TriggerResults_TriggerResults__flateminus. has 919 entries.
    ---- FatalRootError END
  ---- OtherArt END
---- OtherArt END
%MSG

This seems to be new around 9/25/18. It is probably a random Docker-related error, just resubmit the job.


Proditions service not found

---- ServiceNotFound BEGIN
  Unable to create ServiceHandle.
  Perhaps the FHiCL configuration does not specify the necessary service?
  The class of the service is noted below...
  ---- ServiceNotFound BEGIN
    art::ServicesManager unable to find the service of type 'mu2e::ProditionsService'.
  ---- ServiceNotFound END
---- ServiceNotFound END
%MSG

This service provides access to the database for conditions data. It was introduced Feb 2019. Adding this dependence mean that when you run modules that need this service (such as straw reco), then you need to configure the service. Either use the default services:

services : @local::Services.SimAndReco

or if you need to use your own explicit services fcl stanza, add these lines:

   DbService : @local::DbEmpty
   ProditionsService: @local::Proditions

DbHandle no TID

An exe gives this error (table name and module name may be different) on starting up

   ---- DBHANDLE_NO_TID BEGIN
     DbHandle could not get TID (Table ID) from DbEngine for TrkPreampStraw at first use The above exception was thrown while processing module StrawHitReco/makeSH run: 1000
   ---- DBHANDLE_NO_TID END

Every executable may or may not load a set of conditions read from the database, depending on its fcl configuration. This error says that a database table was requested by the code, but was not available in the conditions set. So either the conditions set was not loaded or what was loaded was the wrong conditions set for this code. Look for the following lines in the fcl:

 services.DbService.purpose: NOMINAL
 services.DbService.version: v1_0

These define what conditions set to load. See also general database info and [calibration sets]. Since it is dynamic, and hard to document fully, only an expert can fully verify the correct conditions set for any particular data and exe.

getByLabel: Found zero products matching all criteria

On first event, the error occurs

          getByLabel: Found zero products matching all criteria
          Looking for type: std::map<art::Ptr<mu2e::SimParticle>,double>
          Looking for module label: compressDigiMCs
          Looking for productInstanceName: cosmicTimeMap
          cet::exception going through module ComboHitDiag/CHD run: 1002 subRun: 39 event: 10

The details of which product is missing from which module may be different.

The first possibility is that you are requesting a product that doesn't exist in the file. Read about products, or work with an expert on debugging the fcl or the path.

In Summer 2019, there was an additional, temporary cause of this error. This is caused by a change in the art behavior with art 3.02 (v7_5_1 to v7_5_4) probably fixed in art 3.04. This changes gets triggers by products with similar names. See hypernews and Kyle's talk. Dave Brown has work-arounds which involve making the product references more explicit in fcl.

Assertion an overridden cannot be out-of-date

/scratch/workspace/canvas-products/vdevelop-/SLF7/e17-debug/build/root/v6_16_00/source/root-6.16.00/interpreter/llvm/src/tools/clang/include/clang/Serialization/Module.h:72:
clang::serialization::InputFile::InputFile(const clang::FileEntry*,
bool, bool): Assertion `!(isOverridden && isOutOfDate) && "an overridden
cannot be out-of-date"' failed.

See this dicussion

Product Dropped Unexpectedly by RootInput

When you read an art data file with RootInput, you can drop products from the file using the inputCommands parameter. The default behaviour is to drop all descendant products as well. This is reported with a warning message like the following, one for each descendant data product that is dropped:

%MSG-w RootInputFile:  RootOutput:defaultOutput@Construction  06-Sep-2020 19:33:53 CDT ModuleConstruction
Branch 'mu2e::CaloShowerStepROs_compressDigiMCs__cosmics3.' is being dropped from the input
of file 's8.art' because it is dependent on a branch
that was explicitly dropped.

This behaviour is described in FclIOModules#Drop_on_Input; that page describes how to tell art to keep the descendant data products.

Missing rpms for art v3_06_xx

When you are migrate to an Offline based on art v3_06_xx, from one based on art v3_05_xx, you need may to install additional rpms:

  pcre2 xxhash-libs libzstd libzstd-devel
  ftgl libGLEW gl2ps root-graf-asimage

These must be installed on all machines on which you plan to build or run the code. The rpms on the first line were needed by machines configured as are the Mu2e interactive machines. Bertrand Echenard reports that he had to install some additional rpms on his machine; those are listed on the second line.

One of the symptoms of this is the following error message at build time:

ImportError: libxxhash.so.0: cannot open shared object file: No such file or directory:

If you build on one machine, and run on another machine, and if the run-time machine is not updated, you will get a different error:

---- Configuration BEGIN
  Unable to load requested library /storage/local/data1/condor/execute/dir_14461/no_xfer/Code/Offline/lib/libmu2e_GeometryService_GeometryService_service.so
  libzstd.so.1: cannot open shared object file: No such file or directory
---- Configuration END

If you get the above error in a grid job, you are most likely using an older version of mu2egrid. Be sure to use v6_09_00 or greater, which tells the grid jobs to run using the most recent available singularity container. That container has the above rpms installed. For bare jobsub, here is the incantation. 11/2021 - we are told the append line is no longer required.

--singularity-image '/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest'

<top_level>.fileNames

When running a fcl file from Production, such as Productiom/JobConfig/CeEndPoint.fcl, you quickly see the error

The supplied value of the parameter: <top_level>.fileNames does not represent a sequence

This error means that the fcl is requiring an input file, but you did not provide any. You will have to follow up with experts or documentation.

Grid Workflows

What to do if a grid segment is reported in a HELD state

murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --jobid 51739456.0@jobsub01.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
51739456.0@jobsub01.fnal.gov          murat           01/02 13:41   0+00:37:47 H   0   2.2 mu2eprodsys.sh_20220102_134146_431218_0_1_wrap.sh 

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

to figure out the reason, run jobsub_q --long :

murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --long --jobid 51739456.0@jobsub01.fnal.gov | grep -i hold
HoldReason = "Error from slot1_4@fnpc7672.fnal.gov: Docker job has gone over memory limit of 3000 Mb"
HoldReasonCode = 34
HoldReasonSubCode = 0
NumSystemHolds = 0
OnExitHold = false
PeriodicHold = false

Note, that it looks that the memory is booked per docker container, not per user job - the user job is reported to use only 2.2 GBytes...

Hold codes

1 user put jobs in hold
6 sub 0,2 could not execute glidein, and/or docker did not run
9 not enough memory; remember that you need to include the size of the code tarball/release
12 sub 2 could not execute glidein
13
26 sub 1 memory limits
26 sub 2 exceeded disk limits
26 sub 4 SYSTEM_PERIODIC_HOLD  Starts/limit 31/10 - too many restarts?
26 sub 8 wall time
28 sub -10000,512,768,256 sandbox
30 sub -10000,768 Job put on hold by remote host,  job proxy is not valid
34 sub 0 - memory limits
35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node

See also condor docs

A subset of holds will be released automatically. If the job has started less than 10 times and was held for reasons 6 or 35 the job will restart automatically.

(NumJobStarts < 10) && (HoldReasonCode=?=6 || HoldReasonCode=?=35) && ((time()-EnteredCurrentStatus) > 1200)

mu2eFileDeclare: Metadata is invalid

 3610  OK:       /pnfs/mu2e/scratchError: got server response 400 Bad Request.
Metadata is invalid:
Parent file cnf.rhbob.pbarTracksFromAscii.pbarTracksFromAscii.001000_00006374.fcl not found

This error can occur when you are trying to declare a file to SAM. The declaration include submitting a metadata file to the SAM. This file contains all the information you want to store in the SAM database about the file. One the useful bits is the "parents" of the file. These are the fcl files and the input files used to create this file. If a parent file is "not found", that means that parent file was not declared in the SAM database. In the normal workflow, you would have declared the fcl files and the input file before you even started the job which produced the file at hand. Here are two recovery procedures:

  • go back to the area and the log file for when you declared the file in the error message. In this case it was the fcl file which drove the creation of the file being declared. It might be possible to see what went wrong there, and fix it by re-declaring the fcl file, for example.
  • if you do not need every file, and you do not see this error too often, you can move or delete the directory containing the result from this job, and restart mu2eFileDeclare.

SSL negotiation

Error creating dataset definition for ...
500 SSL negotiation failed: .

Your certificate is not of the right form


dCache hangs

A simple access to dCache (accessing filespecs like /pnfs/mu2e) can sometimes hang for a long time. This is difficult to deal with because there are legitimate reasons dCache could respond slowly. First, please read dCache page for background information.

dCache could be operating normally yet respond slowly because

  • your request was excessive, such as running find or a ls -l on a large number (>few hundred) files. If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files. Use file tools and plain ls where possible.
  • you, or other users, or even other experiments could be overloading dCache. This is difficult to determine, see operations page for some monitors. dCache has several choke points and not all are easily monitored.
  • the files you are accessing are on tape and you have to wait for them to come off tape. The solution is to prestage files

It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common. Here are some guidelines for when to put in a ticket

  • if a simple ls on a directory which does not contain many files or subdirectories hangs for more than 2 min.
  • if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
  • if prestaging does not progress. See Prestage

Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node.

Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache.

dCache Input/Output error

When attempting to access a file in tape-backed dCache, you see an error like

cp: error reading /pnfs/mu2e/tape/phy-etc/bck/mu2e/beams1phase1/g4-10-4/tbz/66/08/bck.mu2e.beams1phase1.g4-10-4.002701_00000004.tbz: Input/output error

This error can have two causes. The most likely is that the file is simply not prestaged to disk. This behavior of reporting a infrastructure error when the file is not available is not user-friendly and it may be updated in the future. Note that attempting to read a non-prestaged file in tape-backed dCache via the NFS directory (/pnfs) will not trigger the file to be prestaged.

The second possibility is that there is actual problem with the /pnfs mount. This usually requires a ticket to ask for the NFS client to be restarted, or perhaps the node to be rebooted.

dCache timeout errors

Multiple grid jobs have the following message in log files:

[ERROR] Server responded with an error: [3012] Internal timeout

This can be due to the requested file not being prestaged, so that is the first thing to check. If you can establish that isn't the problem, put in a ticket. As of 4/2023, timeout is 1.5 hours for FTP. 30 minutes for webdav, xroot.

mkdir failed mu2eFileUpload

The first time you invoke mu2eFileUpload fo r a new dataset, you get a permission denied error in mkidr

mkdir /pnfs/mu2e/tape/usr-sim/sim/rlc: 
Permission denied at /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efiletools/v3_7/bin/mu2eFileUpload line 71

This is because you are writing to the dCache tape or persistent areas. By default no one can write to those areas and as needed, permission is given to write to specific directories. Please write to mu2eDataAdmin.

jobsub_submit cigetcert try

During a jobsub_submit command, you see

 cigetcert try 1 of 3 failed
 Command '/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/cigetcert/v1_16_1/Linux64bit-2-6-2-12/bin/cigetcert -s fifebatch.fnal.gov -n -o /tmp/x509up_u1311' returned non-zero exit status 1: cigetcert: Kerberos initialization failed: GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))

jobsub could not create a voms proxy from your kerberos ticket. Probably your ticket has just expired. Please kinit and try again. Authentication reference.

ifdh authentication error

This error seen to be saying that ifdh could not finish the authentication to write a file. In this case it turned out to be that the user was trying to use ifdh to write to the /mu2e/data disk fro a grid job. This is not allowed, apparently by blocking authentication. If the output disk is not the problem, put in a ticket.

> error: globus_ftp_control: gss_init_sec_context failedGSS Major Status:
> Unexpected Gatekeeper or Service Nameglobus_gsi_gssapi: Authorization
> denied: The name of the remote entity (/DC=org/DC=opensciencegrid/O=Open
> Science Grid/OU=Services/CN=bestgftp1.fnal.gov), and the expected name
> for the remote entity (/CN=fg-bestman1.fnal.gov) do not matchprogram:
> globus-url-copy -rst-retries 1 -gridftp2 -nodcau -restart -stall-timeout
> 14400 gsi
> ftp://fg-bestman1.fnal.gov:2811/mu2e/data/users/dlin/mu2eSimu/caloSimu18/fcl/job/000/cnf.dlin.caloSimu.v02.000002_00000199.fcl
> file:////storage/local/data1/condor/execute/dir_30333/no_xfer/./cnf.dlin.caloSimu.v02.000002_00000199.fcl
> 2>/storage/local/data1/condor/execute/dir_30333/no_xfer/ifdh_53271_1/errtxtexited
> status 1 delaying 12 ... retrying...

Command terminated by signal 9

Running on the grid, the log files shows the exe started but ended with

----->Command terminated by signal 9<-----

A signal 9 is an intentional kill command so something in the grid system stopped your job on purpose. If the grid system kills your job for going over limits, it should put your job to hold. In this case you won't see the log file, but occasionally we see a log file with a kill command. We don't know why this get through. It might be fine if you just resubmit. One case we have seen is a memory request much smaller than needed (which should have gone to hold).

Strange root I/O errors while using xrootd

While using xrootd to read input files, a small fraction fail with strange root I/O errors:

[ERROR] Invalid session

or

[FATAL] Socket timeout

or

Fatal Root Error: @SUB=TUnixSystem::GetHostByName
getaddrinfo failed for 'fndca1.fnal.gov': Temporary failure in name resolution 

or

Fatal Root Error: @SUB=TNetXNGFile::Open
[FATAL] Invalid address

or

   Fatal Root Error: @SUB=
   ! (prop&kIsClass) && "Impossible code path" violated at line 445 of `/scratch/workspace/canvas-products/vdevelop/e17/SLF6/prof/build/root/v6_12_06a/source/root-6.12.06/io/io/src/TGenCollectionProxy.cxx'

The only solution is to resubmit. There seems to be something flaky deep in the art/FermiGrid/xrootd/dCache path that has bugs or load issues.

Here is another failure message:

Fatal Root Error: TNetXNGFile::ReadBuffer
[ERROR] Server responded with an error: [3010] org.dcache.uuid is no longer valid.
ROOT severity: 3000

3/2021, dCache exports suggested setting this might help

export XRD_STREAMTIMEOUT=300

auth and handshake errors while using xrootd

Many jobs end with errors like:

       Fatal Root Error: TNetXNGFile::Open
       [FATAL] Hand shake failed
       ROOT severity: 3000

or:

     ---- FatalRootError BEGIN
       Fatal Root Error: TNetXNGFile::Open
       [FATAL] Auth failed: No protocols left to try
       ROOT severity: 3000
     ---- FatalRootError END
     Unable to open specified secondary event stream file root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-

the current thinking is this a problem with authentication on the grid node. So far it has not been persistent, but If it is, put in a ticket.

samweb SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED]

A samweb command gives the ssl error:

> samweb list-definition-files dig.mu2e.NoPrimary-mix-det.MDC2018e.art
SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:618)

this might be caused by an expired certificate:

 > voms-proxy-info
 ...
 timeleft  : 0:00:00

samweb can handle the case of no cert, but it doesn't seem to handle the case of expired cert. Fix is to delete the cert

voms-proxy-destroy

or get a new one:

 kx509

mu2eprodsys Error - python future

If you setup both Offline and mu2egrid in the same window we recommend that you setup Offline first and mu2egrid second:

mu2einit
source <path to an Offline installation>/setup.sh
setup mu2egrid
mu2eprodsys <other arguments>

If you swap the order of lines 2 and 3, then you will get the following error message from mu2eprodsys:

Traceback (most recent call last):
  File "/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/jobsub_client/v1_3_2_1/NULL/jobsub_submit", line 20, in <module>
    from future import standard_library
  File "/cvmfs/fermilab.opensciencegrid.org/products/common/prd/python_future_six_request/v1_3/Linux64bit-3-10-2-17-python2-7-ucs4/future/standard_library/__init__.py", line 64, in <module>
    import logging
  File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/logging/__init__.py", line 26, in <module>
    import sys, os, time, io, re, traceback, warnings, weakref, collections.abc
  File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/re.py", line 127, in <module>
    import functools
  File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/functools.py", line 18, in <module>
    from collections import namedtuple
  File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/collections/__init__.py", line 27, in <module>
    from reprlib import recursive_repr as _recursive_repr
  File "/cvmfs/fermilab.opensciencegrid.org/products/common/prd/python_future_six_request/v1_3/Linux64bit-3-10-2-17-python2-7-ucs4/reprlib/__init__.py", line 7, in <module>
    raise ImportError('This package should not be accessible on Python 3. '
ImportError: This package should not be accessible on Python 3. Either you are trying to run from the python-future src folder or your installation of python-future is corrupted.

To fix this error continue with the following commands:

setup jobsub_client
mu2eprodsys <other arguments>

and mu2eprodsys will work correctly.


Or you can always setup mu2egrid in a window in which Offline has not been, and will not be, setup:

mu2einit
setup mu2egrid
mu2eprodsys <other arguments>

For those who are interested, the explanation is below.

The ups package mu2egrid implicitly sets up the UPS-current version of jobsub_client. Both Offline (via root) and jobsub_client require that python be in the environment. jobsub_client is designed to work with many different versions of python and will use whatever python it finds in the environment. Starting with a recent version of jobsub_client, it sets up a helper UPS product named python_future_six_request and, at setup-time, chooses different UPS qualifiers for python_future_six_request depending on which python it finds in it's environment.

When you start a new session, there is no UPS based python in your environment so the command "python" finds the system python, which, as of Sept 4, 2020, is 2.7.5. When you setup Offline, it will setup a specific version of root and that version of root will setup a specific version of python, which, as of Sept 4, 2020, is 3.8.3.

If you setup the products in the recommended order then setting up Offline will put python 3.8.3 in the environment; when you setup mu2egrid, it will discover that version of python and will setup python_future_six_request -qpython3.8 . At this point your environment is self consistent and mu2eprodsys will work.

If you setup mu2egrid first, it will discover python 2.7.5 in the environment and will setup python_future_six_request -qpython2.7-ucs4 . When you setup Offline it will change python to 3.8.3 but it will not touch python_future_six_request because it does not know about that product. The result is that environment is not self consistent and mu2eprodsys will fail with error above. If you subsequently "setup jobsub_client" it will detect the new python, remove the old python_future_six_request from the environment and setup the new one. At this point the environment is again self consistent and mu2eprodsys will work. Because Offline does not use ython_future_six_request, Offline will continue to work correctly.

The core problem is that jobsub_client is attempting to be smart enough to "just work" with both the system python and with any python from UPS. UPS does not have the features needed to manage all of the corner cases of this use pattern. I have asked the developers of jobsub_client to check, at run-time, for a self-consistent environment and to issue a user friendly error message if the environment is not self-consistent.


Code NOT_FOUND in job ouput

The symptom of this problem is that you get the following error message in log files in your outstage directory:

mu2eprodsys Thu Oct 1 15:17:56 UTC 2020 -- 1601565476 pointing MU2EGRID_USERSETUP to the tarball extract: NOT_FOUND/Code/setup.sh
./mu2eprodsys.sh: line 281: NOT_FOUND/Code/setup.sh: No such file or directory
Error sourcing setup script NOT_FOUND/Code/setup.sh: status code 1

This means that you submitted a grid job with tarball option but the tarball could not be unpacked by the Rapid Code Distribution Service. For reasons that I don't understand, the grid system might not detect this error and your grid jobs will start to run. When they try to access the code distributed by the tarball they will fail with the message above.

In almost all cases, this error is caused by one of two things:

  1. The tar file is too big. In the one example to date, the tarball was 5 GB instead of the typical ~800 MB. We don't know where the threshold is.
  2. The tar file contains one or more files larger than 1 GB

In all cases to date this was caused by the user putting inappropriate files into the tarball: art event-data files, root files, graphics files or log files. These files should be on /mu2e/data, not on /mu2e/app. When you are looking for large files, remember to check for hidden files, files whose name begins with a dot; some tools make temporary files named .nfsxxx where xxx is a long hex number. You can see hidden files with "ls -a".

If you are within the guidelines above and the problem still occurs, please report the error using a service desk ticket.

There is additional information in an old ticket: INC000001105610.

Can not get BFID

While running mu2eDatasetLocation to add tape locations to the SAM record of files which you uploaded, you see this error:

Can not get BFID from pnfs layer 1  /pnfs/mu2e/tape/usr-sim/dig/oksuzian/CRY-cosmic-general/cry3-digi-hi/art/71/45/.(use)(1)(dig.oksuzian.CRY-cosmic-general.cry3-digi-hi.001002_00009067.art):  on Mon Mar 22 12:37:30 2021

We believe this occurs when the script is run before some items in the dCache database have been inserted. It usually runs correctly after waiting an hour or two.

jobsub dropbox_upload failed

On jobsub job submission,

Submitting....
dropbox_upload failed

This error is caused by the underlying mechanisms for transferring your local files to the grid being broken. It is only in jobsub_client version v1_3_3. A solution is to setup the previous version:

setup jobsub_client v1_3_2_1

Do this after setup mu2egrid.

Error: OperationalError: no such table: keycache error

This error happens when you submit a grid job (typically using mu2eprodsys or jobsub_submit). It started to occur in mid-November 2023. The underlying issue is a problem with the tokens system used for grid authentication. The following is the suggested workaround:

ls -l ${XDG_CACHE_HOME:-$HOME/.cache}/scitokens/scitokens_keycache.sqllite

If this file exists and is zero length, then remove it and try to submit again.

Disk quota exceeded during jobsub_submit or mu2eprodsys

If you get the following error during the execution of a jobsub_submit command

Unable to get key triggered by next update: disk I/O error
 
Error: OSError: [Errno 122] Disk quota exceeded

the issue is that your home directory has no quota remaining. mu2eprodsys is just a wrapper around jobsub_submit so it will issue the same error in the same circumstances.

The issue is that jobsub_submit writes small files in your home directory under $HOME/.cache and $HOME/.config. If there is no available disk quota, jobsub_submit will fail.

The solution is to remove some files from your home directory; either delete them or move them to a more appropriate location.

'Available' is not on the list

2/2024 During jobsub submission

Submitting....
Error: ValueError: 'Available' is not in list

This is caused by a bug in a disk space check. The temporary solution is to skip the check with

--skip-check=disk_space
or for mu2eprodsys
--jobsub-arg='--skip-check=disk_space'

Hold test evaluated true (hold code 26/0)

Your jobs go to hold, but instead of telling you why (memory, time) it simply says that this whole long expression evaluated true. If you go to the "why are my jobs held" monitoring page, it will say the cause is "code 26 subcode 0" which is "other". The experts say this is usually a hold due to running over lifetime, but here is a command which might be able to pinpoint which part of the test is failing:

> condor_config_val -schedd -name jobsub05.fnal.gov SYSTEM_PERIODIC_HOLD | awk '{gsub("JobStatus","LastJobStatus"); gsub("time\\(\\)","EnteredCurrentStatus"); split($0,s," *\\|\\| *"); for(x in s) printf("%s\0",s[x])}' | xargs -0 condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln
( LastJobStatus == 2 && JobUniverse == 5 && MemoryUsage > 1.0 * RequestMemory ) = undefined
( LastJobStatus == 2 && JobUniverse == 5 && DiskUsage > 1.0 * RequestDisk ) = false
( LastJobStatus == 2 && JobUniverse == 5 && NumJobStarts > 10 ) = false
( JobUniverse == 5 && NumShadowStarts > 10 ) = false
( LastJobStatus == 2 && JobUniverse == 5 && EnteredCurrentStatus - JobCurrentStartDate > JOB_EXPECTED_MAX_LIFETIME && JOB_EXPECTED_MAX_LIFETIME > 0 ) = true
( JobUniverse == 5 && JOB_EXPECTED_MAX_LIFETIME > 345600 ) = false
( LastJobStatus == 2 && JobUniverse == 5 && ((EnteredCurrentStatus - JobCurrentStartDate) > 345600 )) = false
( LastJobStatus == 2 && JobUniverse == 7 && ((EnteredCurrentStatus - EnteredCurrentStatus) > 2592000 )) = false

> /usr/bin/condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln MemoryUsage RequestMemory
MemoryUsage = undefined
RequestMemory = 2499

SAM SSL certifcate error

running sam commands, you see the following

400 The SSL certificate error
400 Bad Request
The SSL certificate error

This is because you have a voms (kx509) proxy file, but the proxy is expired, run vomsCert

Mac

Attempting to do grahics (often geometry browsing in root) you see "libGL error: failed to load driver: swrast". The solution is discussed here