ErrorRecovery: Difference between revisions
(54 intermediate revisions by 4 users not shown) | |||
Line 3: | Line 3: | ||
Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here. | Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here. | ||
==UPS== | |||
If you type "muse setup" or issue a UPS setup command and get the following error message: | |||
<pre> | |||
You are attempting to run "setup" which requires administrative | |||
privileges, but more information is needed in order to do so. | |||
Password for root: | |||
</pre> | |||
the solution is discussed at: [[UPS#A_Common_Error_Message]]. | |||
==logins== | ==logins== | ||
===Locale failed=== | |||
When ssh-ing into an interactive node, you see the following error | When ssh-ing into an interactive node, you see the following error | ||
<pre> | <pre> | ||
Line 31: | Line 44: | ||
</pre> | </pre> | ||
See also [https://stackoverflow.com/questions/2499794/how-to-fix-a-locale-setting-warning-from-perl stackoverflow] | See also [https://stackoverflow.com/questions/2499794/how-to-fix-a-locale-setting-warning-from-perl stackoverflow] | ||
===Could not chdir to home directory=== | |||
When you log into one of the [[ComputingLogin#Machines | Mu2e interactive machines]], your kerberos ticket must be forwarded to the machine that serves the home directories. If this does not work properly, you will see an error message similar to: | |||
<pre> | |||
Could not chdir to home directory /nashome/k/kutschke: Permission denied. | |||
</pre> | |||
The issue could either be at your end or it could be a problem with the services provided by the lab. First follow the discussion below to ensure that you are doing everything correctly. If you are, and if the problem remains, then open a service desk ticket; be sure to mention that you have followed these instructions (you can send them the [[ErrorRecovery#Could_not_chdir_to_home_directory |url to this subsection]]). | |||
To diagnose this situation, follow this checklist: | |||
# Check if you can log into some of the other [[ComputingLogin#Machines | Mu2e interactive machines]] and see your home disk. | |||
## If you can, then there is one weird corner case to check; make sure that your ~/.ssh/config file does not enable forwarding to some of the Mu2e interactive machines but not others. | |||
## If your ~/.ssh/config file is good, then the problem is definitely on the lab side; open a service desk ticket and tell them which machines have the problem and which do not. | |||
# On your laptop/desktop, check that your kerberos ticket is still valid and is forwardable. Use the command "klist -f"; check the expiration date and if it has "forwardable" under the "Ticket flags". If you do have not ticket, or if it is no longer valid or not forwardable, do another kinit and try again. For details see the discussion on the [[ComputingLogin | Mu2e wiki page about logging in to the interactive machines]]. | |||
# If you started on your laptop/desktop and logged into machine A and are trying to log into machine B from machine A, make sure that your kerberos ticket is still valid and is forwardable on the intermediate machines. If not, you can kinit on the intermediate machines. | |||
# Check that you are correctly forwarding your kerberos ticket when you use the ssh command. See the [[ComputingLogin | Mu2e wiki page about logging in to the interactive machines]] for how to do this either with command line options or with configuration files. | |||
# on 4/12/21, the admins added a cron to restart the nfs permissions demons if they crashed, which they are doing occasionally. So it is reasonable to wait 30min to see if this will recover the access. | |||
# If none of this works, open a service desk ticket. | |||
===Can't log in to mu2egpvm=== | |||
Logins hang when ssh into mu2egpvm01. Existing processes have trouble accessing kerberized disk (/nashome and /web). This might potentially hit other nodes. | |||
* (from Tim Skirvin, 4/2023) At least one case of this being fixed by his "fix gssproxy" script | |||
* (from Ed Simmons, 4/2023) There is another gssproxy-related bug. There is an automatically-generated 'machine ticket' that is stored in /tmp which can also expire without being renewed by the kernel when it should be. When this happens, restarting nfs-secure, or gss-proxy fixes it. Believed rare. | |||
===Remote Host Identification has changed=== | |||
When you try to log in to one of the Mu2e interactive machines you may ocaissonally see the following message: | |||
<pre> | |||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ | |||
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |||
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! | |||
Someone could be eavesdropping on you right now (man-in-the-middle attack)! | |||
It is also possible that a host key has just been changed. | |||
The fingerprint for the ECDSA key sent by the remote host is | |||
SHA256:ZushUU76+0vOnVrChUmJKzHUmvA/cogAyR6p3L0jpcQ. | |||
Please contact your system administrator. | |||
Add correct host key in /Users/kutschke/.ssh/known_hosts to get rid of this message. | |||
Offending ECDSA key in /Users/kutschke/.ssh/known_hosts:32 | |||
</pre> | |||
If you see this message, issue the following command: | |||
ssh-keygen -R <hostname> | |||
This will remove all lines referring to <hostname> from ~/.ssh/knownhosts. The first time that you log in after this you will see a message like: | |||
<pre> | |||
The authenticity of host 'mu2egpvm03 (131.225.67.36)' can't be established. | |||
ED25519 key fingerprint is SHA256:BedPAEFKfEy8WvBbVLr+IATD0meotoBkdX1OWETJIyI. | |||
This key is not known by any other names. | |||
Are you sure you want to continue connecting (yes/no/[fingerprint])? | |||
</pre> | |||
Reply yes and continue. | |||
The backstory is that the machine in question has been issued a new kerberos host principal. This often happens when a machine has a major OS upgrade but it can happen at other times. | |||
You can perform the same operation as ssh-keygen by hand. Edit the file ~/.ssh/known_hosts and remove all lines containing the name of the machine. You may need to remove several lines, for example "mu2egpvm03" and "mu2egpvm03.fnal.gov". | |||
==git== | ==git== | ||
Line 64: | Line 142: | ||
> which git | > which git | ||
/cvmfs/mu2e.opensciencegrid.org/artexternals/git/v2_20_1/Linux64bit+3.10-2.17/bin/git | /cvmfs/mu2e.opensciencegrid.org/artexternals/git/v2_20_1/Linux64bit+3.10-2.17/bin/git | ||
Exactly which git version may change.. You get the ups verison when you <code> | Exactly which git version may change.. You get the ups verison when you <code>mu2einit</code>. This error may also have other causes | ||
===fatal: not a git repository=== | ===fatal: not a git repository=== | ||
When issuing any of several git commands | When issuing any of several git commands | ||
Line 72: | Line 150: | ||
</pre> | </pre> | ||
Probably you are not in a git repository directory. All git repo directories (such as our Offline), will have a <code>.git</code> subdirectory. | Probably you are not in a git repository directory. All git repo directories (such as our Offline), will have a <code>.git</code> subdirectory. | ||
===SSL certificate problem=== | |||
clone command fails with | |||
fatal: unable to access 'https://github.com/Mu2e/Offline.git/': SSL certificate problem: unable to get local issuer certificate | |||
Likely the problem is that CA certs on the local machine are behind github's. Probable solution: | |||
git clone -c http.sslverify=false https://github.com/Mu2e/Offline.git | |||
also, if on a central node, likely certs will be updated on the node, and this might fix it | |||
==root== | |||
===Missing Dictionaries=== | |||
If you are on a AL9 machine, using root to read a root file, and see error messages like: | |||
Processing proc.c... | |||
Warning in <TClass::Init>: no dictionary for class mu2e::EventInfo is available | |||
Warning in <TClass::Init>: no dictionary for class mu2e::EventInfoMC is available | |||
you need to define LD_LIBRARY_PATH. See [[Spack#Using_root_with_Pre_Built_Dictionaries]]. This is a known issue with Stntuple and TrkAna. | |||
==scons== | ==scons== | ||
==spack== | |||
===matches multiple packages=== | |||
During a spack load command it complains about multiple packages | |||
<pre> | |||
build02 spackm > spack load python | |||
==> Error: python matches multiple packages. | |||
Matching packages: | |||
kz52h4k python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2 | |||
mssmz2i python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2 | |||
</pre> | |||
solution is to pick one, in this case, for example | |||
spack load python/kz52h4k | |||
if you'd like to investigate the versions, you can look at differences in the dependencies of the build | |||
spack diff python/kz52h4k python/mssmz2i | |||
Line 224: | Line 339: | ||
DbService : @local::DbEmpty | DbService : @local::DbEmpty | ||
ProditionsService: @local::Proditions | ProditionsService: @local::Proditions | ||
=== DbHandle no TID === | |||
An exe gives this error (table name and module name may be different) on starting up | |||
---- DBHANDLE_NO_TID BEGIN | |||
DbHandle could not get TID (Table ID) from DbEngine for TrkPreampStraw at first use The above exception was thrown while processing module StrawHitReco/makeSH run: 1000 | |||
---- DBHANDLE_NO_TID END | |||
Every executable may or may not load a set of conditions read from the database, depending on its fcl configuration. This error says that a database table was requested by the code, but was not available in the conditions set. So either the conditions set was not loaded or what was loaded was the wrong conditions set for this code. Look for the following lines in the fcl: | |||
services.DbService.purpose: NOMINAL | |||
services.DbService.version: v1_0 | |||
These define what conditions set to load. See also [[ConditionsData|general database info]] and [[https://mu2einternalwiki.fnal.gov/wiki/CalibrationSets calibration sets]]. Since it is dynamic, and hard to document fully, only an expert can fully verify the correct conditions set for any particular data and exe. | |||
===getByLabel: Found zero products matching all criteria=== | ===getByLabel: Found zero products matching all criteria=== | ||
Line 249: | Line 375: | ||
See [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/HELPBug/243.html this dicussion] | See [https://mu2e-hnews.fnal.gov/HyperNews/Mu2e/get/HELPBug/243.html this dicussion] | ||
===Product Dropped Unexpectedly by RootInput=== | |||
When you read an art data file with RootInput, you can drop products from the file using the inputCommands parameter. The default behaviour is to drop all descendant products as well. This is reported with a warning message like the following, one for each descendant data product that is dropped: | |||
<pre> | |||
%MSG-w RootInputFile: RootOutput:defaultOutput@Construction 06-Sep-2020 19:33:53 CDT ModuleConstruction | |||
Branch 'mu2e::CaloShowerStepROs_compressDigiMCs__cosmics3.' is being dropped from the input | |||
of file 's8.art' because it is dependent on a branch | |||
that was explicitly dropped. | |||
</pre> | |||
This behaviour is described in [[FclIOModules#Drop_on_Input]]; that page describes how to tell art to keep the descendant data products. | |||
===Missing rpms for art v3_06_xx=== | |||
When you are migrate to an Offline based on art v3_06_xx, from one based on art v3_05_xx, you need may to install additional rpms: | |||
pcre2 xxhash-libs libzstd libzstd-devel | |||
ftgl libGLEW gl2ps root-graf-asimage | |||
These must be installed on all machines on which you plan to build or run the code. The rpms on the first line were needed by machines configured as are the Mu2e interactive machines. Bertrand Echenard reports that he had to install some additional rpms on his machine; those are listed on the second line. | |||
One of the symptoms of this is the following error message at build time: | |||
ImportError: libxxhash.so.0: cannot open shared object file: No such file or directory: | |||
If you build on one machine, and run on another machine, and if the run-time machine is not updated, you will get a different error: | |||
<pre> | |||
---- Configuration BEGIN | |||
Unable to load requested library /storage/local/data1/condor/execute/dir_14461/no_xfer/Code/Offline/lib/libmu2e_GeometryService_GeometryService_service.so | |||
libzstd.so.1: cannot open shared object file: No such file or directory | |||
---- Configuration END | |||
</pre> | |||
If you get the above error in a grid job, you are most likely using an older version of mu2egrid. Be sure to use v6_09_00 or greater, which tells the grid jobs to run using the most recent available singularity container. That container has the above rpms installed. For bare jobsub, here is the incantation. 11/2021 - we are told the append line is no longer required. | |||
--singularity-image '/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest' | |||
===<top_level>.fileNames=== | |||
When running a fcl file from Production, such as Productiom/JobConfig/CeEndPoint.fcl, you quickly see the error | |||
The supplied value of the parameter: <top_level>.fileNames does not represent a sequence | |||
This error means that the fcl is requiring an input file, but you did not provide any. You will have to follow up with experts or documentation. | |||
==Grid Workflows== | ==Grid Workflows== | ||
=== What to do if a grid segment is reported in a HELD state === | |||
<pre> | |||
murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --jobid 51739456.0@jobsub01.fnal.gov | |||
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | |||
51739456.0@jobsub01.fnal.gov murat 01/02 13:41 0+00:37:47 H 0 2.2 mu2eprodsys.sh_20220102_134146_431218_0_1_wrap.sh | |||
1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended | |||
</pre> | |||
to figure out the reason, run <b>jobsub_q --long</b> : | |||
<pre> | |||
murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --long --jobid 51739456.0@jobsub01.fnal.gov | grep -i hold | |||
HoldReason = "Error from slot1_4@fnpc7672.fnal.gov: Docker job has gone over memory limit of 3000 Mb" | |||
HoldReasonCode = 34 | |||
HoldReasonSubCode = 0 | |||
NumSystemHolds = 0 | |||
OnExitHold = false | |||
PeriodicHold = false | |||
</pre> | |||
Note, that it looks that the memory is booked per docker container, not per user job - the user job is reported to use only 2.2 GBytes... | |||
===Hold codes=== | ===Hold codes=== | ||
1 user put jobs in hold | |||
6 sub 0,2 could not execute glidein, and/or docker did not run | 6 sub 0,2 could not execute glidein, and/or docker did not run | ||
9 not enough memory; remember that you need to include the size of the code tarball/release | 9 not enough memory; remember that you need to include the size of the code tarball/release | ||
12 sub 2 could not execute glidein | 12 sub 2 could not execute glidein | ||
13 | 13 | ||
26 sub 1 memory limits | 26 sub 1 memory limits | ||
26 sub 2 exceeded disk limits | |||
26 sub 4 SYSTEM_PERIODIC_HOLD Starts/limit 31/10 - too many restarts? | 26 sub 4 SYSTEM_PERIODIC_HOLD Starts/limit 31/10 - too many restarts? | ||
26 sub 8 wall time | |||
28 sub -10000,512,768,256 sandbox | 28 sub -10000,512,768,256 sandbox | ||
30 sub -10000,768 Job put on hold by remote host, job proxy is not valid | 30 sub -10000,768 Job put on hold by remote host, job proxy is not valid | ||
34 sub 0 - memory limits | 34 sub 0 - memory limits | ||
35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node | 35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node | ||
See also [http://research.cs.wisc.edu/htcondor/manual/v8.7/JobClassAdAttributes.html#x170-1249000A.2 condor docs] | See also [http://research.cs.wisc.edu/htcondor/manual/v8.7/JobClassAdAttributes.html#x170-1249000A.2 condor docs] | ||
Line 300: | Line 492: | ||
* if a simple <code>ls</code> on a directory which does not contain many files or subdirectories hangs for more than 2 min. | * if a simple <code>ls</code> on a directory which does not contain many files or subdirectories hangs for more than 2 min. | ||
* if file access in your MC workflow seems normal then suddenly hangs for more than one 1h | * if file access in your MC workflow seems normal then suddenly hangs for more than one 1h | ||
* if | * if prestaging does not progress. See [[Prestage]] | ||
Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node. | Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node. | ||
Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache. | Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache. | ||
===dCache Input/Output error=== | |||
When attempting to access a file in tape-backed dCache, you see an error like | |||
cp: error reading /pnfs/mu2e/tape/phy-etc/bck/mu2e/beams1phase1/g4-10-4/tbz/66/08/bck.mu2e.beams1phase1.g4-10-4.002701_00000004.tbz: Input/output error | |||
This error can have two causes. The most likely is that the file is simply not prestaged to disk. This behavior of reporting a infrastructure error when the file is not available is not user-friendly and it may be updated in the future. Note that attempting to read a non-prestaged file in tape-backed dCache via the NFS directory (/pnfs) will not trigger the file to be prestaged. | |||
The second possibility is that there is actual problem with the /pnfs mount. This usually requires a ticket to ask for the NFS client to be restarted, or perhaps the node to be rebooted. | |||
===dCache timeout errors=== | |||
Multiple grid jobs have the following message in log files: | |||
[ERROR] Server responded with an error: [3012] Internal timeout | |||
This can be due to the requested file not being prestaged, so that is the first thing to check. If you can establish that isn't the problem, put in a ticket. | |||
As of 4/2023, timeout is 1.5 hours for FTP. 30 minutes for webdav, xroot. | |||
===mkdir failed mu2eFileUpload=== | ===mkdir failed mu2eFileUpload=== | ||
Line 365: | Line 569: | ||
The only solution is to resubmit. There seems to be something flaky deep in the art/FermiGrid/xrootd/dCache path that has bugs or load issues. | The only solution is to resubmit. There seems to be something flaky deep in the art/FermiGrid/xrootd/dCache path that has bugs or load issues. | ||
Here is another failure message: | |||
Fatal Root Error: TNetXNGFile::ReadBuffer | |||
[ERROR] Server responded with an error: [3010] org.dcache.uuid is no longer valid. | |||
ROOT severity: 3000 | |||
3/2021, dCache exports suggested setting this might help | |||
export XRD_STREAMTIMEOUT=300 | |||
=== auth and handshake errors while using xrootd === | |||
Many jobs end with errors like: | |||
Fatal Root Error: TNetXNGFile::Open | |||
[FATAL] Hand shake failed | |||
ROOT severity: 3000 | |||
or: | |||
---- FatalRootError BEGIN | |||
Fatal Root Error: TNetXNGFile::Open | |||
[FATAL] Auth failed: No protocols left to try | |||
ROOT severity: 3000 | |||
---- FatalRootError END | |||
Unable to open specified secondary event stream file root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy- | |||
the current thinking is this a problem with authentication on the grid node. So far it has not been persistent, but If it is, put in a ticket. | |||
===samweb SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED]=== | ===samweb SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED]=== | ||
Line 383: | Line 612: | ||
kx509 | kx509 | ||
=== mu2eprodsys Error === | === mu2eprodsys Error - python future === | ||
If you setup both Offline and mu2egrid in the same window we recommend that you setup Offline first and mu2egrid second: | If you setup both Offline and mu2egrid in the same window we recommend that you setup Offline first and mu2egrid second: | ||
<pre> | <pre> | ||
mu2einit | |||
source <path to an Offline installation>/setup.sh | source <path to an Offline installation>/setup.sh | ||
setup mu2egrid | setup mu2egrid | ||
Line 423: | Line 652: | ||
Or you can always setup mu2egrid in a window in which Offline has not been, and will not be, setup: | Or you can always setup mu2egrid in a window in which Offline has not been, and will not be, setup: | ||
<pre> | <pre> | ||
mu2einit | |||
setup mu2egrid | setup mu2egrid | ||
mu2eprodsys <other arguments> | mu2eprodsys <other arguments> | ||
Line 439: | Line 668: | ||
The core problem is that jobsub_client is attempting to be smart enough to "just work" with both the system python and with any python from UPS. UPS does not have the features needed to manage all of the corner cases of this use pattern. I have asked the developers of jobsub_client to check, at run-time, for a self-consistent environment and to issue a user friendly error message if the environment is not self-consistent. | The core problem is that jobsub_client is attempting to be smart enough to "just work" with both the system python and with any python from UPS. UPS does not have the features needed to manage all of the corner cases of this use pattern. I have asked the developers of jobsub_client to check, at run-time, for a self-consistent environment and to issue a user friendly error message if the environment is not self-consistent. | ||
===Code NOT_FOUND in job ouput=== | |||
The symptom of this problem is that you get the following error message in log files in your outstage directory: | |||
mu2eprodsys Thu Oct 1 15:17:56 UTC 2020 -- 1601565476 pointing MU2EGRID_USERSETUP to the tarball extract: NOT_FOUND/Code/setup.sh | |||
./mu2eprodsys.sh: line 281: NOT_FOUND/Code/setup.sh: No such file or directory | |||
Error sourcing setup script NOT_FOUND/Code/setup.sh: status code 1 | |||
This means that you submitted a grid job with tarball option but the tarball could not be unpacked by the [[Cvmfs#Rapid_Code_Distribution_Service_.28RCDS.29|Rapid Code Distribution Service]]. For reasons that I don't understand, the grid system might not detect this error and your grid jobs will start to run. When they try to access the code distributed by the tarball they will fail with the message above. | |||
In almost all cases, this error is caused by one of two things: | |||
#The tar file is too big. In the one example to date, the tarball was 5 GB instead of the typical ~800 MB. We don't know where the threshold is. | |||
#The tar file contains one or more files larger than 1 GB | |||
In all cases to date this was caused by the user putting inappropriate files into the tarball: art event-data files, root files, graphics files or log files. These files should be on /mu2e/data, not on /mu2e/app. When you are looking for large files, remember to check for hidden files, files whose name begins with a dot; some tools make temporary files named .nfsxxx where xxx is a long hex number. You can see hidden files with "ls -a". | |||
If you are within the guidelines above and the problem still occurs, please report the error using a service desk ticket. | |||
There is additional information in an old ticket: INC000001105610. | |||
===Can not get BFID=== | |||
While running mu2eDatasetLocation to add tape locations to the SAM record of files which you uploaded, you see this error: | |||
Can not get BFID from pnfs layer 1 /pnfs/mu2e/tape/usr-sim/dig/oksuzian/CRY-cosmic-general/cry3-digi-hi/art/71/45/.(use)(1)(dig.oksuzian.CRY-cosmic-general.cry3-digi-hi.001002_00009067.art): on Mon Mar 22 12:37:30 2021 | |||
We believe this occurs when the script is run before some items in the dCache database have been inserted. It usually runs correctly after waiting an hour or two. | |||
===jobsub dropbox_upload failed=== | |||
On jobsub job submission, | |||
Submitting.... | |||
dropbox_upload failed | |||
This error is caused by the underlying mechanisms for transferring your local files to the grid being broken. It is only in jobsub_client version v1_3_3. A solution is to setup the previous version: | |||
setup jobsub_client v1_3_2_1 | |||
Do this after setup mu2egrid. | |||
=== Error: OperationalError: no such table: keycache error === | |||
This error happens when you submit a grid job (typically using mu2eprodsys or jobsub_submit). It started to occur in mid-November 2023. The underlying issue is a problem with the tokens system used for grid authentication. The following is the suggested workaround: | |||
ls -l ${XDG_CACHE_HOME:-$HOME/.cache}/scitokens/scitokens_keycache.sqllite | |||
If this file exists and is zero length, then remove it and try to submit again. | |||
===Disk quota exceeded during jobsub_submit or mu2eprodsys=== | |||
If you get the following error during the execution of a jobsub_submit command | |||
<pre> | |||
Unable to get key triggered by next update: disk I/O error | |||
Error: OSError: [Errno 122] Disk quota exceeded | |||
</pre> | |||
the issue is that your home directory has no quota remaining. mu2eprodsys is just a wrapper around jobsub_submit so it will issue the same error in the same circumstances. | |||
The issue is that jobsub_submit writes small files in your home directory under $HOME/.cache and $HOME/.config. If there is no available disk quota, jobsub_submit will fail. | |||
The solution is to remove some files from your home directory; either delete them or move them to a more appropriate location. | |||
==='Available' is not on the list=== | |||
2/2024 During jobsub submission | |||
Submitting.... | |||
Error: ValueError: 'Available' is not in list | |||
This is caused by a bug in a disk space check. The temporary solution is to skip the check with | |||
--skip-check=disk_space | |||
or for mu2eprodsys | |||
--jobsub-arg='--skip-check=disk_space' | |||
=== Hold test evaluated true (hold code 26/0) === | |||
Your jobs go to hold, but instead of telling you why (memory, time) it simply says that this whole long expression evaluated true. If you go to the "why are my jobs held" monitoring page, it will say the cause is "code 26 subcode 0" which is "other". The experts say this is usually a hold due to running over lifetime, but here is a command which might be able to pinpoint which part of the test is failing: | |||
<pre> | |||
> condor_config_val -schedd -name jobsub05.fnal.gov SYSTEM_PERIODIC_HOLD | awk '{gsub("JobStatus","LastJobStatus"); gsub("time\\(\\)","EnteredCurrentStatus"); split($0,s," *\\|\\| *"); for(x in s) printf("%s\0",s[x])}' | xargs -0 condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln | |||
( LastJobStatus == 2 && JobUniverse == 5 && MemoryUsage > 1.0 * RequestMemory ) = undefined | |||
( LastJobStatus == 2 && JobUniverse == 5 && DiskUsage > 1.0 * RequestDisk ) = false | |||
( LastJobStatus == 2 && JobUniverse == 5 && NumJobStarts > 10 ) = false | |||
( JobUniverse == 5 && NumShadowStarts > 10 ) = false | |||
( LastJobStatus == 2 && JobUniverse == 5 && EnteredCurrentStatus - JobCurrentStartDate > JOB_EXPECTED_MAX_LIFETIME && JOB_EXPECTED_MAX_LIFETIME > 0 ) = true | |||
( JobUniverse == 5 && JOB_EXPECTED_MAX_LIFETIME > 345600 ) = false | |||
( LastJobStatus == 2 && JobUniverse == 5 && ((EnteredCurrentStatus - JobCurrentStartDate) > 345600 )) = false | |||
( LastJobStatus == 2 && JobUniverse == 7 && ((EnteredCurrentStatus - EnteredCurrentStatus) > 2592000 )) = false | |||
</pre> | |||
> /usr/bin/condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln MemoryUsage RequestMemory | |||
MemoryUsage = undefined | |||
RequestMemory = 2499 | |||
===SAM SSL certifcate error=== | |||
running sam commands, you see the following | |||
400 The SSL certificate error | |||
400 Bad Request | |||
The SSL certificate error | |||
This is because you have a voms (kx509) proxy file, but the proxy is expired, run vomsCert | |||
==Mac== | ==Mac== |
Latest revision as of 03:03, 6 November 2024
Introduction
Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here.
UPS
If you type "muse setup" or issue a UPS setup command and get the following error message:
You are attempting to run "setup" which requires administrative privileges, but more information is needed in order to do so. Password for root:
the solution is discussed at: UPS#A_Common_Error_Message.
logins
Locale failed
When ssh-ing into an interactive node, you see the following error
while connecting to SL7 machines: -bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). The warning was not present before and it is still not present on mu2egpvm01. Does someone know who generates this? I am having a running error connected to it: ---- StdException BEGIN ServiceCreation std::exception caught during construction of service type art::TFileService: locale::facet::_S_create_c_locale name not valid ---- StdException END
The solution is:
fixed it by commenting out the line (on my local machine): SendEnv LANG LC_* from the file: /etc/ssh/ssh_config
See also stackoverflow
Could not chdir to home directory
When you log into one of the Mu2e interactive machines, your kerberos ticket must be forwarded to the machine that serves the home directories. If this does not work properly, you will see an error message similar to:
Could not chdir to home directory /nashome/k/kutschke: Permission denied.
The issue could either be at your end or it could be a problem with the services provided by the lab. First follow the discussion below to ensure that you are doing everything correctly. If you are, and if the problem remains, then open a service desk ticket; be sure to mention that you have followed these instructions (you can send them the url to this subsection).
To diagnose this situation, follow this checklist:
- Check if you can log into some of the other Mu2e interactive machines and see your home disk.
- If you can, then there is one weird corner case to check; make sure that your ~/.ssh/config file does not enable forwarding to some of the Mu2e interactive machines but not others.
- If your ~/.ssh/config file is good, then the problem is definitely on the lab side; open a service desk ticket and tell them which machines have the problem and which do not.
- On your laptop/desktop, check that your kerberos ticket is still valid and is forwardable. Use the command "klist -f"; check the expiration date and if it has "forwardable" under the "Ticket flags". If you do have not ticket, or if it is no longer valid or not forwardable, do another kinit and try again. For details see the discussion on the Mu2e wiki page about logging in to the interactive machines.
- If you started on your laptop/desktop and logged into machine A and are trying to log into machine B from machine A, make sure that your kerberos ticket is still valid and is forwardable on the intermediate machines. If not, you can kinit on the intermediate machines.
- Check that you are correctly forwarding your kerberos ticket when you use the ssh command. See the Mu2e wiki page about logging in to the interactive machines for how to do this either with command line options or with configuration files.
- on 4/12/21, the admins added a cron to restart the nfs permissions demons if they crashed, which they are doing occasionally. So it is reasonable to wait 30min to see if this will recover the access.
- If none of this works, open a service desk ticket.
Can't log in to mu2egpvm
Logins hang when ssh into mu2egpvm01. Existing processes have trouble accessing kerberized disk (/nashome and /web). This might potentially hit other nodes.
- (from Tim Skirvin, 4/2023) At least one case of this being fixed by his "fix gssproxy" script
- (from Ed Simmons, 4/2023) There is another gssproxy-related bug. There is an automatically-generated 'machine ticket' that is stored in /tmp which can also expire without being renewed by the kernel when it should be. When this happens, restarting nfs-secure, or gss-proxy fixes it. Believed rare.
Remote Host Identification has changed
When you try to log in to one of the Mu2e interactive machines you may ocaissonally see the following message:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:ZushUU76+0vOnVrChUmJKzHUmvA/cogAyR6p3L0jpcQ. Please contact your system administrator. Add correct host key in /Users/kutschke/.ssh/known_hosts to get rid of this message. Offending ECDSA key in /Users/kutschke/.ssh/known_hosts:32
If you see this message, issue the following command:
ssh-keygen -R <hostname>
This will remove all lines referring to <hostname> from ~/.ssh/knownhosts. The first time that you log in after this you will see a message like:
The authenticity of host 'mu2egpvm03 (131.225.67.36)' can't be established. ED25519 key fingerprint is SHA256:BedPAEFKfEy8WvBbVLr+IATD0meotoBkdX1OWETJIyI. This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])?
Reply yes and continue.
The backstory is that the machine in question has been issued a new kerberos host principal. This often happens when a machine has a major OS upgrade but it can happen at other times.
You can perform the same operation as ssh-keygen by hand. Edit the file ~/.ssh/known_hosts and remove all lines containing the name of the machine. You may need to remove several lines, for example "mu2egpvm03" and "mu2egpvm03.fnal.gov".
git
KeyError on git push
Traceback (most recent call last): File "hooks/post-receive", line 73, in <module> main() File "hooks/post-receive", line 62, in main os.environ["GL_USER"] = os.environ["REMOTEUSER"] File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__ raise KeyError(key) KeyError: 'REMOTEUSER' error: hooks/post-receive exited with error code 1 To ssh://p-mu2eofflinesoftwaremu2eoffline@cdcvs.fnal.gov/cvs/projects/mu2eofflinesoftwaremu2eoffline/Offline.git
This is due to inconsistent ssh settings, please see this page.
Could not resolve hostname
On any git command
> git pull ssh: Could not resolve hostname cdcvs.fnal.gov:: Name or service not known fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
Check which git you are using
> which git /usr/bin/git
This result is wrong, this is an old version and has inconsistencies with our settings. You should be using a ups version:
> which git /cvmfs/mu2e.opensciencegrid.org/artexternals/git/v2_20_1/Linux64bit+3.10-2.17/bin/git
Exactly which git version may change.. You get the ups verison when you mu2einit
. This error may also have other causes
fatal: not a git repository
When issuing any of several git commands
fatal: not a git repository (or any parent up to mount point /mu2e) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Probably you are not in a git repository directory. All git repo directories (such as our Offline), will have a .git
subdirectory.
SSL certificate problem
clone command fails with
fatal: unable to access 'https://github.com/Mu2e/Offline.git/': SSL certificate problem: unable to get local issuer certificate
Likely the problem is that CA certs on the local machine are behind github's. Probable solution:
git clone -c http.sslverify=false https://github.com/Mu2e/Offline.git
also, if on a central node, likely certs will be updated on the node, and this might fix it
root
Missing Dictionaries
If you are on a AL9 machine, using root to read a root file, and see error messages like:
Processing proc.c... Warning in <TClass::Init>: no dictionary for class mu2e::EventInfo is available Warning in <TClass::Init>: no dictionary for class mu2e::EventInfoMC is available
you need to define LD_LIBRARY_PATH. See Spack#Using_root_with_Pre_Built_Dictionaries. This is a known issue with Stntuple and TrkAna.
scons
spack
matches multiple packages
During a spack load command it complains about multiple packages
build02 spackm > spack load python ==> Error: python matches multiple packages. Matching packages: kz52h4k python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2 mssmz2i python@3.9.15%gcc@4.8.5 arch=linux-scientific7-x86_64_v2
solution is to pick one, in this case, for example
spack load python/kz52h4k
if you'd like to investigate the versions, you can look at differences in the dependencies of the build
spack diff python/kz52h4k python/mssmz2i
art and fcl
art exit codes
cling::AutoloadingVisitor
At beginning of run time you see a spew of errors like this.
Error in cling::AutoloadingVisitor::InsertIntoAutoloadingState: Missing FileEntry for MCDataProducts/inc/ExtMonFNALPatRecTruthAssns.hh requested to autoload type mu2e::ExtMonFNALTrkFit
These are caused by cling system as implemented by root. Eventually this system will pre-compile header files (and maybe dictionaries?) so that compiling source files is very fast. In root 6, this not fully implemented and cling needs to find the header file for reasons that are not clear to us. It is searching the path given by ROOT_INCLUDE_PATH which should be set adequately by the setups. If, for some reason, the path is not correct or you have removed the header files, the error message is harmless - we know of no consequence except the spew.
Update (1/2020): we have seen this error also be correlated with crashes, in other words, missing header files might be providing semething root needs. We are trying to figure out what root might be getting from the headers and provide it all in the libraries.
And example crash (Lisa, 1/28/20 in a docker container, ceSimReco)
%MSG-s ArtException: PostEndJob 28-Jan-2020 19:53:16 UTC ModuleEndJob ---- FatalRootError BEGIN Fatal Root Error: @SUB= ! (prop&kIsClass) && "Impossible code path" violated at line 462 of `/scratch/workspace/canvas-products/v3_09_00-/SLF7/e19-prof/build/root/v6_18_04c/source/root-6.18.04/io/io/src/TGenCollectionProxy.cxx' ---- FatalRootError END %MSG
Another from Dave (LBL) 12/23/19. This seems to be a real missing dictionary.
%MSG-s ArtException: FileDumperOutput:dumper@Construction 23-Dec-2019 10:00:54 CST ModuleConstruction cet::exception caught in art ---- FileReadError BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TBufferFile::ReadClassBuffer Could not find the StreamerInfo for version 2 of the class art::ProcessHistory, object skipped at offset 108 ---- FatalRootError END ---- FileReadError END %MSG
cannot access private key
Attempting to read a root or xroot input file spec in art or root, see error:
180518 15:04:27 27909 secgsi_InitProxy: cannot access private key file: /nashome/r/rlc/.globus/userkey.pem %MSG-s ArtException: Early 18-May-2018 15:04:27 CDT JobSetup cet::exception caught in art ---- FileOpenError BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TNetXNGFile::Open [FATAL] Auth failed ---- FatalRootError END
The problem is that you need a voms proxy. See authentication. A kx509 cert or proxy identifies you, but does not identify that you are a member of mu2e. The voms proxy adds this information (the VO or Virtual Organization in the voms).
'present' is not actually present
Principal::getForOutput A product with a status of 'present' is not actually present. The branch name is mu2e::CaloShowerSteps_CaloShowerStepFromStepPt_calorimeterRO_photon. Contact a framework developer.
Needs a new version of art (2_11), but can be worked around, see this bug report
Unsuccessful attempt to convert FHiCL parameter
At the start of the job you see and art error message:
Unsuccessful attempt to convert FHiCL parameter 'inputFiles' to type 'std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >'.
You could expand your fcl file:
fhicl-dump myfile.fcl > temp.txt
and look for the fcl parameter, "inputFiles" in this massage, which will probably be set to null. You can see what stanza this was part of and take it from there.
A common case is that the stopped muon input file was not defined:
physics.producers.generate.muonStops.inputFiles
needs to be set to value, currently defined in EventGenerator/fcl/prolog.fcl
unable to find the service
At the beginning of an art executable run you see:
---- ServiceNotFound BEGIN art::ServicesManager unable to find the service of type 'art::TFileService'. ---- ServiceNotFound END
art has a service called TFileService which opens a root file to hold histograms and ntuples made by user modules. In some fcl, the service is not configured with a default file name, then if a module asks the service for the root file, this error is generated. You can fix it by adding a line to your fcl:
services.TFileService.fileName: "myfile.root"
if you are modifying this .fcl for grid running you would write:
services.TFileService.fileName: "nts.owner.desc.version.sequencer.root"
where "desc" is the field you enter in generate_fcl.
If running nteractively, you can add the file name to the command line:
mu2e -T myfile.root -c ...
There doesn't need to be any specific fcl text in the services stanza to create the service, art will create it automatically if the root file name is defined.
Found zero products matching all criteria
In MDC there were a number of changes made to data product names. A standard job to read MDC output is TrkDiag/fcl/TrkAnaDigis.fcl. Let's say you wanted to weight the events from an MDC output (say JobConfig/primary/flateminus.fcl) using the DecayInOrbitWeight module. A natural thing to do is change the trigger_path to:
physics.TrkAnaTriggerPath : [ @sequence::TrkAna.TrkCaloRecoSequence,DIOWeight ]
which weights the generated DIO electron, but this will fail with the "zero product error". You need to add:
physics.producers.DIOWeight.inputModule: "compressDigiMCs"
How would you know this? Run Print/fcl/print.fcl on the output of flateminus.fcl, perhaps dig.owner.flateminus.version.sequencer.art. Then you will see the data products listed and something like:
Friendly Class Name Module Label Instance Name Process Name Product ID mu2e::GenParticles compressDigiMCs flateminus 2502071937
error writing all requested bytes
Your jobs start failing at a few percent rate and you see this error in the log file.
%MSG-s ArtException: PostEndJob 22-Sep-2018 02:06:49 UTC ModuleEndJob cet::exception caught in art ---- OtherArt BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TFile::WriteBuffer error writing all requested bytes to file ./RootOutput-ed23-499f-b549-ae0f.root, wrote 374 of 1272 ---- FatalRootError END ---- OtherArt BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TTree::SetEntries Tree branches have different numbers of entries, eg EventAuxiliary has 918 entries while art::TriggerResults_TriggerResults__flateminus. has 919 entries. ---- FatalRootError END ---- OtherArt END ---- OtherArt END %MSG
This seems to be new around 9/25/18. It is probably a random Docker-related error, just resubmit the job.
Proditions service not found
---- ServiceNotFound BEGIN Unable to create ServiceHandle. Perhaps the FHiCL configuration does not specify the necessary service? The class of the service is noted below... ---- ServiceNotFound BEGIN art::ServicesManager unable to find the service of type 'mu2e::ProditionsService'. ---- ServiceNotFound END ---- ServiceNotFound END %MSG
This service provides access to the database for conditions data. It was introduced Feb 2019. Adding this dependence mean that when you run modules that need this service (such as straw reco), then you need to configure the service. Either use the default services:
services : @local::Services.SimAndReco
or if you need to use your own explicit services fcl stanza, add these lines:
DbService : @local::DbEmpty ProditionsService: @local::Proditions
DbHandle no TID
An exe gives this error (table name and module name may be different) on starting up
---- DBHANDLE_NO_TID BEGIN DbHandle could not get TID (Table ID) from DbEngine for TrkPreampStraw at first use The above exception was thrown while processing module StrawHitReco/makeSH run: 1000 ---- DBHANDLE_NO_TID END
Every executable may or may not load a set of conditions read from the database, depending on its fcl configuration. This error says that a database table was requested by the code, but was not available in the conditions set. So either the conditions set was not loaded or what was loaded was the wrong conditions set for this code. Look for the following lines in the fcl:
services.DbService.purpose: NOMINAL services.DbService.version: v1_0
These define what conditions set to load. See also general database info and [calibration sets]. Since it is dynamic, and hard to document fully, only an expert can fully verify the correct conditions set for any particular data and exe.
getByLabel: Found zero products matching all criteria
On first event, the error occurs
getByLabel: Found zero products matching all criteria Looking for type: std::map<art::Ptr<mu2e::SimParticle>,double> Looking for module label: compressDigiMCs Looking for productInstanceName: cosmicTimeMap cet::exception going through module ComboHitDiag/CHD run: 1002 subRun: 39 event: 10
The details of which product is missing from which module may be different.
The first possibility is that you are requesting a product that doesn't exist in the file. Read about products, or work with an expert on debugging the fcl or the path.
In Summer 2019, there was an additional, temporary cause of this error. This is caused by a change in the art behavior with art 3.02 (v7_5_1 to v7_5_4) probably fixed in art 3.04. This changes gets triggers by products with similar names. See hypernews and Kyle's talk. Dave Brown has work-arounds which involve making the product references more explicit in fcl.
Assertion an overridden cannot be out-of-date
/scratch/workspace/canvas-products/vdevelop-/SLF7/e17-debug/build/root/v6_16_00/source/root-6.16.00/interpreter/llvm/src/tools/clang/include/clang/Serialization/Module.h:72: clang::serialization::InputFile::InputFile(const clang::FileEntry*, bool, bool): Assertion `!(isOverridden && isOutOfDate) && "an overridden cannot be out-of-date"' failed.
See this dicussion
Product Dropped Unexpectedly by RootInput
When you read an art data file with RootInput, you can drop products from the file using the inputCommands parameter. The default behaviour is to drop all descendant products as well. This is reported with a warning message like the following, one for each descendant data product that is dropped:
%MSG-w RootInputFile: RootOutput:defaultOutput@Construction 06-Sep-2020 19:33:53 CDT ModuleConstruction Branch 'mu2e::CaloShowerStepROs_compressDigiMCs__cosmics3.' is being dropped from the input of file 's8.art' because it is dependent on a branch that was explicitly dropped.
This behaviour is described in FclIOModules#Drop_on_Input; that page describes how to tell art to keep the descendant data products.
Missing rpms for art v3_06_xx
When you are migrate to an Offline based on art v3_06_xx, from one based on art v3_05_xx, you need may to install additional rpms:
pcre2 xxhash-libs libzstd libzstd-devel ftgl libGLEW gl2ps root-graf-asimage
These must be installed on all machines on which you plan to build or run the code. The rpms on the first line were needed by machines configured as are the Mu2e interactive machines. Bertrand Echenard reports that he had to install some additional rpms on his machine; those are listed on the second line.
One of the symptoms of this is the following error message at build time:
ImportError: libxxhash.so.0: cannot open shared object file: No such file or directory:
If you build on one machine, and run on another machine, and if the run-time machine is not updated, you will get a different error:
---- Configuration BEGIN Unable to load requested library /storage/local/data1/condor/execute/dir_14461/no_xfer/Code/Offline/lib/libmu2e_GeometryService_GeometryService_service.so libzstd.so.1: cannot open shared object file: No such file or directory ---- Configuration END
If you get the above error in a grid job, you are most likely using an older version of mu2egrid. Be sure to use v6_09_00 or greater, which tells the grid jobs to run using the most recent available singularity container. That container has the above rpms installed. For bare jobsub, here is the incantation. 11/2021 - we are told the append line is no longer required.
--singularity-image '/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest'
<top_level>.fileNames
When running a fcl file from Production, such as Productiom/JobConfig/CeEndPoint.fcl, you quickly see the error
The supplied value of the parameter: <top_level>.fileNames does not represent a sequence
This error means that the fcl is requiring an input file, but you did not provide any. You will have to follow up with experts or documentation.
Grid Workflows
What to do if a grid segment is reported in a HELD state
murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --jobid 51739456.0@jobsub01.fnal.gov JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 51739456.0@jobsub01.fnal.gov murat 01/02 13:41 0+00:37:47 H 0 2.2 mu2eprodsys.sh_20220102_134146_431218_0_1_wrap.sh 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
to figure out the reason, run jobsub_q --long :
murat@mu2egpvm06:/mu2e/app/users/murat/su2020_prof>jobsub_q --long --jobid 51739456.0@jobsub01.fnal.gov | grep -i hold HoldReason = "Error from slot1_4@fnpc7672.fnal.gov: Docker job has gone over memory limit of 3000 Mb" HoldReasonCode = 34 HoldReasonSubCode = 0 NumSystemHolds = 0 OnExitHold = false PeriodicHold = false
Note, that it looks that the memory is booked per docker container, not per user job - the user job is reported to use only 2.2 GBytes...
Hold codes
1 user put jobs in hold 6 sub 0,2 could not execute glidein, and/or docker did not run 9 not enough memory; remember that you need to include the size of the code tarball/release 12 sub 2 could not execute glidein 13 26 sub 1 memory limits 26 sub 2 exceeded disk limits 26 sub 4 SYSTEM_PERIODIC_HOLD Starts/limit 31/10 - too many restarts? 26 sub 8 wall time 28 sub -10000,512,768,256 sandbox 30 sub -10000,768 Job put on hold by remote host, job proxy is not valid 34 sub 0 - memory limits 35 sub 0 Error from slot - probably a condor (Docker?) failure on the worker node
See also condor docs
A subset of holds will be released automatically. If the job has started less than 10 times and was held for reasons 6 or 35 the job will restart automatically.
(NumJobStarts < 10) && (HoldReasonCode=?=6 || HoldReasonCode=?=35) && ((time()-EnteredCurrentStatus) > 1200)
mu2eFileDeclare: Metadata is invalid
3610 OK: /pnfs/mu2e/scratchError: got server response 400 Bad Request. Metadata is invalid: Parent file cnf.rhbob.pbarTracksFromAscii.pbarTracksFromAscii.001000_00006374.fcl not found
This error can occur when you are trying to declare a file to SAM. The declaration include submitting a metadata file to the SAM. This file contains all the information you want to store in the SAM database about the file. One the useful bits is the "parents" of the file. These are the fcl files and the input files used to create this file. If a parent file is "not found", that means that parent file was not declared in the SAM database. In the normal workflow, you would have declared the fcl files and the input file before you even started the job which produced the file at hand. Here are two recovery procedures:
- go back to the area and the log file for when you declared the file in the error message. In this case it was the fcl file which drove the creation of the file being declared. It might be possible to see what went wrong there, and fix it by re-declaring the fcl file, for example.
- if you do not need every file, and you do not see this error too often, you can move or delete the directory containing the result from this job, and restart mu2eFileDeclare.
SSL negotiation
Error creating dataset definition for ... 500 SSL negotiation failed: .
Your certificate is not of the right form
dCache hangs
A simple access to dCache (accessing filespecs like /pnfs/mu2e
) can sometimes hang for a long time. This is difficult to deal with because there are legitimate reasons dCache could respond slowly. First, please read dCache page for background information.
dCache could be operating normally yet respond slowly because
- your request was excessive, such as running find or a
ls -l
on a large number (>few hundred) files. If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files. Use file tools and plainls
where possible. - you, or other users, or even other experiments could be overloading dCache. This is difficult to determine, see operations page for some monitors. dCache has several choke points and not all are easily monitored.
- the files you are accessing are on tape and you have to wait for them to come off tape. The solution is to prestage files
It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common. Here are some guidelines for when to put in a ticket
- if a simple
ls
on a directory which does not contain many files or subdirectories hangs for more than 2 min. - if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
- if prestaging does not progress. See Prestage
Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can check several nodes, put the bad node in a ticket, and work on another node.
Generally, dCache has a lot of moving parts and is fragile in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache.
dCache Input/Output error
When attempting to access a file in tape-backed dCache, you see an error like
cp: error reading /pnfs/mu2e/tape/phy-etc/bck/mu2e/beams1phase1/g4-10-4/tbz/66/08/bck.mu2e.beams1phase1.g4-10-4.002701_00000004.tbz: Input/output error
This error can have two causes. The most likely is that the file is simply not prestaged to disk. This behavior of reporting a infrastructure error when the file is not available is not user-friendly and it may be updated in the future. Note that attempting to read a non-prestaged file in tape-backed dCache via the NFS directory (/pnfs) will not trigger the file to be prestaged.
The second possibility is that there is actual problem with the /pnfs mount. This usually requires a ticket to ask for the NFS client to be restarted, or perhaps the node to be rebooted.
dCache timeout errors
Multiple grid jobs have the following message in log files:
[ERROR] Server responded with an error: [3012] Internal timeout
This can be due to the requested file not being prestaged, so that is the first thing to check. If you can establish that isn't the problem, put in a ticket. As of 4/2023, timeout is 1.5 hours for FTP. 30 minutes for webdav, xroot.
mkdir failed mu2eFileUpload
The first time you invoke mu2eFileUpload fo r a new dataset, you get a permission denied error in mkidr
mkdir /pnfs/mu2e/tape/usr-sim/sim/rlc: Permission denied at /cvmfs/mu2e.opensciencegrid.org/artexternals/mu2efiletools/v3_7/bin/mu2eFileUpload line 71
This is because you are writing to the dCache tape or persistent areas. By default no one can write to those areas and as needed, permission is given to write to specific directories. Please write to mu2eDataAdmin.
jobsub_submit cigetcert try
During a jobsub_submit
command, you see
cigetcert try 1 of 3 failed Command '/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/cigetcert/v1_16_1/Linux64bit-2-6-2-12/bin/cigetcert -s fifebatch.fnal.gov -n -o /tmp/x509up_u1311' returned non-zero exit status 1: cigetcert: Kerberos initialization failed: GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
jobsub could not create a voms proxy from your kerberos ticket. Probably your ticket has just expired. Please kinit and try again. Authentication reference.
ifdh authentication error
This error seen to be saying that ifdh could not finish the authentication to write a file. In this case it turned out to be that the user was trying to use ifdh to write to the /mu2e/data
disk fro a grid job. This is not allowed, apparently by blocking authentication. If the output disk is not the problem, put in a ticket.
> error: globus_ftp_control: gss_init_sec_context failedGSS Major Status: > Unexpected Gatekeeper or Service Nameglobus_gsi_gssapi: Authorization > denied: The name of the remote entity (/DC=org/DC=opensciencegrid/O=Open > Science Grid/OU=Services/CN=bestgftp1.fnal.gov), and the expected name > for the remote entity (/CN=fg-bestman1.fnal.gov) do not matchprogram: > globus-url-copy -rst-retries 1 -gridftp2 -nodcau -restart -stall-timeout > 14400 gsi > ftp://fg-bestman1.fnal.gov:2811/mu2e/data/users/dlin/mu2eSimu/caloSimu18/fcl/job/000/cnf.dlin.caloSimu.v02.000002_00000199.fcl > file:////storage/local/data1/condor/execute/dir_30333/no_xfer/./cnf.dlin.caloSimu.v02.000002_00000199.fcl > 2>/storage/local/data1/condor/execute/dir_30333/no_xfer/ifdh_53271_1/errtxtexited > status 1 delaying 12 ... retrying...
Command terminated by signal 9
Running on the grid, the log files shows the exe started but ended with
----->Command terminated by signal 9<-----
A signal 9 is an intentional kill command so something in the grid system stopped your job on purpose. If the grid system kills your job for going over limits, it should put your job to hold. In this case you won't see the log file, but occasionally we see a log file with a kill command. We don't know why this get through. It might be fine if you just resubmit. One case we have seen is a memory request much smaller than needed (which should have gone to hold).
Strange root I/O errors while using xrootd
While using xrootd to read input files, a small fraction fail with strange root I/O errors:
[ERROR] Invalid session
or
[FATAL] Socket timeout
or
Fatal Root Error: @SUB=TUnixSystem::GetHostByName getaddrinfo failed for 'fndca1.fnal.gov': Temporary failure in name resolution
or
Fatal Root Error: @SUB=TNetXNGFile::Open [FATAL] Invalid address
or
Fatal Root Error: @SUB= ! (prop&kIsClass) && "Impossible code path" violated at line 445 of `/scratch/workspace/canvas-products/vdevelop/e17/SLF6/prof/build/root/v6_12_06a/source/root-6.12.06/io/io/src/TGenCollectionProxy.cxx'
The only solution is to resubmit. There seems to be something flaky deep in the art/FermiGrid/xrootd/dCache path that has bugs or load issues.
Here is another failure message:
Fatal Root Error: TNetXNGFile::ReadBuffer [ERROR] Server responded with an error: [3010] org.dcache.uuid is no longer valid. ROOT severity: 3000
3/2021, dCache exports suggested setting this might help
export XRD_STREAMTIMEOUT=300
auth and handshake errors while using xrootd
Many jobs end with errors like:
Fatal Root Error: TNetXNGFile::Open [FATAL] Hand shake failed ROOT severity: 3000
or:
---- FatalRootError BEGIN Fatal Root Error: TNetXNGFile::Open [FATAL] Auth failed: No protocols left to try ROOT severity: 3000 ---- FatalRootError END Unable to open specified secondary event stream file root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/tape/phy-
the current thinking is this a problem with authentication on the grid node. So far it has not been persistent, but If it is, put in a ticket.
samweb SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED]
A samweb command gives the ssl error:
> samweb list-definition-files dig.mu2e.NoPrimary-mix-det.MDC2018e.art SSL error: [SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:618)
this might be caused by an expired certificate:
> voms-proxy-info ... timeleft : 0:00:00
samweb can handle the case of no cert, but it doesn't seem to handle the case of expired cert. Fix is to delete the cert
voms-proxy-destroy
or get a new one:
kx509
mu2eprodsys Error - python future
If you setup both Offline and mu2egrid in the same window we recommend that you setup Offline first and mu2egrid second:
mu2einit source <path to an Offline installation>/setup.sh setup mu2egrid mu2eprodsys <other arguments>
If you swap the order of lines 2 and 3, then you will get the following error message from mu2eprodsys:
Traceback (most recent call last): File "/cvmfs/fermilab.opensciencegrid.org/products/common/db/../prd/jobsub_client/v1_3_2_1/NULL/jobsub_submit", line 20, in <module> from future import standard_library File "/cvmfs/fermilab.opensciencegrid.org/products/common/prd/python_future_six_request/v1_3/Linux64bit-3-10-2-17-python2-7-ucs4/future/standard_library/__init__.py", line 64, in <module> import logging File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/logging/__init__.py", line 26, in <module> import sys, os, time, io, re, traceback, warnings, weakref, collections.abc File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/re.py", line 127, in <module> import functools File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/functools.py", line 18, in <module> from collections import namedtuple File "/cvmfs/mu2e.opensciencegrid.org/artexternals/python/v3_8_3b/Linux64bit+3.10-2.17/lib/python3.8/collections/__init__.py", line 27, in <module> from reprlib import recursive_repr as _recursive_repr File "/cvmfs/fermilab.opensciencegrid.org/products/common/prd/python_future_six_request/v1_3/Linux64bit-3-10-2-17-python2-7-ucs4/reprlib/__init__.py", line 7, in <module> raise ImportError('This package should not be accessible on Python 3. ' ImportError: This package should not be accessible on Python 3. Either you are trying to run from the python-future src folder or your installation of python-future is corrupted.
To fix this error continue with the following commands:
setup jobsub_client mu2eprodsys <other arguments>
and mu2eprodsys will work correctly.
Or you can always setup mu2egrid in a window in which Offline has not been, and will not be, setup:
mu2einit setup mu2egrid mu2eprodsys <other arguments>
For those who are interested, the explanation is below.
The ups package mu2egrid implicitly sets up the UPS-current version of jobsub_client. Both Offline (via root) and jobsub_client require that python be in the environment. jobsub_client is designed to work with many different versions of python and will use whatever python it finds in the environment. Starting with a recent version of jobsub_client, it sets up a helper UPS product named python_future_six_request and, at setup-time, chooses different UPS qualifiers for python_future_six_request depending on which python it finds in it's environment.
When you start a new session, there is no UPS based python in your environment so the command "python" finds the system python, which, as of Sept 4, 2020, is 2.7.5. When you setup Offline, it will setup a specific version of root and that version of root will setup a specific version of python, which, as of Sept 4, 2020, is 3.8.3.
If you setup the products in the recommended order then setting up Offline will put python 3.8.3 in the environment; when you setup mu2egrid, it will discover that version of python and will setup python_future_six_request -qpython3.8 . At this point your environment is self consistent and mu2eprodsys will work.
If you setup mu2egrid first, it will discover python 2.7.5 in the environment and will setup python_future_six_request -qpython2.7-ucs4 . When you setup Offline it will change python to 3.8.3 but it will not touch python_future_six_request because it does not know about that product. The result is that environment is not self consistent and mu2eprodsys will fail with error above. If you subsequently "setup jobsub_client" it will detect the new python, remove the old python_future_six_request from the environment and setup the new one. At this point the environment is again self consistent and mu2eprodsys will work. Because Offline does not use ython_future_six_request, Offline will continue to work correctly.
The core problem is that jobsub_client is attempting to be smart enough to "just work" with both the system python and with any python from UPS. UPS does not have the features needed to manage all of the corner cases of this use pattern. I have asked the developers of jobsub_client to check, at run-time, for a self-consistent environment and to issue a user friendly error message if the environment is not self-consistent.
Code NOT_FOUND in job ouput
The symptom of this problem is that you get the following error message in log files in your outstage directory:
mu2eprodsys Thu Oct 1 15:17:56 UTC 2020 -- 1601565476 pointing MU2EGRID_USERSETUP to the tarball extract: NOT_FOUND/Code/setup.sh ./mu2eprodsys.sh: line 281: NOT_FOUND/Code/setup.sh: No such file or directory Error sourcing setup script NOT_FOUND/Code/setup.sh: status code 1
This means that you submitted a grid job with tarball option but the tarball could not be unpacked by the Rapid Code Distribution Service. For reasons that I don't understand, the grid system might not detect this error and your grid jobs will start to run. When they try to access the code distributed by the tarball they will fail with the message above.
In almost all cases, this error is caused by one of two things:
- The tar file is too big. In the one example to date, the tarball was 5 GB instead of the typical ~800 MB. We don't know where the threshold is.
- The tar file contains one or more files larger than 1 GB
In all cases to date this was caused by the user putting inappropriate files into the tarball: art event-data files, root files, graphics files or log files. These files should be on /mu2e/data, not on /mu2e/app. When you are looking for large files, remember to check for hidden files, files whose name begins with a dot; some tools make temporary files named .nfsxxx where xxx is a long hex number. You can see hidden files with "ls -a".
If you are within the guidelines above and the problem still occurs, please report the error using a service desk ticket.
There is additional information in an old ticket: INC000001105610.
Can not get BFID
While running mu2eDatasetLocation to add tape locations to the SAM record of files which you uploaded, you see this error:
Can not get BFID from pnfs layer 1 /pnfs/mu2e/tape/usr-sim/dig/oksuzian/CRY-cosmic-general/cry3-digi-hi/art/71/45/.(use)(1)(dig.oksuzian.CRY-cosmic-general.cry3-digi-hi.001002_00009067.art): on Mon Mar 22 12:37:30 2021
We believe this occurs when the script is run before some items in the dCache database have been inserted. It usually runs correctly after waiting an hour or two.
jobsub dropbox_upload failed
On jobsub job submission,
Submitting.... dropbox_upload failed
This error is caused by the underlying mechanisms for transferring your local files to the grid being broken. It is only in jobsub_client version v1_3_3. A solution is to setup the previous version:
setup jobsub_client v1_3_2_1
Do this after setup mu2egrid.
Error: OperationalError: no such table: keycache error
This error happens when you submit a grid job (typically using mu2eprodsys or jobsub_submit). It started to occur in mid-November 2023. The underlying issue is a problem with the tokens system used for grid authentication. The following is the suggested workaround:
ls -l ${XDG_CACHE_HOME:-$HOME/.cache}/scitokens/scitokens_keycache.sqllite
If this file exists and is zero length, then remove it and try to submit again.
Disk quota exceeded during jobsub_submit or mu2eprodsys
If you get the following error during the execution of a jobsub_submit command
Unable to get key triggered by next update: disk I/O error Error: OSError: [Errno 122] Disk quota exceeded
the issue is that your home directory has no quota remaining. mu2eprodsys is just a wrapper around jobsub_submit so it will issue the same error in the same circumstances.
The issue is that jobsub_submit writes small files in your home directory under $HOME/.cache and $HOME/.config. If there is no available disk quota, jobsub_submit will fail.
The solution is to remove some files from your home directory; either delete them or move them to a more appropriate location.
'Available' is not on the list
2/2024 During jobsub submission
Submitting.... Error: ValueError: 'Available' is not in list
This is caused by a bug in a disk space check. The temporary solution is to skip the check with
--skip-check=disk_space or for mu2eprodsys --jobsub-arg='--skip-check=disk_space'
Hold test evaluated true (hold code 26/0)
Your jobs go to hold, but instead of telling you why (memory, time) it simply says that this whole long expression evaluated true. If you go to the "why are my jobs held" monitoring page, it will say the cause is "code 26 subcode 0" which is "other". The experts say this is usually a hold due to running over lifetime, but here is a command which might be able to pinpoint which part of the test is failing:
> condor_config_val -schedd -name jobsub05.fnal.gov SYSTEM_PERIODIC_HOLD | awk '{gsub("JobStatus","LastJobStatus"); gsub("time\\(\\)","EnteredCurrentStatus"); split($0,s," *\\|\\| *"); for(x in s) printf("%s\0",s[x])}' | xargs -0 condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln ( LastJobStatus == 2 && JobUniverse == 5 && MemoryUsage > 1.0 * RequestMemory ) = undefined ( LastJobStatus == 2 && JobUniverse == 5 && DiskUsage > 1.0 * RequestDisk ) = false ( LastJobStatus == 2 && JobUniverse == 5 && NumJobStarts > 10 ) = false ( JobUniverse == 5 && NumShadowStarts > 10 ) = false ( LastJobStatus == 2 && JobUniverse == 5 && EnteredCurrentStatus - JobCurrentStartDate > JOB_EXPECTED_MAX_LIFETIME && JOB_EXPECTED_MAX_LIFETIME > 0 ) = true ( JobUniverse == 5 && JOB_EXPECTED_MAX_LIFETIME > 345600 ) = false ( LastJobStatus == 2 && JobUniverse == 5 && ((EnteredCurrentStatus - JobCurrentStartDate) > 345600 )) = false ( LastJobStatus == 2 && JobUniverse == 7 && ((EnteredCurrentStatus - EnteredCurrentStatus) > 2592000 )) = false
> /usr/bin/condor_q -name jobsub05.fnal.gov 11758936.0 -af:ln MemoryUsage RequestMemory MemoryUsage = undefined RequestMemory = 2499
SAM SSL certifcate error
running sam commands, you see the following
400 The SSL certificate error 400 Bad Request The SSL certificate error
This is because you have a voms (kx509) proxy file, but the proxy is expired, run vomsCert
Mac
Attempting to do grahics (often geometry browsing in root) you see "libGL error: failed to load driver: swrast". The solution is discussed here