ErrorRecovery: Difference between revisions

Revision as of 19:22, 20 March 2018

Introduction

Some errors occur regularly, such as when authorization expires, dCache is struggling, or a procedure is repeated when it can't be repeated. Some common situations are recorded here with advice on how to handle them. Errors that can easily be googled, such as syntax errors, will not appear here.

scons

art and fcl

art exit codes

Grid Workflows

SSL negotiation

Error creating dataset definition for ...
500 SSL negotiation failed: .

Your certificate is not of the right form

dCache hangs

A simple access to dCache (accessing filespecs like /pnfs/mu2e) can sometimes hang for a long time. This is difficult to deal with because there are legitimate reasons dCache could respond slowly. First, please read dCache page for background information.

dCache could be operating normally yet respond slowly because

your request was excessive, such as running find or a ls -l on a large number (>few hundred) files. If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files. Use file tools and plain ls where possible.
you, or other users, or even other experiments could be overloading dCache. This is difficult to determine, see operations page for some monitors. dCache has several choke points and not all are easily monitored.
the files you are accessing are on tape and you have to wait for them to come off tape. The solution is to prestage files

It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common. Here are some guidelines for when to put in a ticket

if a simple ls on a directory which does not contain many files or subdirectories hangs for more than 2 min.
if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
if accessing random files not recently accessed, and known to be prestaged, when access hangs more than 8h.
is prestaging does not progress after 8h

Sometimes a hang is occurring only on one node, due to a problem with its nfs server. In this case, you can put in a ticket and then work on another node.

Generally, dCache has a lot of moving parts and is fragle in some ways. There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket. You will always learn something about dCache.

@@ Line 27: / Line 27: @@
 dCache could be operating normally yet respond slowly because
 * your request was excessive, such as running find or a <code>ls -l</code> on a large number (>few hundred) files.  If there are 1000's of files queried, this could take minutes, and much longer for larger numbers of files.  Use [[FileTools|file tools]] and plain <code>ls</code> where possible.
-* you, or other users, or even other experiments could be overloading dCache.  This is difficult to determine, see [[Ops|operations]] page for some monitors.  dCache has several choke points and not all are easily monitored.
+* you, or other users, or even other experiments could be overloading dCache.  This is difficult to determine, see [[OfflineOps|operations]] page for some monitors.  dCache has several choke points and not all are easily monitored.
-* the files you are accessing are on tape and you have to wait fro them to come off tape.  The solution is the [[Prestage|prestage files]]
+* the files you are accessing are on tape and you have to wait for them to come off tape.  The solution is to [[Prestage|prestage files]]
+It is difficult to tell if dCache is overloaded, but if it is not, your problem could be caused by any of several failure modes inside dCache, and these failures are relatively common.  Here are some guidelines for when to put in a ticket
+* if a simple <code>ls</code> on a directory which does not contain many files or subdirectories hangs for more than 2 min.
+* if file access in your MC workflow seems normal then suddenly hangs for more than one 1h
+* if accessing random files not recently accessed, and known to be prestaged, when access hangs more than 8h.
+* is prestaging does not progress after 8h
+Sometimes a hang is occurring only on one node, due to a problem with its nfs server.  In this case, you can put in a ticket and then work on another node.
+Generally, dCache has a  lot of moving parts and is fragle in some ways.  There is no real cost to putting a ticket and the dCache maintainers are responsive, so when in doubt, put in a ticket.  You will always learn something about dCache.
 [[Category:Computing]]
 [[Category:Workflows]]
 [[Category:DataHandling]]

ErrorRecovery: Difference between revisions

Revision as of 19:22, 20 March 2018

Contents

Introduction

scons

art and fcl

art exit codes

Grid Workflows

SSL negotiation

dCache hangs

Navigation menu

ErrorRecovery: Difference between revisions

Revision as of 19:22, 20 March 2018

Introduction

scons

art and fcl

art exit codes

Grid Workflows

SSL negotiation

dCache hangs

Navigation menu

Search