ProductionLogic: Difference between revisions

From Mu2eWiki
Jump to navigation Jump to search
(Created page with " ==Introduction== A discussion of production logic patterns, with the goals of running with patterns that generate complete and correct output. ==Assumptions== We assume...")
 
 
(3 intermediate revisions by the same user not shown)
Line 6: Line 6:
A discussion of production logic patterns, with the goals of running with patterns that generate complete and correct output.
A discussion of production logic patterns, with the goals of running with patterns that generate complete and correct output.


==Assumptions==
==Job model==


We assume that the jobs are run by POMS and data_dispatcher (ddisp) and output is recorded in metacat and Rucio.  We assume that all stages are driven by ddisp and an input dataset. The model for discussion is that stage0 take datasets A, with files A0, A1, ..., consumed on input, and files for two datasets are produced, B and C.  A job in stage0 receives a file A0, and produces B0 and C0.  Stages consumes dataset B via another ddisp project.  If the job produces files with a version tag (unique to this particular recovery iteration) it will be labeled B0-T0, B0-T1, etc.
We assume that the jobs are run by POMS and data_dispatcher (ddisp) and output is recorded in metacat and Rucio.  We assume that all stages are driven by ddisp projects and an input dataset. The model for discussion is that stage0 take dataset A, with files A0, A1, ..., consumed on input, and files for two datasets are produced, B and C.  A job in stage0 receives a file A0, and produces B0 and C0.  Stage1 consumes dataset B via another ddisp project.  If the job produces files with a version tag in the name (unique to this particular recovery iteration) it will be labeled B0-T0, B0-T1, etc.


We assume that POMS will not allow a file in B to be passed to stag1 ddisp project until it is metacat child of a file from A which was consumed by a successful stage0 project worker. <font color="red">needs to be confirmed</font>
We assume that POMS will not allow a file in B to be passed to stage1 ddisp project until it is a metacat child of a file from A which was consumed by a successful stage0 project worker. <font color="red">needs to be confirmed or developed</font>
 
By default when POMS starts a recovery job for a ddisp project which is done, but not all files were consumed successfully, it will create a new ddisp project with the smaller set of files which still need to be run.  A simplification would be to submit recovery jobs against the original project, not a new one.  This allows more flexibility in when and how the recovery jobs are submitted since the POMS does not have to wait for the original project to be fully resolved before starting recoveries.  This requires all failures to be sorted to retries by ddisp.  This idea is called "OneP" and <font color="red">has not been implemented yet</font>


==Patterns==
==Patterns==
Line 16: Line 18:
These are two "sub patterns" which will be used below.
These are two "sub patterns" which will be used below.


In the "default" jobs submisison pattern
In the "default" jobs submissions pattern
* a job contacts the ddisp project to get a file from dataset A
* a job contacts the ddisp project to get a file from dataset A
* ddisp will not allow any other jobs to operate on A0 until this job has completed (failed or succeeded) or timed-out.
* ddisp will not allow any other jobs to operate on A0 until this job has completed (failed or succeeded) or timed-out.
Line 22: Line 24:
* a file from job which is reports a failure puts the file in the "retry" category (retry can be set by the job)
* a file from job which is reports a failure puts the file in the "retry" category (retry can be set by the job)
* POMS runs recovery jobs
* POMS runs recovery jobs
* it is possible for two copies of one job to be active


In the "strict" sub-pattern we add the following requirement,  
In the "strict" sub-pattern we add the following requirement,  
* the job must not run past the ddisp timeout. This can be achieved by jobsub expected-lifetime, jobsub script timeout switch, and timeout internal to the job.
* the job must not run past the ddisp timeout. This can be achieved by jobsub expected-lifetime, jobsub script timeout switch, and timeout internal to the job. With this step, then any job, under almost any conceivable conditions, is only active when the ddisp file reservation is active.




===OneP===
===Proposed/OneP===


* the strict sub-pattern is run
* the strict sub-pattern is run
* the ddisp project runs a "virtual project" which it does not contact Rucio for locations, the job script knows how to find a file based on the file name and POMS settings
* the ddisp runs a "virtual project" which it does not contact Rucio for locations, the job script knows, using conventions, how to find a file based on the file name and POMS settings
* the job writes output in the B0 and C0 final location, overwriting previous output, if any
* the job writes output in the B0 and C0 final location, overwriting previous output, if any
* the jobs create a metacat record, overwriting the record, if it exists
* the job creates a metacat record, overwriting the previous record, if it exists
* the job does not write Rucio records
* the job does not write Rucio records
* job are recovered until the ddisp project sees success for all files


post-processing:
post-processing:
* a cron job searches recent ddisp projects, determines files A0 which were run in successful ddisp workers, and declares the files B0 and C0 to Rucio.
* a cron job searches recent ddisp projects, determines files A0 which were run in successful ddisp workers, and declares the child files B0 and C0 to Rucio.


Notes
Notes
* only one dCache i/o for output
* only one dCache i/o for output (plus recoveries)
* only one version of output files, with fixed names
* only one final version of output files, with fixed names
* Rucio is updated a few hours past processing
* Rucio is updated a few hours past processing
* stage1 ddisp can't count on using Rucio locations unless it is delayed, or also virtual
* stage1 ddisp can't count on using Rucio locations unless it is delayed, or it can also be virtual
* some user code logic in the job, minimal post-processing




Line 48: Line 53:
* the strict sub-pattern is run
* the strict sub-pattern is run
* the ddisp project runs a default project with Rucio locations
* the ddisp project runs a default project with Rucio locations
* the jobs searches for output from previous iterations, B0-T0, B0-T0 and if they exist, the jobs deprecates the files (possibly metacat retired, Rucio removed from dataset and RSE, or files deleted)
* the job searches for output from previous iterations, B0-T0, B0-T0 and if they exist, the jobs deprecates the files (possibly metacat retired, Rucio removed from dataset and RSE, or files deleted). Create metacat records before Rucio, so if earlier versions are not in metacat, they are not in Rucio.
* the job writes unique output for files in each recovery iteration B0-T1 and C0-T1 in their final location
* the job writes unique output for files in each recovery iteration B0-T1 and C0-T1 in their final location
* the jobs creates metacat and Rucio records for T1 versions
* the jobs creates metacat and Rucio records for T1 versions
* job are recovered until the ddisp project sees success for all files


post-processing:
post-processing:
Line 56: Line 62:


Notes
Notes
* only one dCache i/o for output
* only one dCache i/o for output (plus recoveries)
* only one version of output files, concurrently, with non-fixed names
* only one version of output files at a time, with non-fixed names
* Rucio features may be used as soon at the file is created
* Rucio features may be used as soon at the file is created
* stage1 ddisp can count on no duplicates and correct Rucio locations
* stage1 ddisp can count on no duplicates and correct Rucio locations
* some user code logic in the job, no post-processing




Line 66: Line 73:
* the default sub-pattern is run
* the default sub-pattern is run
* the job writes only files unique to the recovery iteration, like B0-T0
* the job writes only files unique to the recovery iteration, like B0-T0
* the job copies the files to a unique location
* the job always copies the files to a unique location
* the job creates new metacat and Rucio records
* the job always creates new metacat and Rucio records
* job are recovered until the ddisp project sees success
* job are recovered until the ddisp project sees success for all files


post-processing:
post-processing:
Line 75: Line 82:
Notes
Notes
* no fixed file names
* no fixed file names
* for a time, multiple versions of output files
* for a time, multiple versions of output files may exist
* Rucio features may be used immedaitely after file upload
* Rucio features may be used immediately after file upload
* stage1 ddisp can't proceed on stage0 ddisp success, since there may be multiple files B0-T0, B0-T1 until prost-processing is done
* stage1 ddisp can't proceed on stage0 ddisp success, since there may be multiple files B0-T0, B0-T1 until prost-processing is done
* no user code logic in the job, minimal post-processing




[[Category:Computing]]
[[Category:Computing]]
[[Category:Workflows]]
[[Category:Workflows]]

Latest revision as of 22:55, 29 January 2024


Introduction

A discussion of production logic patterns, with the goals of running with patterns that generate complete and correct output.

Job model

We assume that the jobs are run by POMS and data_dispatcher (ddisp) and output is recorded in metacat and Rucio. We assume that all stages are driven by ddisp projects and an input dataset. The model for discussion is that stage0 take dataset A, with files A0, A1, ..., consumed on input, and files for two datasets are produced, B and C. A job in stage0 receives a file A0, and produces B0 and C0. Stage1 consumes dataset B via another ddisp project. If the job produces files with a version tag in the name (unique to this particular recovery iteration) it will be labeled B0-T0, B0-T1, etc.

We assume that POMS will not allow a file in B to be passed to stage1 ddisp project until it is a metacat child of a file from A which was consumed by a successful stage0 project worker. needs to be confirmed or developed

By default when POMS starts a recovery job for a ddisp project which is done, but not all files were consumed successfully, it will create a new ddisp project with the smaller set of files which still need to be run. A simplification would be to submit recovery jobs against the original project, not a new one. This allows more flexibility in when and how the recovery jobs are submitted since the POMS does not have to wait for the original project to be fully resolved before starting recoveries. This requires all failures to be sorted to retries by ddisp. This idea is called "OneP" and has not been implemented yet

Patterns

These are two "sub patterns" which will be used below.

In the "default" jobs submissions pattern

  • a job contacts the ddisp project to get a file from dataset A
  • ddisp will not allow any other jobs to operate on A0 until this job has completed (failed or succeeded) or timed-out.
  • a file which is timed-out will go to ddisp project "retry" category (confirmed)
  • a file from job which is reports a failure puts the file in the "retry" category (retry can be set by the job)
  • POMS runs recovery jobs
  • it is possible for two copies of one job to be active

In the "strict" sub-pattern we add the following requirement,

  • the job must not run past the ddisp timeout. This can be achieved by jobsub expected-lifetime, jobsub script timeout switch, and timeout internal to the job. With this step, then any job, under almost any conceivable conditions, is only active when the ddisp file reservation is active.


Proposed/OneP

  • the strict sub-pattern is run
  • the ddisp runs a "virtual project" which it does not contact Rucio for locations, the job script knows, using conventions, how to find a file based on the file name and POMS settings
  • the job writes output in the B0 and C0 final location, overwriting previous output, if any
  • the job creates a metacat record, overwriting the previous record, if it exists
  • the job does not write Rucio records
  • job are recovered until the ddisp project sees success for all files

post-processing:

  • a cron job searches recent ddisp projects, determines files A0 which were run in successful ddisp workers, and declares the child files B0 and C0 to Rucio.

Notes

  • only one dCache i/o for output (plus recoveries)
  • only one final version of output files, with fixed names
  • Rucio is updated a few hours past processing
  • stage1 ddisp can't count on using Rucio locations unless it is delayed, or it can also be virtual
  • some user code logic in the job, minimal post-processing


Mixed

  • the strict sub-pattern is run
  • the ddisp project runs a default project with Rucio locations
  • the job searches for output from previous iterations, B0-T0, B0-T0 and if they exist, the jobs deprecates the files (possibly metacat retired, Rucio removed from dataset and RSE, or files deleted). Create metacat records before Rucio, so if earlier versions are not in metacat, they are not in Rucio.
  • the job writes unique output for files in each recovery iteration B0-T1 and C0-T1 in their final location
  • the jobs creates metacat and Rucio records for T1 versions
  • job are recovered until the ddisp project sees success for all files

post-processing:

  • None

Notes

  • only one dCache i/o for output (plus recoveries)
  • only one version of output files at a time, with non-fixed names
  • Rucio features may be used as soon at the file is created
  • stage1 ddisp can count on no duplicates and correct Rucio locations
  • some user code logic in the job, no post-processing


Afterburner

  • the default sub-pattern is run
  • the job writes only files unique to the recovery iteration, like B0-T0
  • the job always copies the files to a unique location
  • the job always creates new metacat and Rucio records
  • job are recovered until the ddisp project sees success for all files

post-processing:

  • a cron job searches recent ddisp projects, determines files A0 which were run in successful ddisp workers. If multiple copies exist, the latest will be declared the correct copy and earlier copies will be deprecated (possibly metacat retired, Rucio removed from dataset and RSE, or files deleted)

Notes

  • no fixed file names
  • for a time, multiple versions of output files may exist
  • Rucio features may be used immediately after file upload
  • stage1 ddisp can't proceed on stage0 ddisp success, since there may be multiple files B0-T0, B0-T1 until prost-processing is done
  • no user code logic in the job, minimal post-processing