ProductionProcedures: Difference between revisions
No edit summary |
|||
Line 65: | Line 65: | ||
The keepup scripts drive production scripts that have to run constantly. The keepup technique keeps the scripts running constantly and uses a cron job only to check that all the scripts are still running. This is a little easier to maintain and has the nice feature if a script runs over a repeat interval period, the work load is heavy and the script can be rerun again immediately. | The keepup scripts drive production scripts that have to run constantly. The keepup technique keeps the scripts running constantly and uses a cron job only to check that all the scripts are still running. This is a little easier to maintain and has the nice feature if a script runs over a repeat interval period, the work load is heavy and the script can be rerun again immediately. | ||
The framework scripts are in <code>~mu2epro/cron/production</code>. A cron job on <code>mu2eprodgpvm01</code> running <code>keepup.sh</code> checks the scripts are running and have recent heartbeats. Each procedure is called a "service" and each service can have multiple independent instances labeled by a different "name". The master list of services and instances is in <code>keepup.txt</code>. There are notes at the top of the file on the meaning of each row, which represents an instance of a service. | The framework scripts are in <code>~mu2epro/cron/production</code>. A cron job on <code>mu2eprodgpvm01</code> running <code>keepup.sh</code> checks the scripts are running and have recent heartbeats. Each procedure is called a "service" and each service can have multiple independent instances labeled by a different "name" and can run on a requested node. The master list of services and instances is in <code>keepup.txt</code> which controls which services are running. There are notes at the top of the file on the meaning of each row, which represents an instance of a service. The keepup scripts monitor the heartbeat of the service and will trigger an alarm (currently just an email) if a service is missing or the script is stalled. | ||
Each service has a code and configuration subdirectory (named by the service) with a main worker script called <code>run.sh</code>. The service can customize how to name and configure its instances. Each instance of each service has its own working area under <code>/exp/mu2e/data/users/mu2epro/production/logs</code>. | Each service has a code and configuration subdirectory (named by the service) with a main worker script called <code>run.sh</code>. The service can customize how to name and configure its instances. Each instance of each service has its own working area under <code>/exp/mu2e/data/users/mu2epro/production/logs</code>. In the working directory, there are standard files and directories: | ||
* log files by date - this is the ouput of the service script | |||
* an executable file name "<service>-<name>". This is a wrapper script calling <code>run.sh</code>, renamed for convenience | |||
* <code>work</code> directory - scratch space for the service | |||
* <code>heartbeat.txt</code> - hearbeats from the wrapper confirming the service script is active | |||
* <code>wrapper.log</code> - output from the wrapper script | |||
<code></code> | <code></code> | ||
<code></code> | <code></code> |
Latest revision as of 20:40, 31 July 2025
Introduction
Running jobs locally
In production log files, there is configuration dump stanzas like
************** control summary exe *************** MOO_CAMPAIGN_STAGE=Reco MOO_SOURCE=v00_03_02 MOO_DATASET=CRVWB-000 MOO_VERBOSE=1 MOO_OUTDIR=production MOO_APPEND_NAME=none MOO_CFG=CRVWB-008 MOO_CONFIG=CRVWB-000-008-000 MOO_CAMPAIGN=CRVWB-000-0 MOO_SCRIPT=CRVWB/reco.sh MOO_CRVTESTSTAND=v17 ************** control summary exe ***************
Jobs are run in a a generic wrapper script and are completely controlled by these variables and the input files. The variables are set through the process of interpreting the POMS campaign configuration, the cfg files, and the wrapper script itself. The easiest way to get a complete of control variable is from a log file, but if that's not available, there is currently no simple verified way to extract them from the sources (if, for example, no jobs run). To rerun a jobs locally, you only need to write a little script that sets these variables, then provides one more:
export MOO_LOCAL_INPUT=https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvled-001/dat/3b/a7/raw.mu2e.CRV_wideband_cosmics.crvled-001.001303_056.dat
Goto an area with some space
cd /exp/mu2e/data/users/mu2epro/production_recovery # pick a subdirectory cd 1 # cleanup, make it look a grid dir rm -f * jsb_tmp/* mkdir -p jsb_tmp
if needed to run in sl7.
mu2einit sl7container mu2einit
run the job script
export MOO_CAMPAIGN_STAGE=Reco export MOO_SOURCE=v00_03_02 export MOO_DATASET=CRVWB-000 export MOO_VERBOSE=1 export MOO_OUTDIR=production export MOO_APPEND_NAME=none export MOO_CFG=CRVWB-008 export MOO_CONFIG=CRVWB-000-008-000 export MOO_CAMPAIGN=CRVWB-000-0 export MOO_SCRIPT=CRVWB/reco.sh export MOO_CRVTESTSTAND=v17 export MOO_LOCAL_INPUT=https://fndcadoor.fnal.gov:2880/pnfs/fnal.gov/usr/mu2e/tape/phy-raw/raw/mu2e/CRV_wideband_cosmics/crvled-001/dat/3b/a7/raw.mu2e.CRV_wideband_cosmics.crvled-001.001303_056.dat nice /cvmfs/mu2e.opensciencegrid.org/bin/OfflineOps/wrapper.sh \ 1> jsb_tmp/JOBSUB_LOG_FILE 2> jsb_tmp/JOBSUB_ERR_FILE # optionally put in the background
Keepup scripts
The keepup scripts drive production scripts that have to run constantly. The keepup technique keeps the scripts running constantly and uses a cron job only to check that all the scripts are still running. This is a little easier to maintain and has the nice feature if a script runs over a repeat interval period, the work load is heavy and the script can be rerun again immediately.
The framework scripts are in ~mu2epro/cron/production
. A cron job on mu2eprodgpvm01
running keepup.sh
checks the scripts are running and have recent heartbeats. Each procedure is called a "service" and each service can have multiple independent instances labeled by a different "name" and can run on a requested node. The master list of services and instances is in keepup.txt
which controls which services are running. There are notes at the top of the file on the meaning of each row, which represents an instance of a service. The keepup scripts monitor the heartbeat of the service and will trigger an alarm (currently just an email) if a service is missing or the script is stalled.
Each service has a code and configuration subdirectory (named by the service) with a main worker script called run.sh
. The service can customize how to name and configure its instances. Each instance of each service has its own working area under /exp/mu2e/data/users/mu2epro/production/logs
. In the working directory, there are standard files and directories:
- log files by date - this is the ouput of the service script
- an executable file name "<service>-<name>". This is a wrapper script calling
run.sh
, renamed for convenience work
directory - scratch space for the serviceheartbeat.txt
- hearbeats from the wrapper confirming the service script is activewrapper.log
- output from the wrapper script