Skip to content
Konstantin Androsov edited this page Aug 8, 2017 · 11 revisions

Production v3 instructions

Framework installation

  1. Install framework on lxplus in a prod workspace directory without creating CMSSW release area

    curl -s https://raw.githubusercontent.com/hh-italian-group/hh-bbtautau/master/Run/install_framework.sh | bash -s prod
  2. Check framework production functionality interactively for a few samples

    cd CMSSW_8_0_28/src
    cmsenv
    # Radion 250 sample
    echo /store/mc/RunIISummer16MiniAODv2/GluGluToRadionToHHTo2B2Tau_M-250_narrow_13TeV-madgraph/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/110000/A06FE4CA-27DB-E611-9246-549F3525C380.root > Radion_250.txt
    cmsRun h-tautau/Production/python/Production.py fileList=Radion_250.txt applyTriggerMatch=True sampleType=Summer16MC ReRunJEC=True globalTag=80X_mcRun2_asymptotic_2016_TrancheIV_v6 tupleOutput=eventTuple_Radion_250.root maxEvents=1000
    # Single electron Run2016B
    echo /store/data/Run2016B/SingleElectron/MINIAOD/03Feb2017_ver2-v2/110000/003B2C1F-50EB-E611-A8F1-002590E2D9FE.root > SingleElectron_B.txt
    cmsRun h-tautau/Production/python/Production.py fileList=SingleElectron_B.txt anaChannels=eTau applyTriggerMatch=True sampleType=Run2016 ReRunJEC=True globalTag=80X_dataRun2_2016SeptRepro_v7 tupleOutput=eventTuple_SingleElectronB.root saveGenTopInfo=False saveGenBosonInfo=False saveGenJetInfo=False energyScales=Central lumiFile=h-tautau/Production/json/Cert_271036-284044_13TeV_PromptReco_Collisions16_JSON.txt maxEvents=1000
  3. Install framework on the stage out site (e.g. Pisa)

    curl -s https://raw.githubusercontent.com/hh-italian-group/hh-bbtautau/master/Run/install_framework.sh | bash -s prod
    cd CMSSW_8_0_28/src
    cmsenv
    ./run.sh MergeRootFiles --help

Setup CRAB working environment

Each time after login:

source /cvmfs/cms.cern.ch/crab3/crab.sh
voms-proxy-init --voms cms --valid 168:00
cd CMSSW_DIR/src
cmsenv
cd h-tautau/Production/crab

Production spreadsheet legend

  • done: all crab jobs are successfully finished
  • 99p: at least 99% (but not all) of crab jobs are successfully finished
  • tuple: tuples for a task that have all crab jobs successfully finished are merged and copied to the central storage (in the stage out site or in the cernbox)
  • 99t: tuples for a task that have at least 99% (but not all) crab jobs successfully finished are merged and copied to the central storage (in the stage out site or in the cernbox)
  • done 99t - a combination of done and 99t: all crab jobs are successfully finished, but tuples in the central storage were produced using outputs when task was in the 99p state

Production workflow

Steps 2-6 should be repeated periodically, 1-2 times per day, until the end of the production.

  1. Submit jobs

    ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v3 config1 [config2] ...
  2. Check jobs status

    ./multicrab.py --workArea work-area --crabCmd status
    # if 99p directory exists
    ./multicrab.py --workArea 99p --crabCmd status

    Analyze output of the status command for each task:

    1. If few jobs were failed without any persistent pattern, resubmit them:

      crab resubmit -d work-area/TASK_AREA
      # or
      crab resubmit -d 99p/TASK_AREA
    2. If significant amount of jobs are failing, one should investigate the reason and take actions accordingly. For more details see the CRAB troubleshoot section below.

    3. If all jobs are successfully finished, move task area from 'work-area' (or '99p') into 'finished' directory (create it if needed).

      # mkdir -p finished
      mv work-area/TASK_AREA finished/
      # or
      mv 99p/TASK_AREA finished/
    4. If at least 99% of jobs are successfully finished, move task area from 'work-area' into '99p' directory (create it if needed).

      # mkdir -p 99p
      mv work-area/TASK_AREA 99p/
    5. If there is no reasonable hope that some jobs will be successfully finished, move task area from 'work-area' into 'finished-partial' directory (create it if needed). Before moving the directory, make sure that all jobs has or 'failed' or 'finished', otherwise wait (use kill command, if necessary).

      • Create recovery task for the failed jobs (see CRAB troubleshoot section).
      # mkdir -p finished-partial
      mv work-area/TASK_AREA finished-partial/
      # or
      mv 99p/TASK_AREA finished-partial/
    6. For the tasks from 'finished-partial' and '99p' create crab reports

      # mkdir -p crab_results
      for NAME in finished-partial 99p ; do for JOB in $(ls $NAME) ; do if [ ! -f "$NAME/$JOB/results/processedLumis.json" ] ; then echo "$NAME/$JOB" ; crab report  -d "$NAME/$JOB" ; fi ; done ; done
    7. Create task lists for tasks in 'finished', 'finished-partial' and '99p' directories and transfer them into stage out server. For each task from 'finished-partial' and '99p', list of not processed jobs should be specified (use create_job_list_ex.py).

      if [ -d current-check ] ; then rm -rf prev-check ; mv current-check prev-check ; fi
      mkdir current-check
      for NAME in finished* 99p ; do ./create_job_list.sh $NAME | sort > current-check/$NAME.txt ; done
      for NAME in finished-partial 99p ; do ./create_job_list_ex.py --job-list current-check/$NAME.txt --work-area $NAME --crab-results-out crab_results --prev-output prev-check/$NAME.txt ; done
    8. Update prod_v3 spreadsheet accordingly. 'finished-partial' task should be considered as not complete, so no updates in the spreadsheet are needed.

      for NAME in finished* 99p ; do echo "$NAME:" ; if [ -f prev-check/$NAME.txt ] ; then diff current-check/$NAME.txt prev-check/$NAME.txt ; else cat current-check/$NAME.txt ; fi ; done
    9. For the tasks for which 99t tuples were created, transfer processed crab results from crab_results directory to /eos/user/k/kandroso/cms-it-hh-bbtautau/Tuples2016_v3/crab_results.

      cp crab_results/TASK_NAME.tar.bz2 /eos/user/k/kandroso/cms-it-hh-bbtautau/Tuples2016_v3/crab_results/
  3. Submit merge jobs output files on the stage out server (before running this procedure the software has to be installed).

    • For partially finished tasks wait for the recovery task to be finished before start merging.
    • Support of partially finished jobs is not implemented yet.
    1. If some merge jobs were already created during the previous iteration, use find_new_jobs.sh to create list of new jobs to submit.

      • N.B. The file current-check/*.txt has to be transferred from lxplus to stage out server in the src directory in order to run find_new_jobs.sh.
      ./h-tautau/Instruments/find_new_jobs.sh current-check/finished.txt output/tuples > finished.txt
      ./h-tautau/Instruments/find_new_jobs.sh current-check/finished-partial.txt output/tuples_partial output/tuples output/tuples_99p > finished-partial.txt
      ./h-tautau/Instruments/find_new_jobs.sh current-check/99p.txt output/tuples_99p output/tuples output/tuples_partial > 99p.txt
      • N.B. Check that in the created job lists there are no jobs which are still running in batch system queue for merge (bjobs command on gridui). In case remove these jobs.
    2. Submit merge jobs in the local queue, where CRAB_OUTPUT_PATH in the output crab path specified in the submit.py command. For example for Pisa stage out server is /gpfs/ddn/srm/cms/store/user/#YOUR_USERNAME/hh_bbtautau_prod_v3/.

      ./h-tautau/Instruments/submit_tuple_hadd.sh cms finished.txt output/merge CRAB_OUTPUT_PATH
      ./h-tautau/Instruments/submit_tuple_hadd.sh cms finished-partial.txt output/merge_partial CRAB_OUTPUT_PATH
      ./h-tautau/Instruments/submit_tuple_hadd.sh cms 99p.txt output/merge_99p CRAB_OUTPUT_PATH
    3. Collect finished jobs (this script can be run as many times as you want).

      ./h-tautau/Instruments/collect_tuple_hadd.sh output/merge output/tuples
      ./h-tautau/Instruments/collect_tuple_hadd.sh output/merge_partial output/tuples_partial
      ./h-tautau/Instruments/collect_tuple_hadd.sh output/merge_99p output/tuples_99p
    4. For large samples that were split in several 'sub' taksts and for the samples with the recovery tasks, when all 'sub' and 'recovery' tuples are merged, merge all of them together into output/tuples or output/tuples_99p (depending on the overall jobs status for the sample).

      For example,

      hadd -f9 output/tuples/TTToSemilepton_TuneCUETP8M2_ttHtranche3.root output/tuples_partial/TTToSemilepton_TuneCUETP8M2_ttHtranche3_sub*.root output/tuples/TTToSemilepton_TuneCUETP8M2_ttHtranche3_recovery1.root
  4. Split large merged files into several parts in order to satisfy cernbox requirement that file size should be less than 8GB.

    # mkdir -p output/tuples_split
    ./h-tautau/Instruments/python/split_tuple_file.py --input output/tuples/SAMPLE.root --output output/tuples_split/SAMPLE.root
    # copy split files back into the original directory
    mv output/tuples_split/SAMPLE*.root output/tuples
  5. Transfer tuple files into the local tuple storage. Pisa: /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/.

    • For 100% complete tasks use Full sub-directory, for >=99% (but not 100%) complete tasks use Full_99p sub-directory.
    # Full
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' --dry-run output/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full
    # if everything ok
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' output/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full
    
    # Full_99p
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' --dry-run output/tuples_99p/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full_99p
    # if everything ok
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' output/tuples_99p/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples2016_v3/Full_99p
    • Update prod_v3 spreadsheet accordingly.
    • If 100% tuple is ready, the 99% version should be removed.
    • Tuples the will be transfered by the production coordinator into the central prod_v3 cernbox directory: /eos/user/k/kandroso/cms-it-hh-bbtautau/Tuples2016_v3.
  6. When all production is over, after few weeks of safety delay, delete remaining crab output directories and root files in your area to reduce unnecessary storage usage.

CRAB troubleshoot

  1. Common failure reasons

    • Jobs are failing due to memory excess.

      Solution Resubmit jobs requiring more memore per job, e.g.:

      crab resubmit --maxmemory 4000 -d work-area/TASK_AREA
    • Jobs are failing on some servers

      Solution Resubmit jobs using black or white list, e.g.:

      crab resubmit --siteblacklist=T2_IT_Pisa -d work-area/TASK_AREA
      # OR
      crab resubmit --sitewhitelist=T2_IT_Pisa -d work-area/TASK_AREA
  2. How to create recovery task. Do this, only if problem can't be solved using crab resubmit and recipes suggested in the previous points. Possible reasons to create a recovery task are:

    • Some jobs persistently exceed execution time limit, so smaller jobs should be created.
    • Some bugs in the code which are relevant only in a rare conditions which were met for some jobs in the task.
      • if bug is can affect also the successfully finished jobs, the entire task should be re-run from scratch.

    Here are the steps to create recovery task:

    1. Fix all bugs in the code, if there are any.
    2. Wait until all job has or 'finished' or 'failed' status.
    3. Retrieve crab report:
      crab report -d finished-partial/TASK_AREA
    4. Use file 'results/notFinishedLumis.json' in the task area as the lumi mask for the recovery task. Create recovery task using submit.py:
      ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Pisa --output hh_bbtautau_prod_v3 --jobNames FAILED_TASK_NAME --lumiMask finished-partial/TASK_AREA/results/notFinishedLumis.json --jobNameSuffix _recovery1 FAILED_TASK_CFG
    5. Follow production workflow procedure.