Skip to content
Konstantin Androsov edited this page Aug 14, 2019 · 15 revisions

Production v5 instructions

Framework installation

  1. Install framework on lxplus6 in a prod workspace directory without creating CMSSW release area

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/prod_v5/install_framework.sh | bash -s prod 8
  2. Check framework production functionality interactively for a few samples

    cd CMSSW_10_2_16/src
    cmsenv
    # Run interactively few signal events
    cmsRun h-tautau/Production/python/Production.py inputFiles=file:/eos/home-k/kandroso/cms-it-hh-bbtautau/prod_v5/test/miniAOD/GluGluToHHTo2B2Tau_node_SM_13TeV_2017.root sampleType=MC_17 applyTriggerMatch=True saveGenTopInfo=False saveGenJetInfo=True applyTriggerCut=False storeLHEinfo=True saveGenParticleInfo=True tupleOutput=signal_try.root maxEvents=100
    # check the output
    root -l signal_try.root
  3. Install the framework in Pisa (using a fai machine with the slc6 image)

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/prod_v5/install_framework.sh | bash -s prod 4
    cd CMSSW_10_2_16/src
    cmsenv
    ./run.sh TupleMerger --help

Setup CRAB working environment on lxplus6

Each time after login:

source /cvmfs/cms.cern.ch/crab3/crab.sh #or .csh 
voms-proxy-init --voms cms --valid 168:00
cd CMSSW_DIR/src/h-tautau/Production/crab
cmsenv

Production spreadsheet legend

  • done: all crab jobs are successfully finished.
  • tuple: tuples for a task that have all crab jobs successfully finished are merged and copied to the central storage in Pisa.
  • 99p and 99t: same as done and tuple defined above, but for the tasks for which at least 99% of jobs are successfully finished, while a few remaining jobs have failed. These tasks should be considered as "finished", so you should follow the full procedure explained below. The only exceptions are DATA and embedded samples for which all jobs should be 100% finished.

Production workflow

Steps 0, 2-6 should be repeated periodically, 1-2 times per day, until the end of the production.

  1. Define YEAR variable

     export YEAR=VALUE  # where VALUE is 2016, 2017 or 2018
     
  2. Submit jobs

    ./submit.py --work-area work-area_$YEAR --cfg ../python/Production.py --site T2_IT_Legnaro --output hh_bbtautau_prod_v5_$YEAR config/$YEAR/config1 [config/$YEAR/config2] ...
    • run ./submit.py --help to get more details about available parameters.
    • Submit each year and be careful to use different output folders
    • In order to avoid saturation of the CRAB scheduler, it is better to submit one year at the time and wait that about 50% of jobs are finished before submitting the other years
    • The embedded samples are published in phys03 DBS, therefore --inputDBS phys03 should be specified during the submission.
  3. Check jobs status

    ./multicrab.py --workArea work-area_$YEAR --crabCmd status

    Analyze the output of the status command for each task:

    1. If few jobs were failed without any persistent pattern, resubmit them:
      crab resubmit -d work-area_$YEAR/TASK_AREA
    2. If a significant amount of jobs are failing, one should investigate the reason and take actions accordingly. For more details see the CRAB troubleshoot section below.
    3. If all jobs are successfully finished (or >=99%), move task area from "work-area_$YEAR" into "finished_$YEAR" directory (create it if needed).
      # mkdir -p finished_$YEAR
      mv work-area_$YEAR/TASK_AREA finished_$YEAR

    Before moving the directory, make sure that all jobs have status FAILED or FINISHED, otherwise wait (use kill command, if necessary).

    1. Create task lists for tasks in "finished_$YEAR" directory and transfer them into Pisa server.

      if [ -f current-check_$YEAR.txt ] ; then rm -f prev-check_$YEAR.txt ; mv current-check_$YEAR.txt prev-check_$YEAR.txt ; fi
      ./create_job_list.sh finished_$YEAR | sort > current-check_$YEAR.txt
    2. For all tasks from "finished_$YEAR" (especially if they are 99%) create crab reports

      # mkdir -p crab_results_$YEAR
      for JOB in $(ls finished_$YEAR) ; do if [ ! -f "crab_results_$YEAR/$JOB.tar.gz" ] ; then echo "finished_$YEAR/$JOB" ; crab report  -d "finished_$YEAR/$JOB" ; tar -czvf crab_results_$YEAR/$JOB.tar.gz finished_$YEAR/$JOB/results/ ; fi ; done

      This config will allow rerunning the task partially in case if it would be needed in the future.

    3. Update prod_v5/$YEAR spreadsheet accordingly following the notation defined in the section above production-spreadsheet-legend. You can use the following command line to get a list of the newly finished tasks:

      if [ -f prev-check_$YEAR.txt ] ; then diff current-check_$YEAR.txt prev-check_$YEAR.txt ; else cat current-check_$YEAR.txt ; fi
  4. The stage out is done in Legnaro, so before doing merge we should copy files from Legnaro to Pisa. In order to check files there and transfer them to Pisa you should do the following steps in Pisa:

    1. run voms-proxy-init --voms cms --valid 168:00
    2. check the directory:
       ./AnalysisTools/Run/python/gfal-ls.py T2_IT_Legnaro:/store/user/LXPLUS_USER/hh_bbtautau_prod_v5_YEAR
    3. then you have to copy the files on gridui (where you expect to run the merging code); you should create a folder where inside you should copy folder by folder the samples:
      ./AnalysisTools/Run/python/gfal-cp.py --exclude '.*\.log\.tar\.gz$' T2_IT_Legnaro:/store/user/LXPLUS_USER/hh_bbtautau_prod_v5_YEAR/SAMPLE_FOLDER OUTPUT_FOLDER_IN_GRIDUI_PER_YEAR

    P.S. If copy fails before downloading all the files, you should delete the last downloaded file (not all of them) and rerun the script.

  5. Submit merge jobs output files on the Pisa server.

    1. If some merge jobs were already created during the previous iteration, use find_new_jobs.sh to create a list of new jobs to submit.

      • N.B. The file current-check_$YEAR/*.txt has to be transferred from lxplus to the Pisa server in the src directory in order to run find_new_jobs.sh.
      ./h-tautau/Instruments/find_new_jobs.sh current-check_$YEAR/finished.txt output_$YEAR/tuples > finished_$YEAR.txt
      • N.B. Check that in the created job lists there are no jobs which are still running in batch system queue for merge (bjobs command on gridui). In case remove these jobs from the list.
    2. Submit merge jobs in the interactive queue in Pisa server on a fai machine (bsub -Is -n 1 -q fai -a "docker-sl6" /bin/bash), using the following command line, where CRAB_OUTPUT_PATH is the folder in Pisa where you copied the files from Legnaro:

       ./h-tautau/Instruments/submit_tuple_hadd.sh interactive finished_$YEAR.txt output_$YEAR/merge CRAB_OUTPUT_PATH
    3. Collect finished jobs (this script can be run as many times as you want).

      ./h-tautau/Instruments/collect_tuple_hadd.sh output_$YEAR/merge output_$YEAR/tuples
  6. Split large merged files into several parts in order to satisfy a CernBox requirement that file size should be less than 50GB.

    # mkdir -p output_$YEAR/tuples_split
    ./h-tautau/Instruments/python/split_tuple_file.py --input output_$YEAR/tuples/SAMPLE.root --output output_$YEAR/tuples_split/SAMPLE.root
    # copy split files back into the original directory
    mv output_$YEAR/tuples_split/SAMPLE*.root output_$YEAR/tuples
  7. Transfer tuple files into the central tuple storage in Pisa: /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v5/Full.

    # Full
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' --dry-run output_$YEAR/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v5/Full
    # if everything ok
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' output_$YEAR/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v5/Full
    • Update prod_v5_$YEAR spreadsheet accordingly.
    • Tuples the will be transfered by the production coordinator into the central prod_v5_$YEAR cernbox directory: /eos/home-k/kandroso/cms-it-hh-bbtautau/Tuples${YEAR}_v5.
  8. Transfer crab results in `/gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v5/crab_results

     rsync -auv --chmod=g+rw [email protected]:/PATH_FROM_LXPLUS/crab_results_$YEAR/ /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples${YEAR}_v5/crab_results
  9. When all production is over, after few weeks of safety delay, delete remaining crab output directories and root files in your area to reduce unnecessary storage usage.

CRAB troubleshoot

  1. Common failure reasons

    • Jobs are failing due to memory excess.

      Solution Resubmit jobs requiring more memory per job, e.g.:

      crab resubmit --maxmemory 4000 -d work-area_$YEAR/TASK_AREA
    • Jobs are failing on some servers

      Solution Resubmit jobs using black or white list, e.g.:

      crab resubmit --siteblacklist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
      # OR
      crab resubmit --sitewhitelist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
  2. How to create a recovery task. Do this, only if the problem can't be solved using crab resubmit and recipes suggested in the previous points. Possible reasons to create a recovery task are:

    • Some jobs persistently exceed the execution time limit, so smaller jobs should be created.
    • Some bugs in the code which are relevant only in a rare conditions which were met for some jobs in the task.
      • if bug is can affect also the successfully finished jobs, the entire task should be re-run from scratch.
    • Other non-reproducible crab issues

    Here are the steps to create a recovery task:

    1. Fix all bugs in the code, if there are any.
    2. Wait until all job has or 'finished' or 'failed' status.
    3. Retrieve crab report:
      crab report -d finished-partial/TASK_AREA
    4. Use file 'results/notFinishedLumis.json' in the task area as the lumi mask for the recovery task. Create recovery task using submit.py:
      ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Legnaro --output hh_bbtautau_prod_v5 --jobNames FAILED_TASK_NAME --lumiMask finished-partial/TASK_AREA/results/notFinishedLumis.json --jobNameSuffix _recovery1 FAILED_TASK_CFG
    5. Follow the production workflow procedure.
  3. Create prepare local jobs (when there are few jobs failed):

    1. Once you have set up the environment, you need to create the CRAB project directory for your task. If you have already submitted the task you can simply cd to the project directory created at submission time or create it with the crab remake command;
    2. have not yet submitted the task, do it with the --dryrun option Once the CRAB project directory is created, execute
    crab preparelocal --dir = <PROJECTDIR>

    and then execute locally the script that is created.

Clone this wiki locally