Skip to content
Konstantin Androsov edited this page Aug 8, 2019 · 15 revisions

Production v5 instructions

Framework installation

  1. Install framework on lxplus6 in a prod workspace directory without creating CMSSW release area

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/prod_v5/install_framework.sh | bash -s prod 8
  2. Check framework production functionality interactively for a few samples

    cd CMSSW_10_2_16/src
    cmsenv
    # Run interactively few signal events
    cmsRun h-tautau/Production/python/Production.py inputFiles=file:/eos/home-k/kandroso/cms-it-hh-bbtautau/prod_v5/test/miniAOD/GluGluToHHTo2B2Tau_node_SM_13TeV_2017.root sampleType=MC_17 applyTriggerMatch=True saveGenTopInfo=False saveGenJetInfo=True applyTriggerCut=False storeLHEinfo=True saveGenParticleInfo=True tupleOutput=signal_try.root maxEvents=100
    # check the output
    root -l signal_try.root
  3. Install the framework in Pisa (using a fai machine with the slc6 image)

    curl -s https://raw.githubusercontent.com/hh-italian-group/h-tautau/prod_v5/install_framework.sh | bash -s prod 4
    cd CMSSW_10_2_16/src
    cmsenv
    ./run.sh TupleMerger --help

Setup CRAB working environment on lxplus6

Each time after login:

source /cvmfs/cms.cern.ch/crab3/crab.sh #or .csh 
voms-proxy-init --voms cms --valid 168:00
cd CMSSW_DIR/src/h-tautau/Production/crab
cmsenv

Production spreadsheet legend

  • done: all crab jobs are successfully finished.
  • tuple: tuples for a task that have all crab jobs successfully finished are merged and copied to the central storage in Pisa.
  • 99p and 99t: same as done and tuple defined above, but if samples are finished with 99% of jobs finished and a few remaining jobs are failed. These samples should be considered as "finished", so you should follow the full procedure explained below. The only exceptions are DATA and embedded samples for which all jobs should be 100% finished.

Production workflow

Steps 0, 2-6 should be repeated periodically, 1-2 times per day, until the end of the production.

  1. Define YEAR variable

     export YEAR=VALUE  # where VALUE is 2016, 2017 or 2018
     
  2. Submit jobs

    ./submit.py --work-area work-area_$YEAR --cfg ../python/Production.py --site T2_IT_Legnaro --output hh_bbtautau_prod_v5_$YEAR config/$YEAR/config1 [config/$YEAR/config2] ...
    • Submit each year and be careful to use different output folders
    • PS better to submit each year and wait that the 50% of jobs are finished to submit the others
  3. Check jobs status

    ./multicrab.py --workArea work-area_$YEAR --crabCmd status

    Analyze the output of the status command for each task:

    1. If few jobs were failed without any persistent pattern, resubmit them:
      crab resubmit -d work-area_$YEAR/TASK_AREA
    2. If a significant amount of jobs are failing, one should investigate the reason and take actions accordingly. For more details see the CRAB troubleshoot section below.
    3. If all jobs are successfully finished (or >=99%), move task area from "work-area_YEAR" into "finished_YEAR" directory (create it if needed).
      # mkdir -p finished_$YEAR
      mv work-area_$YEAR/TASK_AREA finished_$YEAR

    Before moving the directory, make sure that all jobs have status FAILED or FINISHED, otherwise wait (use kill command, if necessary).

    1. Create task lists for tasks in "finished" directory and transfer them into Pisa server.

      if [ -f current-check_$YEAR.txt ] ; then rm -f prev-check_$YEAR.txt ; mv current-check_$YEAR.txt prev-check_$YEAR.txt ; fi
      ./create_job_list.sh finished_$YEAR | sort > current-check_$YEAR.txt
    2. For all tasks from 'finished_$YEAR' (especially if they are 99%) create crab reports

      # mkdir -p crab_results_$YEAR
      for JOB in $(ls finished_$YEAR) ; do if [ ! -f "finished_$YEAR/$JOB/results/$JOB.tar.gz" ] ; then echo "finished_$YEAR/$JOB" ; crab report  -d "finished_$YEAR/$JOB" ; tar -czvf crab_results_$YEAR/$JOB.tar.gz finished_$YEAR/$JOB/results/ ; fi ; done

      This config will allow rerunning the task partially if it would be required in the future.

    3. Update prod_v5/$YEAR spreadsheet accordingly following the notation defined in the section above production-spreadsheet-legend. You can use the following command line to get a list of the newly finished tasks:

      if [ -f prev-check_$YEAR.txt ] ; then diff current-check_$YEAR.txt prev-check_$YEAR.txt ; else cat current-check_$YEAR.txt ; fi
  4. Now the stage out is done in Legnaro, so before doing merge we should copy files from Legnaro to Pisa. So in order to check files there and transfer them to Pisa you should do the following steps in Pisa:

    1. run voms-proxy-init --voms cms --valid 168:00
    2. check the directory:
      gfal-ls -l "srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/user/LXPLUS_USER/hh_bbtautau_prod_v5_YEAR
    3. then you have to copy the files on gridui (where you expect to run the merging code); you should create a folder where inside you should copy folder by folder the samples:
      ./AnalysisTools/Run/python/gfal-cp.py T2_IT_Legnaro:/store/user/mgrippo/hh_bbtautau_prod_v5_$YEAR/$SAMPLE_FOLDER $OUTPUT_FOLDER_IN_GRIDUI_PER_YEAR

    PS. if it fails before end of download, you should delete the last downloaded file (not all) and rerun the script

  5. Submit merge jobs output files on the stage out server (before running this procedure the software has to be installed).

    1. If some merge jobs were already created during the previous iteration, use find_new_jobs.sh to create list of new jobs to submit.

      • N.B. The file current-check_$YEAR/*.txt has to be transferred from lxplus to stage out server in the src directory in order to run find_new_jobs.sh.
      ./h-tautau/Instruments/find_new_jobs.sh current-check_$YEAR/finished.txt output_$YEAR/tuples > finished_$YEAR.txt
      • N.B. Check that in the created job lists there are no jobs which are still running in batch system queue for merge (bjobs command on gridui). In case remove these jobs.
    2. Submit merge jobs in the interactive queue in stage out server Pisa on fai machine (bsub -Is -n 1 -q fai -a "docker-sl6" /bin/bash), using the following command line, where CRAB_OUTPUT_PATH is the folder in Pisa where you copied the files from Legnaro:

       ./h-tautau/Instruments/submit_tuple_hadd.sh interactive finished_$YEAR.txt output_$YEAR/merge CRAB_OUTPUT_PATH
    3. Collect finished jobs (this script can be run as many times as you want).

      ./h-tautau/Instruments/collect_tuple_hadd.sh output_$YEAR/merge output_$YEAR/tuples
  6. Split large merged files into several parts in order to satisfy cernbox requirement that file size should be less than 50GB.

    # mkdir -p output_$YEAR/tuples_split
    ./h-tautau/Instruments/python/split_tuple_file.py --input output_$YEAR/tuples/SAMPLE.root --output output_$YEAR/tuples_split/SAMPLE.root
    # copy split files back into the original directory
    mv output_$YEAR/tuples_split/SAMPLE*.root output_$YEAR/tuples
  7. Transfer tuple files into the local tuple storage. Pisa: /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples$YEAR_v5/.

    • For 100% complete tasks use Full sub-directory.
    # Full
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' --dry-run output_$YEAR/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples$YEAR_v5/Full
    # if everything ok
    rsync -auv --chmod=g+rw --exclude '*sub[0-9].root' --exclude '*recovery[0-9].root' output_$YEAR/tuples/*.root /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples$YEAR_v5/Full
    • Update prod_v5_$YEAR spreadsheet accordingly.
    • Tuples the will be transfered by the production coordinator into the central prod_v5_$YEAR cernbox directory: /eos/user/k/kandroso/cms-it-hh-bbtautau/Tuples$YEAR_v4.
  8. Transfer crab results in /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples$YEAR_v5/crab_results_$YEAR

     scp -r $LXPLUS_USERNAME@lxplus6.cern.ch:/$PATH_FROM_LXPLUS/crab_results_$YEAR /gpfs/ddn/cms/user/androsov/store/cms-it-hh-bbtautau/Tuples$YEAR_v5/crab_results_$YEAR
  9. When all production is over, after few weeks of safety delay, delete remaining crab output directories and root files in your area to reduce unnecessary storage usage.

CRAB troubleshoot

  1. Common failure reasons

    • Jobs are failing due to memory excess.

      Solution Resubmit jobs requiring more memory per job, e.g.:

      crab resubmit --maxmemory 4000 -d work-area_$YEAR/TASK_AREA
    • Jobs are failing on some servers

      Solution Resubmit jobs using black or white list, e.g.:

      crab resubmit --siteblacklist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
      # OR
      crab resubmit --sitewhitelist=T2_IT_Pisa -d work-area_$YEAR/TASK_AREA
  2. How to create recovery task. Do this, only if problem can't be solved using crab resubmit and recipes suggested in the previous points. Possible reasons to create a recovery task are:

    • Some jobs persistently exceed execution time limit, so smaller jobs should be created.
    • Some bugs in the code which are relevant only in a rare conditions which were met for some jobs in the task.
      • if bug is can affect also the successfully finished jobs, the entire task should be re-run from scratch.

    Here are the steps to create recovery task:

    1. Fix all bugs in the code, if there are any.
    2. Wait until all job has or 'finished' or 'failed' status.
    3. Retrieve crab report:
      crab report -d finished-partial/TASK_AREA
    4. Use file 'results/notFinishedLumis.json' in the task area as the lumi mask for the recovery task. Create recovery task using submit.py:
      ./submit.py --work-area work-area --cfg ../python/Production.py --site T2_IT_Legnaro --output hh_bbtautau_prod_v5 --jobNames FAILED_TASK_NAME --lumiMask finished-partial/TASK_AREA/results/notFinishedLumis.json --jobNameSuffix _recovery1 FAILED_TASK_CFG
    5. Follow production workflow procedure.
  3. Create prepare local jobs (when there are few jobs failed):

    1. Once you have setup the environment, you need to create the CRAB project directory for your task. If you have already submitted the task you can simply cd to the project directory created at submission time or create it with the crab remake command;
    2. have not yet submitted the task, do it with the --dryrun option Once the CRAB project directory is created, execute
    crab preparelocal --dir = <PROJECTDIR>

    and then execute locally the script that is created.

Clone this wiki locally