All running is done at the Minnesota Supercomputing Institute (MSI). Connect to the Mangi (V100 nodes) or Agate (A100 nodes) cluster. Please visit https://www.msi.umn.edu/ for more information on connecting.
mkdir -p Train
cd Train
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh <<< $'\nyes\n~/anaconda3\nyes\n'
rm Anaconda3-2019.10-Linux-x86_64.sh
source /cvmfs/cms-lpc.opensciencegrid.org/sl7/gpu/Setup.sh
source ~/anaconda3/bin/activate
conda update -n base -c defaults conda <<< $'y\n'
conda create -n tf python=3.7 anaconda <<< $'y\n'
conda activate tf
conda install -n tf libgcc pandas scikit-learn tensorboard tensorflow=2.2.0 tensorflow-gpu Keras=2.4.3 matplotlib numpy=1.18.5 dask h5py protobuf pydot pytorch torchvision cudatoolkit <<< $'y\n'
conda install -c conda-forge shap
pip install uproot
pip install coffea
pip install mplhep==0.1.35
pip install pypi
pip install matplotlib==3.3.0
Get the analysis code from GitHub and rsync
over the ROOT file NN inputs from the LPC
cd Train
git clone [email protected]:StealthStop/DeepESM.git
cd DeepESM
rsync -r <lpcuser>@cmslpc120.fnal.gov:/uscmst1b_scratch/lpc1/3DayLifetime/<some path> .
# Sets environment parameters
source deepenv.sh
On the MSI system, one can run interactive jobs, which run on GPU nodes and whose output is returned to the user's terminal.
Most use cases are for debugging the code and testing purposes.
This interactive running is performed using the srun
command.
The command lets the user allocate a custom amount of CPU/GPU/RAM resources as well as time for running their program.
An example call of srun
would be of the form:
srun -u \
-t 0:40:00 \
-p interactive-gpu \
--gres=gpu:k40:1 \
--mem-per-cpu=30G
python train.py --saveAndPrint --procCats --njetsCats --massCats --minMass 350 --maxMass 1150 --evalMass 550 --trainModel RPV --evalModel RPV --year 2016preVFP --seed 527725 --tree myMiniTree_1l --nJets 7 --inputs UL_NN_inputs/
where 40 minutes of GPU time is requested on an interactive GPU node in the Mangi cluster (k40
, for Agate cluster one would use a40
) and 30 GB of RAM for loading events from disk.
Finally, the last argument provided is the entire python
call to the executable to run, in this case train.py
.
The train.py
arguments are detailed below.
usage: usage: %prog [options] [-h] [--quickVal] [--json JSON]
[--minMass MINMASS] [--maxMass MAXMASS]
[--evalMass EVALMASS] [--evalModel EVALMODEL]
[--evalYear EVALYEAR] [--trainModel TRAINMODEL]
[--replay] [--trainYear TRAINYEAR]
[--inputs INPUTS] [--tree TREE] [--saveAndPrint]
[--seed SEED] [--nJets NJETS] [--debug]
[--scaleJetPt] [--useJECs]
[--maskNjet MASKNJET [MASKNJET ...]]
[--procCats] [--massCats] [--njetsCats]
[--outputDir OUTPUTDIR]
optional arguments:
-h, --help show this help message and exit
--quickVal Do quick (partial) validation
--json JSON JSON config file
--minMass MINMASS Minimum stop mass to train on
--maxMass MAXMASS Maximum stop mass to train on
--evalMass EVALMASS Stop mass to evaluate on
--evalModel EVALMODEL
Signal model to evaluate on
--evalYear EVALYEAR Year(s) to eval on
--trainModel TRAINMODEL
Signal model to train on
--replay Replay saved model
--trainYear TRAINYEAR
Year(s) to train on
--inputs INPUTS Path to input files
--tree TREE TTree to load events from
--saveAndPrint Save pb and print model
--seed SEED Use specific seed for env
--nJets NJETS Minimum number of jets
--debug Debug with small set of events
--scaleJetPt Scale Jet pt by HT
--useJECs Use JEC/JER variations
--maskNjet MASKNJET [MASKNJET ...]
mask Njet bin(s) in training
--procCats Balance batches bkg/sig
--massCats Balance batches among masses
--njetsCats Balance batches among njets
--outputDir OUTPUTDIR
Output directory path
The most powerful use case is submitting NN training jobs in batch to the Mangi or Agate GPU clusters.
This is acheived using the boboTrain.py
script, whose arguments are detailed as follows:
usage: boboTrain.py [-h] [--trainBkgd TRAINBKGD [TRAINBKGD ...]]
[--trainModel TRAINMODEL]
[--evalBkgd EVALBKGD [EVALBKGD ...]]
[--evalModel EVALMODEL]
[--trainMass TRAINMASS [TRAINMASS ...]]
[--evalMass EVALMASS] [--tag TAG]
[--bcorr BCORR [BCORR ...]] [--disc DISC [DISC ...]]
[--abcd ABCD [ABCD ...]] [--reg REG [REG ...]]
[--nodes NODES [NODES ...]] [--reglr REGLR [REGLR ...]]
[--disclr DISCLR [DISCLR ...]]
[--factors FACTORS [FACTORS ...]]
[--epochs EPOCHS [EPOCHS ...]] [--trainYear TRAINYEAR]
[--evalYear EVALYEAR] [--seed SEED] [--channel CHANNEL]
[--noSubmit] [--cluster CLUSTER] [--memory MEMORY]
[--walltime WALLTIME] [--useJECs] [--nJets NJETS]
[--maskNjet MASKNJET [MASKNJET ...]] [--procCats]
[--massCats] [--njetsCats] [--saveAndPrint]
[--inputs INPUTS]
optional arguments:
-h, --help show this help message and exit
--trainBkgd TRAINBKGD [TRAINBKGD ...]
which bkgd to train on
--trainModel TRAINMODEL
which sig to train on
--evalBkgd EVALBKGD [EVALBKGD ...]
which bkgd to validate on
--evalModel EVALMODEL
which model to validate on
--trainMass TRAINMASS [TRAINMASS ...]
lower and upper mass range bounds
--evalMass EVALMASS which mass point to validate on
--tag TAG tag to use in output
--bcorr BCORR [BCORR ...]
list of bcorr lambda values
--disc DISC [DISC ...]
list of disc lambda values
--abcd ABCD [ABCD ...]
list of abcd lambda values
--reg REG [REG ...] list of reg lambda values
--nodes NODES [NODES ...]
list of nodes values
--reglr REGLR [REGLR ...]
regression lr
--disclr DISCLR [DISCLR ...]
disc lr
--factors FACTORS [FACTORS ...]
list of factors to multiply
--epochs EPOCHS [EPOCHS ...]
how many epochs
--trainYear TRAINYEAR
which year(s) to train on
--evalYear EVALYEAR which year to eval on
--seed SEED which seed to init with
--channel CHANNEL which decay channel
--noSubmit do not submit to cluster
--cluster CLUSTER which cluster to run on
--memory MEMORY how much mem to request
--walltime WALLTIME how much time to request
--useJECs use JEC/JER variation events
--nJets NJETS Minimum number of jets
--maskNjet MASKNJET [MASKNJET ...]
mask Njet bin/bins in training
--procCats Balance batches bkg/sig
--massCats Balance batches among masses
--njetsCats Balance batches among njets
--saveAndPrint Save model peanut butter
--inputs INPUTS which inputs files to use
An example call to boboTrain.py
would be:
python boboTrain.py --saveAndPrint \
--procCats \
--njetsCats \
--useJECs \
--channel 1l \
--epochs 15 20 25 \
--bcorr 1000 2000 \
--disc 1.0 3.0 2.0 5.0 \
--abcd 1.0 2.0 3.0 5.0 \
--disclr 0.001 \
--reg 0.0001 \
--reglr 1.0 \
--trainYear Run2 \
--trainSig RPV \
--evalSig RPV \
--evalMass 550 \
--evalYear 2016preVFP \
--tag Run2_RPV \
--inputs UL_NN_inputs/ \
--memory 50gb \
--walltime 01:30:00 \
--cluster a100-4
--noSubmit
Running this command will generate a Run2_RPV_<unique_timestamp>
folder in ./batch
.
No jobs have been submitted yet, but that is acheived by going to ./batch/Run2_RPV_<unique_timestamp
and running qsub job_submit.pbs
.
The status of jobs can be checked using the command qstat -a -f -M -u $USER
.
Occaisionally, some jobs may encounter a segmentation violation or problem (insufficient resource allocation) and stop running.
A resubmit.py
script has been provided to generate a new .pbs
submission file with just the jobs that did not complete successfully.
The arguments are detailed below:
usage: resubmit.py [-h] [--jobDir JOBDIR] [--cluster CLUSTER]
[--memory MEMORY] [--walltime WALLTIME]
optional arguments:
-h, --help show this help message and exit
--jobDir JOBDIR Directory where jobs submitted from
--cluster CLUSTER which cluster to run on
--memory MEMORY how much mem to request
--walltime WALLTIME how much time to request
At this juncture, the user may also request a different cluster, or different amount of memory or RAM when doing the resubmission.
An example call would be of the form:
python resubmit.py --jobDir Run2_RPV_<unique_timestamp> --memory 75gb
where the user is resubmitting jobs for the job dir used above and is requesting a new memory of 75gb.
Again, jobs have not been submitted, so the user can navigate to the respective job dir and call qsub job_resubmit.pbs
A plotting script is provided to make pretty plots of NN inputs from the ntuple files.
Arguments to the script are:
--approved : is Plot is approved?
--path : Path to ntuples files
--tree : TTree name to use
--year : which year
--mass1 : mass 1 to show
--mass2 : mass 2 to show
--model1 : model 1 to show
--model2 : model 2 to show
An example to run the script could be:
python ttVsSigNN_mini.py --year 2016 --path /path/to/ntuples/files --mass1 350 --model1 RPV --mass2 500 --model2 StealthSYY
A python script is provided parseNNjobs.py
to grab plots for each neural network job and make a two slide summary, where two slide summaries are concatenated together into one set of LaTeX slides.
Some primitive logic is available to sort the trainings by a metric and currently the metric is a chi2 calculation comparing the ABCD-predicted number of events in A to the actual number of events in A based on fixed ABCD region boundaries.
Thus, the first NN jobs in the slides demonstrate the best ABCD closure.
The script expects a certain folder structure for the NN jobs of the form
<main_folder_with_tex_file>/<collection_of_NN_jobs>/<individual_NN_job>
An example call to use the script is
python parseNNjobs.py --inputDir main_folder_with_tex_file --subdir collection_of_NN_jobs --title Fancy title for slides
Once a NN training configuration has been chosen for use in the StealthStop analysis framework, a release needs to be made in DeepESMCfg
(see that repository for further information on making a release).
A script is provided here to help make the .cfg
and tar up the .pb
ready for sending to DeepESMCfg
.
The script has the arguments:
usage: usage: %prog [options] [-h] --year YEAR --path PATH --model MODEL --channel CHANNEL --version VERSION
optional arguments:
-h, --help show this help message and exit
--year YEAR year that NN is trained for
--path PATH Input dir with pb and json from training
--model MODEL signal model that NN is trained for
--channel CHANNEL channel that NN is trained for
--version VERSION versioning tag for local organization
where an example call to the script would be
python make_DoubleDisCo_cfgFile.py --year Run2 --path Output/atag_1l_MyFavRPV_lots_of_hyperparams/ --channel 1l --model RPV --version v1.2
This would make a folder DoubleDisCo_Reg_1l_RPV_Run2_v1.2
that contains two .cfg
and a .tar
.
The .cfg
are to be pushed to DeepESMCfg
, while the tar should be uploaded when a new tag is made.