Minimalistic python framework for training HEP-based neural networks.
Developed to read ROOT files (via uproot) and pass the data to keras (+tensorflow) for machine learning.
The framework uses a single text file to quickly prototype different hyperparameter configurations. Relevant plots (to our analysis work) are generated with hepPlotter. Hopefully this framework is useful for getting started and requires little modification for other users.
Clone the repository and the hepPlotter repository (Asimov uses hepPlotter to make plots):
git clone https://github.com/demarley/asimov.git
git clone https://github.com/demarley/hepPlotter.git
cd hepPlotter/
git checkout tags/v0.4.2 # current compatibility
Please see the examples/
directory for an example on using this framework with data from the Higgs Boson Machine Learning Challenge.
This (simple) framework serves as an interface between HEP data and machine learning libraries (keras). The uproot package allows us to open ROOT data files in a python environment and port the data directly to a pandas dataframe. The dataframe can then be passed to keras as needed to do the training. Using hepPlotter, we can make plots of the features, correlations, etc. to visualize the ML performance.
The input root file is assumed to be flat and each 'event' in the TTree contains branches necessary for training
(the branches in the TTree need to match the names of the features provided in the configuration file).
If you're designing an algorithm that discriminates objects, e.g., jets, then each 'event' in the TTree must represent each jet, rather than the actual physics event.
Please inspect example.root
in the examples/
directory for more information.
NB: This is an area for future development, but the current simplicity of this setup prevents that. Furthermore, the author uses their existing workflow (a C++ environment) to generate flat ntuples.
Files | Description |
---|---|
foundation.py (training.py and inference.py inherit from this) |
Base class |
empire.py |
Plotting class |
config.py |
Configuration class (reads text file and sets NN framework) |
util.py |
Misc. utility functions |
A single text file dictates the NN architecture, what data to process, where store outputs, and what features to use, among other things.
An example is provided here: examples/example_config.txt
.
In this file, the list of features are comma-separated and they match the branches in example.root
that we want to use for the training (noted above).
The class python/config.py
reads the text file and stores the relevant data for use by the various NN classes.
To apply further selection on dataframe, you can create a list of strings that will be parsed and then used to select events from the dataframe.
# slices = ['BRANCH <OP> VALUE',...]
# where : 'BRANCH' is the branch name in the root file
# : <OP> is the mathematical operator, e.g., '>' or '<='
# : 'VALUE' is the value the branch is being compared to
# e.g., for the examples directory:
slices = ['mass_MMC > 0'] # this would ensure you don't train on events with mass_MMC=-999.
dnn.preprocess_data(slices)
The model and figures are saved for further inspection and use in a c++ production environment. The LWTNN framework is the default output option, but it is possible to save the model in other forms.
This software has been developed for a custom computing environment. The Anaconda python installation is used to manage python libraries.
Module | Version |
---|---|
conda | 4.4.10 |
matplotlib | 2.2.2 |
numpy | 1.14.1 |
keras | 2.0.8 (with Tensorflow backend) |
tensorflow | 1.4.1 (tensorflow-gpu) |
uproot | 3.2.5 (used in hepPlotter) |
hepPlotter | v0.4.2 (developed using Goldilocks) |
cuda | V9.0.176 |
Furthermore, this setup has access to an NVIDIA 1080Ti (with NVIDIA-SMI 390.87
).
NB: For those interested, it is possible to run this setup using Google Colab.
Feel free to have a look at asimov_demo.ipynb.
This is modeled after the example notebook 0-simple.ipynb that uses the data from the Higgs Boson ML Challenge.
There are a few known issues (e.g., LaTeX isn't available for plot labels), but the code runs rather successfully on both TPUs and GPUs.
Please submit an issue or PR.