Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Commit

Permalink
Merge pull request #2 from yaourtpourtoi/dev
Browse files Browse the repository at this point in the history
[Upd] Generalisation to single model training
  • Loading branch information
Oleg Filatov authored Aug 8, 2021
2 parents 93b0db5 + 4e4d8c7 commit 9d213c4
Show file tree
Hide file tree
Showing 8 changed files with 127 additions and 71 deletions.
4 changes: 0 additions & 4 deletions .gitignore

This file was deleted.

25 changes: 19 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,16 +59,18 @@ To train and track the model create an experiment (unless already done) and run
* a corresponding entry point (with `-e` option, defaults to `main`)
* name of the experiment for the run to be assigned to (`--experiment-name test`)
* `--no-conda` to avoid creating new conda environment and running from there
* mlflow params with their values (`-P num_iterations=5`, optional, see `MLproject` for all of them and their default values)
* mlflow params with their values (e.g. `-P num_iterations=5` or `-P n_splits=2`, optional, see `MLproject` for all of them and their default values)
* project directory (`.` - current)

```bash
mlflow run --experiment-name test -P year=2018 -P num_iterations=5 --no-conda .
mlflow run --experiment-name test -P year=2018 -P num_iterations=5 -P n_splits=2 --no-conda .
```

*Note*: Oppositely to the manual installation, running `mlflow run` without `--no-conda` flag automatically creates a conda environment from `conda.yaml` cfg file and runs the code from there.

`mlflow` takes care of logging and saving all the basic information about the training, including the model and optional metrics/artifacts (if specified in `train.py`). This is by default logged into `mlruns/{experiment_ID}/{run_ID}` folder inside of the framework directory.

_P.S.:_ Oppositely to the manual installation, running `mlflow run` without `--no-conda` flag automatically creates a conda environment from `conda.yaml` cfg file and runs the code from there.
It is important to note that the training is implemented in **N-fold manner** (also referred to as *cross-training*). The input dataset will be split into `n_splits` folds (as defined in the training cfg file) and `n_splits` models will be trained, where `model_{i}` uses `fold_{i}` only for metric validation during the training, not for the training itself. The folds are indexed based on the remainder of division of `xtrain_split_feature` column in the input data set by `n_splits`. In case when `n_splits=1` is set, only one model will be trained on `train_size` fraction of the input data set, while the rest of it will be used for validation of loss/metrics during the training.

## Tracking results
Once the training is done, `mlflow` provides a [UI interface](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui) to inspect and compare the logged results across experiments/runs. Firstly, in case of running the code on a remote machine, find out its hostname with:
Expand All @@ -94,13 +96,17 @@ ssh -N -f -L localhost:${LOCAL_PORT_ID}:localhost:${REMOTE_PORT_ID} ${SERVER}
Then one can access `mlflow` UI locally by going to http://localhost:5010 in a browser (here, `5010` is a local port id taken from a code snippet example).

## Making predictions
Given the trained model, one can now produce predictions for further inference for the given set of `hdf5` files (skimmed by `preprocess.py`). This is performed with `predict.py` script which loads the model with `mlflow` given its `experiment_ID` and `run_ID`, opens each of the input fold files with `FoldYielder` and passes the data to the model. The output in the form of _maximum class probability_ and the _corresponding class_ along with `misc_features` is saved into the output ROOT file through an [`RDataFrame`](https://root.cern/doc/master/classROOT_1_1RDataFrame.html) class. Lastly, `predict.py` uses the configuration file `configs/predict.yaml` to fetch the necessary parameters, e.g. the list input files or `run_ID`. Note, that the default name of the config file is specified in `@hydra.main()` decorator inside of `predict.py` and not required to be passed in the command line. That is, to produce predictions corresponding to 2018 year, mlflow run "abcd" and other parameters from `configs/predict.yaml` as default, execute:
Given the trained model, one can now produce predictions for further inference for the given set of `hdf5` files (skimmed by `preprocess.py`). This is performed with `predict.py` script which loads the model(s) with `mlflow` given the corresponding `experiment_ID` and `run_ID`, opens each of the input fold files with `FoldYielder` and passes the data to the model(s).

Prediction workflow is also implemented to be in N-fold fashion, which should be transparent to the user similarly to the training step. The number of splits is infered from `mlflow` logs for the corresponding run ID, so that the strategy of the prediction split is automatically adapted to the strategy of the training split. That is, conceptually only `mlflow` run ID and path to input data is needed to produce predictions (_maximum class probability_ and the _corresponding class_, plus `misc_features`).

There are two possible outputs (each configured with its own cfg file) which can be created at the prediction step. One is of the kind `for_datacards` and the other is `for_evaluation`. For both of them the predictions are produced in the same way, but they are saved to different file formats. For example, in case of option `for_datacards`:

```bash
python predict.py year=2018 mlflow_runID=abcd # insert the corresponding run ID here
python predict.py --config-name for_datacards.yaml year=2018 mlflow_runID=None # insert the corresponding run ID here
```

This will produce in `output_path` ROOT files with predictions, which can be now used in the next steps of the analysis. For example, using [`TTree` friends](https://root.cern.ch/root/htmldoc/guides/users-guide/Trees.html#example-3-adding-friends-to-trees) they can be easily augment the original input ROOT files as a new branch based on a common index variable (`evt` in the example below):
it will produce in `output_path` ROOT files one per `sample_name` with predictions saved therein to a TTree named `output_tree_name`. To do that, [`RDataFrame`](https://root.cern/doc/master/classROOT_1_1RDataFrame.html) class is used to snapshot a python dictionary with prediction arrays into ROOT files. After that they can be used in the next steps of the analysis, e.g. in order to produce datacards. Using [`TTree` friends](https://root.cern.ch/root/htmldoc/guides/users-guide/Trees.html#example-3-adding-friends-to-trees) might be especially helpful in this case to augment the original input ROOT files with predictions added as a new branch (`evt` in the example below is used as a common index):
```cpp
TFile *f = new TFile("file_pred.root","READ");
TFile *F = new TFile("file_main.root","READ");
Expand All @@ -111,3 +117,10 @@ T->BuildIndex("evt");
T->AddFriend(t);
T->Scan("pred_class_proba:evt");
```
The second option `for_evaluation` is implemented in order to run the next step of estimating the model performance:
```bash
python predict.py --config-name for_evaluation.yaml year=2018 mlflow_runID=None # insert the corresponding run ID here
```

Here, for a given input files (from `input_path`, as always preprocessed with `preprocess.py`) and a given training (from `mlflow_runID`) predictions will be saved into `.csv` files under `output_path` and also will be logged under exactly the same `mlflow` run ID. After that, simply referring to a single `mlflow` run ID, predictions will be fetched automatically from `mlflow` logs and a dashboard with various metrics and plots can be produced with `evaluate.py` script (WIP).
2 changes: 2 additions & 0 deletions configs/predict.yaml → configs/predict/for_datacards.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kind: 'for_datacards'

# input path & info
year: 2018
input_path: 'data/{year}/multi/skims'
Expand Down
20 changes: 20 additions & 0 deletions configs/predict/for_evaluation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
kind: 'for_evaluation'

# input path & info
year: 2018
input_path: 'data/{year}/multi/skims'
input_filename_template: '{sample_name}.hdf5'
sample_names:
- train
- test

misc_features: # will be added to output ROOT file along with prediction branches
- evt

# model & scaling pipe to make predictions
mlflow_experimentID: 1
mlflow_runID: 13e050761d8f40b69e2379b1c04da6ba

# output path & names
output_path: 'data/{year}/multi/pred'
output_filename_template: '{sample_name}_{year}_pred.csv'
2 changes: 1 addition & 1 deletion configs/train.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# path to the train/test data & scaling pipe
year: 2018
train_file: 'data/{year}/multi/skims/train.hdf5'
xtrain_split_feature: 'evt' # xtrain = cross-training
n_splits: 2 # will split training data into len(set(xtrain_split_feature % n_folds)) parts and train a separate model on each
xtrain_split_feature: 'evt' # used only if n_splits > 1; xtrain = cross-training
train_size: 0.9 # used only if n_splits=1, otherwise will use a left-out fold for testing

# features to be used for training
Expand Down
65 changes: 38 additions & 27 deletions predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
from omegaconf import OmegaConf, DictConfig

import ROOT as R
import uproot
from lumin.nn.data.fold_yielder import FoldYielder
import numpy as np
import pandas as pd
import mlflow

from utils.processing import fill_placeholders
from utils.inference import load_models, predict_folds

@hydra.main(config_path="configs", config_name="predict")
@hydra.main(config_path="configs/predict")
def main(cfg: DictConfig) -> None:
# fill placeholders in the cfg parameters
input_path = to_absolute_path(fill_placeholders(cfg.input_path, {'{year}': cfg.year}))
Expand All @@ -31,31 +31,42 @@ def main(cfg: DictConfig) -> None:
model_cfg = yaml.safe_load(f)
for s in model_cfg['signature']['inputs'].split('{')[1:]:
train_features.append(s.split("\"")[3])
fold_id_column = 'fold_id'

# loop over input fold files
for sample_name in cfg.sample_names:
print(f'\n--> Predicting {sample_name}')
print(f" loading data set")
input_filename = fill_placeholders(cfg.input_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})

# extract DataFrame from fold file
fy = FoldYielder(f'{input_path}/{input_filename}')
df = fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
for f in misc_features: # add misc features
df[f] = fy.get_column(f)
df['fold_id'] = df[xtrain_split_feature] % n_splits

# run cross-inference for folds
pred_dict = predict_folds(df, train_features, misc_features, 'fold_id', models)

# store predictions in RDataFrame and snapshot it into output ROOT file
print(f" storing to output file")
output_filename = fill_placeholders(cfg.output_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
if os.path.exists(f'{output_path}/{output_filename}'):
os.system(f'rm {output_path}/{output_filename}')
R_df = R.RDF.MakeNumpyDataFrame(pred_dict)
R_df.Snapshot(cfg.output_tree_name, f'{output_path}/{output_filename}')
del(df, R_df); gc.collect()
mlflow.set_tracking_uri(f"file://{to_absolute_path('mlruns')}")
with mlflow.start_run(run_id=cfg.mlflow_runID):
# loop over input fold files
for sample_name in cfg.sample_names:
print(f'\n--> Predicting {sample_name}')
print(f" loading data set")
input_filename = fill_placeholders(cfg.input_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})

# extract DataFrame from fold file
fy = FoldYielder(f'{input_path}/{input_filename}')
df = fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
for f in misc_features: # add misc features
df[f] = fy.get_column(f)
df[fold_id_column] = (df[xtrain_split_feature] % n_splits).astype('int32')

# run cross-inference for folds
pred_dict = predict_folds(df, train_features, misc_features, fold_id_column=fold_id_column, models=models)

print(f" storing to output file")
output_filename = fill_placeholders(cfg.output_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
if os.path.exists(f'{output_path}/{output_filename}'):
os.system(f'rm {output_path}/{output_filename}')
if cfg.kind == 'for_datacards':
# store predictions in RDataFrame and snapshot it into output ROOT file
R_df = R.RDF.MakeNumpyDataFrame(pred_dict)
R_df.Snapshot(cfg.output_tree_name, f'{output_path}/{output_filename}')
del(df, R_df); gc.collect()
elif cfg.kind == 'for_evaluation':
df_pred = pd.DataFrame(pred_dict)
df_pred.to_csv(f'{output_path}/{output_filename}')
mlflow.log_artifact(f'{output_path}/{output_filename}', artifact_path='pred')
del(df_pred); gc.collect()
else:
raise Exception(f'Unknown kind for prediction: {cfg.kind}')

if __name__ == '__main__':
main()
19 changes: 9 additions & 10 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import matplotlib.pyplot as plt

from sklearn.model_selection import LeaveOneGroupOut, ShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
import lightgbm as lgb
import mlflow
import mlflow.lightgbm
Expand All @@ -16,39 +15,39 @@
from omegaconf import OmegaConf, DictConfig

from utils.processing import fill_placeholders
from utils.plotting import plot_class_score

@hydra.main(config_path="configs", config_name="train")
def main(cfg: DictConfig) -> None:
# prepare train data
# load training data into DataFrame + add necessary columns
print('\n--> Loading training data')
train_file = fill_placeholders(to_absolute_path(cfg.train_file), {'{year}': cfg.year})
train_fy = FoldYielder(train_file)
train_df = train_fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
train_df['w_cp'] = train_fy.get_column('w_cp')
train_df['w_class_imbalance'] = train_fy.get_column('w_class_imbalance')

# fetch feature/weight/target names
# define feature/weight/target names
train_features = cfg.cont_features + cfg.cat_features # features to be used in training
weight_name = cfg.weight_name
target_name = 'gen_target' # internal target name defined inside of lumin library
fold_id_column = 'fold_id'

if cfg.n_splits > 1:
assert type(cfg.n_splits)==int
train_df[cfg.xtrain_split_feature] = train_fy.get_column(cfg.xtrain_split_feature)
train_df['fold_id'] = (train_df[cfg.xtrain_split_feature] % cfg.n_splits).astype('int32')
split_feature_values = train_fy.get_column(cfg.xtrain_split_feature)
train_df[fold_id_column] = (split_feature_values % cfg.n_splits).astype('int32')

# check that there is no more that 5% difference between folds in terms of number of entries
fold_id_count_diff = np.std(train_df['fold_id'].value_counts()) / np.mean(train_df['fold_id'].value_counts())
fold_id_count_diff = np.std(train_df[fold_id_column].value_counts()) / np.mean(train_df[fold_id_column].value_counts())
if fold_id_count_diff > 0.05:
raise Exception(f'Observed {fold_id_count_diff * 100}% relative difference in number of entries across folds. Please check that the split is done equally.')

print(f'\n[INFO] Will split training data set into ({cfg.n_splits}) folds over values of ({cfg.xtrain_split_feature}) feature to perform cross-training')
splitter = LeaveOneGroupOut()
idx_yielder = splitter.split(train_df, groups=train_df['fold_id'])
idx_yielder = splitter.split(train_df, groups=train_df[fold_id_column])
elif cfg.n_splits == 1:
print(f'\n[INFO] Will train a single model on ({cfg.train_size}) part of the training data set with the rest used for validation')
train_df['fold_id'] = 0
train_df[fold_id_column] = 0
splitter = ShuffleSplit(n_splits=1, train_size=cfg.train_size, random_state=1357)
idx_yielder = splitter.split(train_df)
else:
Expand All @@ -71,7 +70,7 @@ def main(cfg: DictConfig) -> None:
validation_fold_df = train_df.iloc[validation_idx]

# check that `i_fold` is the same as fold ID corresponding to a validation fold
validation_fold_idx = set(validation_fold_df['fold_id'])
validation_fold_idx = set(validation_fold_df[fold_id_column])
assert len(validation_fold_idx)==1 and i_fold in validation_fold_idx

# construct lightgbm dataset
Expand Down
Loading

0 comments on commit 9d213c4

Please sign in to comment.