Merge pull request #2 from yaourtpourtoi/dev

[Upd] Generalisation to single model training
ofivite · Aug 8, 2021 · 9d213c4 · 9d213c4
2 parents 93b0db5 + 4e4d8c7
commit 9d213c4
Show file tree

Hide file tree

Showing 8 changed files with 127 additions and 71 deletions.
diff --git a/.gitignore b/.gitignore
diff --git a/README.md b/README.md
@@ -59,16 +59,18 @@ To train and track the model create an experiment (unless already done) and run
 *  a corresponding entry point (with `-e` option, defaults to `main`)
 *  name of the experiment for the run to be assigned to (`--experiment-name test`)
 *  `--no-conda` to avoid creating new conda environment and running from there
-*  mlflow params with their values (`-P num_iterations=5`, optional, see `MLproject` for all of them and their default values)
+*  mlflow params with their values (e.g. `-P num_iterations=5` or `-P n_splits=2`, optional, see `MLproject` for all of them and their default values)
 *  project directory (`.` - current)
 
 ```bash
-mlflow run --experiment-name test -P year=2018 -P num_iterations=5 --no-conda .
+mlflow run --experiment-name test -P year=2018 -P num_iterations=5 -P n_splits=2 --no-conda .
 ```
 
+*Note*: Oppositely to the manual installation, running `mlflow run` without `--no-conda` flag automatically creates a conda environment from `conda.yaml` cfg file and runs the code from there.
+
 `mlflow` takes care of logging and saving all the basic information about the training, including the model and optional metrics/artifacts (if specified in `train.py`). This is by default logged into `mlruns/{experiment_ID}/{run_ID}` folder inside of the framework directory.
 
-_P.S.:_ Oppositely to the manual installation, running `mlflow run` without `--no-conda` flag automatically creates a conda environment from `conda.yaml` cfg file and runs the code from there.
+It is important to note that the training is implemented in **N-fold manner** (also referred to as *cross-training*). The input dataset will be split into `n_splits` folds (as defined in the training cfg file) and `n_splits` models will be trained, where `model_{i}` uses `fold_{i}` only for metric validation during the training, not for the training itself. The folds are indexed based on the remainder of division of `xtrain_split_feature` column in the input data set by `n_splits`. In case when `n_splits=1` is set, only one model will be trained on `train_size` fraction of the input data set, while the rest of it will be used for validation of loss/metrics during the training.   
 
 ## Tracking results
 Once the training is done, `mlflow` provides a [UI interface](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui) to inspect and compare the logged results across experiments/runs. Firstly, in case of running the code on a remote machine, find out its hostname with:
@@ -94,13 +96,17 @@ ssh -N -f -L localhost:${LOCAL_PORT_ID}:localhost:${REMOTE_PORT_ID} ${SERVER}
 Then one can access `mlflow` UI locally by going to http://localhost:5010 in a browser (here, `5010` is a local port id taken from a code snippet example).
 
 ## Making predictions
-Given the trained model, one can now produce predictions for further inference for the given set of `hdf5` files (skimmed by `preprocess.py`). This is performed with `predict.py` script which loads the model with `mlflow` given its `experiment_ID` and `run_ID`, opens each of the input fold files with `FoldYielder` and passes the data to the model. The output in the form of _maximum class probability_ and the _corresponding class_ along with `misc_features` is saved into the output ROOT file through an [`RDataFrame`](https://root.cern/doc/master/classROOT_1_1RDataFrame.html) class. Lastly, `predict.py` uses the configuration file `configs/predict.yaml` to fetch the necessary parameters, e.g. the list input files or `run_ID`. Note, that the default name of the config file is specified in `@hydra.main()` decorator inside of `predict.py` and not required to be passed in the command line. That is, to produce predictions corresponding to 2018 year, mlflow run "abcd" and other parameters from `configs/predict.yaml` as default, execute:
+Given the trained model, one can now produce predictions for further inference for the given set of `hdf5` files (skimmed by `preprocess.py`). This is performed with `predict.py` script which loads the model(s) with `mlflow` given the corresponding `experiment_ID` and `run_ID`, opens each of the input fold files with `FoldYielder` and passes the data to the model(s). 
+
+Prediction workflow is also implemented to be in N-fold fashion, which should be transparent to the user similarly to the training step. The number of splits is infered from `mlflow` logs for the corresponding run ID, so that the strategy of the prediction split is automatically adapted to the strategy of the training split. That is, conceptually only `mlflow` run ID and path to input data is needed to produce predictions (_maximum class probability_ and the _corresponding class_, plus `misc_features`).     
+
+There are two possible outputs (each configured with its own cfg file) which can be created at the prediction step. One is of the kind `for_datacards` and the other is `for_evaluation`. For both of them the predictions are produced in the same way, but they are saved to different file formats. For example, in case of option `for_datacards`:
 
 ```bash
-python predict.py year=2018 mlflow_runID=abcd # insert the corresponding run ID here
+python predict.py --config-name for_datacards.yaml year=2018 mlflow_runID=None # insert the corresponding run ID here
 ```
 
-This will produce in `output_path` ROOT files with predictions, which can be now used in the next steps of the analysis. For example, using [`TTree` friends](https://root.cern.ch/root/htmldoc/guides/users-guide/Trees.html#example-3-adding-friends-to-trees) they can be easily augment the original input ROOT files as a new branch based on a common index variable (`evt` in the example below):  
+it will produce in `output_path` ROOT files one per `sample_name` with predictions saved therein to a TTree named `output_tree_name`. To do that, [`RDataFrame`](https://root.cern/doc/master/classROOT_1_1RDataFrame.html) class is used to snapshot a python dictionary with prediction arrays into ROOT files. After that they can be used in the next steps of the analysis, e.g. in order to produce datacards. Using [`TTree` friends](https://root.cern.ch/root/htmldoc/guides/users-guide/Trees.html#example-3-adding-friends-to-trees) might be especially helpful in this case to augment the original input ROOT files with predictions added as a new branch (`evt` in the example below is used as a common index):  
 ```cpp
 TFile *f = new TFile("file_pred.root","READ");
 TFile *F = new TFile("file_main.root","READ");
@@ -111,3 +117,10 @@ T->BuildIndex("evt");
 T->AddFriend(t);
 T->Scan("pred_class_proba:evt");
 ```
+
+The second option `for_evaluation` is implemented in order to run the next step of estimating the model performance:
+```bash
+python predict.py --config-name for_evaluation.yaml year=2018 mlflow_runID=None # insert the corresponding run ID here
+```
+
+Here, for a given input files (from `input_path`, as always preprocessed with `preprocess.py`) and a given training (from `mlflow_runID`) predictions will be saved into `.csv` files under `output_path` and also will be logged under exactly the same `mlflow` run ID. After that, simply referring to a single `mlflow` run ID, predictions will be fetched automatically from `mlflow` logs and a dashboard with various metrics and plots can be produced with `evaluate.py` script (WIP).
diff --git a/configs/predict.yaml → configs/predict/for_datacards.yaml b/configs/predict.yaml → configs/predict/for_datacards.yaml
@@ -1,3 +1,5 @@
+kind: 'for_datacards'
+
 # input path & info
 year: 2018
 input_path: 'data/{year}/multi/skims'

diff --git a/configs/predict/for_evaluation.yaml b/configs/predict/for_evaluation.yaml
@@ -0,0 +1,20 @@
+kind: 'for_evaluation'
+
+# input path & info
+year: 2018
+input_path: 'data/{year}/multi/skims'
+input_filename_template: '{sample_name}.hdf5'
+sample_names:
+    - train
+    - test
+
+misc_features: # will be added to output ROOT file along with prediction branches
+    - evt
+
+# model & scaling pipe to make predictions
+mlflow_experimentID: 1
+mlflow_runID: 13e050761d8f40b69e2379b1c04da6ba
+
+# output path & names
+output_path: 'data/{year}/multi/pred'
+output_filename_template: '{sample_name}_{year}_pred.csv'
diff --git a/configs/train.yaml b/configs/train.yaml
@@ -1,8 +1,8 @@
 # path to the train/test data & scaling pipe
 year: 2018
 train_file: 'data/{year}/multi/skims/train.hdf5'
-xtrain_split_feature: 'evt' # xtrain = cross-training
 n_splits: 2 # will split training data into len(set(xtrain_split_feature % n_folds)) parts and train a separate model on each
+xtrain_split_feature: 'evt' # used only if n_splits > 1; xtrain = cross-training
 train_size: 0.9 # used only if n_splits=1, otherwise will use a left-out fold for testing
 
 # features to be used for training

diff --git a/predict.py b/predict.py
@@ -6,14 +6,14 @@
 from omegaconf import OmegaConf, DictConfig
 
 import ROOT as R
-import uproot
 from lumin.nn.data.fold_yielder import FoldYielder
-import numpy as np
+import pandas as pd
+import mlflow
 
 from utils.processing import fill_placeholders
 from utils.inference import load_models, predict_folds
 
-@hydra.main(config_path="configs", config_name="predict")
+@hydra.main(config_path="configs/predict")
 def main(cfg: DictConfig) -> None:
     # fill placeholders in the cfg parameters
     input_path = to_absolute_path(fill_placeholders(cfg.input_path, {'{year}': cfg.year}))
@@ -31,31 +31,42 @@ def main(cfg: DictConfig) -> None:
         model_cfg = yaml.safe_load(f)
     for s in model_cfg['signature']['inputs'].split('{')[1:]:
         train_features.append(s.split("\"")[3])
+    fold_id_column = 'fold_id'
 
-    # loop over input fold files
-    for sample_name in cfg.sample_names:
-        print(f'\n--> Predicting {sample_name}')
-        print(f"        loading data set")
-        input_filename = fill_placeholders(cfg.input_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
-
-        # extract DataFrame from fold file
-        fy = FoldYielder(f'{input_path}/{input_filename}')
-        df = fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
-        for f in misc_features: # add misc features
-            df[f] = fy.get_column(f)
-        df['fold_id'] = df[xtrain_split_feature] % n_splits
-
-        # run cross-inference for folds
-        pred_dict = predict_folds(df, train_features, misc_features, 'fold_id', models)
-
-        # store predictions in RDataFrame and snapshot it into output ROOT file
-        print(f"        storing to output file")
-        output_filename = fill_placeholders(cfg.output_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
-        if os.path.exists(f'{output_path}/{output_filename}'):
-            os.system(f'rm {output_path}/{output_filename}')
-        R_df = R.RDF.MakeNumpyDataFrame(pred_dict)
-        R_df.Snapshot(cfg.output_tree_name, f'{output_path}/{output_filename}')
-        del(df, R_df); gc.collect()
+    mlflow.set_tracking_uri(f"file://{to_absolute_path('mlruns')}")
+    with mlflow.start_run(run_id=cfg.mlflow_runID):
+        # loop over input fold files
+        for sample_name in cfg.sample_names:
+            print(f'\n--> Predicting {sample_name}')
+            print(f"        loading data set")
+            input_filename = fill_placeholders(cfg.input_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
+
+            # extract DataFrame from fold file
+            fy = FoldYielder(f'{input_path}/{input_filename}')
+            df = fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
+            for f in misc_features: # add misc features
+                df[f] = fy.get_column(f)
+            df[fold_id_column] = (df[xtrain_split_feature] % n_splits).astype('int32')
+
+            # run cross-inference for folds
+            pred_dict = predict_folds(df, train_features, misc_features, fold_id_column=fold_id_column, models=models)
+
+            print(f"        storing to output file")
+            output_filename = fill_placeholders(cfg.output_filename_template, {'{sample_name}': sample_name, '{year}': cfg.year})
+            if os.path.exists(f'{output_path}/{output_filename}'):
+                os.system(f'rm {output_path}/{output_filename}')
+            if cfg.kind == 'for_datacards':
+                # store predictions in RDataFrame and snapshot it into output ROOT file
+                R_df = R.RDF.MakeNumpyDataFrame(pred_dict)
+                R_df.Snapshot(cfg.output_tree_name, f'{output_path}/{output_filename}')
+                del(df, R_df); gc.collect()
+            elif cfg.kind == 'for_evaluation':
+                df_pred = pd.DataFrame(pred_dict)
+                df_pred.to_csv(f'{output_path}/{output_filename}')
+                mlflow.log_artifact(f'{output_path}/{output_filename}', artifact_path='pred')
+                del(df_pred); gc.collect()
+            else:
+                raise Exception(f'Unknown kind for prediction: {cfg.kind}')
 
 if __name__ == '__main__':
     main()
diff --git a/train.py b/train.py
@@ -3,7 +3,6 @@
 import matplotlib.pyplot as plt
 
 from sklearn.model_selection import LeaveOneGroupOut, ShuffleSplit
-from sklearn.metrics import accuracy_score, log_loss
 import lightgbm as lgb
 import mlflow
 import mlflow.lightgbm
@@ -16,39 +15,39 @@
 from omegaconf import OmegaConf, DictConfig
 
 from utils.processing import fill_placeholders
-from utils.plotting import plot_class_score
 
 @hydra.main(config_path="configs", config_name="train")
 def main(cfg: DictConfig) -> None:
-    # prepare train data
+    # load training data into DataFrame + add necessary columns
     print('\n--> Loading training data')
     train_file = fill_placeholders(to_absolute_path(cfg.train_file), {'{year}': cfg.year})
     train_fy = FoldYielder(train_file)
     train_df = train_fy.get_df(inc_inputs=True, deprocess=False, nan_to_num=False, verbose=False, suppress_warn=True)
     train_df['w_cp'] = train_fy.get_column('w_cp')
     train_df['w_class_imbalance'] = train_fy.get_column('w_class_imbalance')
 
-    # fetch feature/weight/target names
+    # define feature/weight/target names
     train_features = cfg.cont_features + cfg.cat_features # features to be used in training
     weight_name = cfg.weight_name
     target_name = 'gen_target' # internal target name defined inside of lumin library
+    fold_id_column = 'fold_id'
 
     if cfg.n_splits > 1:
         assert type(cfg.n_splits)==int
-        train_df[cfg.xtrain_split_feature] = train_fy.get_column(cfg.xtrain_split_feature)
-        train_df['fold_id'] = (train_df[cfg.xtrain_split_feature] % cfg.n_splits).astype('int32')
+        split_feature_values = train_fy.get_column(cfg.xtrain_split_feature)
+        train_df[fold_id_column] = (split_feature_values % cfg.n_splits).astype('int32')
 
         # check that there is no more that 5% difference between folds in terms of number of entries
-        fold_id_count_diff = np.std(train_df['fold_id'].value_counts()) / np.mean(train_df['fold_id'].value_counts())
+        fold_id_count_diff = np.std(train_df[fold_id_column].value_counts()) / np.mean(train_df[fold_id_column].value_counts())
         if fold_id_count_diff > 0.05:
             raise Exception(f'Observed {fold_id_count_diff * 100}% relative difference in number of entries across folds. Please check that the split is done equally.')
 
         print(f'\n[INFO] Will split training data set into ({cfg.n_splits}) folds over values of ({cfg.xtrain_split_feature}) feature to perform cross-training')
         splitter = LeaveOneGroupOut()
-        idx_yielder = splitter.split(train_df, groups=train_df['fold_id'])
+        idx_yielder = splitter.split(train_df, groups=train_df[fold_id_column])
     elif cfg.n_splits == 1:
         print(f'\n[INFO] Will train a single model on ({cfg.train_size}) part of the training data set with the rest used for validation')
-        train_df['fold_id'] = 0
+        train_df[fold_id_column] = 0
         splitter = ShuffleSplit(n_splits=1, train_size=cfg.train_size, random_state=1357)
         idx_yielder = splitter.split(train_df)
     else:
@@ -71,7 +70,7 @@ def main(cfg: DictConfig) -> None:
             validation_fold_df = train_df.iloc[validation_idx]
 
             # check that `i_fold` is the same as fold ID corresponding to a validation fold
-            validation_fold_idx = set(validation_fold_df['fold_id'])
+            validation_fold_idx = set(validation_fold_df[fold_id_column])
             assert len(validation_fold_idx)==1 and i_fold in validation_fold_idx
 
             # construct lightgbm dataset