\ No newline at end of file
diff --git a/general_advice/after/after.html b/general_advice/after/after.html
index 185d52a..10486e1 100644
--- a/general_advice/after/after.html
+++ b/general_advice/after/after.html
@@ -1 +1 @@
- After training - CMS Machine Learning Documentation
After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.
Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is "frozen", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.
The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:
In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.
Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.
Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a "point estimate", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.
A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.
Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be "propagated" through the model. A sizeable fraction of those uncertainties are so-called "up/down" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.
Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.
The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function. ↩
Last update: December 5, 2023
\ No newline at end of file
+ After training - CMS Machine Learning Documentation
After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.
Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is "frozen", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.
The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:
In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.
Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.
Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a "point estimate", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.
A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.
Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be "propagated" through the model. A sizeable fraction of those uncertainties are so-called "up/down" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.
Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.
The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function. ↩
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/before/domains.html b/general_advice/before/domains.html
index 3319f51..43183b7 100644
--- a/general_advice/before/domains.html
+++ b/general_advice/before/domains.html
@@ -1 +1 @@
- Domains - CMS Machine Learning Documentation
Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:
The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
A proper preprocessing of the data set should be performed so that the training step goes smoothly.
In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.
To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,
Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.
Examples
In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.
It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.
It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.
Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.
Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.
Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,
It is important to make sure that the phase space of interest is well-represented in the training data set.
Example
This is what is often called in HEP jargon "little statistics in the tails": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.
Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).
Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.
Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.
From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper. ↩
Last update: December 5, 2023
\ No newline at end of file
+ Domains - CMS Machine Learning Documentation
Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:
The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
A proper preprocessing of the data set should be performed so that the training step goes smoothly.
In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.
To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,
Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.
Examples
In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.
It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.
It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.
Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.
Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.
Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,
It is important to make sure that the phase space of interest is well-represented in the training data set.
Example
This is what is often called in HEP jargon "little statistics in the tails": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.
Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).
Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.
Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.
From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper. ↩
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/before/features.html b/general_advice/before/features.html
index 09d60e4..3721c49 100644
--- a/general_advice/before/features.html
+++ b/general_advice/before/features.html
@@ -1 +1 @@
- Features - CMS Machine Learning Documentation
In the previous section, the data was considered from a general "domain" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.
The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:
Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.
Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.
Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle "garbage in, garbage out" still holds.
Example
For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true "data" distribution might sizeably affect the construction of decision boundary in the feature space.
Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.
If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.
Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.
Example
In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).
Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.
Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.
Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into. ↩
Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood. ↩
Last update: December 5, 2023
\ No newline at end of file
+ Features - CMS Machine Learning Documentation
In the previous section, the data was considered from a general "domain" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.
The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:
Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.
Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.
Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle "garbage in, garbage out" still holds.
Example
For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true "data" distribution might sizeably affect the construction of decision boundary in the feature space.
Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.
If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.
Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.
Example
In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).
Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.
Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.
Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into. ↩
Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood. ↩
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/before/inputs.html b/general_advice/before/inputs.html
index 2ccdf52..cc02737 100644
--- a/general_advice/before/inputs.html
+++ b/general_advice/before/inputs.html
@@ -1,4 +1,4 @@
- Inputs - CMS Machine Learning Documentation
After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.
The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.
The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.
One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be "universal" and provide robust and correct decisions regardless of the data subset sampled from the same population.
Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.
Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called "test" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is "unblinded" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.
But it might be not that simple
Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).
Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.
Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.
Lastly, one can also considering to put aside yet another fraction of original data set, what was called "validation" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.
Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.
While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).
That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2
Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.
Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:
After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.
The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.
The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.
One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be "universal" and provide robust and correct decisions regardless of the data subset sampled from the same population.
Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.
Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called "test" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is "unblinded" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.
But it might be not that simple
Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).
Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.
Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.
Lastly, one can also considering to put aside yet another fraction of original data set, what was called "validation" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.
Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.
While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).
That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2
Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.
Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:
It is important to disentangle here two factors which define the weight to be applied on a per-event basis because of the different motivations behind them:
The first point is related to the fact, that in case of classification we may have significantly more (>O(1) times) training data for one class than for the other. Since the training data usually comes from MC simulation, that corresponds to the case when there is more events generated for one physical process than for another. Therefore, here we want to make sure that model is equally presented with instances of each class - this may have a significant impact on the model performance depending on the loss/metric choice.
Example
Consider the case when there is 1M events of target = 0 and 100 events of target = 1 in the training data set and a model is fitted by minimising cross-entropy to distinguish between those classes. In that case the resulted model can easily turn out to be a constant function predicting the majority target = 0, simply because this would be the optimal solution in terms of the loss function minimisation. If using accuracy as a metric for validation, this will result in a value close to 1 on the training data.
To account for this type of imbalance, the following weight simply needs to be introduced according to the target label of an object:
Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.
That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a "good" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.
Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.
Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.
In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.
Example of things going wrong
It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.
A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those "upweighted" events2.
Examples in HEP beyond classical MSE/MAE/cross entropy
DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases
However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.
For example, that corresponds to asking oneself a question: "what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?" ↩
However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis. ↩
Last update: December 5, 2023
\ No newline at end of file
+ Metrics & Losses - CMS Machine Learning Documentation
Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.
That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a "good" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.
Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.
Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.
In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.
Example of things going wrong
It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.
A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those "upweighted" events2.
Examples in HEP beyond classical MSE/MAE/cross entropy
DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases
However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.
For example, that corresponds to asking oneself a question: "what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?" ↩
However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis. ↩
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/before/model.html b/general_advice/before/model.html
index 29a456f..4e615e2 100644
--- a/general_advice/before/model.html
+++ b/general_advice/before/model.html
@@ -1 +1 @@
- Model - CMS Machine Learning Documentation
There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.
However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:
Start off from a simple baseline, then gradually increase the complexity to improve upon it.
In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.
If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.
If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.
If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.
Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.
Models in HEP
One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.
Last update: December 5, 2023
\ No newline at end of file
+ Model - CMS Machine Learning Documentation
There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.
However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:
Start off from a simple baseline, then gradually increase the complexity to improve upon it.
In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.
If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.
If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.
If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.
Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.
Models in HEP
One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/during/opt.html b/general_advice/during/opt.html
index b56221b..5cd020b 100644
--- a/general_advice/during/opt.html
+++ b/general_advice/during/opt.html
@@ -1 +1 @@
- Optimisation problems - CMS Machine Learning Documentation
However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:
The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.
Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.
Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.
However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:
The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.
Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.
Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.
Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.
The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:
The first one would be the case where the model complexity is way too large for a problem and a data set being considered.
Example
A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.
This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.
Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.
The second case is a more general idea that any reasonable model at some point starts to overfit.
Example
Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to "focus" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.
Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:
Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.
To automate the process of finding this "sweat spot", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.
Last update: December 5, 2023
\ No newline at end of file
+ Overfitting - CMS Machine Learning Documentation
Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.
The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:
The first one would be the case where the model complexity is way too large for a problem and a data set being considered.
Example
A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.
This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.
Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.
The second case is a more general idea that any reasonable model at some point starts to overfit.
Example
Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to "focus" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.
Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:
Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.
To automate the process of finding this "sweat spot", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/during/xvalidation.html b/general_advice/during/xvalidation.html
index 2c5bde1..82c7fb3 100644
--- a/general_advice/during/xvalidation.html
+++ b/general_advice/during/xvalidation.html
@@ -1 +1 @@
- Cross-validation - CMS Machine Learning Documentation
However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).
The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.
In order to prevent that, a cross-validation (CV) technique is often used:
The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.
In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.
Further insights
Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.
Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.
Last update: December 5, 2023
\ No newline at end of file
+ Cross-validation - CMS Machine Learning Documentation
However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).
The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.
In order to prevent that, a cross-validation (CV) technique is often used:
The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.
In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.
Further insights
Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.
Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/general_advice/intro.html b/general_advice/intro.html
index a489928..b24f271 100644
--- a/general_advice/intro.html
+++ b/general_advice/intro.html
@@ -1,4 +1,4 @@
- Introduction - CMS Machine Learning Documentation
In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.
In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.
fromsklearn.datasetsimportmake_circlesfromsklearn.model_selectionimporttrain_test_splitfromsklearn.svmimportSVC
diff --git a/images/BDTscores_EXO19020.png b/images/BDTscores_EXO19020.png
new file mode 100644
index 0000000..bd10ed5
Binary files /dev/null and b/images/BDTscores_EXO19020.png differ
diff --git a/images/DisCoPresentation_MLForum.png b/images/DisCoPresentation_MLForum.png
new file mode 100644
index 0000000..77c35ed
Binary files /dev/null and b/images/DisCoPresentation_MLForum.png differ
diff --git a/images/ML_Forum_talk_May8_2019.png b/images/ML_Forum_talk_May8_2019.png
new file mode 100644
index 0000000..1a24294
Binary files /dev/null and b/images/ML_Forum_talk_May8_2019.png differ
diff --git a/images/doublediscoNN.png b/images/doublediscoNN.png
new file mode 100644
index 0000000..aca7296
Binary files /dev/null and b/images/doublediscoNN.png differ
diff --git a/images/hig21002_bdtscores.png b/images/hig21002_bdtscores.png
new file mode 100644
index 0000000..4501360
Binary files /dev/null and b/images/hig21002_bdtscores.png differ
diff --git a/index.html b/index.html
index a224a59..b2aa791 100644
--- a/index.html
+++ b/index.html
@@ -1 +1 @@
- CMS Machine Learning Documentation
Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:
ML best practices
How to optimize a NN
Common pitfalls for CMS analyzers
Direct and indirect inferencing using a variety of ML packages
How to get a model integrated into CMSSW
And much more!
If you think we are missing some important information, please contact the ML Knowledge Subgroup!
Last update: December 5, 2023
\ No newline at end of file
+ CMS Machine Learning Documentation
Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:
ML best practices
How to optimize a NN
Common pitfalls for CMS analyzers
Direct and indirect inferencing using a variety of ML packages
How to get a model integrated into CMSSW
And much more!
If you think we are missing some important information, please contact the ML Knowledge Subgroup!
Last update: December 5, 2023
\ No newline at end of file
diff --git a/inference/checklist.html b/inference/checklist.html
index f1d7831..4f2d9ac 100644
--- a/inference/checklist.html
+++ b/inference/checklist.html
@@ -1 +1 @@
- Integration checklist - CMS Machine Learning Documentation
conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:
conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python
All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.
Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.
conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:
conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python
All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.
Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.
Minimal preparation from Python:
import conifer
model = conifer. ... # convert or load a conifer model
# e.g. model = conifer.converters.convert_from_xgboost(xgboost_model)
model.save('my_bdt.json')
diff --git a/inference/hls4ml.html b/inference/hls4ml.html
index 7764c9f..9cae5a1 100644
--- a/inference/hls4ml.html
+++ b/inference/hls4ml.html
@@ -1 +1 @@
- hls4ml - CMS Machine Learning Documentation
hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.
The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.
That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).
hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.
The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.
That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).
ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community—most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.
ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to "run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available". It includes "built-in optimization features that trim and consolidate nodes without impacting model accuracy."
The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.
On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.
We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)
ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community—most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.
ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to "run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available". It includes "built-in optimization features that trim and consolidate nodes without impacting model accuracy."
The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.
On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.
We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)
ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.
On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:
build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
try to reproduce the original performance and make the ROC plots.
This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the "ParticleNet".
tips for readers who are using/modifying the ParticleNet model to achieve a better performance
This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.
ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.
On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:
build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
try to reproduce the original performance and make the ROC plots.
This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the "ParticleNet".
tips for readers who are using/modifying the ParticleNet model to achieve a better performance
This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.
Corresponding persons:
Huilin Qu, Loukas Gouskos (original developers of ParticleNet)
ParticleNet is a graph neural net (GNN) model. The key ingredient of ParticleNet is the graph convolutional operation, i.e., the edge convolution (EdgeConv) and the dynamic graph CNN (DGCNN) method [arXiv:1801.07829] applied on the "point cloud" data structure.
We will disassemble the ParticleNet model and provide a detailed exploration in the next section, but here we briefly explain the key features of the model.
Intuitively, ParticleNet treats all candidates inside an object as a "point cloud", which is a permutational-invariant set of points (e.g. a set of PF candidates), each carrying a feature vector (η, φ, pT, charge, etc.). The DGCNN uses the EdgeConv operation to exploit their spatial correlations (two-dimensional on the η-φ plain) by finding the k-nearest neighbours of each point and generate a new latent graph layer where points are scattered on a high-dimensional latent space. This is a graph-type analogue of the classical 2D convolution operation, which acts on a regular 2D grid (e.g., a picture) using a 3×3 local patch to explore the relations of a single-pixel with its 8 nearest pixels, then generates a new 2D grid.
As a consequence, the EdgeConv operation transforms the graph to a new graph, which has a changed spatial relationship among points. It then acts on the second graph to produce the third graph, showing the stackability of the convolution operation. This illustrates the "dynamic" property as the graph topology changes after each EdgeConv layer.
By concept, the advantage of the network may come from exploiting the permutational-invariant symmetry of the points, which is intrinsic to our physics objects. This symmetry is held naturally in a point cloud representation.
In a recent study on jet physics or event-based analysis using ML techniques, there are increasing interest to explore the point cloud data structure. We explain here conceptually why a "point cloud" representation outperforms the classical ones, including the variable-length 2D vector structure passing to a 1D CNN or any type of RNN, and imaged-based representation passing through a 2D CNN. By using the 1D CNN, the points (PF candidates) are more often ordered by pT to fix on the 1D grid. Only correlations with neighbouring points with similar pT are learned by the network with a convolution operation. The Long Short-Term Memory (LSTM) type recurrent neural network (RNN) provides the flexibility to feed in a variant-length sequence and has a "memory" mechanism to cooperate the information it learns from an early node to the latest node. The concern is that such ordering of the sequence is somewhat artificial, and not an underlying property that an NN must learn to accomplish the classification task. As a comparison, in the task of the natural language processing where LSTM has a huge advantage, the order of words are important characteristic of a language itself (reflects the "grammar" in some circumstances) and is a feature the NN must learn to master the language. The imaged-based data explored by a 2D CNN stems from the image recognition task. A jet image with proper standardization is usually performed before feeding into the network. In this sense, it lacks local features which the 2D local patch is better at capturing, e.g. the ear of the cat that a local patch can capture by scanning over the entire image. The jet image is appearing to hold the features globally (e.g. two-prong structure for W-tagging). The sparsity of data is another concern in that it introduces redundant information to present a jet on the regular grid, making the network hard to capture the key properties.
Here we briefly summarize the applications and ongoing works on ParticleNet. Public CMS results include
large-R jet with R=0.8 tagging (for W/Z/H/t) using ParticleNet [CMS-DP-2020/002]
regression on the large-R jet mass based on the ParticleNet model [CMS-DP-2021/017]
ParticleNet architecture is also applied on small radius R=0.4 jets for the b/c-tagging and quark/gluon classification (see this talk (CMS internal)). A recent ongoing work applies the ParticleNet architecture in heavy flavour tagging at HLT (see this talk (CMS internal)). The ParticleNet model is recently updated to ParticleNeXt and see further improvement (see the ML4Jets 2021 talk).
Recent works in the joint field of HEP and ML also shed light on exploiting the point cloud data structure and GNN-based architectures. We see very active progress in recent years. Here list some useful materials for the reader's reference.
Some pheno-based work are summarized in the HEP × ML living review, especially in the "graph" and "sets" categories.
Weaver is a machine learning R&D framework for high energy physics (HEP) applications. It trains the neural net with PyTorch and is capable of exporting the model to the ONNX format for fast inference. A detailed guide is presented on Weaver README page.
Now we walk through three solid examples to get you familiar with Weaver. We use the benchmark of the top tagging task [arXiv:1707.08966] in the following example. Some useful information can be found in the "top tagging" section in the IML public datasets webpage (the gDoc).
Our goal is to do some warm-up with Weaver, and more importantly, to explore from a technical side the neural net architectures: a simple multi-layer perceptron (MLP) model, a more complicated "DeepAK8 tagger" model based on 1D CNN with ResNet, and the "ParticleNet model," which is based on DGCNN. We will dig deeper into their implementations in Weaver and try to illustrate as many details as possible. Finally, we compare their performance and see if we can reproduce the benchmark record with the model. Please clone the repo weaver-benchmark and we'll get started. The Weaver repo will be cloned as a submodule.
gitclone--recursivehttps://github.com/colizz/weaver-benchmark.git
# Create a soft link inside weaver so that it can find data/model cards
diff --git a/inference/performance.html b/inference/performance.html
index 0ec2551..ad74892 100644
--- a/inference/performance.html
+++ b/inference/performance.html
@@ -1 +1 @@
- Performance - CMS Machine Learning Documentation
Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.
A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.
Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \(u\in\mathcal{V}\), where \(\mathcal{V}\) is the node set. Relations are represented by edges \((i,j)\in\mathcal{E}\) between nodes, where \(\mathcal{E}\) is the edge set. Denote the sizes of the node and edge sets as \(|\mathcal{V}|=n_\mathrm{nodes}\) and \(|\mathcal{E}|=n_\mathrm{edges}\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \(n_\mathrm{edges}=n_\mathrm{nodes}(n_\mathrm{nodes}-1)/2\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.
In general, nodes can have positions \(\{p_i\}_{i=1}^{n_\mathrm{nodes}}\), \(p_i\in\mathbb{R}^{n_\mathrm{space\_dim}}\), and features (attributes) \(\{x_i\}_{i=1}^{n_\mathrm{nodes}}\), \(x_i\in\mathbb{R}^{n_\mathrm{node\_dim}}\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \(\{e_{ij}\}_{(i,j)\in\mathcal{E}}\), \(e_{ij}\in\mathbb{R}^{n_\mathrm{edge\_dim}}\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:
data.y: training target with arbitary shape (\(y\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{out}}\) for node-level targets, \(y\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{out}}\) for edge-level targets or \(y\in\mathbb{R}^{1\times n_\mathrm{out}}\) for node-level targets).
data.pos: Node position matrix, \(P\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{space\_dim}}\)
The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.
Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.
A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.
Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \(u\in\mathcal{V}\), where \(\mathcal{V}\) is the node set. Relations are represented by edges \((i,j)\in\mathcal{E}\) between nodes, where \(\mathcal{E}\) is the edge set. Denote the sizes of the node and edge sets as \(|\mathcal{V}|=n_\mathrm{nodes}\) and \(|\mathcal{E}|=n_\mathrm{edges}\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \(n_\mathrm{edges}=n_\mathrm{nodes}(n_\mathrm{nodes}-1)/2\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.
In general, nodes can have positions \(\{p_i\}_{i=1}^{n_\mathrm{nodes}}\), \(p_i\in\mathbb{R}^{n_\mathrm{space\_dim}}\), and features (attributes) \(\{x_i\}_{i=1}^{n_\mathrm{nodes}}\), \(x_i\in\mathbb{R}^{n_\mathrm{node\_dim}}\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \(\{e_{ij}\}_{(i,j)\in\mathcal{E}}\), \(e_{ij}\in\mathbb{R}^{n_\mathrm{edge\_dim}}\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:
data.y: training target with arbitary shape (\(y\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{out}}\) for node-level targets, \(y\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{out}}\) for edge-level targets or \(y\in\mathbb{R}^{1\times n_\mathrm{out}}\) for node-level targets).
data.pos: Node position matrix, \(P\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{space\_dim}}\)
The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.
PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!
The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.
The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:
PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!
The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.
The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:
While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.
Last update: December 5, 2023
\ No newline at end of file
+ TensorFlow 1 - CMS Machine Learning Documentation
While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/inference/tensorflow2.html b/inference/tensorflow2.html
index 15a60b2..6621c48 100644
--- a/inference/tensorflow2.html
+++ b/inference/tensorflow2.html
@@ -1,4 +1,4 @@
- TensorFlow 2 - CMS Machine Learning Documentation
At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.
To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.
At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.
To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.
TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.
TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.
Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.
If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!
Below follows a list of previous successful hackathons.
Abstract: The HGCAL reconstruction relies on “The Iterative CLustering” (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the “tracksters”. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.
In this project we plan to tackle the following topics using ML:
trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
tuning of the existing pattern recognition algorithms
reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.
Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.
Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:
Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b
Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:
Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
Study tagging performance as a function of the number of jet constituents
Study tagging performance for a "real" input vector (zero-paddes, perhaps unsorted)
Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets
Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.
The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.
Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:
Task 1: Create a package for extracting graph features and building graphs in CMSSW.
Task 2. GNN inference on Sonic servers
Task 3: Track fitting after GNN track building
Task 4. Performance evaluation for the new track collection
In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.
There are two main projects:
Event-based anomaly detection algorithms for the Level-1 Trigger¶
Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting¶
A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.
Last update: December 5, 2023
\ No newline at end of file
+ ML Hackathons - CMS Machine Learning Documentation
Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.
If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!
Below follows a list of previous successful hackathons.
Abstract: The HGCAL reconstruction relies on “The Iterative CLustering” (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the “tracksters”. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.
In this project we plan to tackle the following topics using ML:
trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
tuning of the existing pattern recognition algorithms
reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.
Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.
Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:
Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b
Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:
Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
Study tagging performance as a function of the number of jet constituents
Study tagging performance for a "real" input vector (zero-paddes, perhaps unsorted)
Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets
Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.
The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.
Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:
Task 1: Create a package for extracting graph features and building graphs in CMSSW.
Task 2. GNN inference on Sonic servers
Task 3: Track fitting after GNN track building
Task 4. Performance evaluation for the new track collection
In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.
There are two main projects:
Event-based anomaly detection algorithms for the Level-1 Trigger¶
Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting¶
A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/innovation/journal_club.html b/innovation/journal_club.html
index 436eac8..a0a5489 100644
--- a/innovation/journal_club.html
+++ b/innovation/journal_club.html
@@ -1 +1 @@
- ML Journal Club - CMS Machine Learning Documentation
Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.
Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.
Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review¶
Tommaso Dorigo, Pablo de Castro
Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.
Mapping Machine-Learned Physics into a Human-Readable Space¶
Taylor Faucett, Jesse Thaler, Daniel Whiteson
Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper
Identifying the relevant dependencies of the neural network response on characteristics of the input space¶
Stefan Wunsch, Raphael Friese, Roger Wolf, Günter Quast
Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.
Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller, Sven Dähne, Pieter-Jan Kindermans
In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.
Simulation-based inference in particle physics and beyond (and beyond)¶
Johann Brehmer, Kyle Cranmer
Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.
C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi
Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code
A General Framework for Uncertainty Estimation in Deep Learning¶
Antonio Loquercio, Mattia Segù, Davide Scaramuzza
Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.
Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.
Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.
Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review¶
Tommaso Dorigo, Pablo de Castro
Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.
Mapping Machine-Learned Physics into a Human-Readable Space¶
Taylor Faucett, Jesse Thaler, Daniel Whiteson
Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper
Identifying the relevant dependencies of the neural network response on characteristics of the input space¶
Stefan Wunsch, Raphael Friese, Roger Wolf, Günter Quast
Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.
Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller, Sven Dähne, Pieter-Jan Kindermans
In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.
Simulation-based inference in particle physics and beyond (and beyond)¶
Johann Brehmer, Kyle Cranmer
Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.
C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi
Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code
A General Framework for Uncertainty Estimation in Deep Learning¶
Antonio Loquercio, Mattia Segù, Davide Scaramuzza
Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.
\ No newline at end of file
diff --git a/optimization/data_augmentation.html b/optimization/data_augmentation.html
index 4e82432..ef54d3a 100644
--- a/optimization/data_augmentation.html
+++ b/optimization/data_augmentation.html
@@ -1,4 +1,4 @@
- Data augmentation - CMS Machine Learning Documentation
With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.
The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.
The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.
Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.
Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)
Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:
Improvement of model prediction precision
More training data for the model
Preventing data scarcity for state-of-the-art models
Reduction of over overfitting and creation of data variability
Increased model generalization properties
Help in resolving class imbalance problems in datasets
Reduced cost of data collection and labeling
Enabling rare event prediction
And some words of caution:
There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
Don't trust the augmented data blindly
Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
Ensure the validity of the augmented data before using it in ML models.
If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.
Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.
With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, "low-level" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using "high-level" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.
Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.
For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.
Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.
The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.
The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.
The usual ML workflow, employing LBN, is as follows:
With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.
The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.
The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.
Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.
Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)
Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:
Improvement of model prediction precision
More training data for the model
Preventing data scarcity for state-of-the-art models
Reduction of over overfitting and creation of data variability
Increased model generalization properties
Help in resolving class imbalance problems in datasets
Reduced cost of data collection and labeling
Enabling rare event prediction
And some words of caution:
There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
Don't trust the augmented data blindly
Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
Ensure the validity of the augmented data before using it in ML models.
If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.
Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.
With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, "low-level" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using "high-level" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.
Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.
For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.
Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.
The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.
The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.
The usual ML workflow, employing LBN, is as follows:
Step-1: LBN(M, F)
1.0: Input hyperparameters: number of combinations M; number of features F
1.0: Choose: number of incoming particles, N, according to the research
diff --git a/optimization/importance.html b/optimization/importance.html
index 70f31a5..516529f 100644
--- a/optimization/importance.html
+++ b/optimization/importance.html
@@ -1,4 +1,4 @@
- Feature importance - CMS Machine Learning Documentation
Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:
Important features can be used to create simplified models, e.g. to mitigate overfitting.
Using only important features can reduce the latency and memory requirements of the model.
The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.
In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.
Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:
Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,
Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.
Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.
Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.
Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.
Embedded methods are specific to a prediction model and independent of the dataset. Examples:
L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.
Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:
Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\(f\)) to training data, yielding \(f(X_\mathrm{train})\), where \(X_\mathrm{train}\in\mathbb{R}^{n\times d}\) for \(n\) input examples with \(d\) features. Next, measure the model's performance on testing data for some loss \(\mathcal{L}\), i.e. \(s=\mathcal{L}\big(f(X_\mathrm{test}), y_\mathrm{test}\big)\). For each feature \(j\in[1\ ..\ d]\), randomly shuffle the corresponding column in \(X_\mathrm{test}\) to form \(X_\mathrm{test}^{(j)}\). Repeat this process \(K\) times, so that for \(k\in [1\ ..\ K]\) each random shuffling of feature column \(j\) gives a corrupted input dataset \(X_\mathrm{test}^{(j,k)}\). Finally, define the permutation importance of feature \(j\) as the difference between the un-corrupted validation score and average validation score over the corrupted \(X_\mathrm{test}^{(j,k)}\) datasets:
\[\texttt{PI}_j = s - \frac{1}{K}\sum_{k=1}^{K} \mathcal{L}[f(X_\mathrm{test}^{(j,k)}), y_\mathrm{test}]\]
Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \(D\) initial features, RFE returns the set of \(d < D\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \(d\) features remain.
Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:
Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:
Important features can be used to create simplified models, e.g. to mitigate overfitting.
Using only important features can reduce the latency and memory requirements of the model.
The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.
In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.
Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:
Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,
Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.
Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.
Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.
Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.
Embedded methods are specific to a prediction model and independent of the dataset. Examples:
L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.
Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:
Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\(f\)) to training data, yielding \(f(X_\mathrm{train})\), where \(X_\mathrm{train}\in\mathbb{R}^{n\times d}\) for \(n\) input examples with \(d\) features. Next, measure the model's performance on testing data for some loss \(\mathcal{L}\), i.e. \(s=\mathcal{L}\big(f(X_\mathrm{test}), y_\mathrm{test}\big)\). For each feature \(j\in[1\ ..\ d]\), randomly shuffle the corresponding column in \(X_\mathrm{test}\) to form \(X_\mathrm{test}^{(j)}\). Repeat this process \(K\) times, so that for \(k\in [1\ ..\ K]\) each random shuffling of feature column \(j\) gives a corrupted input dataset \(X_\mathrm{test}^{(j,k)}\). Finally, define the permutation importance of feature \(j\) as the difference between the un-corrupted validation score and average validation score over the corrupted \(X_\mathrm{test}^{(j,k)}\) datasets:
\[\texttt{PI}_j = s - \frac{1}{K}\sum_{k=1}^{K} \mathcal{L}[f(X_\mathrm{test}^{(j,k)}), y_\mathrm{test}]\]
Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \(D\) initial features, RFE returns the set of \(d < D\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \(d\) features remain.
Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:
What we talk about when we talk about model optimization¶
Given some data \(x\) and a family of functionals parameterized by (a vector of) parameters \(\theta\) (e.g. for DNN training weights), the problem of learning consists in finding \(argmin_\theta Loss(f_\theta(x) - y_{true})\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.
The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.
The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.
When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \(f_\theta\) is also a function of these hyperparameters \(\lambda\): \(f_{(\theta, \lambda)}\), and the \(\lambda\) can be optimized
The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.
Caveat: which data should you use to optimize your model¶
In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:
no hyperparameter optimization is performed;
no overtraining is found;
the number of training data is high enough to make statistical fluctuations negligible.
If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).
The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: "Deep Learning".
To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.
The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \(m\) parameters and testing \(n\) values for each of them leads to \(\mathcal{O}(n^m)\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.
Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.
Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \(\{-1, 0, 1\}\):
if the best parameter is found to be at the boundary (1), then extend range (\(\{1, 2, 3\}\)) and do the search in the new range;
if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \(\{-0.1, 0, 0.1\}\).
An improvement of the grid search is the random search, which proceeds like this:
you provide a marginal p.d.f. for each hyperparameter;
you sample from the joint p.d.f. a certain number of training configurations;
you train for each of these configurations to build the loss function landscape.
This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.
Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \(C\) is continuous and differentiable with respect to both the parameters \(\theta\) (e.g. weights) and hyperparameters \(\lambda\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).
Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.
This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.
We are now ready to tackle in full what is referred to as Bayesian optimization.
Bayesian optimization assumes that the unknown function \(f(\theta, \lambda)\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \(\lambda\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of "sampled from a Gaussian process", we need to define what a Gaussian process is.
Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.
GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).
Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.
The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).
There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:
maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
maximize the GP Upper confidence bound: minimize "regret" over the course of the optimization.
Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)
The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).
Limitations (and some workaround) of Bayesian Optimization¶
There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.
First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Matérn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.
Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.
The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.
Alternatives to Gaussian processes: Tree-based models¶
Gaussian Processes model directly \(P(hyperpar | data)\) but are not the only suitable surrogate models for Bayesian optimization
The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \(P(data | hyperpar)\) and \(P(hyperpar)\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.
Several expansions and improvements (particularly targeted at HPC clusters) are available, see e.g. this talk by Eric Wulff.
Caveats: don't get too obsessed with model optimization¶
In general, optimizing model structure is a good thing. F. Chollet e.g. says "If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.
Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \(pred\_a = model\_a.predict(x)\), ..., \(pred\_d = model\_d.predict(x)\). You then pool the predictions: \(pooled\_pred = (pred\_a + pred\_b + pred\_c + pred\_d)/4.\). THis works if all models are kind of good: if one is significantly worse than the others, then \(pooled\_pred\) may not be as good as the best model of the pool.
You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \(pooled\_pred = 0.5\cdot pred\_a + 0.25\cdot pred\_b + 0.1\cdot pred\_c + 0.15\cdot pred\_d)/4.\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.
What we talk about when we talk about model optimization¶
Given some data \(x\) and a family of functionals parameterized by (a vector of) parameters \(\theta\) (e.g. for DNN training weights), the problem of learning consists in finding \(argmin_\theta Loss(f_\theta(x) - y_{true})\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.
The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.
The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.
When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \(f_\theta\) is also a function of these hyperparameters \(\lambda\): \(f_{(\theta, \lambda)}\), and the \(\lambda\) can be optimized
The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.
Caveat: which data should you use to optimize your model¶
In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:
no hyperparameter optimization is performed;
no overtraining is found;
the number of training data is high enough to make statistical fluctuations negligible.
If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).
The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: "Deep Learning".
To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.
The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \(m\) parameters and testing \(n\) values for each of them leads to \(\mathcal{O}(n^m)\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.
Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.
Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \(\{-1, 0, 1\}\):
if the best parameter is found to be at the boundary (1), then extend range (\(\{1, 2, 3\}\)) and do the search in the new range;
if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \(\{-0.1, 0, 0.1\}\).
An improvement of the grid search is the random search, which proceeds like this:
you provide a marginal p.d.f. for each hyperparameter;
you sample from the joint p.d.f. a certain number of training configurations;
you train for each of these configurations to build the loss function landscape.
This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.
Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \(C\) is continuous and differentiable with respect to both the parameters \(\theta\) (e.g. weights) and hyperparameters \(\lambda\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).
Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.
This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.
We are now ready to tackle in full what is referred to as Bayesian optimization.
Bayesian optimization assumes that the unknown function \(f(\theta, \lambda)\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \(\lambda\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of "sampled from a Gaussian process", we need to define what a Gaussian process is.
Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.
GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).
Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.
The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).
There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:
maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
maximize the GP Upper confidence bound: minimize "regret" over the course of the optimization.
Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)
The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).
Limitations (and some workaround) of Bayesian Optimization¶
There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.
First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Matérn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.
Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.
The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.
Alternatives to Gaussian processes: Tree-based models¶
Gaussian Processes model directly \(P(hyperpar | data)\) but are not the only suitable surrogate models for Bayesian optimization
The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \(P(data | hyperpar)\) and \(P(hyperpar)\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.
Several expansions and improvements (particularly targeted at HPC clusters) are available, see e.g. this talk by Eric Wulff.
Caveats: don't get too obsessed with model optimization¶
In general, optimizing model structure is a good thing. F. Chollet e.g. says "If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.
Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \(pred\_a = model\_a.predict(x)\), ..., \(pred\_d = model\_d.predict(x)\). You then pool the predictions: \(pooled\_pred = (pred\_a + pred\_b + pred\_c + pred\_d)/4.\). THis works if all models are kind of good: if one is significantly worse than the others, then \(pooled\_pred\) may not be as good as the best model of the pool.
You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \(pooled\_pred = 0.5\cdot pred\_a + 0.25\cdot pred\_b + 0.1\cdot pred\_c + 0.15\cdot pred\_d)/4.\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.
Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!
JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fréchet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fréchet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.
Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\eta, \varphi, \p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\(p_T\), \(\eta\), mass, # of particles].
A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within ∆R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable "ttv" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.
Structure: Use “train” for training, “val” for validation during the training and “test” for final testing and reporting results. For details, see the Zenodo link
Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!
JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fréchet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fréchet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.
Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\eta, \varphi, \p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\(p_T\), \(\eta\), mass, # of particles].
A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within ∆R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable "ttv" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.
Structure: Use “train” for training, “val” for validation during the training and “test” for final testing and reporting results. For details, see the Zenodo link
In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.
In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.
There are good materials providing detailed documentation on how to run HTCondor jobs with GPU support at both machines.
The configuration of the software environment for lxplus-gpu and HTcondor is described in the Software Environments page. Moreover the page Using container explains step by step how to build a docker image to be run on HTCondor jobs.
A complete documentation can be found from the GPUs section in CERN Batch Docs. Where a Tensorflow example is supplied. This documentation also contains instructions on advanced HTCondor configuration, for instance constraining GPU device or CUDA version.
A good example on submitting GPU HTCondor job @ Lxplus is the weaver-benchmark project. It provides a concrete example on how to setup environment for weaver framework and operate trainning and testing process within a single job. Detailed description can be found at section ParticleNet of this documentation.
In principle, this example can be run elsewhere as HTCondor jobs. However, paths to the datasets should be modified to meet the requirements.
CMS Connect also provides a documentation on GPU job submission. In this documentation there is also a Tensorflow example.
When submitting GPU jobs @ CMS Connect, especially for Machine Learning purpose, EOS space @ CERN are not accessible as a directory, therefore one should consider using xrootd utilities as documented in this page
Last update: December 5, 2023
\ No newline at end of file
diff --git a/resources/gpu_resources/cms_resources/ml_cern_ch.html b/resources/gpu_resources/cms_resources/ml_cern_ch.html
index 3ce4618..0698540 100644
--- a/resources/gpu_resources/cms_resources/ml_cern_ch.html
+++ b/resources/gpu_resources/cms_resources/ml_cern_ch.html
@@ -1 +1 @@
- ml.cern.ch - CMS Machine Learning Documentation
Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.
ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.
After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.
mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.
Last update: December 5, 2023
\ No newline at end of file
+ ml.cern.ch - CMS Machine Learning Documentation
Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.
ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.
After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.
mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.
Last update: December 5, 2023
\ No newline at end of file
diff --git a/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html b/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html
index af8114c..5f77079 100644
--- a/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html
+++ b/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html
@@ -1,4 +1,4 @@
- Pytorch mnist - CMS Machine Learning Documentation