diff --git a/404.html b/404.html index b5098ff..b6dc7a0 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - CMS Machine Learning Documentation

404 - Not found

\ No newline at end of file + CMS Machine Learning Documentation

404 - Not found

\ No newline at end of file diff --git a/general_advice/after/after.html b/general_advice/after/after.html index 185d52a..10486e1 100644 --- a/general_advice/after/after.html +++ b/general_advice/after/after.html @@ -1 +1 @@ - After training - CMS Machine Learning Documentation
Skip to content

After training

After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.

Final evaluation

Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is "frozen", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.

The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:

Figure 1. Comparison of model output for signal and background classes overlaid for train and test data sets. [source: root-forum.cern.ch]

In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.

Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.

Figure 2. Postfit jet-fake NN score for the mutau channel. Note that the distribution for jet-fakes class is dominant in this category and also peaks at value 1 (mind the log scale), which is an indication of good identification of this background process by the model. Furthermore, ratio of data and MC templates is equal to 1 within uncertainties. [source: CMS-PAS-HIG-20-006]

Robustness

Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a "point estimate", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.

A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.

Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be "propagated" through the model. A sizeable fraction of those uncertainties are so-called "up/down" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.

Systematic biases

Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.

  • The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
  • Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
  • Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Figure 3. Left: Distributions of signal and background events without selection. Right: Background distributions at 50% signal efficiency (true positive rate) for different classifiers. The unconstrained classifier sculpts a peak at the W-boson mass, while other classifiers do not. [source: arXiv:2010.09745]

  1. Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function. 


Last update: December 5, 2023
\ No newline at end of file + After training - CMS Machine Learning Documentation
Skip to content

After training

After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.

Final evaluation

Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is "frozen", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.

The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:

Figure 1. Comparison of model output for signal and background classes overlaid for train and test data sets. [source: root-forum.cern.ch]

In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.

Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.

Figure 2. Postfit jet-fake NN score for the mutau channel. Note that the distribution for jet-fakes class is dominant in this category and also peaks at value 1 (mind the log scale), which is an indication of good identification of this background process by the model. Furthermore, ratio of data and MC templates is equal to 1 within uncertainties. [source: CMS-PAS-HIG-20-006]

Robustness

Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a "point estimate", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.

A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.

Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be "propagated" through the model. A sizeable fraction of those uncertainties are so-called "up/down" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.

Systematic biases

Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.

  • The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
  • Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
  • Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Figure 3. Left: Distributions of signal and background events without selection. Right: Background distributions at 50% signal efficiency (true positive rate) for different classifiers. The unconstrained classifier sculpts a peak at the W-boson mass, while other classifiers do not. [source: arXiv:2010.09745]

  1. Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function. 


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/before/domains.html b/general_advice/before/domains.html index 3319f51..43183b7 100644 --- a/general_advice/before/domains.html +++ b/general_advice/before/domains.html @@ -1 +1 @@ - Domains - CMS Machine Learning Documentation
Skip to content

Domains

Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:

  1. The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
  2. A proper preprocessing of the data set should be performed so that the training step goes smoothly.

In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.

Coverage

To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,

Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.

Examples
  • In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
  • Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.

Solution

It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.

It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.

Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.

Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.

Population

Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,

It is important to make sure that the phase space of interest is well-represented in the training data set.

Example

This is what is often called in HEP jargon "little statistics in the tails": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.

Solution

Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).

Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.

Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.


  1. From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper


Last update: December 5, 2023
\ No newline at end of file + Domains - CMS Machine Learning Documentation
Skip to content

Domains

Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:

  1. The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
  2. A proper preprocessing of the data set should be performed so that the training step goes smoothly.

In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.

Coverage

To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,

Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.

Examples
  • In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
  • Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.

Solution

It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.

It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.

Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.

Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.

Population

Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,

It is important to make sure that the phase space of interest is well-represented in the training data set.

Example

This is what is often called in HEP jargon "little statistics in the tails": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.

Solution

Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).

Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.

Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.


  1. From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/before/features.html b/general_advice/before/features.html index 09d60e4..3721c49 100644 --- a/general_advice/before/features.html +++ b/general_advice/before/features.html @@ -1 +1 @@ - Features - CMS Machine Learning Documentation
Skip to content

Features

In the previous section, the data was considered from a general "domain" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.

The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:

  • Are features understood?
  • Are features correctly modelled?
  • Are features appropriately processed?

Understanding

Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.

Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.

Modelling

Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle "garbage in, garbage out" still holds.

Example

For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true "data" distribution might sizeably affect the construction of decision boundary in the feature space.

Figure 1. Control plot for a visible mass of tau lepton pair in emu final state. [source: CMS-TAU-18-001]

Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.

If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.

Processing

Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.

Example

In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).

Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.

  • Feature encoding
  • NaN/inf/missing values2
  • Outliers & noisy data
  • Standartisation & transformations

Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.


  1. Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into. 

  2. Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood. 


Last update: December 5, 2023
\ No newline at end of file + Features - CMS Machine Learning Documentation
Skip to content

Features

In the previous section, the data was considered from a general "domain" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.

The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:

  • Are features understood?
  • Are features correctly modelled?
  • Are features appropriately processed?

Understanding

Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.

Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.

Modelling

Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle "garbage in, garbage out" still holds.

Example

For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true "data" distribution might sizeably affect the construction of decision boundary in the feature space.

Figure 1. Control plot for a visible mass of tau lepton pair in emu final state. [source: CMS-TAU-18-001]

Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.

If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.

Processing

Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.

Example

In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).

Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.

  • Feature encoding
  • NaN/inf/missing values2
  • Outliers & noisy data
  • Standartisation & transformations

Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.


  1. Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into. 

  2. Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood. 


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/before/inputs.html b/general_advice/before/inputs.html index 2ccdf52..cc02737 100644 --- a/general_advice/before/inputs.html +++ b/general_advice/before/inputs.html @@ -1,4 +1,4 @@ - Inputs - CMS Machine Learning Documentation
Skip to content

Inputs

After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.

Data split

The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.

Figure 1. Decision boundaries for underfitted, optimal and overfitted models. [source: ibm.com/cloud/learn/overfitting]

The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.

One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be "universal" and provide robust and correct decisions regardless of the data subset sampled from the same population.

Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.

Figure 2. Data split worflow before the training. Also cross-validation is shown as the technique to find optimal hyperparameters. [source: scikit-learn.org/stable/modules/cross_validation.html]

Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called "test" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is "unblinded" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.

But it might be not that simple

Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).

Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.

Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.

Lastly, one can also considering to put aside yet another fraction of original data set, what was called "validation" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.

Batches

Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.

While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).

That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2

Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.

Training weights

Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:

def CrossEntropy(y_pred, y_true, w): # assuming y_true = {0, 1}
+ Inputs - CMS Machine Learning Documentation       

Inputs

After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.

Data split

The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.

Figure 1. Decision boundaries for underfitted, optimal and overfitted models. [source: ibm.com/cloud/learn/overfitting]

The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.

One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be "universal" and provide robust and correct decisions regardless of the data subset sampled from the same population.

Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.

Figure 2. Data split worflow before the training. Also cross-validation is shown as the technique to find optimal hyperparameters. [source: scikit-learn.org/stable/modules/cross_validation.html]

Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called "test" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is "unblinded" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.

But it might be not that simple

Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).

Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.

Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.

Lastly, one can also considering to put aside yet another fraction of original data set, what was called "validation" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.

Batches

Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.

While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).

That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2

Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.

Training weights

Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:

def CrossEntropy(y_pred, y_true, w): # assuming y_true = {0, 1}
     return -w*[y_true*log(y_pred) + (1-y_true)*log(1-y_pred)]
 

It is important to disentangle here two factors which define the weight to be applied on a per-event basis because of the different motivations behind them:

  • accounting for imbalance in training data
  • accounting for imbalance in nature

Imbalance in training data

The first point is related to the fact, that in case of classification we may have significantly more (>O(1) times) training data for one class than for the other. Since the training data usually comes from MC simulation, that corresponds to the case when there is more events generated for one physical process than for another. Therefore, here we want to make sure that model is equally presented with instances of each class - this may have a significant impact on the model performance depending on the loss/metric choice.

Example

Consider the case when there is 1M events of target = 0 and 100 events of target = 1 in the training data set and a model is fitted by minimising cross-entropy to distinguish between those classes. In that case the resulted model can easily turn out to be a constant function predicting the majority target = 0, simply because this would be the optimal solution in terms of the loss function minimisation. If using accuracy as a metric for validation, this will result in a value close to 1 on the training data.

To account for this type of imbalance, the following weight simply needs to be introduced according to the target label of an object:

train_df['weight'] = 1
 train_df.loc[train_df.target == 0, 'weight'] /= np.sum(train_df.loc[train_df.target == 0, 'weight'])
diff --git a/general_advice/before/metrics.html b/general_advice/before/metrics.html
index b2fc778..0d9c746 100644
--- a/general_advice/before/metrics.html
+++ b/general_advice/before/metrics.html
@@ -1 +1 @@
- Metrics & Losses - CMS Machine Learning Documentation       

Metrics & Losses

Metric

Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.

That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a "good" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.

Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.

Examples of HEP-specific metrics

Speaking of those metrics which were developed in the HEP field, the most prominent one is approximate median significance (AMS), firstly introduced in Asymptotic formulae for likelihood-based tests of new physics and then adopted in the HiggsML challenge on Kaggle.

Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.

Loss function

In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.

Example of things going wrong

It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.

A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those "upweighted" events2.

Examples in HEP beyond classical MSE/MAE/cross entropy
  • b-jet energy regression, being a part of nonresonant HH to bb gamma gamma analysis, uses Huber and two quantile loss terms for simultaneous prediction of point and dispersion estimators of the target disstribution.
  • DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases

However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.


  1. For example, that corresponds to asking oneself a question: "what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?" 

  2. However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis. 


Last update: December 5, 2023
\ No newline at end of file + Metrics & Losses - CMS Machine Learning Documentation

Metrics & Losses

Metric

Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.

That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a "good" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.

Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.

Examples of HEP-specific metrics

Speaking of those metrics which were developed in the HEP field, the most prominent one is approximate median significance (AMS), firstly introduced in Asymptotic formulae for likelihood-based tests of new physics and then adopted in the HiggsML challenge on Kaggle.

Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.

Loss function

In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.

Example of things going wrong

It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.

A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those "upweighted" events2.

Examples in HEP beyond classical MSE/MAE/cross entropy
  • b-jet energy regression, being a part of nonresonant HH to bb gamma gamma analysis, uses Huber and two quantile loss terms for simultaneous prediction of point and dispersion estimators of the target disstribution.
  • DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases

However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.


  1. For example, that corresponds to asking oneself a question: "what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?" 

  2. However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis. 


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/before/model.html b/general_advice/before/model.html index 29a456f..4e615e2 100644 --- a/general_advice/before/model.html +++ b/general_advice/before/model.html @@ -1 +1 @@ - Model - CMS Machine Learning Documentation

There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.

However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:

Start off from a simple baseline, then gradually increase the complexity to improve upon it.

  1. In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.

  2. If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.

  3. If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.

  4. If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.

  5. Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.

Models in HEP

One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.

Illustration from G. Kasieczka et. al showing ROC curves for all evaluated algorithms.


Last update: December 5, 2023
\ No newline at end of file + Model - CMS Machine Learning Documentation

There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.

However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:

Start off from a simple baseline, then gradually increase the complexity to improve upon it.

  1. In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.

  2. If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.

  3. If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.

  4. If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.

  5. Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.

Models in HEP

One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.

Illustration from G. Kasieczka et. al showing ROC curves for all evaluated algorithms.


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/during/opt.html b/general_advice/during/opt.html index b56221b..5cd020b 100644 --- a/general_advice/during/opt.html +++ b/general_advice/during/opt.html @@ -1 +1 @@ - Optimisation problems - CMS Machine Learning Documentation
Figure 1. The loss surfaces of ResNet-56 with/without skip connections. [source: "Visualizing the Loss Landscape of Neural Nets" paper]

However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:

  • The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.

  • Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.

  • Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.


  1. Sometimes particularly peculiar


Last update: December 5, 2023
\ No newline at end of file + Optimisation problems - CMS Machine Learning Documentation
Figure 1. The loss surfaces of ResNet-56 with/without skip connections. [source: "Visualizing the Loss Landscape of Neural Nets" paper]

However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:

  • The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.

  • Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.

  • Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.


  1. Sometimes particularly peculiar


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/during/overfitting.html b/general_advice/during/overfitting.html index 5f5f8b1..de1065f 100644 --- a/general_advice/during/overfitting.html +++ b/general_advice/during/overfitting.html @@ -1 +1 @@ - Overfitting - CMS Machine Learning Documentation

Overfitting

Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.

Overfitting

The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:

The first one would be the case where the model complexity is way too large for a problem and a data set being considered.

Example

A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.

This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.

Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.

The second case is a more general idea that any reasonable model at some point starts to overfit.

Example

Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to "focus" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.

Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:

Figure 1. Error metric as a function of number of iterations for train and validation sets. Vertical dashed line represents the separation between the region of underfitting (model hasn't captured well the data complexity to solve the problem) and overfitting (model does not longer generalise to unseen data). The point between these two regions is the optimal moment when the training should stop. [source: ibm.com/cloud/learn/overfitting]

Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.

To automate the process of finding this "sweat spot", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.


Last update: December 5, 2023
\ No newline at end of file + Overfitting - CMS Machine Learning Documentation

Overfitting

Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.

Overfitting

The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:

The first one would be the case where the model complexity is way too large for a problem and a data set being considered.

Example

A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.

This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.

Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.

The second case is a more general idea that any reasonable model at some point starts to overfit.

Example

Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to "focus" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.

Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:

Figure 1. Error metric as a function of number of iterations for train and validation sets. Vertical dashed line represents the separation between the region of underfitting (model hasn't captured well the data complexity to solve the problem) and overfitting (model does not longer generalise to unseen data). The point between these two regions is the optimal moment when the training should stop. [source: ibm.com/cloud/learn/overfitting]

Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.

To automate the process of finding this "sweat spot", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/during/xvalidation.html b/general_advice/during/xvalidation.html index 2c5bde1..82c7fb3 100644 --- a/general_advice/during/xvalidation.html +++ b/general_advice/during/xvalidation.html @@ -1 +1 @@ - Cross-validation - CMS Machine Learning Documentation

However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).

The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.

In order to prevent that, a cross-validation (CV) technique is often used:

Figure 1. Illustration of the data set split for cross-validation. [source: scikit-learn.org/stable/modules/cross_validation.html]

The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.

In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.

Further insights

Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.

Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.


Last update: December 5, 2023
\ No newline at end of file + Cross-validation - CMS Machine Learning Documentation

However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).

The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.

In order to prevent that, a cross-validation (CV) technique is often used:

Figure 1. Illustration of the data set split for cross-validation. [source: scikit-learn.org/stable/modules/cross_validation.html]

The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.

In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.

Further insights

Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.

Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.


Last update: December 5, 2023
\ No newline at end of file diff --git a/general_advice/intro.html b/general_advice/intro.html index a489928..b24f271 100644 --- a/general_advice/intro.html +++ b/general_advice/intro.html @@ -1,4 +1,4 @@ - Introduction - CMS Machine Learning Documentation

Introduction

In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.

from sklearn.datasets import make_circles
+ Introduction - CMS Machine Learning Documentation       

Introduction

In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.

from sklearn.datasets import make_circles
 from sklearn.model_selection import train_test_split
 from sklearn.svm import SVC
 
diff --git a/images/BDTscores_EXO19020.png b/images/BDTscores_EXO19020.png
new file mode 100644
index 0000000..bd10ed5
Binary files /dev/null and b/images/BDTscores_EXO19020.png differ
diff --git a/images/DisCoPresentation_MLForum.png b/images/DisCoPresentation_MLForum.png
new file mode 100644
index 0000000..77c35ed
Binary files /dev/null and b/images/DisCoPresentation_MLForum.png differ
diff --git a/images/ML_Forum_talk_May8_2019.png b/images/ML_Forum_talk_May8_2019.png
new file mode 100644
index 0000000..1a24294
Binary files /dev/null and b/images/ML_Forum_talk_May8_2019.png differ
diff --git a/images/doublediscoNN.png b/images/doublediscoNN.png
new file mode 100644
index 0000000..aca7296
Binary files /dev/null and b/images/doublediscoNN.png differ
diff --git a/images/hig21002_bdtscores.png b/images/hig21002_bdtscores.png
new file mode 100644
index 0000000..4501360
Binary files /dev/null and b/images/hig21002_bdtscores.png differ
diff --git a/index.html b/index.html
index a224a59..b2aa791 100644
--- a/index.html
+++ b/index.html
@@ -1 +1 @@
- CMS Machine Learning Documentation      

Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:

  • ML best practices
  • How to optimize a NN
  • Common pitfalls for CMS analyzers
  • Direct and indirect inferencing using a variety of ML packages
  • How to get a model integrated into CMSSW

And much more!

If you think we are missing some important information, please contact the ML Knowledge Subgroup!


Last update: December 5, 2023
\ No newline at end of file + CMS Machine Learning Documentation

Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:

  • ML best practices
  • How to optimize a NN
  • Common pitfalls for CMS analyzers
  • Direct and indirect inferencing using a variety of ML packages
  • How to get a model integrated into CMSSW

And much more!

If you think we are missing some important information, please contact the ML Knowledge Subgroup!


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/checklist.html b/inference/checklist.html index f1d7831..4f2d9ac 100644 --- a/inference/checklist.html +++ b/inference/checklist.html @@ -1 +1 @@ - Integration checklist - CMS Machine Learning Documentation

Integration checklist

Todo.


Last update: December 5, 2023
\ No newline at end of file + Integration checklist - CMS Machine Learning Documentation

Integration checklist

Todo.


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/conifer.html b/inference/conifer.html index 79e7383..9455816 100644 --- a/inference/conifer.html +++ b/inference/conifer.html @@ -1,4 +1,4 @@ - conifer - CMS Machine Learning Documentation

Direct inference with conifer

drawing

Introduction

conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:

  • conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
  • conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
  • utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python

Emulation in CMSSW

All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.

Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.

Minimal preparation from Python:

import conifer
+ conifer - CMS Machine Learning Documentation       

Direct inference with conifer

drawing

Introduction

conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:

  • conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
  • conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
  • utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python

Emulation in CMSSW

All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.

Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.

Minimal preparation from Python:

import conifer
 model = conifer. ... # convert or load a conifer model
 # e.g. model = conifer.converters.convert_from_xgboost(xgboost_model)
 model.save('my_bdt.json')
diff --git a/inference/hls4ml.html b/inference/hls4ml.html
index 7764c9f..9cae5a1 100644
--- a/inference/hls4ml.html
+++ b/inference/hls4ml.html
@@ -1 +1 @@
- hls4ml - CMS Machine Learning Documentation       

Direct inference with hls4ml

drawing

hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.

drawing

The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.

That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).

Workshop/Conference Date Links
23rd Virtual IEEE Real Time Conference August 03, 2022 Indico
2022 CMS ML Town Hall July 22, 2022 Contribution Link
a3d3 hls4ml @ Snowmass CSS 2022: Tutorial July 21, 2022 Slides, Recording, JupyterHub
Fast Machine Learning for Science Workshop December 3, 2020 Indico, Slides, GitHub, Interactive Notebooks
hls4ml @ UZH ML Workshop November 17, 2020 Indico, Slides
ICCAD 2020 November 5, 2020 https://events-siteplex.confcats.io/iccad2022/wp-content/uploads/sites/72/2021/12/2020_ICCAD_ConferenceProgram.pdf, GitHub
4th IML Workshop October 19, 2020 Indico, Slides, Instructions, Notebooks, Recording
22nd Virtual IEEE Real Time Conference October 15, 2020 Indico, Slides, Notebooks
30th International Conference on Field-Programmable Logic and Applications September 4, 2020 Program
hls4ml tutorial @ CERN June 3, 2020 Indico, Slides, Notebooks
Fast Machine Learning September 12, 2019 Indico
1st Real Time Analysis Workshop, Université Paris-Saclay July 16, 2019 Indico, Slides, Autoencoder Tutorial

Last update: December 5, 2023
\ No newline at end of file + hls4ml - CMS Machine Learning Documentation

Direct inference with hls4ml

drawing

hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.

drawing

The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.

That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).

Workshop/Conference Date Links
23rd Virtual IEEE Real Time Conference August 03, 2022 Indico
2022 CMS ML Town Hall July 22, 2022 Contribution Link
a3d3 hls4ml @ Snowmass CSS 2022: Tutorial July 21, 2022 Slides, Recording, JupyterHub
Fast Machine Learning for Science Workshop December 3, 2020 Indico, Slides, GitHub, Interactive Notebooks
hls4ml @ UZH ML Workshop November 17, 2020 Indico, Slides
ICCAD 2020 November 5, 2020 https://events-siteplex.confcats.io/iccad2022/wp-content/uploads/sites/72/2021/12/2020_ICCAD_ConferenceProgram.pdf, GitHub
4th IML Workshop October 19, 2020 Indico, Slides, Instructions, Notebooks, Recording
22nd Virtual IEEE Real Time Conference October 15, 2020 Indico, Slides, Notebooks
30th International Conference on Field-Programmable Logic and Applications September 4, 2020 Program
hls4ml tutorial @ CERN June 3, 2020 Indico, Slides, Notebooks
Fast Machine Learning September 12, 2019 Indico
1st Real Time Analysis Workshop, Université Paris-Saclay July 16, 2019 Indico, Slides, Autoencoder Tutorial

Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/onnx.html b/inference/onnx.html index 7d46bf4..989c601 100644 --- a/inference/onnx.html +++ b/inference/onnx.html @@ -1,4 +1,4 @@ - ONNX - CMS Machine Learning Documentation

Direct inference with ONNX Runtime

ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community—most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.

The figure showing the ONNX interoperability. (Source from website.)

ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to "run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available". It includes "built-in optimization features that trim and consolidate nodes without impacting model accuracy."

The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.

On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.

Software Setup

We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)

 1
+ ONNX - CMS Machine Learning Documentation       

Direct inference with ONNX Runtime

ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community—most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.

The figure showing the ONNX interoperability. (Source from website.)

ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to "run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available". It includes "built-in optimization features that trim and consolidate nodes without impacting model accuracy."

The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.

On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.

Software Setup

We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)

 1
  2
  3
  4
diff --git a/inference/particlenet.html b/inference/particlenet.html
index 09038eb..19818ae 100644
--- a/inference/particlenet.html
+++ b/inference/particlenet.html
@@ -1,4 +1,4 @@
- ParticleNet - CMS Machine Learning Documentation       

ParticleNet

ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.

The full architecture of the ParticleNet model. We'll walk through the details in the following sections.

On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:

  1. An introduction to ParticleNet, including

    • a general description of ParticleNet
    • the advantages brought from the architecture by concept
    • a sketch of ParticleNet applications in CMS and other relevant works
  2. An introduction to Weaver and model implementations, introduced in a step-by-step manner:

    • build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
    • try to reproduce the original performance and make the ROC plots.

    This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the "ParticleNet".

  3. Tuning the ParticleNet model, including

    • tips for readers who are using/modifying the ParticleNet model to achieve a better performance

    This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.

cms-ml/documentation

ParticleNet

ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.

The full architecture of the ParticleNet model. We'll walk through the details in the following sections.

On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:

  1. An introduction to ParticleNet, including

    • a general description of ParticleNet
    • the advantages brought from the architecture by concept
    • a sketch of ParticleNet applications in CMS and other relevant works
  2. An introduction to Weaver and model implementations, introduced in a step-by-step manner:

    • build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
    • try to reproduce the original performance and make the ROC plots.

    This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the "ParticleNet".

  3. Tuning the ParticleNet model, including

    • tips for readers who are using/modifying the ParticleNet model to achieve a better performance

    This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.


Corresponding persons:

  • Huilin Qu, Loukas Gouskos (original developers of ParticleNet)
  • Congqiao Li (author of the page)

Introduction to ParticleNet

1. General description

ParticleNet is a graph neural net (GNN) model. The key ingredient of ParticleNet is the graph convolutional operation, i.e., the edge convolution (EdgeConv) and the dynamic graph CNN (DGCNN) method [arXiv:1801.07829] applied on the "point cloud" data structure.

We will disassemble the ParticleNet model and provide a detailed exploration in the next section, but here we briefly explain the key features of the model.

Intuitively, ParticleNet treats all candidates inside an object as a "point cloud", which is a permutational-invariant set of points (e.g. a set of PF candidates), each carrying a feature vector (η, φ, pT, charge, etc.). The DGCNN uses the EdgeConv operation to exploit their spatial correlations (two-dimensional on the η-φ plain) by finding the k-nearest neighbours of each point and generate a new latent graph layer where points are scattered on a high-dimensional latent space. This is a graph-type analogue of the classical 2D convolution operation, which acts on a regular 2D grid (e.g., a picture) using a 3×3 local patch to explore the relations of a single-pixel with its 8 nearest pixels, then generates a new 2D grid.

The cartoon illustrates the convolutional operation acted on the regular grid and on the point cloud (plot from ML4Jets 2018 talk).

As a consequence, the EdgeConv operation transforms the graph to a new graph, which has a changed spatial relationship among points. It then acts on the second graph to produce the third graph, showing the stackability of the convolution operation. This illustrates the "dynamic" property as the graph topology changes after each EdgeConv layer.

2. Advantage

By concept, the advantage of the network may come from exploiting the permutational-invariant symmetry of the points, which is intrinsic to our physics objects. This symmetry is held naturally in a point cloud representation.

In a recent study on jet physics or event-based analysis using ML techniques, there are increasing interest to explore the point cloud data structure. We explain here conceptually why a "point cloud" representation outperforms the classical ones, including the variable-length 2D vector structure passing to a 1D CNN or any type of RNN, and imaged-based representation passing through a 2D CNN. By using the 1D CNN, the points (PF candidates) are more often ordered by pT to fix on the 1D grid. Only correlations with neighbouring points with similar pT are learned by the network with a convolution operation. The Long Short-Term Memory (LSTM) type recurrent neural network (RNN) provides the flexibility to feed in a variant-length sequence and has a "memory" mechanism to cooperate the information it learns from an early node to the latest node. The concern is that such ordering of the sequence is somewhat artificial, and not an underlying property that an NN must learn to accomplish the classification task. As a comparison, in the task of the natural language processing where LSTM has a huge advantage, the order of words are important characteristic of a language itself (reflects the "grammar" in some circumstances) and is a feature the NN must learn to master the language. The imaged-based data explored by a 2D CNN stems from the image recognition task. A jet image with proper standardization is usually performed before feeding into the network. In this sense, it lacks local features which the 2D local patch is better at capturing, e.g. the ear of the cat that a local patch can capture by scanning over the entire image. The jet image is appearing to hold the features globally (e.g. two-prong structure for W-tagging). The sparsity of data is another concern in that it introduces redundant information to present a jet on the regular grid, making the network hard to capture the key properties.

Here we briefly summarize the applications and ongoing works on ParticleNet. Public CMS results include

  • large-R jet with R=0.8 tagging (for W/Z/H/t) using ParticleNet [CMS-DP-2020/002]
  • regression on the large-R jet mass based on the ParticleNet model [CMS-DP-2021/017]

ParticleNet architecture is also applied on small radius R=0.4 jets for the b/c-tagging and quark/gluon classification (see this talk (CMS internal)). A recent ongoing work applies the ParticleNet architecture in heavy flavour tagging at HLT (see this talk (CMS internal)). The ParticleNet model is recently updated to ParticleNeXt and see further improvement (see the ML4Jets 2021 talk).

Recent works in the joint field of HEP and ML also shed light on exploiting the point cloud data structure and GNN-based architectures. We see very active progress in recent years. Here list some useful materials for the reader's reference.

  • Some pheno-based work are summarized in the HEP × ML living review, especially in the "graph" and "sets" categories.
  • An overview of GNN applications to CMS, see CMS ML forum (CMS internal). Also see more recent GNN application progress in ML forums: Oct 20, Nov 3.
  • At the time of writing, various novel GNN-based models are explored and introduced in the recent ML4Jets2021 meeting.

Introduction to Weaver and model implementations

Weaver is a machine learning R&D framework for high energy physics (HEP) applications. It trains the neural net with PyTorch and is capable of exporting the model to the ONNX format for fast inference. A detailed guide is presented on Weaver README page.

Now we walk through three solid examples to get you familiar with Weaver. We use the benchmark of the top tagging task [arXiv:1707.08966] in the following example. Some useful information can be found in the "top tagging" section in the IML public datasets webpage (the gDoc).

Our goal is to do some warm-up with Weaver, and more importantly, to explore from a technical side the neural net architectures: a simple multi-layer perceptron (MLP) model, a more complicated "DeepAK8 tagger" model based on 1D CNN with ResNet, and the "ParticleNet model," which is based on DGCNN. We will dig deeper into their implementations in Weaver and try to illustrate as many details as possible. Finally, we compare their performance and see if we can reproduce the benchmark record with the model. Please clone the repo weaver-benchmark and we'll get started. The Weaver repo will be cloned as a submodule.

git clone --recursive https://github.com/colizz/weaver-benchmark.git
 
 # Create a soft link inside weaver so that it can find data/model cards
diff --git a/inference/performance.html b/inference/performance.html
index 0ec2551..ad74892 100644
--- a/inference/performance.html
+++ b/inference/performance.html
@@ -1 +1 @@
- Performance - CMS Machine Learning Documentation       

Performance of inference tools


Last update: December 5, 2023
\ No newline at end of file + Performance - CMS Machine Learning Documentation

Performance of inference tools


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/pyg.html b/inference/pyg.html index d77fa47..a0e9d9d 100644 --- a/inference/pyg.html +++ b/inference/pyg.html @@ -1,4 +1,4 @@ - PyTorch Geometric - CMS Machine Learning Documentation

PyTorch Geometric

Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.

GDL with PyG

A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.

The Data Class: PyG Graphs

Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \(u\in\mathcal{V}\), where \(\mathcal{V}\) is the node set. Relations are represented by edges \((i,j)\in\mathcal{E}\) between nodes, where \(\mathcal{E}\) is the edge set. Denote the sizes of the node and edge sets as \(|\mathcal{V}|=n_\mathrm{nodes}\) and \(|\mathcal{E}|=n_\mathrm{edges}\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \(n_\mathrm{edges}=n_\mathrm{nodes}(n_\mathrm{nodes}-1)/2\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.

In general, nodes can have positions \(\{p_i\}_{i=1}^{n_\mathrm{nodes}}\), \(p_i\in\mathbb{R}^{n_\mathrm{space\_dim}}\), and features (attributes) \(\{x_i\}_{i=1}^{n_\mathrm{nodes}}\), \(x_i\in\mathbb{R}^{n_\mathrm{node\_dim}}\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \(\{e_{ij}\}_{(i,j)\in\mathcal{E}}\), \(e_{ij}\in\mathbb{R}^{n_\mathrm{edge\_dim}}\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:

  • data.x: node feature matrix, \(X\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{node\_dim}}\)
  • data.edge_index: node indices at each end of each edge, \(I\in\mathbb{R}^{2\times n_\mathrm{edges}}\)
  • data.edge_attr: edge feature matrix, \(E\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{edge\_dim}}\)
  • data.y: training target with arbitary shape (\(y\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{out}}\) for node-level targets, \(y\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{out}}\) for edge-level targets or \(y\in\mathbb{R}^{1\times n_\mathrm{out}}\) for node-level targets).
  • data.pos: Node position matrix, \(P\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{space\_dim}}\)

The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.

As an example, consider the ZINC chemical compounds dataset, which available as a built-in dataset in PyG:

from torch_geometric.datasets import ZINC
+ PyTorch Geometric - CMS Machine Learning Documentation       

PyTorch Geometric

Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.

GDL with PyG

A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.

The Data Class: PyG Graphs

Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \(u\in\mathcal{V}\), where \(\mathcal{V}\) is the node set. Relations are represented by edges \((i,j)\in\mathcal{E}\) between nodes, where \(\mathcal{E}\) is the edge set. Denote the sizes of the node and edge sets as \(|\mathcal{V}|=n_\mathrm{nodes}\) and \(|\mathcal{E}|=n_\mathrm{edges}\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \(n_\mathrm{edges}=n_\mathrm{nodes}(n_\mathrm{nodes}-1)/2\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.

In general, nodes can have positions \(\{p_i\}_{i=1}^{n_\mathrm{nodes}}\), \(p_i\in\mathbb{R}^{n_\mathrm{space\_dim}}\), and features (attributes) \(\{x_i\}_{i=1}^{n_\mathrm{nodes}}\), \(x_i\in\mathbb{R}^{n_\mathrm{node\_dim}}\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \(\{e_{ij}\}_{(i,j)\in\mathcal{E}}\), \(e_{ij}\in\mathbb{R}^{n_\mathrm{edge\_dim}}\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:

  • data.x: node feature matrix, \(X\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{node\_dim}}\)
  • data.edge_index: node indices at each end of each edge, \(I\in\mathbb{R}^{2\times n_\mathrm{edges}}\)
  • data.edge_attr: edge feature matrix, \(E\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{edge\_dim}}\)
  • data.y: training target with arbitary shape (\(y\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{out}}\) for node-level targets, \(y\in\mathbb{R}^{n_\mathrm{edges}\times n_\mathrm{out}}\) for edge-level targets or \(y\in\mathbb{R}^{1\times n_\mathrm{out}}\) for node-level targets).
  • data.pos: Node position matrix, \(P\in\mathbb{R}^{n_\mathrm{nodes}\times n_\mathrm{space\_dim}}\)

The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.

As an example, consider the ZINC chemical compounds dataset, which available as a built-in dataset in PyG:

from torch_geometric.datasets import ZINC
 train_dataset = ZINC(root='/tmp/ZINC', subset=True, split='train')
 test_dataset =  ZINC(root='/tmp/ZINC', subset=True, split='test')
 len(train_dataset)
diff --git a/inference/pytorch.html b/inference/pytorch.html
index 1180c9b..a983961 100644
--- a/inference/pytorch.html
+++ b/inference/pytorch.html
@@ -1,4 +1,4 @@
- PyTorch - CMS Machine Learning Documentation       

PyTorch Inference

PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!

Introductory References

The Basics

The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.

Tensors

The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:

a = torch.randn(size=(2,2))
+ PyTorch - CMS Machine Learning Documentation       

PyTorch Inference

PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!

Introductory References

The Basics

The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.

Tensors

The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:

a = torch.randn(size=(2,2))
 >>> tensor([[ 1.3552, -0.0204],
             [ 1.2677, -0.8926]])
 a.view(-1, 1)
diff --git a/inference/sonic_triton.html b/inference/sonic_triton.html
index dfba50a..b456fed 100644
--- a/inference/sonic_triton.html
+++ b/inference/sonic_triton.html
@@ -1 +1 @@
- Sonic/Triton - CMS Machine Learning Documentation       

Service-based inference with Triton/Sonic

This page is still under construction. For the moment, please see the Sonic+Triton tutorial given as part of the Machine Learning HATS@LPC 2021.


Last update: December 5, 2023
\ No newline at end of file + Sonic/Triton - CMS Machine Learning Documentation

Service-based inference with Triton/Sonic

This page is still under construction. For the moment, please see the Sonic+Triton tutorial given as part of the Machine Learning HATS@LPC 2021.


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/standalone.html b/inference/standalone.html index ff0c58c..66473da 100644 --- a/inference/standalone.html +++ b/inference/standalone.html @@ -1 +1 @@ - Standalone framework - CMS Machine Learning Documentation

Todo.

Idea: Working w/ TF+ROOT standalone (outside of CMSSW)


Last update: December 5, 2023
\ No newline at end of file + Standalone framework - CMS Machine Learning Documentation

Todo.

Idea: Working w/ TF+ROOT standalone (outside of CMSSW)


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/swan_aws.html b/inference/swan_aws.html index 66dcc4a..a381f82 100644 --- a/inference/swan_aws.html +++ b/inference/swan_aws.html @@ -1 +1 @@ - SWAN + AWS - CMS Machine Learning Documentation

Todo.

Ideas: best practices cost model instance priving need to log out monitoring madatory


Last update: December 5, 2023
\ No newline at end of file + SWAN + AWS - CMS Machine Learning Documentation

Todo.

Ideas: best practices cost model instance priving need to log out monitoring madatory


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/tensorflow1.html b/inference/tensorflow1.html index aaeabf4..06a5a80 100644 --- a/inference/tensorflow1.html +++ b/inference/tensorflow1.html @@ -1 +1 @@ - TensorFlow 1 - CMS Machine Learning Documentation

Direct inference with TensorFlow 1

While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.


Last update: December 5, 2023
\ No newline at end of file + TensorFlow 1 - CMS Machine Learning Documentation

Direct inference with TensorFlow 1

While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.


Last update: December 5, 2023
\ No newline at end of file diff --git a/inference/tensorflow2.html b/inference/tensorflow2.html index 15a60b2..6621c48 100644 --- a/inference/tensorflow2.html +++ b/inference/tensorflow2.html @@ -1,4 +1,4 @@ - TensorFlow 2 - CMS Machine Learning Documentation

Direct inference with TensorFlow 2


TensorFlow 2 is available since CMSSW_11_1_X (cmssw#28711, cmsdist#5525). The integration into the software stack can be found in cmsdist/tensorflow.spec and the interface is located in cmssw/PhysicsTools/TensorFlow.

Available versions

TensorFlow el8_amd64_gcc10 el8_amd64_gcc11
v2.6.0 ≥ CMSSW_12_3_4 -
v2.6.4 ≥ CMSSW_12_5_0 ≥ CMSSW_12_5_0
TensorFlow slc7_amd64_gcc900 slc7_amd64_gcc10 slc7_amd64_gcc11
v2.1.0 ≥ CMSSW_11_1_0 - -
v2.3.1 ≥ CMSSW_11_2_0 - -
v2.4.1 ≥ CMSSW_11_3_0 - -
v2.5.0 ≥ CMSSW_12_0_0 ≥ CMSSW_12_0_0 -
v2.6.0 ≥ CMSSW_12_1_0 ≥ CMSSW_12_1_0 ≥ CMSSW_12_3_0
v2.6.4 - ≥ CMSSW_12_5_0 ≥ CMSSW_13_0_0
TensorFlow slc7_amd64_gcc900
v2.1.0 ≥ CMSSW_11_1_0
v2.3.1 ≥ CMSSW_11_2_0

At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.

Software setup

To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.

 1
+ TensorFlow 2 - CMS Machine Learning Documentation       

Direct inference with TensorFlow 2


TensorFlow 2 is available since CMSSW_11_1_X (cmssw#28711, cmsdist#5525). The integration into the software stack can be found in cmsdist/tensorflow.spec and the interface is located in cmssw/PhysicsTools/TensorFlow.

Available versions

TensorFlow el8_amd64_gcc10 el8_amd64_gcc11
v2.6.0 ≥ CMSSW_12_3_4 -
v2.6.4 ≥ CMSSW_12_5_0 ≥ CMSSW_12_5_0
TensorFlow slc7_amd64_gcc900 slc7_amd64_gcc10 slc7_amd64_gcc11
v2.1.0 ≥ CMSSW_11_1_0 - -
v2.3.1 ≥ CMSSW_11_2_0 - -
v2.4.1 ≥ CMSSW_11_3_0 - -
v2.5.0 ≥ CMSSW_12_0_0 ≥ CMSSW_12_0_0 -
v2.6.0 ≥ CMSSW_12_1_0 ≥ CMSSW_12_1_0 ≥ CMSSW_12_3_0
v2.6.4 - ≥ CMSSW_12_5_0 ≥ CMSSW_13_0_0
TensorFlow slc7_amd64_gcc900
v2.1.0 ≥ CMSSW_11_1_0
v2.3.1 ≥ CMSSW_11_2_0

At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.

Software setup

To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.

 1
  2
  3
  4
diff --git a/inference/tfaas.html b/inference/tfaas.html
index 374019a..890c5c9 100644
--- a/inference/tfaas.html
+++ b/inference/tfaas.html
@@ -1,4 +1,4 @@
- TFaaS - CMS Machine Learning Documentation       

TFaaS

TensorFlow as a Service

TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.

For CERN users we already deploy TFaaS on the following URL: https://cms-tfaas.cern.ch

It can be used by CMS members using any HTTP based client. For example, here is a basic access from curl client:

curl -k https://cms-tfaas.cern.ch/models
+ TFaaS - CMS Machine Learning Documentation       

TFaaS

TensorFlow as a Service

TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.

For CERN users we already deploy TFaaS on the following URL: https://cms-tfaas.cern.ch

It can be used by CMS members using any HTTP based client. For example, here is a basic access from curl client:

curl -k https://cms-tfaas.cern.ch/models
 [
   {
     "name": "luca",
diff --git a/inference/xgboost.html b/inference/xgboost.html
index 2844ede..89caac9 100644
--- a/inference/xgboost.html
+++ b/inference/xgboost.html
@@ -1,4 +1,4 @@
- XGBoost - CMS Machine Learning Documentation       

Direct inference with XGBoost

General

XGBoost is avaliable (at least) since CMSSW_9_2_4 cmssw#19377.

In CMSSW environment, XGBoost can be used via its Python API.

For UL era, there are different verisons available for different SCRAM_ARCH:

  1. For slc7_amd64_gcc700 and above, ver.0.80 is available.

  2. For slc7_amd64_gcc900 and above, ver.1.3.3 is available.

  3. Please note that different major versions have different behavior( See Caveat Session).

Existing Examples

There are some existing good examples of using XGBoost under CMSSW, as listed below:

  1. Offical sample for testing the integration of XGBoost library with CMSSW.

  2. Useful codes created by Dr. Huilin Qu for inference with existing trained model.

  3. C/C++ Interface for inference with existing trained model.

We will provide examples for both C/C++ interface and python interface of XGBoost under CMSSW environment.

Example: Classification of points from joint-Gaussian distribution.

In this specific example, you will use XGBoost to classify data points generated from two 8-dimension joint-Gaussian distribution.

Feature Index 0 1 2 3 4 5 6 7
μ1 1 2 3 4 5 6 7 8
μ2 0 1.9 3.2 4.5 4.8 6.1 8.1 11
σ½ = σ 1 1 1 1 1 1 1 1
1 - μ2| / σ 1 0.1 0.2 0.5 0.2 0.1 1.1 3

All generated data points for train(1:10000,2:10000) and test(1:1000,2:1000) are stored as Train_data.csv/Test_data.csv.

Preparing Model

The training process of a XGBoost model can be done outside of CMSSW. We provide a python script for illustration.

# importing necessary models
+ XGBoost - CMS Machine Learning Documentation       

Direct inference with XGBoost

General

XGBoost is avaliable (at least) since CMSSW_9_2_4 cmssw#19377.

In CMSSW environment, XGBoost can be used via its Python API.

For UL era, there are different verisons available for different SCRAM_ARCH:

  1. For slc7_amd64_gcc700 and above, ver.0.80 is available.

  2. For slc7_amd64_gcc900 and above, ver.1.3.3 is available.

  3. Please note that different major versions have different behavior( See Caveat Session).

Existing Examples

There are some existing good examples of using XGBoost under CMSSW, as listed below:

  1. Offical sample for testing the integration of XGBoost library with CMSSW.

  2. Useful codes created by Dr. Huilin Qu for inference with existing trained model.

  3. C/C++ Interface for inference with existing trained model.

We will provide examples for both C/C++ interface and python interface of XGBoost under CMSSW environment.

Example: Classification of points from joint-Gaussian distribution.

In this specific example, you will use XGBoost to classify data points generated from two 8-dimension joint-Gaussian distribution.

Feature Index 0 1 2 3 4 5 6 7
μ1 1 2 3 4 5 6 7 8
μ2 0 1.9 3.2 4.5 4.8 6.1 8.1 11
σ½ = σ 1 1 1 1 1 1 1 1
1 - μ2| / σ 1 0.1 0.2 0.5 0.2 0.1 1.1 3

All generated data points for train(1:10000,2:10000) and test(1:1000,2:1000) are stored as Train_data.csv/Test_data.csv.

Preparing Model

The training process of a XGBoost model can be done outside of CMSSW. We provide a python script for illustration.

# importing necessary models
 import numpy as np
 import pandas as pd 
 from xgboost import XGBClassifier # Or XGBRegressor for Logistic Regression
diff --git a/innovation/hackathons.html b/innovation/hackathons.html
index 18b03eb..70da7f2 100644
--- a/innovation/hackathons.html
+++ b/innovation/hackathons.html
@@ -1 +1 @@
- ML Hackathons - CMS Machine Learning Documentation       

CMS Machine Learning Hackathons

Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.

If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!

Below follows a list of previous successful hackathons.

HGCAL TICL reconstruction

20 Jun 2022 - 24 Jun 2022
https://indico.cern.ch/e/ticlhack

Abstract: The HGCAL reconstruction relies on “The Iterative CLustering” (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the “tracksters”. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.

In this project we plan to tackle the following topics using ML:

  • trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
  • tuning of the existing pattern recognition algorithms
  • reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
  • trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.

Material:

A CodiMD document has been created with an overview of the topics and to keep track of the activities during the hackathon:

https://codimd.web.cern.ch/s/hMd74Yi7J

Jet tagging

8 Nov 2021 - 11 Nov 2021
https://indico.cern.ch/e/jethack

Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.

Two main projects are covered:

  • Jet tagging for scouting
  • Jet tagging for Level-1

Jet tagging for scouting

Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:

Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b

Jet tagging for Level-1

Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:

  • Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
  • Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
  • Study tagging performance as a function of the number of jet constituents
  • Study tagging performance for a "real" input vector (zero-paddes, perhaps unsorted)
  • Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
  • Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets

GNN-4-tracking

27 Sept 2021 - 1 Oct 2021

https://indico.cern.ch/e/gnn4tracks

Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.

The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.

Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:

  • Task 1: Create a package for extracting graph features and building graphs in CMSSW.
  • Task 2. GNN inference on Sonic servers
  • Task 3: Track fitting after GNN track building
  • Task 4. Performance evaluation for the new track collection

Material:

Code is provided at this GitHub organisation. Project are listed here.

Anomaly detection

In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.

There are two main projects:

Event-based anomaly detection algorithms for the Level-1 Trigger

Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting

Material:

A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.


Last update: December 5, 2023
\ No newline at end of file + ML Hackathons - CMS Machine Learning Documentation

CMS Machine Learning Hackathons

Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.

If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!

Below follows a list of previous successful hackathons.

HGCAL TICL reconstruction

20 Jun 2022 - 24 Jun 2022
https://indico.cern.ch/e/ticlhack

Abstract: The HGCAL reconstruction relies on “The Iterative CLustering” (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the “tracksters”. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.

In this project we plan to tackle the following topics using ML:

  • trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
  • tuning of the existing pattern recognition algorithms
  • reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
  • trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.

Material:

A CodiMD document has been created with an overview of the topics and to keep track of the activities during the hackathon:

https://codimd.web.cern.ch/s/hMd74Yi7J

Jet tagging

8 Nov 2021 - 11 Nov 2021
https://indico.cern.ch/e/jethack

Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.

Two main projects are covered:

  • Jet tagging for scouting
  • Jet tagging for Level-1

Jet tagging for scouting

Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:

Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b

Jet tagging for Level-1

Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:

  • Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
  • Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
  • Study tagging performance as a function of the number of jet constituents
  • Study tagging performance for a "real" input vector (zero-paddes, perhaps unsorted)
  • Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
  • Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets

GNN-4-tracking

27 Sept 2021 - 1 Oct 2021

https://indico.cern.ch/e/gnn4tracks

Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.

The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.

Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:

  • Task 1: Create a package for extracting graph features and building graphs in CMSSW.
  • Task 2. GNN inference on Sonic servers
  • Task 3: Track fitting after GNN track building
  • Task 4. Performance evaluation for the new track collection

Material:

Code is provided at this GitHub organisation. Project are listed here.

Anomaly detection

In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.

There are two main projects:

Event-based anomaly detection algorithms for the Level-1 Trigger

Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting

Material:

A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.


Last update: December 5, 2023
\ No newline at end of file diff --git a/innovation/journal_club.html b/innovation/journal_club.html index 436eac8..a0a5489 100644 --- a/innovation/journal_club.html +++ b/innovation/journal_club.html @@ -1 +1 @@ - ML Journal Club - CMS Machine Learning Documentation

CMS Machine Learning Journal Club

Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.

To vote for or to propose new papers for discussion, go to https://cms-ml-journalclub.web.cern.ch/.

Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.

Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review

Tommaso Dorigo, Pablo de Castro

Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.

Mapping Machine-Learned Physics into a Human-Readable Space

Taylor Faucett, Jesse Thaler, Daniel Whiteson

Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper

Model Interpretability (2 papers):

Identifying the relevant dependencies of the neural network response on characteristics of the input space

Stefan Wunsch, Raphael Friese, Roger Wolf, Günter Quast

Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.

iNNvestigate neural networks!

Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller, Sven Dähne, Pieter-Jan Kindermans

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

Simulation-based inference in particle physics and beyond (and beyond)

Johann Brehmer, Kyle Cranmer

Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.

Efficiency Parameterization with Neural Networks

C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi

Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code

A General Framework for Uncertainty Estimation in Deep Learning

Antonio Loquercio, Mattia Segù, Davide Scaramuzza

Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.


Last update: December 5, 2023
\ No newline at end of file + ML Journal Club - CMS Machine Learning Documentation

CMS Machine Learning Journal Club

Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.

To vote for or to propose new papers for discussion, go to https://cms-ml-journalclub.web.cern.ch/.

Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.

Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review

Tommaso Dorigo, Pablo de Castro

Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.

Mapping Machine-Learned Physics into a Human-Readable Space

Taylor Faucett, Jesse Thaler, Daniel Whiteson

Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper

Model Interpretability (2 papers):

Identifying the relevant dependencies of the neural network response on characteristics of the input space

Stefan Wunsch, Raphael Friese, Roger Wolf, Günter Quast

Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.

iNNvestigate neural networks!

Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam Hägele, Kristof T. Schütt, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller, Sven Dähne, Pieter-Jan Kindermans

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

Simulation-based inference in particle physics and beyond (and beyond)

Johann Brehmer, Kyle Cranmer

Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.

Efficiency Parameterization with Neural Networks

C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi

Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code

A General Framework for Uncertainty Estimation in Deep Learning

Antonio Loquercio, Mattia Segù, Davide Scaramuzza

Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.


Last update: December 5, 2023
\ No newline at end of file diff --git a/optimization/data_augmentation.html b/optimization/data_augmentation.html index 4e82432..ef54d3a 100644 --- a/optimization/data_augmentation.html +++ b/optimization/data_augmentation.html @@ -1,4 +1,4 @@ - Data augmentation - CMS Machine Learning Documentation

Data augmentation

Introduction

This introduction is based on papers by Shorten & Khoshgoftaar, 2019 and Rebuffi et al., 2021 among others

With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.

The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.

The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.

Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.

Data Augmentation
Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)

Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:

  • Improvement of model prediction precision
  • More training data for the model
  • Preventing data scarcity for state-of-the-art models
  • Reduction of over overfitting and creation of data variability
  • Increased model generalization properties
  • Help in resolving class imbalance problems in datasets
  • Reduced cost of data collection and labeling
  • Enabling rare event prediction

And some words of caution:

  • There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
  • Don't trust the augmented data blindly
  • Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
  • There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
  • Ensure the validity of the augmented data before using it in ML models.
  • If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.

Feature Engineering

This part is based mostly on Erdmann et al., 2018

Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.

With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, "low-level" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using "high-level" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.

Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.

HEP Application - Lorentz Boosted Network

For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.

Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.

The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.

The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.

The usual ML workflow, employing LBN, is as follows:

Step-1: LBN(M, F)
+ Data augmentation - CMS Machine Learning Documentation       

Data augmentation

Introduction

This introduction is based on papers by Shorten & Khoshgoftaar, 2019 and Rebuffi et al., 2021 among others

With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.

The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.

The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.

Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.

Data Augmentation
Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)

Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:

  • Improvement of model prediction precision
  • More training data for the model
  • Preventing data scarcity for state-of-the-art models
  • Reduction of over overfitting and creation of data variability
  • Increased model generalization properties
  • Help in resolving class imbalance problems in datasets
  • Reduced cost of data collection and labeling
  • Enabling rare event prediction

And some words of caution:

  • There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
  • Don't trust the augmented data blindly
  • Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
  • There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
  • Ensure the validity of the augmented data before using it in ML models.
  • If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.

Feature Engineering

This part is based mostly on Erdmann et al., 2018

Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.

With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, "low-level" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using "high-level" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.

Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.

HEP Application - Lorentz Boosted Network

For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.

Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.

The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.

The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.

The usual ML workflow, employing LBN, is as follows:

Step-1: LBN(M, F)
 
     1.0: Input hyperparameters: number of combinations M; number of features F
     1.0: Choose: number of incoming particles, N, according to the research
diff --git a/optimization/importance.html b/optimization/importance.html
index 70f31a5..516529f 100644
--- a/optimization/importance.html
+++ b/optimization/importance.html
@@ -1,4 +1,4 @@
- Feature importance - CMS Machine Learning Documentation       

Feature Importance

Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:

  • Important features can be used to create simplified models, e.g. to mitigate overfitting.
  • Using only important features can reduce the latency and memory requirements of the model.
  • The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
  • If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.

In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.

General Discussion

Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:

Filter Methods

Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,

  • Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.

  • Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.

  • Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.

  • Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.

Embedded Methods

Embedded methods are specific to a prediction model and independent of the dataset. Examples:

  • L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.

Wrapper Methods

Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:

  • Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\(f\)) to training data, yielding \(f(X_\mathrm{train})\), where \(X_\mathrm{train}\in\mathbb{R}^{n\times d}\) for \(n\) input examples with \(d\) features. Next, measure the model's performance on testing data for some loss \(\mathcal{L}\), i.e. \(s=\mathcal{L}\big(f(X_\mathrm{test}), y_\mathrm{test}\big)\). For each feature \(j\in[1\ ..\ d]\), randomly shuffle the corresponding column in \(X_\mathrm{test}\) to form \(X_\mathrm{test}^{(j)}\). Repeat this process \(K\) times, so that for \(k\in [1\ ..\ K]\) each random shuffling of feature column \(j\) gives a corrupted input dataset \(X_\mathrm{test}^{(j,k)}\). Finally, define the permutation importance of feature \(j\) as the difference between the un-corrupted validation score and average validation score over the corrupted \(X_\mathrm{test}^{(j,k)}\) datasets:
\[\texttt{PI}_j = s - \frac{1}{K}\sum_{k=1}^{K} \mathcal{L}[f(X_\mathrm{test}^{(j,k)}), y_\mathrm{test}]\]
  • Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \(D\) initial features, RFE returns the set of \(d < D\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \(d\) features remain.

Introduction by Example

Direct Interpretation

Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:

from sklearn.datasets import load_diabetes
+ Feature importance - CMS Machine Learning Documentation       

Feature Importance

Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:

  • Important features can be used to create simplified models, e.g. to mitigate overfitting.
  • Using only important features can reduce the latency and memory requirements of the model.
  • The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
  • If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.

In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.

General Discussion

Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:

Filter Methods

Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,

  • Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.

  • Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.

  • Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.

  • Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.

Embedded Methods

Embedded methods are specific to a prediction model and independent of the dataset. Examples:

  • L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.

Wrapper Methods

Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:

  • Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\(f\)) to training data, yielding \(f(X_\mathrm{train})\), where \(X_\mathrm{train}\in\mathbb{R}^{n\times d}\) for \(n\) input examples with \(d\) features. Next, measure the model's performance on testing data for some loss \(\mathcal{L}\), i.e. \(s=\mathcal{L}\big(f(X_\mathrm{test}), y_\mathrm{test}\big)\). For each feature \(j\in[1\ ..\ d]\), randomly shuffle the corresponding column in \(X_\mathrm{test}\) to form \(X_\mathrm{test}^{(j)}\). Repeat this process \(K\) times, so that for \(k\in [1\ ..\ K]\) each random shuffling of feature column \(j\) gives a corrupted input dataset \(X_\mathrm{test}^{(j,k)}\). Finally, define the permutation importance of feature \(j\) as the difference between the un-corrupted validation score and average validation score over the corrupted \(X_\mathrm{test}^{(j,k)}\) datasets:
\[\texttt{PI}_j = s - \frac{1}{K}\sum_{k=1}^{K} \mathcal{L}[f(X_\mathrm{test}^{(j,k)}), y_\mathrm{test}]\]
  • Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \(D\) initial features, RFE returns the set of \(d < D\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \(d\) features remain.

Introduction by Example

Direct Interpretation

Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:

from sklearn.datasets import load_diabetes
 from sklearn.model_selection import train_test_split
 
 diabetes = load_diabetes()
diff --git a/optimization/model_optimization.html b/optimization/model_optimization.html
index b9918bb..865fea3 100644
--- a/optimization/model_optimization.html
+++ b/optimization/model_optimization.html
@@ -1 +1 @@
- Model optimization - CMS Machine Learning Documentation       

Model optimization

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum and may be edited and published elsewhere by the author.

What we talk about when we talk about model optimization

Given some data \(x\) and a family of functionals parameterized by (a vector of) parameters \(\theta\) (e.g. for DNN training weights), the problem of learning consists in finding \(argmin_\theta Loss(f_\theta(x) - y_{true})\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.

The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.

The cartoon illustrates the general idea behind gradient descent to find the minimum of a function in a multidimensional parameter space (figure elaborated from an image by the huawei forums).

The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.

When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \(f_\theta\) is also a function of these hyperparameters \(\lambda\): \(f_{(\theta, \lambda)}\), and the \(\lambda\) can be optimized

The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.

The cartoon illustrates the general idea behind gradient descent to optimize the model complexity (in terms of the choice of hyperparameters) multidimensional parameter and hyperparameter space (figure elaborated from an image by the huawei forums).

Caveat: which data should you use to optimize your model

In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:

  • no hyperparameter optimization is performed;
  • no overtraining is found;
  • the number of training data is high enough to make statistical fluctuations negligible.

If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).

The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: "Deep Learning".

The cartoon illustrates the general idea behind grid search (image taken from Goodfellow, Bengio, Courville: "Deep Learning").

To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.

The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \(m\) parameters and testing \(n\) values for each of them leads to \(\mathcal{O}(n^m)\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.

Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.

Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \(\{-1, 0, 1\}\):

  • if the best parameter is found to be at the boundary (1), then extend range (\(\{1, 2, 3\}\)) and do the search in the new range;
  • if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \(\{-0.1, 0, 0.1\}\).

An improvement of the grid search is the random search, which proceeds like this:

  • you provide a marginal p.d.f. for each hyperparameter;
  • you sample from the joint p.d.f. a certain number of training configurations;
  • you train for each of these configurations to build the loss function landscape.

This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.

Random search also work best when done iteratively. The differences between grid and random search are again illustrated in Goodfellow, Bengio, Courville: "Deep Learning".

The cartoon illustrates the general idea behind random search, as opposed to grid search (image taken from Goodfellow, Bengio, Courville: "Deep Learning").

Model-based optimization by gradient descent

Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \(C\) is continuous and differentiable with respect to both the parameters \(\theta\) (e.g. weights) and hyperparameters \(\lambda\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).

A diagram illustrating the way gradient-based model optimization works has been prepared by Bengio, doi:10.1162/089976600300015187.

The diagram illustrates the way model optimization can be recast as a model selection problem, where a model selection criterion involves a differentiable validation set error (image taken from Bengio, doi:10.1162/089976600300015187).

Model-based optimization by surrogates

Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.

The diagram illustrates the pseudocode for the Sequential Model-based Global Optimization (image taken from Bergstra et al).

This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.

Bayesian Optimization

We are now ready to tackle in full what is referred to as Bayesian optimization.

Bayesian optimization assumes that the unknown function \(f(\theta, \lambda)\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \(\lambda\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of "sampled from a Gaussian process", we need to define what a Gaussian process is.

Gaussian processes

Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.

The diagram illustrates the evolution of a Gaussian process, when adding interpolating points (image taken from Kevin Jamieson's CSE599 lecture notes).

GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).

The basic idea behind Bayesian optimization

Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.

The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).

There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:

  • maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
  • maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
  • maximize the GP Upper confidence bound: minimize "regret" over the course of the optimization.

Historical note

Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)

Bayesian optimization in practice

The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).

Practical illustration of Bayesian Optimization (images taken from a tutorial on BO by Martin Krasser]).

Limitations (and some workaround) of Bayesian Optimization

There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.

First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Matérn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.

Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.

The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.

Alternatives to Gaussian processes: Tree-based models

Gaussian Processes model directly \(P(hyperpar | data)\) but are not the only suitable surrogate models for Bayesian optimization

The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \(P(data | hyperpar)\) and \(P(hyperpar)\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.

TPEs have been used in CMS already around 2017 in a VHbb analysis (see repository by Sean-Jiun Wang) and in a charged Higgs to tb search (HIG-18-004, doi:10.1007/JHEP01(2020)096).

Implementations of Bayesian Optimization

Caveats: don't get too obsessed with model optimization

In general, optimizing model structure is a good thing. F. Chollet e.g. says "If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.

Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \(pred\_a = model\_a.predict(x)\), ..., \(pred\_d = model\_d.predict(x)\). You then pool the predictions: \(pooled\_pred = (pred\_a + pred\_b + pred\_c + pred\_d)/4.\). THis works if all models are kind of good: if one is significantly worse than the others, then \(pooled\_pred\) may not be as good as the best model of the pool.

You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \(pooled\_pred = 0.5\cdot pred\_a + 0.25\cdot pred\_b + 0.1\cdot pred\_c + 0.15\cdot pred\_d)/4.\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.


This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum. Content may be edited and published elsewhere by the author. Page author: Pietro Vischia, 2022


Last update: December 5, 2023
\ No newline at end of file + Model optimization - CMS Machine Learning Documentation

Model optimization

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum and may be edited and published elsewhere by the author.

What we talk about when we talk about model optimization

Given some data \(x\) and a family of functionals parameterized by (a vector of) parameters \(\theta\) (e.g. for DNN training weights), the problem of learning consists in finding \(argmin_\theta Loss(f_\theta(x) - y_{true})\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.

The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.

The cartoon illustrates the general idea behind gradient descent to find the minimum of a function in a multidimensional parameter space (figure elaborated from an image by the huawei forums).

The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.

When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \(f_\theta\) is also a function of these hyperparameters \(\lambda\): \(f_{(\theta, \lambda)}\), and the \(\lambda\) can be optimized

The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.

The cartoon illustrates the general idea behind gradient descent to optimize the model complexity (in terms of the choice of hyperparameters) multidimensional parameter and hyperparameter space (figure elaborated from an image by the huawei forums).

Caveat: which data should you use to optimize your model

In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:

  • no hyperparameter optimization is performed;
  • no overtraining is found;
  • the number of training data is high enough to make statistical fluctuations negligible.

If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).

The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: "Deep Learning".

The cartoon illustrates the general idea behind grid search (image taken from Goodfellow, Bengio, Courville: "Deep Learning").

To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.

The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \(m\) parameters and testing \(n\) values for each of them leads to \(\mathcal{O}(n^m)\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.

Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.

Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \(\{-1, 0, 1\}\):

  • if the best parameter is found to be at the boundary (1), then extend range (\(\{1, 2, 3\}\)) and do the search in the new range;
  • if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \(\{-0.1, 0, 0.1\}\).

An improvement of the grid search is the random search, which proceeds like this:

  • you provide a marginal p.d.f. for each hyperparameter;
  • you sample from the joint p.d.f. a certain number of training configurations;
  • you train for each of these configurations to build the loss function landscape.

This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.

Random search also work best when done iteratively. The differences between grid and random search are again illustrated in Goodfellow, Bengio, Courville: "Deep Learning".

The cartoon illustrates the general idea behind random search, as opposed to grid search (image taken from Goodfellow, Bengio, Courville: "Deep Learning").

Model-based optimization by gradient descent

Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \(C\) is continuous and differentiable with respect to both the parameters \(\theta\) (e.g. weights) and hyperparameters \(\lambda\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).

A diagram illustrating the way gradient-based model optimization works has been prepared by Bengio, doi:10.1162/089976600300015187.

The diagram illustrates the way model optimization can be recast as a model selection problem, where a model selection criterion involves a differentiable validation set error (image taken from Bengio, doi:10.1162/089976600300015187).

Model-based optimization by surrogates

Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.

The diagram illustrates the pseudocode for the Sequential Model-based Global Optimization (image taken from Bergstra et al).

This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.

Bayesian Optimization

We are now ready to tackle in full what is referred to as Bayesian optimization.

Bayesian optimization assumes that the unknown function \(f(\theta, \lambda)\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \(\lambda\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of "sampled from a Gaussian process", we need to define what a Gaussian process is.

Gaussian processes

Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.

The diagram illustrates the evolution of a Gaussian process, when adding interpolating points (image taken from Kevin Jamieson's CSE599 lecture notes).

GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).

The basic idea behind Bayesian optimization

Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.

The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).

There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:

  • maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
  • maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
  • maximize the GP Upper confidence bound: minimize "regret" over the course of the optimization.

Historical note

Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)

Bayesian optimization in practice

The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).

Practical illustration of Bayesian Optimization (images taken from a tutorial on BO by Martin Krasser]).

Limitations (and some workaround) of Bayesian Optimization

There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.

First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Matérn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.

Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.

The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.

Alternatives to Gaussian processes: Tree-based models

Gaussian Processes model directly \(P(hyperpar | data)\) but are not the only suitable surrogate models for Bayesian optimization

The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \(P(data | hyperpar)\) and \(P(hyperpar)\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.

TPEs have been used in CMS already around 2017 in a VHbb analysis (see repository by Sean-Jiun Wang) and in a charged Higgs to tb search (HIG-18-004, doi:10.1007/JHEP01(2020)096).

Implementations of Bayesian Optimization

Caveats: don't get too obsessed with model optimization

In general, optimizing model structure is a good thing. F. Chollet e.g. says "If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.

Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \(pred\_a = model\_a.predict(x)\), ..., \(pred\_d = model\_d.predict(x)\). You then pool the predictions: \(pooled\_pred = (pred\_a + pred\_b + pred\_c + pred\_d)/4.\). THis works if all models are kind of good: if one is significantly worse than the others, then \(pooled\_pred\) may not be as good as the best model of the pool.

You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \(pooled\_pred = 0.5\cdot pred\_a + 0.25\cdot pred\_b + 0.1\cdot pred\_c + 0.15\cdot pred\_d)/4.\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.


This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum. Content may be edited and published elsewhere by the author. Page author: Pietro Vischia, 2022


Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/cloud_resources/index.html b/resources/cloud_resources/index.html index b891d86..76bb801 100644 --- a/resources/cloud_resources/index.html +++ b/resources/cloud_resources/index.html @@ -1 +1 @@ - Cloud Resources - CMS Machine Learning Documentation

Work in progress.


Last update: December 5, 2023
\ No newline at end of file + Cloud Resources - CMS Machine Learning Documentation

Work in progress.


Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/dataset_resources/index.html b/resources/dataset_resources/index.html index acd293a..639ab07 100644 --- a/resources/dataset_resources/index.html +++ b/resources/dataset_resources/index.html @@ -1,4 +1,4 @@ - Dataset Resources - CMS Machine Learning Documentation

CMS-ML Dataset Tab

Introduction

Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!

1. JetNet

Github Repository

Zenodo

Description

JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fréchet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fréchet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.

Nature of Objects

  • Objects: Gluon (g), Top Quark (t), Light Quark (q), W boson (w), and Z boson (z) jets of ~1 TeV transverse momentum (\(p_T\))
  • Number of Objects: N = 177252, 177945, 170679, 177172, 176952 for g, t, q, w, z jets respectively.

Format of Dataset

  • File Type: HDF5
  • Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\eta, \varphi, \p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\(p_T\), \(\eta\), mass, # of particles].

2. Top Tagging Benchmark Dataset

Zenodo

Description

A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within ∆R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable "ttv" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.

Nature of Objects

  • Objects: 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8
  • Number of Objects: In total 1.2M training events, 400k validation events and 400k test events.

Format of Dataset

  • File Type: HDF5
  • Structure: Use “train” for training, “val” for validation during the training and “test” for final testing and reporting results. For details, see the Zenodo link
  • Butter, Anja; Kasieczka, Gregor; Plehn, Tilman and Russell, Michael (2017). Based on data from 10.21468/SciPostPhys.5.3.028 (1707.08966)
  • Kasieczka, Gregor et al (2019). Dataset used for arXiv:1902.09914 (The Machine Learning Landscape of Top Taggers)

More dataset coming in!

Have any questions? Want your dataset shown on this page? Contact the ML Knowledge Subgroup!

cms-ml/documentation

CMS-ML Dataset Tab

Introduction

Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!

1. JetNet

Github Repository

Zenodo

Description

JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fréchet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fréchet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.

Nature of Objects

  • Objects: Gluon (g), Top Quark (t), Light Quark (q), W boson (w), and Z boson (z) jets of ~1 TeV transverse momentum (\(p_T\))
  • Number of Objects: N = 177252, 177945, 170679, 177172, 176952 for g, t, q, w, z jets respectively.

Format of Dataset

  • File Type: HDF5
  • Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\eta, \varphi, \p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\(p_T\), \(\eta\), mass, # of particles].

2. Top Tagging Benchmark Dataset

Zenodo

Description

A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within ∆R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable "ttv" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.

Nature of Objects

  • Objects: 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8
  • Number of Objects: In total 1.2M training events, 400k validation events and 400k test events.

Format of Dataset

  • File Type: HDF5
  • Structure: Use “train” for training, “val” for validation during the training and “test” for final testing and reporting results. For details, see the Zenodo link
  • Butter, Anja; Kasieczka, Gregor; Plehn, Tilman and Russell, Michael (2017). Based on data from 10.21468/SciPostPhys.5.3.028 (1707.08966)
  • Kasieczka, Gregor et al (2019). Dataset used for arXiv:1902.09914 (The Machine Learning Landscape of Top Taggers)

More dataset coming in!

Have any questions? Want your dataset shown on this page? Contact the ML Knowledge Subgroup!

cms-ml/documentation

Work in progress.


Last update: December 5, 2023
\ No newline at end of file + FPGA Resource - CMS Machine Learning Documentation

Work in progress.


Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/gpu_resources/cms_resources/lxplus_gpu.html b/resources/gpu_resources/cms_resources/lxplus_gpu.html index aa428ca..b37c390 100644 --- a/resources/gpu_resources/cms_resources/lxplus_gpu.html +++ b/resources/gpu_resources/cms_resources/lxplus_gpu.html @@ -1,2 +1,2 @@ - lxplus-gpu - CMS Machine Learning Documentation

lxplus-gpu.cern.ch

How to use it?

lxplus-gpu are special lxplus nodes with GPU support. You can access these nodes by executing

ssh <your_user_name>@lxplus-gpu.cern.ch
+ lxplus-gpu - CMS Machine Learning Documentation       

lxplus-gpu.cern.ch

How to use it?

lxplus-gpu are special lxplus nodes with GPU support. You can access these nodes by executing

ssh <your_user_name>@lxplus-gpu.cern.ch
 

Untitled

The configuration of the software environment for lxplus-gpu is described in the Software Environments page.


Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/gpu_resources/cms_resources/lxplus_htcondor.html b/resources/gpu_resources/cms_resources/lxplus_htcondor.html index 8f1a029..b49cc80 100644 --- a/resources/gpu_resources/cms_resources/lxplus_htcondor.html +++ b/resources/gpu_resources/cms_resources/lxplus_htcondor.html @@ -1,2 +1,2 @@ - CERN HTCondor - CMS Machine Learning Documentation

HTCondor With GPU resources

In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.

How to require GPUs in HTCondor

People can require their jobs to have GPU support by adding the following requirements to the condor submission file.

request_gpus = n # n equal to the number of GPUs required
+ CERN HTCondor - CMS Machine Learning Documentation       

HTCondor With GPU resources

In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.

How to require GPUs in HTCondor

People can require their jobs to have GPU support by adding the following requirements to the condor submission file.

request_gpus = n # n equal to the number of GPUs required
 

Further documentation

There are good materials providing detailed documentation on how to run HTCondor jobs with GPU support at both machines.

The configuration of the software environment for lxplus-gpu and HTcondor is described in the Software Environments page. Moreover the page Using container explains step by step how to build a docker image to be run on HTCondor jobs.

More available resources

  1. A complete documentation can be found from the GPUs section in CERN Batch Docs. Where a Tensorflow example is supplied. This documentation also contains instructions on advanced HTCondor configuration, for instance constraining GPU device or CUDA version.
  2. A good example on submitting GPU HTCondor job @ Lxplus is the weaver-benchmark project. It provides a concrete example on how to setup environment for weaver framework and operate trainning and testing process within a single job. Detailed description can be found at section ParticleNet of this documentation.

    In principle, this example can be run elsewhere as HTCondor jobs. However, paths to the datasets should be modified to meet the requirements.

  3. CMS Connect also provides a documentation on GPU job submission. In this documentation there is also a Tensorflow example.

    When submitting GPU jobs @ CMS Connect, especially for Machine Learning purpose, EOS space @ CERN are not accessible as a directory, therefore one should consider using xrootd utilities as documented in this page


Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/gpu_resources/cms_resources/ml_cern_ch.html b/resources/gpu_resources/cms_resources/ml_cern_ch.html index 3ce4618..0698540 100644 --- a/resources/gpu_resources/cms_resources/ml_cern_ch.html +++ b/resources/gpu_resources/cms_resources/ml_cern_ch.html @@ -1 +1 @@ - ml.cern.ch - CMS Machine Learning Documentation

ml.cern.ch

ml.cern.ch is a Kubeflow based ML solution provided by CERN.

Kubeflow

Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.

How to access

ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.

Untitled

Examples

After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.

  1. mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
  2. katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.

Last update: December 5, 2023
\ No newline at end of file + ml.cern.ch - CMS Machine Learning Documentation

ml.cern.ch

ml.cern.ch is a Kubeflow based ML solution provided by CERN.

Kubeflow

Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.

How to access

ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.

Untitled

Examples

After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.

  1. mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
  2. katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.

Last update: December 5, 2023
\ No newline at end of file diff --git a/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html b/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html index af8114c..5f77079 100644 --- a/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html +++ b/resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html @@ -1,4 +1,4 @@ - Pytorch mnist - CMS Machine Learning Documentation
from __future__ import print_function
+ Pytorch mnist - CMS Machine Learning Documentation      
from __future__ import print_function
 import argparse
 import torch
 import torch.nn as nn
diff --git a/resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html b/resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html
index be8cdd0..78f037a 100644
--- a/resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html
+++ b/resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html
@@ -1,4 +1,4 @@
- Toptagging mlp - CMS Machine Learning Documentation      

import torch
+ Toptagging mlp - CMS Machine Learning Documentation      

import torch
 import torch.nn as nn
 from torch.utils.data.dataset import Dataset
 import pandas as pd
diff --git a/resources/gpu_resources/cms_resources/swan.html b/resources/gpu_resources/cms_resources/swan.html
index 471f6a7..3b1eb50 100644
--- a/resources/gpu_resources/cms_resources/swan.html
+++ b/resources/gpu_resources/cms_resources/swan.html
@@ -1 +1 @@
- SWAN - CMS Machine Learning Documentation       

SWAN

Preparation

  1. Registration:

    To require GPU resources for SWAN: According to this thread, one can create a ticket through this link to ask for GPU support at SWAN, it is now in beta version and limited to a small scale. 2. Setup SWAN with GPU resources:

    1. Once the registration is done, one can login SWAN with Kerberes8 support and then create his SWAN environment.

Untitled

Untitled

Another important option is the environment script, which will be discussed later in this document.

Working with SWAN

  1. After creation, one will browse the SWAN main directory My Project where all existing projects are displayed. A new project can be created by clicking the upper right "+" button. After creation one will be redirected to the newly created project, at which point the "+" button on the upper right panel can be used for creating new notebook.

    Untitled

    Untitled

  2. It is possible to use the terminal for installing new packages or monitoring computational resources.

    1. For package installation, one can install packages with package management tools, e.g. pip for python. To use the installed packages, you will need to wrap the environment configuration in a scrip, which will be executed by SWAN. Detailed documentation can be found by clicking the upper right "?" button.

    2. In addition to using top and htop to monitor ordinary resources, you can use nvidia-smi to monitor GPU usage.

    Untitled

Examples

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.


Last update: December 5, 2023
\ No newline at end of file + SWAN - CMS Machine Learning Documentation

SWAN

Preparation

  1. Registration:

    To require GPU resources for SWAN: According to this thread, one can create a ticket through this link to ask for GPU support at SWAN, it is now in beta version and limited to a small scale. 2. Setup SWAN with GPU resources:

    1. Once the registration is done, one can login SWAN with Kerberes8 support and then create his SWAN environment.

Untitled

Untitled

Another important option is the environment script, which will be discussed later in this document.

Working with SWAN

  1. After creation, one will browse the SWAN main directory My Project where all existing projects are displayed. A new project can be created by clicking the upper right "+" button. After creation one will be redirected to the newly created project, at which point the "+" button on the upper right panel can be used for creating new notebook.

    Untitled

    Untitled

  2. It is possible to use the terminal for installing new packages or monitoring computational resources.

    1. For package installation, one can install packages with package management tools, e.g. pip for python. To use the installed packages, you will need to wrap the environment configuration in a scrip, which will be executed by SWAN. Detailed documentation can be found by clicking the upper right "?" button.

    2. In addition to using top and htop to monitor ordinary resources, you can use nvidia-smi to monitor GPU usage.

    Untitled

Examples

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.


Last update: December 5, 2023
\ No newline at end of file diff --git a/search/search_index.json b/search/search_index.json index 970fcfb..d0682af 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":"

Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:

  • ML best practices
  • How to optimize a NN
  • Common pitfalls for CMS analyzers
  • Direct and indirect inferencing using a variety of ML packages
  • How to get a model integrated into CMSSW

And much more!

If you think we are missing some important information, please contact the ML Knowledge Subgroup!

"},{"location":"general_advice/intro.html","title":"Introduction","text":"

In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.

from sklearn.datasets import make_circles\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVC\n\nX, y = make_circles(noise=0.2, factor=0.5, random_state=1)\nX_train, X_test, y_train, y_test = \\\n        train_test_split(X, y, test_size=.4, random_state=42)\n\nclf = SVC(kernel=\"linear\", C=0.025)\nclf.fit(X_train, y_train)\nprint(f'Accuracy: {clf.score(X_test, y_test)}')\n# Accuracy: 0.4\n

Being an extremely simplified and naive example, one would be lucky to have the code above produce a valid and optimal model. This is because it explicitly doesn't check for those things which could've gone wrong and therefore is prone to producing undesirable results. Indeed, there are several pitfalls which one may encounter on the way towards implementation of ML into their analysis pipeline. These can be easily avoided by being aware of those and performing a few simple checks here and there.

Therefore, this section is intended to review potential issues on the ML side and how they can be approached in order to train a robust and optimal model. The section is designed to be, to a large extent, analysis-agnostic. It will focus on common, generalized validation steps from ML perspective, without paying particular emphasis on the physical context. However, for illustrative purposes, it will be supplemented with some examples from HEP and additional links for further reading. As the last remark, in the following there will mostly an emphasis on the validation items specific to supervised learning. This includes classification and regression problems as being so far the most common use cases amongst HEP analysts.

The General Advice chapter is divided into into 3 sections. Things become logically aligned if presented from the perspective of the training procedure (fitting/loss minimisation part). That is, the sections will group validation items as they need to be investigated:

  • Before training
  • During training
  • After training

Authors: Oleg Filatov

"},{"location":"general_advice/after/after.html","title":"After training","text":"

After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.

"},{"location":"general_advice/after/after.html#final-evaluation","title":"Final evaluation","text":"

Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is \"frozen\", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.

The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:

Figure 1. Comparison of model output for signal and background classes overlaid for train and test data sets. [source: root-forum.cern.ch]

In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.

Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.

Figure 2. Postfit jet-fake NN score for the mutau channel. Note that the distribution for jet-fakes class is dominant in this category and also peaks at value 1 (mind the log scale), which is an indication of good identification of this background process by the model. Furthermore, ratio of data and MC templates is equal to 1 within uncertainties. [source: CMS-PAS-HIG-20-006]"},{"location":"general_advice/after/after.html#robustness","title":"Robustness","text":"

Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a \"point estimate\", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.

A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.

Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be \"propagated\" through the model. A sizeable fraction of those uncertainties are so-called \"up/down\" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.

"},{"location":"general_advice/after/after.html#systematic-biases","title":"Systematic biases","text":"

Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.

  • The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
  • Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
  • Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Figure 3. Left: Distributions of signal and background events without selection. Right: Background distributions at 50% signal efficiency (true positive rate) for different classifiers. The unconstrained classifier sculpts a peak at the W-boson mass, while other classifiers do not. [source: arXiv:2010.09745]
  1. Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function.\u00a0\u21a9

"},{"location":"general_advice/before/domains.html","title":"Domains","text":"

Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:

  1. The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
  2. A proper preprocessing of the data set should be performed so that the training step goes smoothly.

In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.

"},{"location":"general_advice/before/domains.html#coverage","title":"Coverage","text":"

To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,

Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.

Examples
  • In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
  • Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.
"},{"location":"general_advice/before/domains.html#solution","title":"Solution","text":"

It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.

It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.

Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.

Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.

"},{"location":"general_advice/before/domains.html#population","title":"Population","text":"

Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,

It is important to make sure that the phase space of interest is well-represented in the training data set.

Example

This is what is often called in HEP jargon \"little statistics in the tails\": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.

"},{"location":"general_advice/before/domains.html#solution_1","title":"Solution","text":"

Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).

Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.

Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.

  1. From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper.\u00a0\u21a9

"},{"location":"general_advice/before/features.html","title":"Features","text":"

In the previous section, the data was considered from a general \"domain\" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.

The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:

  • Are features understood?
  • Are features correctly modelled?
  • Are features appropriately processed?
"},{"location":"general_advice/before/features.html#understanding","title":"Understanding","text":"

Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.

Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.

"},{"location":"general_advice/before/features.html#modelling","title":"Modelling","text":"

Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle \"garbage in, garbage out\" still holds.

Example

For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true \"data\" distribution might sizeably affect the construction of decision boundary in the feature space.

Figure 1. Control plot for a visible mass of tau lepton pair in emu final state. [source: CMS-TAU-18-001]

Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.

If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.

"},{"location":"general_advice/before/features.html#processing","title":"Processing","text":"

Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.

Example

In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).

Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.

  • Feature encoding
  • NaN/inf/missing values2
  • Outliers & noisy data
  • Standartisation & transformations

Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.

  1. Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into.\u00a0\u21a9

  2. Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood.\u00a0\u21a9

"},{"location":"general_advice/before/inputs.html","title":"Inputs","text":"

After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.

"},{"location":"general_advice/before/inputs.html#data-split","title":"Data split","text":"

The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.

Figure 1. Decision boundaries for underfitted, optimal and overfitted models. [source: ibm.com/cloud/learn/overfitting]

The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.

One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be \"universal\" and provide robust and correct decisions regardless of the data subset sampled from the same population.

Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.

Figure 2. Data split worflow before the training. Also cross-validation is shown as the technique to find optimal hyperparameters. [source: scikit-learn.org/stable/modules/cross_validation.html]

Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called \"test\" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is \"unblinded\" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.

But it might be not that simple

Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).

Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.

Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.

Lastly, one can also considering to put aside yet another fraction of original data set, what was called \"validation\" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.

"},{"location":"general_advice/before/inputs.html#batches","title":"Batches","text":"

Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.

While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).

That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2

Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.

"},{"location":"general_advice/before/inputs.html#training-weights","title":"Training weights","text":"

Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:

def CrossEntropy(y_pred, y_true, w): # assuming y_true = {0, 1}\n    return -w*[y_true*log(y_pred) + (1-y_true)*log(1-y_pred)]\n

It is important to disentangle here two factors which define the weight to be applied on a per-event basis because of the different motivations behind them:

  • accounting for imbalance in training data
  • accounting for imbalance in nature
"},{"location":"general_advice/before/inputs.html#imbalance-in-training-data","title":"Imbalance in training data","text":"

The first point is related to the fact, that in case of classification we may have significantly more (>O(1) times) training data for one class than for the other. Since the training data usually comes from MC simulation, that corresponds to the case when there is more events generated for one physical process than for another. Therefore, here we want to make sure that model is equally presented with instances of each class - this may have a significant impact on the model performance depending on the loss/metric choice.

Example

Consider the case when there is 1M events of target = 0 and 100 events of target = 1 in the training data set and a model is fitted by minimising cross-entropy to distinguish between those classes. In that case the resulted model can easily turn out to be a constant function predicting the majority target = 0, simply because this would be the optimal solution in terms of the loss function minimisation. If using accuracy as a metric for validation, this will result in a value close to 1 on the training data.

To account for this type of imbalance, the following weight simply needs to be introduced according to the target label of an object:

train_df['weight'] = 1\ntrain_df.loc[train_df.target == 0, 'weight'] /= np.sum(train_df.loc[train_df.target == 0, 'weight'])\ntrain_df.loc[train_df.target == 1, 'weight'] /= np.sum(train_df.loc[train_df.target == 1, 'weight'])\n

Alternatively, one can consider using other ways of balancing classes aside of those with training weights. For a more detailed description of them and also a general problem statement see imbalanced-learn documentation.

"},{"location":"general_advice/before/inputs.html#imbalance-in-nature","title":"Imbalance in nature","text":"

The second case corresponds to the fact that in experiment we expect some classes to be more represented than the others. For example, the signal process usually has way smaller cross-section than background ones and therefore we expect to have in the end fewer events of the signal class. So the motivation of using weights in that case would be to augment the optimisation problem with additional knowledge of expected contribution of physical processes.

Practically, the notion of expected number of events is incorporated into the weights per physical process so that the following conditions hold3:

As a part of this reweighting, one would naturally need to perform the normalisation as of the previous point, however the difference between those two is something which is worth emphasising.

  1. That is, sampled independently and identically (i.i.d) from the same distribution.\u00a0\u21a9

  2. Although this is a somewhat intuitive statement which may or may not be impactful for a given task and depends on the training procedure itself, it is advisable to keep this aspect in mind while preparing batches for training.\u00a0\u21a9

  3. See also Chapter 2 of the HiggsML overview document \u21a9

"},{"location":"general_advice/before/metrics.html","title":"Metrics & Losses","text":""},{"location":"general_advice/before/metrics.html#metric","title":"Metric","text":"

Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.

That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a \"good\" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.

Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.

Examples of HEP-specific metrics

Speaking of those metrics which were developed in the HEP field, the most prominent one is approximate median significance (AMS), firstly introduced in Asymptotic formulae for likelihood-based tests of new physics and then adopted in the HiggsML challenge on Kaggle.

Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.

"},{"location":"general_advice/before/metrics.html#loss-function","title":"Loss function","text":"

In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.

Example of things going wrong

It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.

A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those \"upweighted\" events2.

Examples in HEP beyond classical MSE/MAE/cross entropy
  • b-jet energy regression, being a part of nonresonant HH to bb gamma gamma analysis, uses Huber and two quantile loss terms for simultaneous prediction of point and dispersion estimators of the target disstribution.
  • DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases

However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.

  1. For example, that corresponds to asking oneself a question: \"what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?\"\u00a0\u21a9

  2. However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis.\u00a0\u21a9

"},{"location":"general_advice/before/model.html","title":"Model","text":"

There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.

However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:

Start off from a simple baseline, then gradually increase the complexity to improve upon it.

  1. In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.

  2. If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.

  3. If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.

  4. If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.

  5. Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.

Models in HEP

One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.

Illustration from G. Kasieczka et. al showing ROC curves for all evaluated algorithms.

"},{"location":"general_advice/during/opt.html","title":"Optimisation problems","text":"Figure 1. The loss surfaces of ResNet-56 with/without skip connections. [source: \"Visualizing the Loss Landscape of Neural Nets\" paper]

However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:

  • The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.

  • Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.

  • Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.

  1. Sometimes particularly peculiar.\u00a0\u21a9

"},{"location":"general_advice/during/overfitting.html","title":"Overfitting","text":"

Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.

"},{"location":"general_advice/during/overfitting.html#overfitting","title":"Overfitting","text":"

The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:

The first one would be the case where the model complexity is way too large for a problem and a data set being considered.

Example

A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.

This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.

Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.

The second case is a more general idea that any reasonable model at some point starts to overfit.

Example

Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to \"focus\" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.

Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:

Figure 1. Error metric as a function of number of iterations for train and validation sets. Vertical dashed line represents the separation between the region of underfitting (model hasn't captured well the data complexity to solve the problem) and overfitting (model does not longer generalise to unseen data). The point between these two regions is the optimal moment when the training should stop. [source: ibm.com/cloud/learn/overfitting]

Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.

To automate the process of finding this \"sweat spot\", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.

"},{"location":"general_advice/during/xvalidation.html","title":"Cross-validation","text":"

However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).

The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.

In order to prevent that, a cross-validation (CV) technique is often used:

Figure 1. Illustration of the data set split for cross-validation. [source: scikit-learn.org/stable/modules/cross_validation.html]

The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.

In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.

Further insights

Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.

Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.

"},{"location":"inference/checklist.html","title":"Integration checklist","text":"

Todo.

"},{"location":"inference/conifer.html","title":"Direct inference with conifer","text":""},{"location":"inference/conifer.html#introduction","title":"Introduction","text":"

conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:

  • conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
  • conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
  • utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python
"},{"location":"inference/conifer.html#emulation-in-cmssw","title":"Emulation in CMSSW","text":"

All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.

Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.

Minimal preparation from Python:

import conifer\nmodel = conifer. ... # convert or load a conifer model\n# e.g. model = conifer.converters.convert_from_xgboost(xgboost_model)\nmodel.save('my_bdt.json')\n

CMSSW C++ user code:

// include the conifer emulation header file\n#include \"L1Trigger/Phase2L1ParticleFlow/interface/conifer.h\"\n\n... model setup\n// define the input/threshold and score types\n// important: this needs to match the firmware settings for bit-exactness!\n// note: can use native types like float/double for development/debugging\ntypedef ap_fixed<18,8> input_t;\ntypedef ap_fixed<12,3,AP_RND_CONV,AP_SAT> score_t;\n\n// create a conifer BDT instance\n// 'true' to use balanced add-tree score aggregation (needed for bit-exactness)\nbdt = conifer::BDT<input_t, score_t, true>(\"my_bdt.json\");\n\n... inference\n// prepare the inputs, vector length same as model n_features\nstd::vector<input_t> inputs = ... \n// run inference, scores vector length same as model n_classes (or 1 for binary classification/regression)\nstd::vector<score_t> scores = bdt.decision_function(inputs);\n

conifer does not compute class probabilities from the raw predictions for the avoidance of extra resource and latency cost in the L1T deployment. Cuts or working points should therefore be applied on the raw predictions.

"},{"location":"inference/hls4ml.html","title":"Direct inference with hls4ml","text":"

hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.

The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.

That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).

Workshop/Conference Date Links 23rd Virtual IEEE Real Time Conference August 03, 2022 Indico 2022 CMS ML Town Hall July 22, 2022 Contribution Link a3d3 hls4ml @ Snowmass CSS 2022: Tutorial July 21, 2022 Slides, Recording, JupyterHub Fast Machine Learning for Science Workshop December 3, 2020 Indico, Slides, GitHub, Interactive Notebooks hls4ml @ UZH ML Workshop November 17, 2020 Indico, Slides ICCAD 2020 November 5, 2020 https://events-siteplex.confcats.io/iccad2022/wp-content/uploads/sites/72/2021/12/2020_ICCAD_ConferenceProgram.pdf, GitHub 4th IML Workshop October 19, 2020 Indico, Slides, Instructions, Notebooks, Recording 22nd Virtual IEEE Real Time Conference October 15, 2020 Indico, Slides, Notebooks 30th International Conference on Field-Programmable Logic and Applications September 4, 2020 Program hls4ml tutorial @ CERN June 3, 2020 Indico, Slides, Notebooks Fast Machine Learning September 12, 2019 Indico 1st Real Time Analysis Workshop, Universit\u00e9 Paris-Saclay July 16, 2019 Indico, Slides, Autoencoder Tutorial"},{"location":"inference/onnx.html","title":"Direct inference with ONNX Runtime","text":"

ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community\u2014most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.

The figure showing the ONNX interoperability. (Source from website.)

ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to \"run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available\". It includes \"built-in optimization features that trim and consolidate nodes without impacting model accuracy.\"

The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.

On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.

"},{"location":"inference/onnx.html#software-setup","title":"Software Setup","text":"

We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)

export SCRAM_ARCH=\"slc7_amd64_gcc900\"\nexport CMSSW_VERSION=\"CMSSW_11_2_5_patch2\"\n\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\n\ncmsrel \"$CMSSW_VERSION\"\ncd \"$CMSSW_VERSION/src\"\n\ncmsenv\nscram b\n
"},{"location":"inference/onnx.html#converting-model-to-onnx","title":"Converting model to ONNX","text":"

The model deployed into CMSSW or our analysis needs to be converted to ONNX from the original framework format where it is trained. Please see here for a nice deck of tutorials on converting models from different mainstream frameworks into ONNX.

Here we take PyTorch as an example. A PyTorch model can be converted by torch.onnx.export(...). As a simple illustration, we convert a randomly initialized feed-forward network implemented in PyTorch, with 10 input nodes and 2 output nodes, and two hidden layers with 64 nodes each. The conversion code is presented below. The output model model.onnx will be deployed under the CMSSW framework in our following tutorial.

Click to expand
import torch\nimport torch.nn as nn\ntorch.manual_seed(42)\n\nclass SimpleMLP(nn.Module):\n\n    def __init__(self, **kwargs):\n        super(SimpleMLP, self).__init__(**kwargs)\n        self.mlp = nn.Sequential(\n            nn.Linear(10, 64), nn.BatchNorm1d(64), nn.ReLU(), \n            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(), \n            nn.Linear(64, 2), nn.ReLU(), \n            )\n    def forward(self, x):\n        # input x: (batch_size, feature_dim=10)\n        x = self.mlp(x)\n        return torch.softmax(x, dim=1)\n\nmodel = SimpleMLP()\n\n# create dummy input for the model\ndummy_input = torch.ones(1, 10, requires_grad=True) # batch size = 1\n\n# export model to ONNX\ntorch.onnx.export(model, dummy_input, \"model.onnx\", verbose=True, input_names=['my_input'], output_names=['my_output'])\n
"},{"location":"inference/onnx.html#inference-in-cmssw-c","title":"Inference in CMSSW (C++)","text":"

We will introduce how to write a module to run inference on the ONNX model under the CMSSW framework. CMSSW is known for its multi-threaded ability. In a threaded framework, multiple threads are served for processing events in the event loop. The logic is straightforward: a new event is assigned to idled threads following the first-come-first-serve princlple.

In most cases, each thread is able to process events individually as the majority of event processing workflow can be accomplished only by seeing the information of that event. Thus, the stream modules (stream EDAnalyzer and stream EDFilter) are used frequently as each thread holds an individual copy of the module instance\u2014they do not need to communicate with each other. It is however also possible to share a global cache object between all threads in case sharing information across threads is necessary. In all, such CMSSW EDAnalyzer modules are declared by class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<CacheData>> (similar for EDFilter). Details can be found in documentation on the C++ interface of stream modules.

Let's then think about what would happen when interfacing CMSSW with ONNX for model inference. When ONNX Runtime accepts a model, it converts the model into an in-memory representation, and performance a variety of optimizations depending on the operators in the model. The procedure is done when an ONNX Runtime Session is created with an inputting model. The economic method will then be to hold only one Session for all threads\u2014this may save memory to a large extent, as the model has only one copy in memory. Upon request from multiple threads to do inference with their input data, the Session accepts those requests and serializes them, then produces the output data. ONNX Runtime has by design accepted that multithread threads invoke the Run() method on the same inference Session object. Therefore, what has left us to do is to

  1. create a Session as a global object in our CMSSW module and share it among all threads;
  2. in each thread, we process the input data and then call the Run() method from that global Session.

That's the main logic for implementing ONNX inference in CMSSW. For details of high-level designs of ONNX Runtime, please see documentation here.

With this concept, let's build the module.

"},{"location":"inference/onnx.html#1-includes","title":"1. includes","text":"
#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n// further framework includes\n...\n

We include stream/EDAnalyzer.h to build the stream CMSSW module.

"},{"location":"inference/onnx.html#2-global-cache-object","title":"2. Global cache object","text":"

In CMSSW there exists a class ONNXRuntime which can be used directly as the global cache object. Upon initialization from a given model, it holds the ONNX Runtime Session object and provides the handle to invoke the Run() for model inference.

We put the ONNXRuntime class in the edm::GlobalCache template argument:

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<ONNXRuntime>> {\n...\n};\n
"},{"location":"inference/onnx.html#3-initiate-objects","title":"3. Initiate objects","text":"

In the stream EDAnlyzer module, it provides a hook initializeGlobalCache() to initiate the global object. We simply do

std::unique_ptr<ONNXRuntime> MyPlugin::initializeGlobalCache(const edm::ParameterSet &iConfig) {\nreturn std::make_unique<ONNXRuntime>(iConfig.getParameter<edm::FileInPath>(\"model_path\").fullPath());\n}\n

to initiate the ONNXRuntime object upon a given model path.

"},{"location":"inference/onnx.html#4-inference","title":"4. Inference","text":"

We know the event processing step is implemented in the void EDAnalyzer::analyze method. When an event is assigned to a valid thread, the content will be processed in that thread. This can go in parallel with other threads processing other events.

We need to first construct the input data dedicated to the event. Here we create a dummy input: a sequence of consecutive integers of length 10. The input is set by replacing the values of our pre-booked vector, data_. This member variable has vector<vector<float>> format and is initialised as { {0, 0, ..., 0} } (contains only one element, which is a vector of 10 zeros). In processing of each event, the input data_ is modified:

std::vector<float> &group_data = data_[0];\nfor (size_t i = 0; i < 10; i++){\ngroup_data[i] = float(iEvent.id().event() % 100 + i);\n}\n

Then, we send data_ to the inference engine and get the model output:

std::vector<float> outputs = globalCache()->run(input_names_, data_, input_shapes_)[0];\n

We clarify a few details here.

First, we use globalCache() which is a class method in our stream CMSSW module to access the global object shared across all threads. In our case it is the ONNXRuntime instance.

The run() method is a wrapper to call Run() on the ONNX Session. Definations on the method arguments are (code from link):

// Run inference and get outputs\n// input_names: list of the names of the input nodes.\n// input_values: list of input arrays for each input node. The order of `input_values` must match `input_names`.\n// input_shapes: list of `int64_t` arrays specifying the shape of each input node. Can leave empty if the model does not have dynamic axes.\n// output_names: names of the output nodes to get outputs from. Empty list means all output nodes.\n// batch_size: number of samples in the batch. Each array in `input_values` must have a shape layout of (batch_size, ...).\n// Returns: a std::vector<std::vector<float>>, with the order matched to `output_names`.\n// When `output_names` is empty, will return all outputs ordered as in `getOutputNames()`.\nFloatArrays run(const std::vector<std::string>& input_names,\nFloatArrays& input_values,\nconst std::vector<std::vector<int64_t>>& input_shapes = {},\nconst std::vector<std::string>& output_names = {},\nint64_t batch_size = 1) const;\n
where we have
typedef std::vector<std::vector<float>> FloatArrays;\n

In our case, input_names is set to {\"my_input\"} which corresponds to the names upon model creation. input_values is a length-1 vector, and input_values[0] is a vector of float of length 10, which are inputs to the 10 nodes. input_shapes can be set empty here and will be necessary for advanced usage, when our input has dynamic lengths (e.g., in boosed jet tagging, we use different numbers of particle-flow candidates and secondary vertices as input).

For the usual model design, we have only one vector of output. In such a case, the output is simply a length-1 vector, and we use [0] to get the vector of two float numbers\u2014the output of the model.

"},{"location":"inference/onnx.html#full-example","title":"Full example","text":"

Let's construct the full example.

Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 model.onnx\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.pydata/model.onnx
/*\n * Example plugin to demonstrate the direct multi-threaded inference with ONNX Runtime.\n */\n\n#include <memory>\n#include <iostream>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n\n#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n\nusing namespace cms::Ort;\n\nclass MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<ONNXRuntime>> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet &, const ONNXRuntime *);\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\nstatic std::unique_ptr<ONNXRuntime> initializeGlobalCache(const edm::ParameterSet &);\nstatic void globalEndJob(const ONNXRuntime *);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::vector<std::string> input_names_;\nstd::vector<std::vector<int64_t>> input_shapes_;\nFloatArrays data_; // each stream hosts its own data\n};\n\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<edm::FileInPath>(\"model_path\", edm::FileInPath(\"MySubsystem/MyModule/data/model.onnx\"));\ndesc.add<std::vector<std::string>>(\"input_names\", std::vector<std::string>({\"my_input\"}));\ndescriptions.addWithDefaultLabel(desc);\n}\n\n\nMyPlugin::MyPlugin(const edm::ParameterSet &iConfig, const ONNXRuntime *cache)\n: input_names_(iConfig.getParameter<std::vector<std::string>>(\"input_names\")),\ninput_shapes_() {\n// initialize the input data arrays\n// note there is only one element in the FloatArrays type (i.e. vector<vector<float>>) variable\ndata_.emplace_back(10, 0);\n}\n\n\nstd::unique_ptr<ONNXRuntime> MyPlugin::initializeGlobalCache(const edm::ParameterSet &iConfig) {\nreturn std::make_unique<ONNXRuntime>(iConfig.getParameter<edm::FileInPath>(\"model_path\").fullPath());\n}\n\nvoid MyPlugin::globalEndJob(const ONNXRuntime *cache) {}\n\nvoid MyPlugin::analyze(const edm::Event &iEvent, const edm::EventSetup &iSetup) {\n// prepare dummy inputs for every event\nstd::vector<float> &group_data = data_[0];\nfor (size_t i = 0; i < 10; i++){\ngroup_data[i] = float(iEvent.id().event() % 100 + i);\n}\n\n// run prediction and get outputs\nstd::vector<float> outputs = globalCache()->run(input_names_, data_, input_shapes_)[0];\n\n// print the input and output data\nstd::cout << \"input data -> \";\nfor (auto &i: group_data) { std::cout << i << \" \"; }\nstd::cout << std::endl << \"output data -> \";\nfor (auto &i: outputs) { std::cout << i << \" \"; }\nstd::cout << std::endl;\n\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/ONNXRuntime\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"/store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(10))\nprocess.source = cms.Source(\"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles))\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup options for multithreaded\nprocess.options.numberOfThreads=cms.untracked.uint32(1)\nprocess.options.numberOfStreams=cms.untracked.uint32(0)\nprocess.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32(1)\n\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\n# specify the path of the ONNX model\nprocess.myPlugin.model_path = \"MySubsystem/MyModule/data/model.onnx\"\n# input names as defined in the model\n# the order of name strings should also corresponds to the order of input data array feed to the model\nprocess.myPlugin.input_names = [\"my_input\"]\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n

The model is produced by code in the section \"Converting model to ONNX\" and can be downloaded here.

"},{"location":"inference/onnx.html#test-our-module","title":"Test our module","text":"

Under MySubsystem/MyModule/test, run cmsRun my_plugin_cfg.py to launch our module. You may see the following from the output, which include the input and output vectors in the inference process.

Click to see the output
...\n19-Jul-2022 10:50:41 CEST  Successfully opened file root://xrootd-cms.infn.it//store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\nBegin processing the 1st record. Run 1, Event 27074045, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.494 CEST\ninput data -> 45 46 47 48 49 50 51 52 53 54\noutput data -> 0.995657 0.00434343\nBegin processing the 2nd record. Run 1, Event 27074048, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 48 49 50 51 52 53 54 55 56 57\noutput data -> 0.996884 0.00311563\nBegin processing the 3rd record. Run 1, Event 27074059, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 59 60 61 62 63 64 65 66 67 68\noutput data -> 0.999081 0.000919373\nBegin processing the 4th record. Run 1, Event 27074061, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 61 62 63 64 65 66 67 68 69 70\noutput data -> 0.999264 0.000736247\nBegin processing the 5th record. Run 1, Event 27074046, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 46 47 48 49 50 51 52 53 54 55\noutput data -> 0.996112 0.00388828\nBegin processing the 6th record. Run 1, Event 27074047, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 47 48 49 50 51 52 53 54 55 56\noutput data -> 0.996519 0.00348065\nBegin processing the 7th record. Run 1, Event 27074064, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 64 65 66 67 68 69 70 71 72 73\noutput data -> 0.999472 0.000527586\nBegin processing the 8th record. Run 1, Event 27074074, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 74 75 76 77 78 79 80 81 82 83\noutput data -> 0.999826 0.000173664\nBegin processing the 9th record. Run 1, Event 27074050, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 50 51 52 53 54 55 56 57 58 59\noutput data -> 0.997504 0.00249614\nBegin processing the 10th record. Run 1, Event 27074060, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 60 61 62 63 64 65 66 67 68 69\noutput data -> 0.999177 0.000822734\n19-Jul-2022 10:50:43 CEST  Closed file root://xrootd-cms.infn.it//store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\n

Also we could try launching the script with more threads. Change the corresponding line in my_plugin_cfg.py as follows to activate the multi-threaded mode with 4 threads.

process.options.numberOfThreads=cms.untracked.uint32(4)\n

Launch the script again, and one could see the same results, but with the inference processed concurrently on 4 threads.

"},{"location":"inference/onnx.html#inference-in-cmssw-python","title":"Inference in CMSSW (Python)","text":"

Doing ONNX Runtime inference with python is possible as well. For those releases that have the ONNX Runtime C++ package installed, the onnxruntime python package is also installed in python3 (except for CMSSW_10_6_X). We still use CMSSW_11_2_5_patch2 to run our examples. We could quickly check if onnxruntime is available by:

python3 -c \"import onnxruntime; print('onnxruntime available')\"\n

The python code is simple to construct: following the quick examples \"Get started with ORT for Python\", we create the file MySubsystem/MyModule/test/my_standalone_test.py as follows:

import onnxruntime as ort\nimport numpy as np\n\n# create input data in the float format (32 bit)\ndata = np.arange(45, 55).astype(np.float32)\n\n# create inference session using ort.InferenceSession from a given model\nort_sess = ort.InferenceSession('../data/model.onnx')\n\n# run inference\noutputs = ort_sess.run(None, {'my_input': np.array([data])})[0]\n\n# print input and output\nprint('input ->', data)\nprint('output ->', outputs)\n

Under the directory MySubsystem/MyModule/test, run the example with python3 my_standalone_test.py. Then we see the output:

input -> [45. 46. 47. 48. 49. 50. 51. 52. 53. 54.]\noutput -> [[0.9956566  0.00434343]]\n

Using ONNX Runtime on NanoAOD-tools follows the same logic. Here we create the ONNX Session in the beginning stage and run inference in the event loop. Note that NanoAOD-tools runs the event loop in the single-thread mode.

Please find details in the following block.

Click to see the NanoAOD-tools example

We run the NanoAOD-tools example following the above CMSSW_11_2_5_patch2 environment. According to the setup instruction in NanoAOD-tools, do

cd $CMSSW_BASE/src\ngit clone https://github.com/cms-nanoAOD/nanoAOD-tools.git PhysicsTools/NanoAODTools\ncd PhysicsTools/NanoAODTools\ncmsenv\nscram b\n

Now we add our custom module to run ONNX Runtime inference. Create a file PhysicsTools/NanoAODTools/python/postprocessing/examples/exampleOrtModule.py with the content:

from PhysicsTools.NanoAODTools.postprocessing.framework.datamodel import Collection\nfrom PhysicsTools.NanoAODTools.postprocessing.framework.eventloop import Module\nimport ROOT\nROOT.PyConfig.IgnoreCommandLineOptions = True\n\nimport onnxruntime as ort\nimport numpy as np\nimport os \n\nclass exampleOrtProducer(Module):\n    def __init__(self):\n        pass\n\n    def beginJob(self):\n        model_path = os.path.join(os.getenv(\"CMSSW_BASE\"), 'src', 'MySubsystem/MyModule/data/model.onnx')\nself.ort_sess = ort.InferenceSession(model_path)\ndef endJob(self):\n        pass\n\n    def beginFile(self, inputFile, outputFile, inputTree, wrappedOutputTree):\n        self.out = wrappedOutputTree\n        self.out.branch(\"OrtScore\", \"F\")\n\n    def endFile(self, inputFile, outputFile, inputTree, wrappedOutputTree):\n        pass\n\n    def analyze(self, event):\n\"\"\"process event, return True (go to next module) or False (fail, go to next event)\"\"\"\n\n        # create input data\n        data = np.arange(event.event % 100, event.event % 100 + 10).astype(np.float32)\n        # run inference\noutputs = self.ort_sess.run(None, {'my_input': np.array([data])})[0]\n# print input and output\n        print('input ->', data)\n        print('output ->', outputs)\n\n        self.out.fillBranch(\"OrtScore\", outputs[0][0])\n        return True\n\n\n# define modules using the syntax 'name = lambda : constructor' to avoid having them loaded when not needed\n\nexampleOrtModuleConstr = lambda: exampleOrtProducer()\n

Please notice the highlighted lines for the creation of ONNX Runtime Session and launching the inference.

Finally, following the test command from NanoAOD-tools, we run our custom module in python3 by

python3 scripts/nano_postproc.py outDir /eos/cms/store/user/andrey/f.root -I PhysicsTools.NanoAODTools.postprocessing.examples.exampleOrtModule exampleOrtModuleConstr -N 10\n

We should see the output as follows

processing.examples.exampleOrtModule exampleOrtModuleConstr -N 10\nLoading exampleOrtModuleConstr from PhysicsTools.NanoAODTools.postprocessing.examples.exampleOrtModule\nWill write selected trees to outDir\nPre-select 10 entries out of 10 (100.00%)\ninput -> [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]\noutput -> [[0.83919346 0.16080655]]\ninput -> [ 7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]\noutput -> [[0.76994413 0.2300559 ]]\ninput -> [ 4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]\noutput -> [[0.7116992 0.2883008]]\ninput -> [ 2.  3.  4.  5.  6.  7.  8.  9. 10. 11.]\noutput -> [[0.66414535 0.33585465]]\ninput -> [ 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.]\noutput -> [[0.80617136 0.19382869]]\ninput -> [ 6.  7.  8.  9. 10. 11. 12. 13. 14. 15.]\noutput -> [[0.75187963 0.2481204 ]]\ninput -> [16. 17. 18. 19. 20. 21. 22. 23. 24. 25.]\noutput -> [[0.9014619  0.09853811]]\ninput -> [18. 19. 20. 21. 22. 23. 24. 25. 26. 27.]\noutput -> [[0.9202239  0.07977609]]\ninput -> [ 5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]\noutput -> [[0.7330253  0.26697478]]\ninput -> [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]\noutput -> [[0.82333535 0.17666471]]\nProcessed 10 preselected entries from /eos/cms/store/user/andrey/f.root (10 entries). Finally selected 10 entries\nDone outDir/f_Skim.root\nTotal time 1.1 sec. to process 10 events. Rate = 9.3 Hz.\n

"},{"location":"inference/onnx.html#links-and-further-reading","title":"Links and further reading","text":"
  • ONNX/ONNX Runtime
    • Tutorials on converting models to ONNX format
    • ONNX Runtime C++ example
    • ONNX Runtime C++ API
    • ONNX Runtime python example
    • ONNX Runtime python API
    • ONNX Runtime in CMSSW (talk)

Developers: Huilin Qu

Authors: Congqiao Li

"},{"location":"inference/particlenet.html","title":"ParticleNet","text":"

ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.

The full architecture of the ParticleNet model. We'll walk through the details in the following sections.

On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:

  1. An introduction to ParticleNet, including

    • a general description of ParticleNet
    • the advantages brought from the architecture by concept
    • a sketch of ParticleNet applications in CMS and other relevant works
  2. An introduction to Weaver and model implementations, introduced in a step-by-step manner:

    • build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
    • try to reproduce the original performance and make the ROC plots.

    This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the \"ParticleNet\".

  3. Tuning the ParticleNet model, including

    • tips for readers who are using/modifying the ParticleNet model to achieve a better performance

    This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.

Corresponding persons:

  • Huilin Qu, Loukas Gouskos (original developers of ParticleNet)
  • Congqiao Li (author of the page)
"},{"location":"inference/particlenet.html#introduction-to-particlenet","title":"Introduction to ParticleNet","text":""},{"location":"inference/particlenet.html#1-general-description","title":"1. General description","text":"

ParticleNet is a graph neural net (GNN) model. The key ingredient of ParticleNet is the graph convolutional operation, i.e., the edge convolution (EdgeConv) and the dynamic graph CNN (DGCNN) method [arXiv:1801.07829] applied on the \"point cloud\" data structure.

We will disassemble the ParticleNet model and provide a detailed exploration in the next section, but here we briefly explain the key features of the model.

Intuitively, ParticleNet treats all candidates inside an object as a \"point cloud\", which is a permutational-invariant set of points (e.g. a set of PF candidates), each carrying a feature vector (\u03b7, \u03c6, pT, charge, etc.). The DGCNN uses the EdgeConv operation to exploit their spatial correlations (two-dimensional on the \u03b7-\u03c6 plain) by finding the k-nearest neighbours of each point and generate a new latent graph layer where points are scattered on a high-dimensional latent space. This is a graph-type analogue of the classical 2D convolution operation, which acts on a regular 2D grid (e.g., a picture) using a 3\u00d73 local patch to explore the relations of a single-pixel with its 8 nearest pixels, then generates a new 2D grid.

The cartoon illustrates the convolutional operation acted on the regular grid and on the point cloud (plot from ML4Jets 2018 talk).

As a consequence, the EdgeConv operation transforms the graph to a new graph, which has a changed spatial relationship among points. It then acts on the second graph to produce the third graph, showing the stackability of the convolution operation. This illustrates the \"dynamic\" property as the graph topology changes after each EdgeConv layer.

"},{"location":"inference/particlenet.html#2-advantage","title":"2. Advantage","text":"

By concept, the advantage of the network may come from exploiting the permutational-invariant symmetry of the points, which is intrinsic to our physics objects. This symmetry is held naturally in a point cloud representation.

In a recent study on jet physics or event-based analysis using ML techniques, there are increasing interest to explore the point cloud data structure. We explain here conceptually why a \"point cloud\" representation outperforms the classical ones, including the variable-length 2D vector structure passing to a 1D CNN or any type of RNN, and imaged-based representation passing through a 2D CNN. By using the 1D CNN, the points (PF candidates) are more often ordered by pT to fix on the 1D grid. Only correlations with neighbouring points with similar pT are learned by the network with a convolution operation. The Long Short-Term Memory (LSTM) type recurrent neural network (RNN) provides the flexibility to feed in a variant-length sequence and has a \"memory\" mechanism to cooperate the information it learns from an early node to the latest node. The concern is that such ordering of the sequence is somewhat artificial, and not an underlying property that an NN must learn to accomplish the classification task. As a comparison, in the task of the natural language processing where LSTM has a huge advantage, the order of words are important characteristic of a language itself (reflects the \"grammar\" in some circumstances) and is a feature the NN must learn to master the language. The imaged-based data explored by a 2D CNN stems from the image recognition task. A jet image with proper standardization is usually performed before feeding into the network. In this sense, it lacks local features which the 2D local patch is better at capturing, e.g. the ear of the cat that a local patch can capture by scanning over the entire image. The jet image is appearing to hold the features globally (e.g. two-prong structure for W-tagging). The sparsity of data is another concern in that it introduces redundant information to present a jet on the regular grid, making the network hard to capture the key properties.

"},{"location":"inference/particlenet.html#3-applications-and-other-related-work","title":"3. Applications and other related work","text":"

Here we briefly summarize the applications and ongoing works on ParticleNet. Public CMS results include

  • large-R jet with R=0.8 tagging (for W/Z/H/t) using ParticleNet [CMS-DP-2020/002]
  • regression on the large-R jet mass based on the ParticleNet model [CMS-DP-2021/017]

ParticleNet architecture is also applied on small radius R=0.4 jets for the b/c-tagging and quark/gluon classification (see this talk (CMS internal)). A recent ongoing work applies the ParticleNet architecture in heavy flavour tagging at HLT (see this talk (CMS internal)). The ParticleNet model is recently updated to ParticleNeXt and see further improvement (see the ML4Jets 2021 talk).

Recent works in the joint field of HEP and ML also shed light on exploiting the point cloud data structure and GNN-based architectures. We see very active progress in recent years. Here list some useful materials for the reader's reference.

  • Some pheno-based work are summarized in the HEP \u00d7 ML living review, especially in the \"graph\" and \"sets\" categories.
  • An overview of GNN applications to CMS, see CMS ML forum (CMS internal). Also see more recent GNN application progress in ML forums: Oct 20, Nov 3.
  • At the time of writing, various novel GNN-based models are explored and introduced in the recent ML4Jets2021 meeting.
"},{"location":"inference/particlenet.html#introduction-to-weaver-and-model-implementations","title":"Introduction to Weaver and model implementations","text":"

Weaver is a machine learning R&D framework for high energy physics (HEP) applications. It trains the neural net with PyTorch and is capable of exporting the model to the ONNX format for fast inference. A detailed guide is presented on Weaver README page.

Now we walk through three solid examples to get you familiar with Weaver. We use the benchmark of the top tagging task [arXiv:1707.08966] in the following example. Some useful information can be found in the \"top tagging\" section in the IML public datasets webpage (the gDoc).

Our goal is to do some warm-up with Weaver, and more importantly, to explore from a technical side the neural net architectures: a simple multi-layer perceptron (MLP) model, a more complicated \"DeepAK8 tagger\" model based on 1D CNN with ResNet, and the \"ParticleNet model,\" which is based on DGCNN. We will dig deeper into their implementations in Weaver and try to illustrate as many details as possible. Finally, we compare their performance and see if we can reproduce the benchmark record with the model. Please clone the repo weaver-benchmark and we'll get started. The Weaver repo will be cloned as a submodule.

git clone --recursive https://github.com/colizz/weaver-benchmark.git\n\n# Create a soft link inside weaver so that it can find data/model cards\nln -s ../top_tagging weaver-benchmark/weaver/top_tagging\n

"},{"location":"inference/particlenet.html#1-build-models-in-weaver","title":"1. Build models in Weaver","text":"

When implementing a new training in Weaver, two key elements are crucial: the model and the data configuration file. The model defines the network architecture we are using, and the data configuration includes which variables to use for training, which pre-selection to apply, how to assign truth labels, etc.

Technically, The model configuration file includes a get_model function that returns a torch.nn.Module type model and a dictionary of model info used to export an ONNX-format model. The data configuration is a YAML file describing how to process the input data. Please see the Weaver README for details.

Before moving on, we need a preprocessing of the benchmark datasets. The original sample is an H5 file including branches like energy E_i and 3-momenta PX_i, PY_i, PZ_i for each jet constituent i (i=0, ..., 199) inside a jet. All branches are in the 1D flat structure. We reconstruct the data in a way that the jet features are 2D vectors (e.g., in the vector<float> format): Part_E, Part_PX, Part_PY, Part_PZ, with variable-length that corresponds to the number of constituents. Note that this is a commonly used data structure, similar to the NanoAOD format in CMS.

The datasets can be found at CERN EOS space /eos/user/c/coli/public/weaver-benchmark/top_tagging/samples. The input files used in this page are in fact the ROOT files produced by the preprocessing step, stored under the prep/ subdirectory. It includes three sets of data for training, validation, and test.

Note

To preprocess the input files from the original datasets manually, direct to the weaver-benchmark base directory and run

python utils/convert_top_datasets.py -i <your-sample-dir>\n
This will convert the .h5 file to ROOT ntuples and create some new variables for each jet, including the relative \u03b7 and \u03c6 value w.r.t. main axis of the jet of each jet constituent. The converted files are stored in prep/ subfolder of the original directory.

Then, we show three NN model configurations below and provide detailed explanations of the code. We make meticulous efforts on the illustration of the model architecture, especially in the ParticleNet case.

A simple MLPDeepAK8 (1D CNN)ParticleNet (DGCNN)

The full architecture of the proof-of-concept multi-layer perceptron model.

A simple multi-layer perceptron model is first provided here as proof of the concept. All layers are based on the linear transformation of the 1D vectors. The model configuration card is shown in top_tagging/networks/mlp_pf.py. First, we implement an MLP network in the nn.Module class.

MLP implementation

Also, see top_tagging/networks/mlp_pf.py. We elaborate here on several aspects.

  • A sequence of linear layers and ReLU activation functions is defined in nn.Sequential(nn.Linear(channels[i], channels[i + 1]), nn.ReLU()). By combining multiple of them, we construct a simple multi-layer perceptron.

  • The input data x takes the 3D format, in the dimension (N, C, P), which is decided by our data structure and the data configuration card. Here, N is the mini-batch size, C is the feature size, and P is the size of constituents per jet. To feed into our MLP, we flatten the last two dimensions by x = x.flatten(start_dim=1) to form the vector of dimension (N, L).

class MultiLayerPerceptron(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    layer_params : list\n        List of the feature size for each layer.\n    \"\"\"\n\n    def __init__(self, input_dims, num_classes,\n                layer_params=(1024, 256, 256),\n                **kwargs):\n\n        super(MultiLayerPerceptron, self).__init__(**kwargs)\n        channels = [input_dims] + list(layer_params) + [num_classes]\n        layers = []\n        for i in range(len(channels) - 1):\n            layers.append(nn.Sequential(nn.Linear(channels[i], channels[i + 1]),\n                                        nn.ReLU()))\n        self.mlp = nn.Sequential(*layers)\n\n    def forward(self, x):\n        # x: the feature vector initally read from the data structure, in dimension (N, C, P)\n        x = x.flatten(start_dim=1) # (N, L), where L = C * P\n        return self.mlp(x)\n

Then, we write the get_model and get_loss functions which will be sent into Weaver's training code.

get_model and get_loss function

Also see top_tagging/networks/mlp_pf.py. We elaborate here on several aspects.

  • Inside get_model, the model is essentially the MLP class we define, and the model_info takes the default definition, including the input/output shape, the dimensions of the dynamic axes for the input/output data shape that will guide the ONNX model exportation.
  • The get_loss function is not changed as in the classification task we always use the cross-entropy loss function.
def get_model(data_config, **kwargs):\n    layer_params = (1024, 256, 256)\n    _, pf_length, pf_features_dims = data_config.input_shapes['pf_features']\n    input_dims = pf_length * pf_features_dims\n    num_classes = len(data_config.label_value)\n    model = MultiLayerPerceptron(input_dims, num_classes, layer_params=layer_params)\n\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The output below shows the full structure of the MLP network printed by PyTorch. You will see it in the Weaver output during the training.

The full-scale structure of the MLP network
MultiLayerPerceptron(\n  |0.739 M, 100.000% Params, 0.001 GMac, 100.000% MACs|\n  (mlp): Sequential(\n    |0.739 M, 100.000% Params, 0.001 GMac, 100.000% MACs|\n    (0): Sequential(\n      |0.411 M, 55.540% Params, 0.0 GMac, 55.563% MACs|\n      (0): Linear(in_features=400, out_features=1024, bias=True, |0.411 M, 55.540% Params, 0.0 GMac, 55.425% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.138% MACs|)\n    )\n    (1): Sequential(\n      |0.262 M, 35.492% Params, 0.0 GMac, 35.452% MACs|\n      (0): Linear(in_features=1024, out_features=256, bias=True, |0.262 M, 35.492% Params, 0.0 GMac, 35.418% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.035% MACs|)\n    )\n    (2): Sequential(\n      |0.066 M, 8.899% Params, 0.0 GMac, 8.915% MACs|\n      (0): Linear(in_features=256, out_features=256, bias=True, |0.066 M, 8.899% Params, 0.0 GMac, 8.880% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.035% MACs|)\n    )\n    (3): Sequential(\n      |0.001 M, 0.070% Params, 0.0 GMac, 0.070% MACs|\n      (0): Linear(in_features=256, out_features=2, bias=True, |0.001 M, 0.070% Params, 0.0 GMac, 0.069% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n    )\n  )\n)\n

The data card is shown in top_tagging/data/pf_features.yaml. It defines one input group, pf_features, which takes four variables Etarel, Phirel, E_log, P_log. This is based on our data structure, where these variables are 2D vectors with variable lengths. The length is chosen as 100 in a way that the last dimension (the jet constituent dimension) is always truncated or padded to have length 100.

MLP data config top_tagging/data/pf_features.yaml

Also see top_tagging/data/pf_features.yaml. See a tour guide to the data configuration card in Weaver README.

selection:\n### use `&`, `|`, `~` for logical operations on numpy arrays\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\n\nnew_variables:\n### [format] name: formula\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\nis_bkg: np.logical_not(is_signal_new)\n\npreprocess:\n### method: [manual, auto] - whether to use manually specified parameters for variable standardization\nmethod: manual\n### data_fraction: fraction of events to use when calculating the mean/scale for the standardization\ndata_fraction:\n\ninputs:\npf_features:\nlength: 100\nvars:\n### [format 1]: var_name (no transformation)\n### [format 2]: [var_name,\n###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto),\n###              multiply_by(optional, default=1),\n###              clip_min(optional, default=-5),\n###              clip_max(optional, default=5),\n###              pad_value(optional, default=0)]\n- Part_Etarel\n- Part_Phirel\n- [Part_E_log, 2, 1]\n- [Part_P_log, 2, 1]\n\nlabels:\n### type can be `simple`, `custom`\n### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels\ntype: simple\nvalue: [\nis_signal_new, is_bkg\n]\n### [option 2] otherwise use `custom` to define the label, then `value` is a map\n# type: custom\n# value:\n# target_mass: np.where(fj_isQCD, fj_genjet_sdmass, fj_gen_mass)\n\nobservers:\n- origIdx\n- idx\n- Part_E_tot\n- Part_PX_tot\n- Part_PY_tot\n- Part_PZ_tot\n- Part_P_tot\n- Part_Eta_tot\n- Part_Phi_tot\n\n# weights:\n### [option 1] use precomputed weights stored in the input files\n# use_precomputed_weights: true\n# weight_branches: [weight, class_weight]\n### [option 2] compute weights on-the-fly using reweighting histograms\n

In the following two models (i.e., the DeepAK8 and the ParticleNet model) you will see that the data card is very similar. The change will only be the way we present the input group(s).

The full architecture of the DeepAK8 model, which is based on 1D CNN with ResNet architecture.

Note

The DeepAK8 tagger is a widely used highly-boosted jet tagger in the CMS community. The design of the model can be found in the CMS paper [arXiv:2004.08262]. The original model is trained on MXNet and its configuration can be found here.

We now migrate the model architecture to Weaver and train it on PyTorch. Also, we narrow the multi-class output score to the binary output to adapt our binary classification task (top vs. QCD jet).

The model card is given in top_tagging/networks/deepak8_pf.py. The DeepAK8 model is inspired by the ResNet architecture. The key ingredient is the ResNet unit constructed by multiple CNN layers with a shortcut connection. First, we define the ResNet unit in the model card.

ResNet unit implementation

See top_tagging/networks/deepak8_pf.py. We elaborate here on several aspects.

  • A ResNet unit is made of two 1D CNNs with batch normalization and ReLU activation function.
  • The shortcut is introduced here by directly adding the input data to the processed data after passing the CNN layers. The shortcut connection help to ease the training for the \"deeper\" model [arXiv:1512.03385]. Note that a trivial linear transformation is applied (self.conv_sc) if the feature dimension of the input and output data does not match.
class ResNetUnit(nn.Module):\nr\"\"\"Parameters\n    ----------\n    in_channels : int\n        Number of channels in the input vectors.\n    out_channels : int\n        Number of channels in the output vectors.\n    strides: tuple\n        Strides of the two convolutional layers, in the form of (stride0, stride1)\n    \"\"\"\n\n    def __init__(self, in_channels, out_channels, strides=(1,1), **kwargs):\n\n        super(ResNetUnit, self).__init__(**kwargs)\n        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size=3, stride=strides[0], padding=1)\n        self.bn1 = nn.BatchNorm1d(out_channels)\n        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size=3, stride=strides[1], padding=1)\n        self.bn2 = nn.BatchNorm1d(out_channels)\n        self.relu = nn.ReLU()\n        self.dim_match = True\n        if not in_channels == out_channels or not strides == (1,1): # dimensions not match\n            self.dim_match = False\n            self.conv_sc = nn.Conv1d(in_channels, out_channels, kernel_size=1, stride=strides[0]*strides[1], bias=False)\n\n    def forward(self, x):\n        identity = x\n        x = self.conv1(x)\n        x = self.bn1(x)\n        x = self.relu(x)\n        x = self.conv2(x)\n        x = self.bn2(x)\n        x = self.relu(x)\n        # print('resnet unit', identity.shape, x.shape, self.dim_match)\n        if self.dim_match:\n            return identity + x\n        else:\n            return self.conv_sc(identity) + x\n

With the ResNet unit, we construct the DeepAK8 model. The model hyperparameters are chosen as follows.

conv_params = [(32,), (64, 64), (64, 64), (128, 128)]\nfc_params = [(512, 0.2)]\n

DeepAK8 model implementation

See top_tagging/networks/deepak8_pf.py. Note that the main architecture is a PyTorch re-implementation of the code here based on the MXNet.

class ResNet(nn.Module):\nr\"\"\"Parameters\n    ----------\n    features_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    conv_params : list\n        List of the convolution layer parameters.\n        The first element is a tuple of size 1, defining the transformed feature size for the initial feature convolution layer.\n        The following are tuples of feature size for multiple stages of the ResNet units. Each number defines an individual ResNet unit.\n    fc_params: list\n        List of fully connected layer parameters after all EdgeConv blocks, each element in the format of\n        (n_feat, drop_rate)\n    \"\"\"\n\n    def __init__(self, features_dims, num_classes,\n                conv_params=[(32,), (64, 64), (64, 64), (128, 128)],\n                fc_params=[(512, 0.2)],\n                **kwargs):\n\n        super(ResNet, self).__init__(**kwargs)\n        self.conv_params = conv_params\n        self.num_stages = len(conv_params) - 1\n        self.fts_conv = nn.Sequential(nn.Conv1d(in_channels=features_dims, out_channels=conv_params[0][0], kernel_size=3, stride=1, padding=1),\n                                    nn.BatchNorm1d(conv_params[0][0]),\n                                    nn.ReLU())\n\n        # define ResNet units for each stage. Each unit is composed of a sequence of ResNetUnit block\n        self.resnet_units = nn.ModuleDict()\n        for i in range(self.num_stages):\n            # stack units[i] layers in this stage\n            unit_layers = []\n            for j in range(len(conv_params[i + 1])):\n                in_channels, out_channels = (conv_params[i][-1], conv_params[i + 1][0]) if j == 0 \\\n                                            else (conv_params[i + 1][j - 1], conv_params[i + 1][j])\n                strides = (2, 1) if (j == 0 and i > 0) else (1, 1)\n                unit_layers.append(ResNetUnit(in_channels, out_channels, strides))\n\n            self.resnet_units.add_module('resnet_unit_%d' % i, nn.Sequential(*unit_layers))\n\n        # define fully connected layers\n        fcs = []\n        for idx, layer_param in enumerate(fc_params):\n            channels, drop_rate = layer_param\n            in_chn = conv_params[-1][-1] if idx == 0 else fc_params[idx - 1][0]\n            fcs.append(nn.Sequential(nn.Linear(in_chn, channels), nn.ReLU(), nn.Dropout(drop_rate)))\n        fcs.append(nn.Linear(fc_params[-1][0], num_classes))\n        self.fc = nn.Sequential(*fcs)\n\n    def forward(self, x):\n        # x: the feature vector, (N, C, P)\n        x = self.fts_conv(x)\n        for i in range(self.num_stages):\n            x = self.resnet_units['resnet_unit_%d' % i](x) # (N, C', P'), P'<P due to kernal_size>1 or stride>1\n\n        # global average pooling\n        x = x.sum(dim=-1) / x.shape[-1] # (N, C')\n        # fully connected\n        x = self.fc(x) # (N, out_chn)\n        return x\n\n\ndef get_model(data_config, **kwargs):\n    conv_params = [(32,), (64, 64), (64, 64), (128, 128)]\n    fc_params = [(512, 0.2)]\n\n    pf_features_dims = len(data_config.input_dicts['pf_features'])\n    num_classes = len(data_config.label_value)\n    model = ResNet(pf_features_dims, num_classes,\n                conv_params=conv_params,\n                fc_params=fc_params)\n\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    print(data_config.input_shapes)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The output below shows the full structure of the DeepAK8 model based on 1D CNN with ResNet. It is printed by PyTorch and you will see it in the Weaver output during training.

The full-scale structure of the DeepAK8 architecture
ResNet(\n  |0.349 M, 100.000% Params, 0.012 GMac, 100.000% MACs|\n  (fts_conv): Sequential(\n    |0.0 M, 0.137% Params, 0.0 GMac, 0.427% MACs|\n    (0): Conv1d(4, 32, kernel_size=(3,), stride=(1,), padding=(1,), |0.0 M, 0.119% Params, 0.0 GMac, 0.347% MACs|)\n    (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.018% Params, 0.0 GMac, 0.053% MACs|)\n    (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.027% MACs|)\n  )\n  (resnet_units): ModuleDict(\n    |0.282 M, 80.652% Params, 0.012 GMac, 99.010% MACs|\n    (resnet_unit_0): Sequential(\n      |0.046 M, 13.124% Params, 0.005 GMac, 38.409% MACs|\n      (0): ResNetUnit(\n        |0.021 M, 5.976% Params, 0.002 GMac, 17.497% MACs|\n        (conv1): Conv1d(32, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.006 M, 1.778% Params, 0.001 GMac, 5.175% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.107% MACs|)\n        (conv_sc): Conv1d(32, 64, kernel_size=(1,), stride=(1,), bias=False, |0.002 M, 0.587% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.025 M, 7.149% Params, 0.003 GMac, 20.912% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.107% MACs|)\n      )\n    )\n    (resnet_unit_1): Sequential(\n      |0.054 M, 15.471% Params, 0.003 GMac, 22.619% MACs|\n      (0): ResNetUnit(\n        |0.029 M, 8.322% Params, 0.001 GMac, 12.163% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(2,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n        (conv_sc): Conv1d(64, 64, kernel_size=(1,), stride=(2,), bias=False, |0.004 M, 1.173% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.025 M, 7.149% Params, 0.001 GMac, 10.456% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n      )\n    )\n    (resnet_unit_2): Sequential(\n      |0.182 M, 52.057% Params, 0.005 GMac, 37.982% MACs|\n      (0): ResNetUnit(\n        |0.083 M, 23.682% Params, 0.002 GMac, 17.284% MACs|\n        (conv1): Conv1d(64, 128, kernel_size=(3,), stride=(2,), padding=(1,), |0.025 M, 7.075% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n        (conv_sc): Conv1d(64, 128, kernel_size=(1,), stride=(2,), bias=False, |0.008 M, 2.346% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.099 M, 28.375% Params, 0.002 GMac, 20.698% MACs|\n        (conv1): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n      )\n    )\n  )\n  (fc): Sequential(\n    |0.067 M, 19.210% Params, 0.0 GMac, 0.563% MACs|\n    (0): Sequential(\n      |0.066 M, 18.917% Params, 0.0 GMac, 0.555% MACs|\n      (0): Linear(in_features=128, out_features=512, bias=True, |0.066 M, 18.917% Params, 0.0 GMac, 0.551% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.004% MACs|)\n      (2): Dropout(p=0.2, inplace=False, |0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n    )\n    (1): Linear(in_features=512, out_features=2, bias=True, |0.001 M, 0.294% Params, 0.0 GMac, 0.009% MACs|)\n  )\n)\n

The data card is the same as the MLP case, shown in top_tagging/data/pf_features.yaml.

The full architecture of the ParticleNet model, which is based on DGCNN and EdgeConv.

Note

The ParticleNet model applied to the CMS analysis is provided in weaver/networks/particle_net_pf_sv.py, and the data card in weaver/data/ak15_points_pf_sv.yaml. Here we use a similar configuration card to deal with the benchmark task.

We will elaborate on the ParticleNet model and focus more on the technical side in this section. The model is defined in top_tagging/networks/particlenet_pf.py, but it imports some constructor, the EdgeConv block, in weaver/utils/nn/model/ParticleNet.py. The EdgeConv is illustrated in the cartoon.

Illustration of the EdgeConv block

From an EdgeConv block's point of view, it requires two classes of features as input: the \"coordinates\" and the \"features\". These features are the per point properties, in the 2D shape with dimensions (C, P), where C is the size of the features (the feature size of \"coordinates\" and the \"features\" can be different, marked as C_pts, C_fts in the following code), and P is the number of points. The block outputs the new features that the model learns, also in the 2D shape with dimensions (C_fts_out, P).

What happens inside the EdgeConv block? And how is the output feature vector transferred from the input features using the topology of the point cloud? The answer is encoded in the edge convolution (EdgeConv).

The edge convolution is an analogue convolution method defined on a point cloud, whose shape is given by the \"coordinates\" of points. Specifically, the input \"coordinates\" provide a view of spatial relations of the points in the Euclidean space. It determines the k-nearest neighbouring points for each point that will guide the update of the feature vector of a point. For each point, the updated feature vector is based on the current state of the point and its k neighbours. Guided by this spirit, all features of the point cloud forms a 3D vector with dimensions (C, P, K), where C is the per-point feature size (e.g., \u03b7, \u03c6, pT\uff0c...), P is the number of points, and K the k-NN number. The structured vector is linearly transformed by acting 2D CNN on the feature dimension C. This helps to aggregate the feature information and exploit the correlations of each point with its adjacent points. A shortcut connection is also introduced inspired by the ResNet.

Note

The feature dimension C after exploring the k neighbours of each point actually doubles the value of the initial feature dimension. Here, a new set of features is constructed by subtracting the feature a point carries to the features its k neighbours carry (namely xi \u2013 xi_j for point i, and j=1,...,k). This way, the correlation of each point with its neighbours are well captured.

Below shows how the EdgeConv structure is implemented in the code.

EdgeConv block implementation

See weaver/utils/nn/model/ParticleNet.py, or the following code block annotated with more comments. We elaborate here on several aspects.

  • The EdgeConvBlock takes the feature dimension in_feat, out_feats which are C_fts, C_fts_out we introduced above.
  • The input data vectors to forward() are \"coordinates\" and \"features\" vector, in the dimension of (N, C_pts(C_fts), P) as introduced above. The first dimension is the mini-batch size.
  • self.get_graph_feature() helps to aggregate k-nearest neighbours for each point. The resulting vector is in the dimension of (N, C_fts(0), P, K) as we discussed above, K being the k-NN number. Note that the C_fts(0) doubles the value of the original input feature dimension C_fts as mentioned above.
  • After convolutions, the per-point features are merged by taking the mean of all k-nearest neighbouring vectors:
    fts = x.mean(dim=-1)  # (N, C, P)\n
class EdgeConvBlock(nn.Module):\nr\"\"\"EdgeConv layer.\n    Introduced in \"`Dynamic Graph CNN for Learning on Point Clouds\n    <https://arxiv.org/pdf/1801.07829>`__\".  Can be described as follows:\n    .. math::\n    x_i^{(l+1)} = \\max_{j \\in \\mathcal{N}(i)} \\mathrm{ReLU}(\n    \\Theta \\cdot (x_j^{(l)} - x_i^{(l)}) + \\Phi \\cdot x_i^{(l)})\n    where :math:`\\mathcal{N}(i)` is the neighbor of :math:`i`.\n    Parameters\n    ----------\n    in_feat : int\n        Input feature size.\n    out_feat : int\n        Output feature size.\n    batch_norm : bool\n        Whether to include batch normalization on messages.\n    \"\"\"\n\n    def __init__(self, k, in_feat, out_feats, batch_norm=True, activation=True, cpu_mode=False):\n        super(EdgeConvBlock, self).__init__()\n        self.k = k\n        self.batch_norm = batch_norm\n        self.activation = activation\n        self.num_layers = len(out_feats)\n        self.get_graph_feature = get_graph_feature_v2 if cpu_mode else get_graph_feature_v1\n\n        self.convs = nn.ModuleList()\n        for i in range(self.num_layers):\n            self.convs.append(nn.Conv2d(2 * in_feat if i == 0 else out_feats[i - 1], out_feats[i], kernel_size=1, bias=False if self.batch_norm else True))\n\n        if batch_norm:\n            self.bns = nn.ModuleList()\n            for i in range(self.num_layers):\n                self.bns.append(nn.BatchNorm2d(out_feats[i]))\n\n        if activation:\n            self.acts = nn.ModuleList()\n            for i in range(self.num_layers):\n                self.acts.append(nn.ReLU())\n\n        if in_feat == out_feats[-1]:\n            self.sc = None\n        else:\n            self.sc = nn.Conv1d(in_feat, out_feats[-1], kernel_size=1, bias=False)\n            self.sc_bn = nn.BatchNorm1d(out_feats[-1])\n\n        if activation:\n            self.sc_act = nn.ReLU()\n\n    def forward(self, points, features):\n        # points:   (N, C_pts, P)\n        # features: (N, C_fts, P)\n        # N: batch size, C: feature size per point, P: number of points\n\n        topk_indices = knn(points, self.k) # (N, P, K)\n        x = self.get_graph_feature(features, self.k, topk_indices) # (N, C_fts(0), P, K)\n\n        for conv, bn, act in zip(self.convs, self.bns, self.acts):\n            x = conv(x)  # (N, C', P, K)\n            if bn:\n                x = bn(x)\n            if act:\n                x = act(x)\n\n        fts = x.mean(dim=-1)  # (N, C, P)\n\n        # shortcut\n        if self.sc:\n            sc = self.sc(features)  # (N, C_out, P)\n            sc = self.sc_bn(sc)\n        else:\n            sc = features\n\n        return self.sc_act(sc + fts)  # (N, C_out, P)\n

With the EdgeConv architecture as the building block, the ParticleNet model is constructed as follow.

The ParticleNet model stacks three EdgeConv blocks to construct higher-level features and passing them through the pipeline. The points (i.e., in our case, the particle candidates inside a jet) are not changing, but the per-point \"coordinates\" and \"features\" vectors changes, in both values and dimensions.

For the first EdgeConv block, the \"coordinates\" only include the relative \u03b7 and \u03c6 value of each particle. The \"features\" is a vector with a standard length of 32, which is linearly transformed from the initial feature vectors including the components of relative \u03b7, \u03c6, the log of pT, etc. The first EdgeConv block outputs a per-point feature vector of length 64, which is taken as both the \"coordinates\" and \"features\" to the next EdgeConv block. That is to say, the next k-NN is applied on the 64D high-dimensional spatial space to capture the new relations of points learned by the model. This is visualized by the input/output arrows showing the data flow of the model. We see that this architecture illustrates the stackability of the EdgeConv block, and is the core to the Dynamic Graph CNN (DGCNN), as the model can dynamically change the correlations of each point based on learnable features.

A fusion technique is also used by concatenating the three EdgeConv output vectors together (adding the dimensions), instead of using the last EdgeConv output, to form an output vector. This is also one form of shortcut implementations that helps to ease the training for a complex and deep convolutional network model.

The concatenated vectors per point are then averaged over points to produce a single 1D vector of the whole point cloud. The vector passes through one fully connected layer, with a dropout rate of p=0.1 to prevent overfitting. Then, in our example, the full network outputs two scores after a softmax, representing the one-hot encoding of the top vs. QCD class.

The ParticleNet implementation is shown below.

ParticleNet model implementation

See weaver/utils/nn/model/ParticleNet.py, or the following code block annotated with more comments. We elaborate here on several mean points.

  • The stack of multiple EdgeConv blocks are implemented in
    for idx, conv in enumerate(self.edge_convs):\n    pts = (points if idx == 0 else fts) + coord_shift\n    fts = conv(pts, fts) * mask\n
  • The multiple EdgeConv layer parameters are given by conv_params, which takes a list of tuples, each tuple in the format of (K, (C1, C2, C3)). K for the k-NN number, C1,2,3 for convolution feature sizes of three layers in an EdgeConv block.
  • The fully connected layer parameters are given by fc_params, which takes a list of tuples, each tuple in the format of (n_feat, drop_rate).
class ParticleNet(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions (C_fts).\n    num_classes : int\n        Number of output classes.\n    conv_params : list\n        List of convolution parameters of EdgeConv blocks, each element in the format of (K, (C1, C2, C3)).\n        K for the kNN number, C1,2,3 for convolution feature sizes of three layers in an EdgeConv block.\n    fc_params: list\n        List of fully connected layer parameters after all EdgeConv blocks, each element in the format of\n        (n_feat, drop_rate)\n    use_fusion: bool\n        If true, concatenates all output features from each EdgeConv before the fully connected layer.\n    use_fts_bn: bool\n        If true, applies a batch norm before feeding to the EdgeConv block.\n    use_counts: bool\n        If true, uses the real count of points instead of the padded size (the max point size).\n    for_inference: bool\n        Whether this is an inference routine. If true, applies a softmax to the output.\n    for_segmentation: bool\n        Whether the model is set up for the point cloud segmentation (instead of classification) task. If true,\n        does not merge the features after the last EdgeConv, and apply Conv1D instead of the linear layer.\n        The output is hence each output_features per point, instead of output_features.\n    \"\"\"\n\n\n    def __init__(self,\n                input_dims,\n                num_classes,\n                conv_params=[(7, (32, 32, 32)), (7, (64, 64, 64))],\n                fc_params=[(128, 0.1)],\n                use_fusion=True,\n                use_fts_bn=True,\n                use_counts=True,\n                for_inference=False,\n                for_segmentation=False,\n                **kwargs):\n        super(ParticleNet, self).__init__(**kwargs)\n\n        self.use_fts_bn = use_fts_bn\n        if self.use_fts_bn:\n            self.bn_fts = nn.BatchNorm1d(input_dims)\n\n        self.use_counts = use_counts\n\n        self.edge_convs = nn.ModuleList()\n        for idx, layer_param in enumerate(conv_params):\n            k, channels = layer_param\n            in_feat = input_dims if idx == 0 else conv_params[idx - 1][1][-1]\n            self.edge_convs.append(EdgeConvBlock(k=k, in_feat=in_feat, out_feats=channels, cpu_mode=for_inference))\n\n        self.use_fusion = use_fusion\n        if self.use_fusion:\n            in_chn = sum(x[-1] for _, x in conv_params)\n            out_chn = np.clip((in_chn // 128) * 128, 128, 1024)\n            self.fusion_block = nn.Sequential(nn.Conv1d(in_chn, out_chn, kernel_size=1, bias=False), nn.BatchNorm1d(out_chn), nn.ReLU())\n\n        self.for_segmentation = for_segmentation\n\n        fcs = []\n        for idx, layer_param in enumerate(fc_params):\n            channels, drop_rate = layer_param\n            if idx == 0:\n                in_chn = out_chn if self.use_fusion else conv_params[-1][1][-1]\n            else:\n                in_chn = fc_params[idx - 1][0]\n            if self.for_segmentation:\n                fcs.append(nn.Sequential(nn.Conv1d(in_chn, channels, kernel_size=1, bias=False),\n                                        nn.BatchNorm1d(channels), nn.ReLU(), nn.Dropout(drop_rate)))\n            else:\n                fcs.append(nn.Sequential(nn.Linear(in_chn, channels), nn.ReLU(), nn.Dropout(drop_rate)))\n        if self.for_segmentation:\n            fcs.append(nn.Conv1d(fc_params[-1][0], num_classes, kernel_size=1))\n        else:\n            fcs.append(nn.Linear(fc_params[-1][0], num_classes))\n        self.fc = nn.Sequential(*fcs)\n\n        self.for_inference = for_inference\n\n    def forward(self, points, features, mask=None):\n#         print('points:\\n', points)\n#         print('features:\\n', features)\n        if mask is None:\n            mask = (features.abs().sum(dim=1, keepdim=True) != 0)  # (N, 1, P)\n        points *= mask\n        features *= mask\n        coord_shift = (mask == 0) * 1e9\n        if self.use_counts:\n            counts = mask.float().sum(dim=-1)\n            counts = torch.max(counts, torch.ones_like(counts))  # >=1\n\n        if self.use_fts_bn:\n            fts = self.bn_fts(features) * mask\n        else:\n            fts = features\n        outputs = []\n        for idx, conv in enumerate(self.edge_convs):\n            pts = (points if idx == 0 else fts) + coord_shift\n            fts = conv(pts, fts) * mask\n            if self.use_fusion:\n                outputs.append(fts)\n        if self.use_fusion:\n            fts = self.fusion_block(torch.cat(outputs, dim=1)) * mask\n\n#         assert(((fts.abs().sum(dim=1, keepdim=True) != 0).float() - mask.float()).abs().sum().item() == 0)\n\n        if self.for_segmentation:\n            x = fts\n        else:\n            if self.use_counts:\n                x = fts.sum(dim=-1) / counts  # divide by the real counts\n            else:\n                x = fts.mean(dim=-1)\n\n        output = self.fc(x)\n        if self.for_inference:\n            output = torch.softmax(output, dim=1)\n        # print('output:\\n', output)\n        return output\n

Above are the capsulation of all ParticleNet building blocks. Eventually, we have the model defined in the model card top_tagging/networks/particlenet_pf.py, in the ParticleNetTagger1Path class, meaning we only use the ParticleNet pipeline that deals with one set of the point cloud (i.e., the particle candidates).

Info

Two sets of point clouds in the CMS application, namely the particle-flow candidates and secondary vertices, are used. This requires special handling to merge the clouds before feeding them to the first layer of EdgeConv.

ParticleNet model config

Also see top_tagging/networks/particlenet_pf.py.

import torch\nimport torch.nn as nn\nfrom utils.nn.model.ParticleNet import ParticleNet, FeatureConv\n\n\nclass ParticleNetTagger1Path(nn.Module):\n\n    def __init__(self,\n                pf_features_dims,\n                num_classes,\n                conv_params=[(7, (32, 32, 32)), (7, (64, 64, 64))],\n                fc_params=[(128, 0.1)],\n                use_fusion=True,\n                use_fts_bn=True,\n                use_counts=True,\n                pf_input_dropout=None,\n                for_inference=False,\n                **kwargs):\n        super(ParticleNetTagger1Path, self).__init__(**kwargs)\n        self.pf_input_dropout = nn.Dropout(pf_input_dropout) if pf_input_dropout else None\n        self.pf_conv = FeatureConv(pf_features_dims, 32)\n        self.pn = ParticleNet(input_dims=32,\n                            num_classes=num_classes,\n                            conv_params=conv_params,\n                            fc_params=fc_params,\n                            use_fusion=use_fusion,\n                            use_fts_bn=use_fts_bn,\n                            use_counts=use_counts,\n                            for_inference=for_inference)\n\n    def forward(self, pf_points, pf_features, pf_mask):\n        if self.pf_input_dropout:\n            pf_mask = (self.pf_input_dropout(pf_mask) != 0).float()\n            pf_points *= pf_mask\n            pf_features *= pf_mask\n\n        return self.pn(pf_points, self.pf_conv(pf_features * pf_mask) * pf_mask, pf_mask)\n\n\ndef get_model(data_config, **kwargs):\n    conv_params = [\n        (16, (64, 64, 64)),\n        (16, (128, 128, 128)),\n        (16, (256, 256, 256)),\n        ]\n    fc_params = [(256, 0.1)]\n    use_fusion = True\n\n    pf_features_dims = len(data_config.input_dicts['pf_features'])\n    num_classes = len(data_config.label_value)\n    model = ParticleNetTagger1Path(pf_features_dims, num_classes,\n                            conv_params, fc_params,\n                            use_fusion=use_fusion,\n                            use_fts_bn=kwargs.get('use_fts_bn', False),\n                            use_counts=kwargs.get('use_counts', True),\n                            pf_input_dropout=kwargs.get('pf_input_dropout', None),\n                            for_inference=kwargs.get('for_inference', False)\n                            )\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    print(data_config.input_shapes)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The most important parameters are conv_params and fc_params, which decides the model parameters of EdgeConv blocks and the fully connected layer. See details in the above \"ParticleNet model implementation\" box.

conv_params = [\n    (16, (64, 64, 64)),\n    (16, (128, 128, 128)),\n    (16, (256, 256, 256)),\n    ]\nfc_params = [(256, 0.1)]\n

A full structure printed from PyTorch is shown below. It will appear in the Weaver output during training.

ParticleNet full-scale structure
ParticleNetTagger1Path(\n  |0.577 M, 100.000% Params, 0.441 GMac, 100.000% MACs|\n  (pf_conv): FeatureConv(\n    |0.0 M, 0.035% Params, 0.0 GMac, 0.005% MACs|\n    (conv): Sequential(\n      |0.0 M, 0.035% Params, 0.0 GMac, 0.005% MACs|\n      (0): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.001% Params, 0.0 GMac, 0.000% MACs|)\n      (1): Conv1d(4, 32, kernel_size=(1,), stride=(1,), bias=False, |0.0 M, 0.022% Params, 0.0 GMac, 0.003% MACs|)\n      (2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.011% Params, 0.0 GMac, 0.001% MACs|)\n      (3): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.001% MACs|)\n    )\n  )\n  (pn): ParticleNet(\n    |0.577 M, 99.965% Params, 0.441 GMac, 99.995% MACs|\n    (edge_convs): ModuleList(\n      |0.305 M, 52.823% Params, 0.424 GMac, 96.047% MACs|\n      (0): EdgeConvBlock(\n        |0.015 M, 2.575% Params, 0.021 GMac, 4.716% MACs|\n        (convs): ModuleList(\n          |0.012 M, 2.131% Params, 0.02 GMac, 4.456% MACs|\n          (0): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n          (1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n          (2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n        )\n        (bns): ModuleList(\n          |0.0 M, 0.067% Params, 0.001 GMac, 0.139% MACs|\n          (0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n          (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n          (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.0 GMac, 0.070% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n        )\n        (sc): Conv1d(32, 64, kernel_size=(1,), stride=(1,), bias=False, |0.002 M, 0.355% Params, 0.0 GMac, 0.046% MACs|)\n        (sc_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.003% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.001% MACs|)\n      )\n      (1): EdgeConvBlock(\n        |0.058 M, 10.121% Params, 0.081 GMac, 18.437% MACs|\n        (convs): ModuleList(\n          |0.049 M, 8.523% Params, 0.079 GMac, 17.825% MACs|\n          (0): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n          (1): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n          (2): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n        )\n        (bns): ModuleList(\n          |0.001 M, 0.133% Params, 0.001 GMac, 0.279% MACs|\n          (0): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n          (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.001 GMac, 0.139% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n        )\n        (sc): Conv1d(64, 128, kernel_size=(1,), stride=(1,), bias=False, |0.008 M, 1.420% Params, 0.001 GMac, 0.186% MACs|)\n        (sc_bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.006% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.003% MACs|)\n      )\n      (2): EdgeConvBlock(\n        |0.231 M, 40.128% Params, 0.322 GMac, 72.894% MACs|\n        (convs): ModuleList(\n          |0.197 M, 34.091% Params, 0.315 GMac, 71.299% MACs|\n          (0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n          (1): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n          (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n        )\n        (bns): ModuleList(\n          |0.002 M, 0.266% Params, 0.002 GMac, 0.557% MACs|\n          (0): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n          (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.001 GMac, 0.279% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n        )\n        (sc): Conv1d(128, 256, kernel_size=(1,), stride=(1,), bias=False, |0.033 M, 5.682% Params, 0.003 GMac, 0.743% MACs|)\n        (sc_bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.0 GMac, 0.012% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.006% MACs|)\n      )\n    )\n    (fusion_block): Sequential(\n      |0.173 M, 29.963% Params, 0.017 GMac, 3.925% MACs|\n      (0): Conv1d(448, 384, kernel_size=(1,), stride=(1,), bias=False, |0.172 M, 29.830% Params, 0.017 GMac, 3.899% MACs|)\n      (1): BatchNorm1d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.133% Params, 0.0 GMac, 0.017% MACs|)\n      (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.009% MACs|)\n    )\n    (fc): Sequential(\n      |0.099 M, 17.179% Params, 0.0 GMac, 0.023% MACs|\n      (0): Sequential(\n        |0.099 M, 17.090% Params, 0.0 GMac, 0.022% MACs|\n        (0): Linear(in_features=384, out_features=256, bias=True, |0.099 M, 17.090% Params, 0.0 GMac, 0.022% MACs|)\n        (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n        (2): Dropout(p=0.1, inplace=False, |0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n      )\n      (1): Linear(in_features=256, out_features=2, bias=True, |0.001 M, 0.089% Params, 0.0 GMac, 0.000% MACs|)\n    )\n  )\n)\n

The data card is shown in top_tagging/data/pf_points_features.yaml, given in a similar way as in the MLP example. Here we group the inputs into three classes: pf_points, pf_features and pf_masks. They correspond to the forward(self, pf_points, pf_features, pf_mask) prototype of our nn.Module model, and will send in these 2D vectors in the mini-batch size for each iteration during training/prediction.

ParticleNet data config top_tagging/data/pf_points_features.yaml

See top_tagging/data/pf_points_features.yaml.

selection:\n### use `&`, `|`, `~` for logical operations on numpy arrays\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\n\nnew_variables:\n### [format] name: formula\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\npf_mask: awkward.JaggedArray.ones_like(Part_E)\nis_bkg: np.logical_not(is_signal_new)\n\npreprocess:\n### method: [manual, auto] - whether to use manually specified parameters for variable standardization\nmethod: manual\n### data_fraction: fraction of events to use when calculating the mean/scale for the standardization\ndata_fraction:\n\ninputs:\npf_points:\nlength: 100\nvars:\n- Part_Etarel\n- Part_Phirel\npf_features:\nlength: 100\nvars:\n### [format 1]: var_name (no transformation)\n### [format 2]: [var_name,\n###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto),\n###              multiply_by(optional, default=1),\n###              clip_min(optional, default=-5),\n###              clip_max(optional, default=5),\n###              pad_value(optional, default=0)]\n- Part_Etarel\n- Part_Phirel\n- [Part_E_log, 2, 1]\n- [Part_P_log, 2, 1]\npf_mask:\nlength: 100\nvars:\n- pf_mask\n\nlabels:\n### type can be `simple`, `custom`\n### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels\ntype: simple\nvalue: [\nis_signal_new, is_bkg\n]\n### [option 2] otherwise use `custom` to define the label, then `value` is a map\n# type: custom\n# value:\n# target_mass: np.where(fj_isQCD, fj_genjet_sdmass, fj_gen_mass)\n\nobservers:\n- origIdx\n- idx\n- Part_E_tot\n- Part_PX_tot\n- Part_PY_tot\n- Part_PZ_tot\n- Part_P_tot\n- Part_Eta_tot\n- Part_Phi_tot\n\n# weights:\n### [option 1] use precomputed weights stored in the input files\n# use_precomputed_weights: true\n# weight_branches: [weight, class_weight]\n### [option 2] compute weights on-the-fly using reweighting histograms\n

Now we have walked through the detailed description of three networks in their architecture as well as their implementations in Weaver.

Before ending this section, we summarize the three networks on their (1) model and data configuration cards, (2) the number of parameters, and (3) computational complexity in the following table. Note that we'll refer to the shell variables provided here in the following training example.

Model ${PREFIX} ${MODEL_CONFIG} ${DATA_CONFIG} Parameters Computational complexity MLP mlp mlp_pf.py pf_features.yaml 739k 0.001 GMac DeepAK8 (1D CNN) deepak8 deepak8_pf.py pf_features.yaml 349k 0.012 GMac ParticleNet (DGCNN) particlenet particlenet_pf.py pf_points_features.yaml 577k 0.441 GMac"},{"location":"inference/particlenet.html#2-start-training","title":"2. Start training!","text":"

Now we train the three neural networks based on the provided model and data configurations.

Here we present three ways of training. For readers who have a local machine with CUDA GPUs, please try out training on the local GPUs. Readers who would like to try on CPUs can also refer to the local GPU instruction. It is also possible to borrow the GPU resources from the lxplus HTCondor or CMS Connect. Please find in the following that meets your situation.

Train on local GPUsUse GPUs on lxplus HTCondorUse GPUs on CMS Connect

The three networks can be trained with a universal script. Enter the weaver base folder and run the following command. Note that ${DATA_CONFIG}, ${MODEL_CONFIG}, and ${PREFIX} refers to the value in the above table for each example, and the fake path should be replaced with the correct one.

PREFIX='<prefix-from-table>'\nMODEL_CONFIG='<model-config-from-table>'\nDATA_CONFIG='<data-config-from-table>'\nPATH_TO_SAMPLES='<your-path-to-samples>'\n\npython train.py \\\n --data-train ${PATH_TO_SAMPLES}'/prep/top_train_*.root' \\\n --data-val ${PATH_TO_SAMPLES}'/prep/top_val_*.root' \\\n --fetch-by-file --fetch-step 1 --num-workers 3 \\\n --data-config top_tagging/data/${DATA_CONFIG} \\\n --network-config top_tagging/networks/${MODEL_CONFIG} \\\n --model-prefix output/${PREFIX} \\\n --gpus 0,1 --batch-size 1024 --start-lr 5e-3 --num-epochs 20 --optimizer ranger \\\n --log output/${PREFIX}.train.log\n

Here --gpus 0,1 specifies the GPUs to run with the device ID 1 and 2. For training on CPUs, please use --gpu ''.

A detailed description of the training command can be found in Weaver README. Below we will note a few more caveats about the data loading options, though the specific settings will depend on the specifics of the input data.

Caveats on the data loading options

Our goal in data loading is to guarantee that the data loaded in every mini-batch is evenly distributed with different labels, though they are not necessarily stored evenly in the file. Besides, we also need to ensure that the on-the-fly loading and preprocessing of data should be smooth and not be a bottleneck of the data delivering pipeline. The total amount of loaded data also needs to be controlled so as not to explode the entire memory. The following guidelines should be used to choose the best options for your use case:

  • in the default case, data are loaded from every input file with a small proportion per fetch-step, provided by --fetch-step (default is 0.01). This adapts to the case when we have multiple classes of input, each class having multiple files (e.g., it adapts to the real CMS application because we may have multiple nano_i.root files for different input classes). The strategy gathered all pieces per fetch-step from all input files, shuffle them, and present the data we need in each regular mini-batch. One can also append --num-workers n with n being the number of paralleled workers to load the data.
  • --fetch-step 1 --num-workers 1. This strategy helps in the case we have few input files with data in different labels not evenly distributed. In the extreme case, we only have 1 file, with all data at the top being one class (signal) and data at the bottom being another class (background), or we have 2 or multiple files, each containing a specific class. In this option, --fetch-step 1 guarantees the entire data in the file is loaded and participate in the shuffle. Therefore all classes are safely mixed before sending to the mini-batch. --num-workers 1 means we only use one worker that takes care of all files to avoid inconsistent loading speeds of multiple workers (depending on CPUs). This strategy can further cooperate with --in-memory so that all data are put permanently in memory and will not be reloaded every epoch. --fetch-by-file is the option we can use when all input files have a similar structure. See Weaver README:

An alternative approach is the \"file-based\" strategy, which can be enabled with --fetch-by-files. This approach will instead read all events from every file for each step, and it will read m input files (m is set by --fetch-step) before mixing and shuffling the loaded events. This strategy is more suitable when each input file is already a mixture of all types of events (e.g., pre-processed with NNTools), otherwise it may lead to suboptimal training performance. However, a higher data loading speed can generally be achieved with this approach.

Please note that you can test if all data classes are well mixed by printing the truth label in each mini-batch. Also, remember to test if data are loaded just-in-time by monitoring the GPU performance \u2014 if switching the data loading strategy helps improve the GPU efficiency, it means the previous data loader is the bottleneck in the pipeline to deliver and use the data.

After training, we predict the score on the test datasets using the best model:

PREFIX='<prefix-from-table>'\nMODEL_CONFIG='<model-config-from-table>'\nDATA_CONFIG='<data-config-from-table>'\nPATH_TO_SAMPLES='<your-path-to-samples>'\n\npython train.py --predict \\\n --data-test ${PATH_TO_SAMPLES}'/prep/top_test_*.root' \\\n --num-workers 3 \\\n --data-config top_tagging/data/${DATA_CONFIG} \\\n --network-config top_tagging/networks/${MODEL_CONFIG} \\\n --model-prefix output/${PREFIX}_best_epoch_state.pt \\\n --gpus 0,1 --batch-size 1024 \\\n --predict-output output/${PREFIX}_predict.root\n

On lxplus HTCondor, the GPU(s) can be booked via the arguments request_gpus. To get familiar with the GPU service, please refer to the documentation here.

While it is not possible to test the script locally, you can try out the condor_ssh_to_job command to connect to the remote condor machine that runs the jobs. This interesting feature will help you with debugging or monitoring the condor job.

Here we provide the example executed script and the condor submitted file for the training and predicting task. Create the following two files:

The executable: run.sh

Still, please remember to specify ${DATA_CONFIG}, ${MODEL_CONFIG}, and ${PREFIX} as shown in the above table, and replace the fake path with the correct one.

#!/bin/bash\n\nPREFIX=$1\nMODEL_CONFIG=$2\nDATA_CONFIG=$3\nPATH_TO_SAMPLES=$4\nWORKDIR=`pwd`\n\n# Download miniconda\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda_install.sh\nbash miniconda_install.sh -b -p ${WORKDIR}/miniconda\nexport PATH=$WORKDIR/miniconda/bin:$PATH\npip install numpy pandas scikit-learn scipy matplotlib tqdm PyYAML\npip install uproot3 awkward0 lz4 xxhash\npip install tables\npip install onnxruntime-gpu\npip install tensorboard\npip install torch\n\n# CUDA environment setup\nexport PATH=$PATH:/usr/local/cuda-10.2/bin\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64\nexport LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda-10.2/lib64\n\n# Clone weaver-benchmark\ngit clone --recursive https://github.com/colizz/weaver-benchmark.git\nln -s ../top_tagging weaver-benchmark/weaver/top_tagging\ncd weaver-benchmark/weaver/\nmkdir output\n\n# Training, using 1 GPU\npython train.py \\\n--data-train ${PATH_TO_SAMPLES}'/prep/top_train_*.root' \\\n--data-val ${PATH_TO_SAMPLES}'/prep/top_val_*.root' \\\n--fetch-by-file --fetch-step 1 --num-workers 3 \\\n--data-config top_tagging/data/${DATA_CONFIG} \\\n--network-config top_tagging/networks/${MODEL_CONFIG} \\\n--model-prefix output/${PREFIX} \\\n--gpus 0 --batch-size 1024 --start-lr 5e-3 --num-epochs 20 --optimizer ranger \\\n--log output/${PREFIX}.train.log\n\n# Predicting score, using 1 GPU\npython train.py --predict \\\n--data-test ${PATH_TO_SAMPLES}'/prep/top_test_*.root' \\\n--num-workers 3 \\\n--data-config top_tagging/data/${DATA_CONFIG} \\\n--network-config top_tagging/networks/${MODEL_CONFIG} \\\n--model-prefix output/${PREFIX}_best_epoch_state.pt \\\n--gpus 0 --batch-size 1024 \\\n--predict-output output/${PREFIX}_predict.root\n\n[ -d \"runs/\" ] && tar -caf output.tar output/ runs/ || tar -caf output.tar output/\n

HTCondor submitted file: submit.sub

Modify the argument line. These are the bash variable PREFIX, MODEL_CONFIG, DATA_CONFIG, PATH_TO_SAMPLES used in the Weaver command. Since the EOS directory is accessable accross all condor nodes on lxplus, one may directly specify <your-path-to-samples> as the EOS path provided above. An example is shown in the commented line.

Universe                = vanilla\nexecutable              = run.sh\narguments               = <prefix> <model-config> <data-config> <your-path-to-samples>\n#arguments              = mlp mlp_pf.py pf_features.yaml /eos/user/c/coli/public/weaver-benchmark/top_tagging/samples\noutput                  = job.$(ClusterId).$(ProcId).out\nerror                   = job.$(ClusterId).$(ProcId).err\nlog                     = job.$(ClusterId).log\nshould_transfer_files   = YES\nwhen_to_transfer_output = ON_EXIT_OR_EVICT\ntransfer_output_files   = weaver-benchmark/weaver/output.tar\ntransfer_output_remaps  = \"output.tar = output.$(ClusterId).$(ProcId).tar\"\nrequest_GPUs = 1\nrequest_CPUs = 4\n+MaxRuntime = 604800\nqueue\n

Make the run.sh script an executable, then submit the job.

chmod +x run.sh\ncondor_submit submit.sub\n
A tarball will be transfered back with the weaver/output directory where the trained models and the predicted ROOT file are stored.

CMS Connect provides several GPU nodes. One can request to run GPU condor jobs in a similar way as on lxplus, please refer to the link: https://ci-connect.atlassian.net/wiki/spaces/CMS/pages/80117822/Requesting+GPUs

As the EOS user space may not be accessed from the remote node launched by CMS Connect, one may consider either (1) migrating the input files by condor, or (2) using XRootD to transfer the input file from EOS space to the condor node, before running the Weaver train command.

"},{"location":"inference/particlenet.html#3-evaluation-of-models","title":"3. Evaluation of models","text":"

In the output folder, we find the trained PyTorch models after every epoch and the log file that records the loss and accuracy in the runtime.

The predict step also produces a predicted root file in the output folder, including the truth label, the predicted store, and several observer variables we provided in the data card. With the predicted root file, we make the ROC curve comparing the performance of the three trained models.

Here is the result from my training:

Model AUC Accuracy 1/eB (@eS=0.3) MLP 0.961 0.898 186 DeepAK8 (1D CNN) 0.979 0.927 585 ParticleNet (DGCNN) 0.984 0.936 1030

We see that the ParticleNet model shows an outstanding performance in this classification task. Besides, the DeepAK8 and ParticleNet results are similar to the benchmark values found in the gDoc. We address that the performance can be further improved by some following tricks:

  • Train an ensemble of models with different initial parametrization. For each event/jet, take the final predicted score as the mean/median of the score ensembles predicted by each model. This is a widely used ML technique to pursue an extra few percent of improvements.
  • Use more input variables for training. We note that in the above training example, only four input variables are used instead of a full suite of input features as done in the ParticleNet paper [arXiv:1902.08570]. Additional variables (e.g. \u0394R or log(pT / pT(jet))) can be designed based on the given 4-momenta, and, although providing redundant information in principle, can still help the network fully exploit the point cloud structure and thus do a better discrimination job.
  • The fine-tuning of the model will also bring some performance gain. See details in the next section.
"},{"location":"inference/particlenet.html#tuning-the-particlenet-model","title":"Tuning the ParticleNet model","text":"

When it comes to the real application of any DNN model, tunning the hyperparameters is an important path towards a better performance. In this section, we provide some tips on the ParticleNet model tunning. For a more detailed discussion on this topic, see more in the \"validation\" chapter in the documentation.

"},{"location":"inference/particlenet.html#1-choices-on-the-optimizer-and-the-learning-rate","title":"1. Choices on the optimizer and the learning rate","text":"

The optimizer decides how our neural network update all its parameters, and the learning rate means how fast the parameters changes in one training iteration.

Learning rate is the most important hyperparameter to choose from before concrete training is done. Here we quote from a suggested strategy: if you only have the opportunity to optimize one hyperparameter, choose the learning rate. The optimizer is also important because a wiser strategy usually means avoid the zig-zagging updating route, avoid falling into the local minima and even adapting different strategies for the fast-changing parameters and the slow ones. Adam (and its several variations) is a widely used optimizer. Another recently developed advanced optimizer is Ranger that combines RAdam and LookAhead. However, one should note that the few percent level improvement by using different optimizers is likely to be smeared by an unoptimized learning rate.

The above training scheme uses a start learning rate of 5e-3, and Ranger as the optimizer. It uses a flat+decay schedular, in a way that the LR starts to decay after processing 70% of epochs, and gradually reduce to 0.01 of its original value when nearing the completion of all epochs.

First, we note that the current case is already well optimized. Therefore, by simply reuse the current choice, the training will converge to a stable result in general. But it is always good in practice to test several choices of the optimizer and reoptimize the learning rate.

Weaver integrates multiple optimizers. In the above training command, we use --optimizer ranger to adopt the Ranger optimizer. It is also possible to switch to --optimizer adam or --optimizer adamW.

Weaver also provides the interface to optimize the learning rate before real training is performed. In the ParticleNet model training, we append

--lr-finder 5e-6,5e0,200\n
in the command, then a specific learning-rate finder program will be launched. This setup scans over the LR from 5e-6 to 5e0 by applying 200 mini-batches of training. It outputs a plot showing the training loss for different starting learning rates. In general, a lower training loss means a better choice of the learning rate parameter.

Below shows the results from LR finder by specifying --lr-finder 5e-6,5e0,200, for the --optimizer adamW (left) and the --optimizer ranger (right) case.

The training loss forms a basin shape which indicates that the optimal learning rate falls somewhere in the middle. We extract two aspects from the plots. First, the basin covers a wide range, meaning that the LR finder only provides a rough estimation. But it is a good attempt to first run the LR finder to have an overall feeling. For the Ranger case (right figure), one can choose the range 1e-3 to 1e-2 and further determine the optminal learning rate by delivering the full training. Second, we should be aware that different optimizer takes different optimal LR values. As can be seen here, the AdamW in general requires a small LR than Ranger.

"},{"location":"inference/particlenet.html#2-visualize-the-training-with-tensorboard","title":"2. Visualize the training with TensorBoard","text":"

To monitor the full training/evaluation accuracy and the loss for each mini-batch, we can draw support from a nicely integrated utility, TensorBoard, to employ real-time monitoring. See the introduction page from PyTorch: https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

To activate TensorBoard, append (note that replace ${PREFIX} according to the above table)

--tensorboard ${PREFIX}\n
to the training command. The runs/ subfolder containing the TensorBoard monitoring log will appear in the Weaver directory (if you are launching condor jobs, the runs/ folder will be transferred back in the tarball). Then, one can run
tensorboard --logdir=runs\n
to start the TensorBoard service and go to URL https://localhost:6006 to view the TensorBoard dashboard.

The below plots show the training and evaluation loss, in our standard choice with LR being 5e-3, and in the case of a small LR 2e-3 and a large LR 1e-2. Note that all tested LR values are within the basin in the LR finder plots.

We see that in the evaluated loss plot, the standard LR outperforms two variational choices. The reason may be that a larger LR finds difficulty in converging to the global minima, while a smaller LR may not be adequate to reach the minima point in a journey of 20 epochs. Overall, we see 5e-3 as a good choice as the starting LR for the Ranger optimizer.

"},{"location":"inference/particlenet.html#3-optimize-the-model","title":"3. Optimize the model","text":"

In practice, tuning the model size is also an important task. By concept, a smaller model tends to have unsatisfactory performance due to the limited ability to learn many local features. As the model size goes up, the performance will climb to some extent, but may further decrease due to the network \"degradation\" (deeper models have difficulty learning features). Besides, a heavier model may also cause the overfitting issue. In practice, it also leads to larger inference time which is the main concern when coming to real applications.

For the ParticleNet model case, we also test between a smaller and larger variation of the model size. Recall that the original model is defined by the following layer parameters.

conv_params = [\n    (16, (64, 64, 64)),\n    (16, (128, 128, 128)),\n    (16, (256, 256, 256)),\n    ]\nfc_params = [(256, 0.1)]\n
We can replace the code block with
ec_k = kwargs.get('ec_k', 16)\nec_c1 = kwargs.get('ec_c1', 64)\nec_c2 = kwargs.get('ec_c2', 128)\nec_c3 = kwargs.get('ec_c3', 256)\nfc_c, fc_p = kwargs.get('fc_c', 256), kwargs.get('fc_p', 0.1)\nconv_params = [\n    (ec_k, (ec_c1, ec_c1, ec_c1)),\n    (ec_k, (ec_c2, ec_c2, ec_c2)),\n    (ec_k, (ec_c3, ec_c3, ec_c3)),\n    ]\nfc_params = [(fc_c, fc_p)]\n
Then we have the ability to tune the model parameters from the command line. Append the extra arguments in the training command
--network-option ec_k 32 --network-option ec_c1 128 --network-option ec_c2 192 --network-option ec_c3 256\n
and the model parameters will take the new values as specified.

We test over two cases, one with the above setting to enlarge the model, and another by using

--network-option ec_c1 64 --network-option ec_c2 64 --network-option ec_c3 96\n
to adopt a lite version.

The Tensorboard monitoring plots in the training/evaluation loss is shown as follows.

We see that the \"heavy\" model reaches even smaller training loss, meaning that the model does not meet the degradation issue yet. However, the evaluation loss is not catching up with the training loss, showing some degree of overtraining in this scheme. From the evaluation result, we see no improvement by moving to a heavy model.

"},{"location":"inference/particlenet.html#4-apply-preselection-and-class-weights","title":"4. Apply preselection and class weights","text":"

In HEP applications, it is sometimes required to train a multi-class classifier. While it is simple to specify the input classes in the label section of the Weaver data config, it is sometimes ignored to set up the preselection and assign the suitable class weights for training. Using an unoptimized configuration, the trained model will not reach the best performance although no error message will result.

Since our top tagging example is a binary classification problem, there is no specific need to configure the preselection and class weights. Below we summarize some experiences that may be applicable in reader's custom multi-class training task.

The preselection should be chosen in a way that all remaining events passing the selection should fall into one and only one category. In other words, events with no labels attached should not be kept since it will confuse the training process.

Class weights (the class_weights option under weights in the data config) control the relative importance of input sample categories for training. Implementation-wise, it changes the event probability in a specific category chosen as training input events. The class weight comes into effect when one trains a multi-class classifier. Take 3-class case (denoted as [A, B, C]) as an example, the class_weights: [1, 1, 1] gives equal weights to all categories. Retraining the input with class_weights: [10, 1, 1] may result in a better discriminating power for class A vs. B or A vs. C; while the power of B separating with C will be weakened. As a trade-off between separating A vs. C and B vs. C, the class weights need to be intentionally tuned to achieve reasonable performance.

After the class weights are tuned, one can use another method to further factor out the interplay across categories, i.e., to define a \"binarized\" score between two classes only. Suppose the raw score for the three classes are P(A), P(B), and P(C) (their sum should be 1), then one can define the discriminant P(BvsC) = P(B) / (P(B)+P(C)) to separate B vs. C. In this way, the saparating power of B vs. C will remain unchanged for class_weights configured as either [1, 1, 1] or [10, 1, 1]. This strategy has been widely used in CMS to define composite tagger discrimant which are applied analysis-wise.

Above, we discuss in a very detailed manner on various attempts we can make to optimize the model. We hope the practical experiences presented here will help readers develop and deploy the complex ML model.

"},{"location":"inference/performance.html","title":"Performance of inference tools","text":""},{"location":"inference/pyg.html","title":"PyTorch Geometric","text":"

Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.

"},{"location":"inference/pyg.html#gdl-with-pyg","title":"GDL with PyG","text":"

A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.

"},{"location":"inference/pyg.html#the-data-class-pyg-graphs","title":"The Data Class: PyG Graphs","text":"

Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \\(u\\in\\mathcal{V}\\), where \\(\\mathcal{V}\\) is the node set. Relations are represented by edges \\((i,j)\\in\\mathcal{E}\\) between nodes, where \\(\\mathcal{E}\\) is the edge set. Denote the sizes of the node and edge sets as \\(|\\mathcal{V}|=n_\\mathrm{nodes}\\) and \\(|\\mathcal{E}|=n_\\mathrm{edges}\\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \\(n_\\mathrm{edges}=n_\\mathrm{nodes}(n_\\mathrm{nodes}-1)/2\\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.

In general, nodes can have positions \\(\\{p_i\\}_{i=1}^{n_\\mathrm{nodes}}\\), \\(p_i\\in\\mathbb{R}^{n_\\mathrm{space\\_dim}}\\), and features (attributes) \\(\\{x_i\\}_{i=1}^{n_\\mathrm{nodes}}\\), \\(x_i\\in\\mathbb{R}^{n_\\mathrm{node\\_dim}}\\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \\(\\{e_{ij}\\}_{(i,j)\\in\\mathcal{E}}\\), \\(e_{ij}\\in\\mathbb{R}^{n_\\mathrm{edge\\_dim}}\\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:

  • data.x: node feature matrix, \\(X\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{node\\_dim}}\\)
  • data.edge_index: node indices at each end of each edge, \\(I\\in\\mathbb{R}^{2\\times n_\\mathrm{edges}}\\)
  • data.edge_attr: edge feature matrix, \\(E\\in\\mathbb{R}^{n_\\mathrm{edges}\\times n_\\mathrm{edge\\_dim}}\\)
  • data.y: training target with arbitary shape (\\(y\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{out}}\\) for node-level targets, \\(y\\in\\mathbb{R}^{n_\\mathrm{edges}\\times n_\\mathrm{out}}\\) for edge-level targets or \\(y\\in\\mathbb{R}^{1\\times n_\\mathrm{out}}\\) for node-level targets).
  • data.pos: Node position matrix, \\(P\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{space\\_dim}}\\)

The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.

As an example, consider the ZINC chemical compounds dataset, which available as a built-in dataset in PyG:

from torch_geometric.datasets import ZINC\ntrain_dataset = ZINC(root='/tmp/ZINC', subset=True, split='train')\ntest_dataset =  ZINC(root='/tmp/ZINC', subset=True, split='test')\nlen(train_dataset)\n>>> 10000\nlen(test_dataset)\n>>> 1000   \n
Each graph in the dataset is a chemical compound; nodes are atoms and edges are chemical bonds. The node features x are categorical atom labels and the edge features edge_attr are categorical bond labels. The edge_index matrix lists all bonds present in the compound in COO format. The truth labels y indicate a synthetic computed property called constrained solubility; given a set of molecules represented as graphs, the task is to regress the constrained solubility. Therefore, this dataset is suitable for graph-level regression. Let's take a look at one molecule:

data = train_dataset[27]\ndata.x # node features\n>>> tensor([[0], [0], [1], [2], [0], \n            [0], [2], [0], [1], [2],\n            [4], [0], [0], [0], [0],\n            [4], [0], [0], [0], [0]])\n\ndata.pos # node positions \n>>> None\n\ndata.edge_index # COO edge indices\n>>> tensor([[ 0,  1,  1,  1,  2,  3,  3,  4,  4,  \n              5,  5,  6,  6,  7,  7,  7,  8,  9, \n              9, 10, 10, 10, 11, 11, 12, 12, 13, \n              13, 14, 14, 15, 15, 15, 16, 16, 16,\n              16, 17, 18, 19], # node indices w/ outgoing edges\n            [ 1,  0,  2,  3,  1,  1,  4,  3,  5,  \n              4,  6,  5,  7,  6,  8,  9,  7,  7,\n              10,  9, 11, 15, 10, 12, 11, 13, 12, \n              14, 13, 15, 10, 14, 16, 15, 17, 18,\n              19, 16, 16, 16]]) # node indices w/ incoming edges\n\ndata.edge_attr # edge features\n>>> tensor([1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, \n            1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1,\n            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n            1, 1, 1, 1])\n\ndata.y # truth labels\n>>> tensor([-0.0972])\n\ndata.num_nodes\n>>> 20\n\ndata.num_edges\n>>> 40\n\ndata.num_node_features\n>>> 1 \n

We can load the full set of graphs onto an available GPU and create PyG dataloaders as follows:

import torch\nfrom torch_geometric.data import DataLoader\n\ndevice = 'cuda:0' if torch.cuda.is_available() else 'cpu'\ntest_dataset = [d.to(device) for d in test_dataset]\ntrain_dataset = [d.to(device) for d in train_dataset]\ntest_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)\ntrain_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)\n

"},{"location":"inference/pyg.html#the-message-passing-base-class-pyg-gnns","title":"The Message Passing Base Class: PyG GNNs","text":"

The 2017 paper Neural Message Passing for Quantum Chemistry presents a unified framework for a swath of GNN architectures known as message passing neural networks (MPNNs). MPNNs are GNNs whose feature updates are given by:

\\[x_i^{(k)} = \\gamma^{(k)} \\left(x_i^{(k-1)}, \\square_{j \\in \\mathcal{N}(i)} \\, \\phi^{(k)}\\left(x_i^{(k-1)}, x_j^{(k-1)},e_{ij}\\right) \\right)\\]

Here, \\(\\gamma\\) and \\(\\phi\\) are learnable functions (which we can approximate as multilayer perceptrons), \\(\\square\\) is a permutation-invariant function (e.g. mean, max, add), and \\(\\mathcal{N}(i)\\) is the neighborhood of node \\(i\\). In PyG, you'd write your own MPNN by using the MessagePassing base class, implementing each of the above mathematical objects as an explicit function.

  • MessagePassing.message() : define an explicit NN for \\(\\phi\\), use it to calculate \"messages\" between a node \\(x_i^{(k-1)}\\) and its neighbors \\(x_j^{(k-1)}\\), \\(j\\in\\mathcal{N}(i)\\), leveraging edge features \\(e_{ij}\\) if applicable
  • MessagePassing.propagate() : in this step, messages are calculated via the message function and aggregated across each receiving node; the keyword aggr (which can be 'add', 'max', or 'mean') is used to specify the specific permutation invariant function \\(\\square_{j\\in\\mathcal{N}(i)}\\) used for message aggregation.
  • MessagePassing.update() : the results of message passing are used to update the node features \\(x_i^{(k)}\\) through the \\(\\gamma\\) MLP

The specific implementations of message(), propagate(), and update() are up to the user. A specific example is available in the PyG Creating Message Passing Networks tutorial

"},{"location":"inference/pyg.html#message-passing-with-zinc-data","title":"Message-Passing with ZINC Data","text":"

Returning to the ZINC molecular compound dataset, we can design a message-passing layer to aggregate messages across molecular graphs. Here, we'll define a multi-layer perceptron (MLP) class and use it to build a message passing layer (MPL) the following equation:

\\[x_i' = \\gamma \\left(x_i, \\frac{1}{|\\mathcal{N}(i)|}\\sum_{j \\in \\mathcal{N}(i)} \\, \\phi\\left([x_i, x_j, e_{j,i}\\right]) \\right)\\]

Here, the MLP dimensions are constrained. Since \\(x_i, e_{i,j}\\in\\mathbb{R}\\), the \\(\\phi\\) MLP must map \\(\\mathbb{R}^3\\) to \\(\\mathbb{R}^\\mathrm{message\\_size}\\). Similarly, \\(\\gamma\\) must map \\(\\mathbb{R}^{1+\\mathrm{\\mathrm{message\\_size}}}\\) to \\(\\mathbb{R}^\\mathrm{out}\\).

from torch_geometric.nn import MessagePassing\nimport torch.nn as nn\nfrom torch.nn import Sequential as Seq, Linear, ReLU\n\nclass MLP(nn.Module):\n    def __init__(self, input_size, output_size):\n        super(MLP, self).__init__()\n\n        self.layers = nn.Sequential(\n            nn.Linear(input_size, 16),\n            nn.ReLU(),\n            nn.Linear(16, 16),\n            nn.ReLU(),\n            nn.Linear(16, output_size),\n        )\n\n    def forward(self, x):\n        return self.layers(x)\n\nclass MPLayer(MessagePassing):\n    def __init__(self, n_node_feats, n_edge_feats, message_size, output_size):\n        super(MPLayer, self).__init__(aggr='mean', \n                                      flow='source_to_target')\n        self.phi = MLP(2*n_node_feats + n_edge_feats, message_size)\n        self.gamma = MLP(message_size + n_node_feats, output_size)\n\n    def forward(self, x, edge_index, edge_attr):\n        return self.propagate(edge_index, x=x, edge_attr=edge_attr)\n\n    def message(self, x_i, x_j, edge_attr):       \n        return self.phi(torch.cat([x_i, x_j, edge_attr], dim=1))\n\n    def update(self, aggr_out, x):\n        return self.gamma(torch.cat([x, aggr_out], dim=1))\n

Let's apply this layer to one of the ZINC molecules:

molecule = train_dataset[0]\ntorch.Size([29, 1]) # 29 atoms and 1 feature (atom label)\nmpl = MPLayer(1, 1, 16, 8).to(device) # message_size = 16, output_size = 8\nxprime = mpl(graph.x.float(), graph.edge_index, graph.edge_attr.unsqueeze(1))\nxprime.shape\n>>> torch.Size([29, 8]) # 29 atoms and 8 features\n
There we have it - the message passing layer has produced 8 new features for each atom.

"},{"location":"inference/pytorch.html","title":"PyTorch Inference","text":"

PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!

"},{"location":"inference/pytorch.html#introductory-references","title":"Introductory References","text":"
  • PyTorch Install Guide
  • PyTorch Tutorials
  • LPC HATs: PyTorch
  • Deep Learning w/ PyTorch Course Repo
  • CODAS-HEP
"},{"location":"inference/pytorch.html#the-basics","title":"The Basics","text":"

The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.

"},{"location":"inference/pytorch.html#tensors","title":"Tensors","text":"

The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:

a = torch.randn(size=(2,2))\n>>> tensor([[ 1.3552, -0.0204],\n            [ 1.2677, -0.8926]])\na.view(-1, 1)\n>>> tensor([[ 1.3552],\n            [-0.0204],\n            [ 1.2677],\n            [-0.8926]])\na.transpose(0, 1)\n>>> tensor([[ 1.3552,  1.2677],\n            [-0.0204, -0.8926]])\na.unsqueeze(dim=0)\n>>> tensor([[[ 1.3552, -0.0204],\n             [ 1.2677, -0.8926]]])\na.squeeze(dim=0)\n>>> tensor([[ 1.3552, -0.0204],\n            [ 1.2677, -0.8926]])\n
Additionally, torch supports familiar matrix operations with various syntax options:
m1 = torch.randn(size=(2,3))\nm2 = torch.randn(size=(3,2))\nx = torch.randn(3)\n\nm1 @ m2 == m1.mm(m2) # matrix multiplication\n>>> tensor([[True, True],\n            [True, True]])\n\nm1 @ x == m1.mv(x) # matrix-vector multiplication\n>>> tensor([True, True])\n\nm1.t() == m1.transpose(0, 1) # matrix transpose\n>>> tensor([[True, True],\n            [True, True],\n            [True, True]])\n
Note that tensor.transpose(dim0, dim1) is a more general operation than tensor.t(). It is important to note that tensors have been ''upgraded'' from Numpy arrays in two key ways: 1) Tensors have native GPU support. If a GPU is available at runtime, tensors can be transferred from CPU to GPU, where computations such as matrix operations are substantially faster. Note that tensor operations must be performed on objects on the same device. PyTorch supports CUDA tensor types for GPU computation (see the PyTorch Cuda Semantics guide). 2) Tensors support automatic gradient (audograd) calculations, such that operations on tensors flagged with requires_grad=True are automatically tracked. The flow of tracked tensor operations defines a computation graph in which nodes are tensors and edges are functions mapping input tensors to output tensors. Gradients are calculated numerically via autograd by walking through this computation graph.

"},{"location":"inference/pytorch.html#gpu-support","title":"GPU Support","text":"

Tensors are created on the host CPU by default:

b = torch.zeros([2,3], dtype=torch.int32)\nb.device\n>>> cpu\n

You can also create tensors on any available GPUs:

torch.cuda.is_available() # check that a GPU is available\n>>> True \ncuda0 = torch.device('cuda:0')\nc = torch.ones([2,3], dtype=torch.int32, device=cuda0)\nc.device\n>>> cuda:0\n

You can also move tensors between devices:

b = b.to(cuda0)\nb.device\n>>> cuda:0\n

There are trade-offs between computations on the CPU and GPU. GPUs have limited memory and there is a cost associated with transfering data from CPUs to GPUs. However, GPUs perform heavy matrix operations much faster than CPUs, and are therefore often used to speed up training routines.

N = 1000 # \nfor i, N in enumerate([10, 100, 500, 1000, 5000]):\n    print(\"({},{}) Matrices:\".format(N,N))\n    M1_cpu = torch.randn(size=(N,N), device='cpu')\n    M2_cpu = torch.randn(size=(N,N), device='cpu')\n    M1_gpu = torch.randn(size=(N,N), device=cuda0)\n    M2_gpu = torch.randn(size=(N,N), device=cuda0)\n    if (i==0):\n        print('Check devices for each tensor:')\n        print('M1_cpu, M2_cpu devices:', M1_cpu.device, M2_cpu.device)\n        print('M1_gpu, M2_gpu devices:', M1_gpu.device, M2_gpu.device)\n\n    def large_matrix_multiply(M1, M2):\n        return M1 * M2.transpose(0,1)\n\n    n_iter = 1000\n    t_cpu = Timer(lambda: large_matrix_multiply(M1_cpu, M2_cpu))\n    cpu_time = t_cpu.timeit(number=n_iter)/n_iter\n    print('cpu time per call: {:.6f} s'.format(cpu_time))\n\n    t_gpu = Timer(lambda: large_matrix_multiply(M1_gpu, M2_gpu))\n    gpu_time = t_gpu.timeit(number=n_iter)/n_iter\n    print('gpu time per call: {:.6f} s'.format(gpu_time))\n    print('gpu_time/cpu_time: {:.6f}\\n'.format(gpu_time/cpu_time))\n\n>>> (10,10) Matrices:\nCheck devices for each tensor:\nM1_cpu, M2_cpu devices: cpu cpu\nM1_gpu, M2_gpu devices: cuda:0 cuda:0\ncpu time per call: 0.000008 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 1.904711\n\n(100,100) Matrices:\ncpu time per call: 0.000015 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 0.993163\n\n(500,500) Matrices:\ncpu time per call: 0.000058 s\ngpu time per call: 0.000016 s\ngpu_time/cpu_time: 0.267371\n\n(1000,1000) Matrices:\ncpu time per call: 0.000170 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 0.089784\n\n(5000,5000) Matrices:\ncpu time per call: 0.025083 s\ngpu time per call: 0.000011 s\ngpu_time/cpu_time: 0.000419\n

The complete list of Torch Tensor operations is available in the docs.

"},{"location":"inference/pytorch.html#autograd","title":"Autograd","text":"

Backpropagation occurs automatically through autograd. For example, consider the following function and its derivatives:

\\[\\begin{aligned} f(\\textbf{a}, \\textbf{b}) &= \\textbf{a}^T \\textbf{X} \\textbf{b} \\\\ \\frac{\\partial f}{\\partial \\textbf{a}} &= \\textbf{b}^T \\textbf{X}^T\\\\ \\frac{\\partial f}{\\partial \\textbf{b}} &= \\textbf{a}^T \\textbf{X} \\end{aligned}\\]

Given specific choices of \\(\\textbf{X}\\), \\(\\textbf{a}\\), and \\(\\textbf{b}\\), we can calculate the corresponding derivatives via autograd by requiring a gradient to be stored in each relevant tensor:

X = torch.ones((2,2), requires_grad=True)\na = torch.tensor([0.5, 1], requires_grad=True)\nb = torch.tensor([0.5, -2], requires_grad=True)\nf = a.T @ X @ b\nf\n>>> tensor(-2.2500, grad_fn=<DotBackward>) \nf.backward() # backprop \na.grad\n>>> tensor([-1.5000, -1.5000])\nb.T @ X.T \n>>> tensor([-1.5000, -1.5000], grad_fn=<SqueezeBackward3>)\nb.grad\n>>> tensor([1.5000, 1.5000])\na.T @ X\n>>> tensor([1.5000, 1.5000], grad_fn=<SqueezeBackward3>)\n
The tensor.backward() call initiates backpropagation, accumulating the gradient backward through a series of grad_fn labels tied to each tensor (e.g. <DotBackward>, indicating the dot product \\((\\textbf{a}^T\\textbf{X})\\textbf{b}\\)).

"},{"location":"inference/pytorch.html#data-utils","title":"Data Utils","text":"

PyTorch is equipped with many useful data-handling utilities. For example, the torch.utils.data package implements datasets (torch.utils.data.Dataset) and iterable data loaders (torch.utils.data.DataLoader). Additionally, various batching and sampling schemes are available.

You can create custom iterable datasets via torch.utils.data.Dataset, for example a dataset collecting the results of XOR on two binary inputs:

from torch.utils.data import Dataset\n\nclass Data(Dataset):\n    def __init__(self, device):\n        self.samples = torch.tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)\n        self.targets = np.logical_xor(self.samples[:,0], \n                                      self.samples[:,1]).float().to(device)\n\n    def __len__(self):\n        return len(self.targets)\n\n    def __getitem__(self,idx):\n        return({'x': self.samples[idx],\n                'y': self.targets[idx]})\n
Dataloaders, from torch.utils.data.DataLoader, can generate shuffled batches of data via multiple workers. Here, we load our datasets onto the GPU:
from torch.utils.data import DataLoader\n\ndevice = 'cpu'\ntrain_data = Data(device)\ntest_data = Data(device)\ntrain_loader = DataLoader(train_data, batch_size=1, shuffle=True, num_workers=2)\ntest_loader = DataLoader(test_data, batch_size=1, shuffle=False, num_workers=2)\nfor i, batch in enumerate(train_loader):\n    print(i, batch)\n\n>>> 0 {'x': tensor([[0., 0.]]), 'y': tensor([0.])}\n    1 {'x': tensor([[1., 0.]]), 'y': tensor([1.])}\n    2 {'x': tensor([[1., 1.]]), 'y': tensor([0.])}\n    3 {'x': tensor([[0., 1.]]), 'y': tensor([1.])}\n
The full set of data utils is available in the docs.

"},{"location":"inference/pytorch.html#neural-networks","title":"Neural Networks","text":"

The PyTorch nn package specifies a set of modules that correspond to different neural network (NN) components and operations. For example, the torch.nn.Linear module defines a linear transform with learnable parameters and the torch.nn.Flatten module flattens two contiguous tensor dimensions. The torch.nn.Sequential module contains a set of modules such as torch.nn.Linear and torch.nn.Sequential, chaining them together to form the forward pass of a forward network. Furthermore, one may specify various pre-implemented loss functions, for example torch.nn.BCELoss and torch.nn.KLDivLoss. The full set of PyTorch NN building blocks is available in the docs.

As an example, we can design a simple neural network designed to reproduce the output of the XOR operation on binary inputs. To do so, we can compute a simple NN of the form:

\\[\\begin{aligned} x_{in}&\\in\\{0,1\\}^{2}\\\\ l_1 &= \\sigma(W_1^Tx_{in} + b_1); \\ W_1\\in\\mathbb{R}^{2\\times2},\\ b_1\\in\\mathbb{R}^{2}\\\\ l_2 &= \\sigma(W_2^Tx + b_2); \\ W_2\\in\\mathbb{R}^{2},\\ b_1\\in\\mathbb{R}\\\\ \\end{aligned}\\]
import torch.nn as nn\n\nclass Network(nn.Module):\n\n    def __init__(self):\n        super().__init__()\n\n        self.l1 = nn.Linear(2, 2)\n        self.l2 = nn.Linear(2, 1)\n\n    def forward(self, x):\n        x = torch.sigmoid(self.l1(x))\n        x = torch.sigmoid(self.l2(x))\n        return x\n\nmodel = Network().to(device)\nmodel(train_data['x'])\n\n>>> tensor([[0.5000],\n            [0.4814],\n            [0.5148],\n            [0.4957]], grad_fn=<SigmoidBackward>)\n
"},{"location":"inference/pytorch.html#optimizers","title":"Optimizers","text":"

Training a neural network involves minimizing a loss function; classes in the torch.optim package implement various optimization strategies for example stochastic gradient descent and Adam through torch.optim.SGD and torch.optim.Adam respectively. Optimizers are configurable through parameters such as the learning rate (configuring the optimizer's step size). The full set of optimizers and accompanying tutorials are available in the docs.

To demonstrate the use of an optimizer, let's train the NN above to produce the results of the XOR operation on binary inputs. Here we'll use the Adam optimizer:

from torch import optim\nfrom torch.optim.lr_scheduler import StepLR\nfrom matplotlib import pyplot as plt\n\n# helpful references:\n# Learning XOR: exploring the space of a classic problem\n# https://towardsdatascience.com/how-neural-networks-solve-the-xor-problem-59763136bdd7\n# https://courses.cs.washington.edu/courses/cse446/18wi/sections/section8/XOR-Pytorch.html\n\n# the training function initiates backprop and \n# steps the optimizer towards the weights that \n# optimize the loss function \ndef train(model, train_loader, optimizer, epoch):\n    model.train()\n    losses = []\n    for i, batch in enumerate(train_loader):\n        optimizer.zero_grad()\n        output = model(batch['x'])\n        y, output = batch['y'], output.squeeze(1)\n\n        # optimize binary cross entropy:\n        # https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html\n        loss = F.binary_cross_entropy(output, y, reduction='mean')\n        loss.backward()\n        optimizer.step()\n        losses.append(loss.item())\n\n    return np.mean(losses)\n\n# the test function does not adjust the model's weights\ndef test(model, test_loader):\n    model.eval()\n    losses, n_correct, n_incorrect = [], 0, 0\n    with torch.no_grad():\n        for i, batch in enumerate(test_loader):\n            output = model(batch['x'])\n            y, output = batch['y'], output.squeeze(1)\n            loss = F.binary_cross_entropy(output, y, \n                                          reduction='mean').item()\n            losses.append(loss)\n\n            # determine accuracy by thresholding model output at 0.5\n            batch_correct = torch.sum(((output>0.5) & (y==1)) |\n                                      ((output<0.5) & (y==0)))\n            batch_incorrect = len(y) - batch_correct\n            n_correct += batch_correct\n            n_incorrect += batch_incorrect\n\n    return np.mean(losses), n_correct/(n_correct+n_incorrect)\n\n\n# randomly initialize the model's weights\nfor module in model.modules():\n    if isinstance(module, nn.Linear):\n        module.weight.data.normal_(0, 1)\n\n# send weights to optimizer \nlr = 2.5e-2\noptimizer = optim.Adam(model.parameters(), lr=lr)\n\nepochs = 500\nfor epoch in range(1, epochs + 1):\n    train_loss = train(model, train_loader, optimizer, epoch)\n    test_loss, test_acc = test(model, test_loader)\n    if epoch%25==0:\n        print('epoch={}: train_loss={:.3f}, test_loss={:.3f}, test_acc={:.3f}'\n              .format(epoch, train_loss, test_loss, test_acc))\n\n>>> epoch=25: train_loss=0.683, test_loss=0.681, test_acc=0.500\n    epoch=50: train_loss=0.665, test_loss=0.664, test_acc=0.750\n    epoch=75: train_loss=0.640, test_loss=0.635, test_acc=0.750\n    epoch=100: train_loss=0.598, test_loss=0.595, test_acc=0.750\n    epoch=125: train_loss=0.554, test_loss=0.550, test_acc=0.750\n    epoch=150: train_loss=0.502, test_loss=0.498, test_acc=0.750\n    epoch=175: train_loss=0.435, test_loss=0.432, test_acc=0.750\n    epoch=200: train_loss=0.360, test_loss=0.358, test_acc=0.750\n    epoch=225: train_loss=0.290, test_loss=0.287, test_acc=1.000\n    epoch=250: train_loss=0.230, test_loss=0.228, test_acc=1.000\n    epoch=275: train_loss=0.184, test_loss=0.183, test_acc=1.000\n    epoch=300: train_loss=0.149, test_loss=0.148, test_acc=1.000\n    epoch=325: train_loss=0.122, test_loss=0.122, test_acc=1.000\n    epoch=350: train_loss=0.102, test_loss=0.101, test_acc=1.000\n    epoch=375: train_loss=0.086, test_loss=0.086, test_acc=1.000\n    epoch=400: train_loss=0.074, test_loss=0.073, test_acc=1.000\n    epoch=425: train_loss=0.064, test_loss=0.063, test_acc=1.000\n    epoch=450: train_loss=0.056, test_loss=0.055, test_acc=1.000\n    epoch=475: train_loss=0.049, test_loss=0.049, test_acc=1.000\n    epoch=500: train_loss=0.043, test_loss=0.043, test_acc=1.000\n
Here, the model has converged to 100% test accuracy, indicating that it has learned to reproduce the XOR outputs perfectly. Note that even though the test accuracy is 100%, the test loss (BCE) decreases steadily; this is because the BCE loss is nonzero when \\(y_{output}\\) is not exactly 0 or 1, while accuracy is determined by thresholding the model outputs such that each prediction is the boolean \\((y_{output} > 0.5)\\). This highlights that it is important to choose the correct performance metric for an ML problem. In the case of XOR, perfect test accuracy is sufficient. Let's check that we've recovered the XOR output by extracting the model's weights and using them to build a custom XOR function:

for name, param in model.named_parameters():\n    if param.requires_grad:\n        print(name, param.data)\n\n>>> l1.weight tensor([[ 7.2888, -6.4168],\n                      [ 7.2824, -8.1637]])\n    l1.bias tensor([ 2.6895, -3.9633])\n    l2.weight tensor([[-6.3500,  8.0990]])\n    l2.bias tensor([2.5058])\n

Because our model was built with nn.Linear modules, we have weight matrices and bias terms. Next, we'll hard-code the matrix operations into a custom XOR function based on the architecture of the NN:

def XOR(x):\n    w1 = torch.tensor([[ 7.2888, -6.4168],\n                       [ 7.2824, -8.1637]]).t()\n    b1 = torch.tensor([ 2.6895, -3.9633])\n    layer1_out = torch.tensor([x[0]*w1[0,0] + x[1]*w1[1,0] + b1[0],\n                               x[0]*w1[0,1] + x[1]*w1[1,1] + b1[1]])\n    layer1_out = torch.sigmoid(layer1_out)\n\n    w2 = torch.tensor([-6.3500,  8.0990])\n    b2 = 2.5058\n    layer2_out = layer1_out[0]*w2[0] + layer1_out[1]*w2[1] + b2\n    layer2_out = torch.sigmoid(layer2_out)\n    return layer2_out, (layer2_out > 0.5)\n\nXOR([0.,0.])\n>>> (tensor(0.0359), tensor(False))\nXOR([0.,1.])\n>>> (tensor(0.9135), tensor(True))\nXOR([1.,0.])\n>>> (tensor(0.9815), tensor(True))\nXOR([1.,1.])\n>>> (tensor(0.0265), tensor(False))\n

There we have it - the NN learned XOR!

"},{"location":"inference/pytorch.html#pytorch-in-cmssw","title":"PyTorch in CMSSW","text":""},{"location":"inference/pytorch.html#via-onnx","title":"Via ONNX","text":"

One way to incorporate your PyTorch models into CMSSW is through the Open Neural Network Exchange (ONNX) Runtime tool. In brief, ONNX supports training and inference for a variety of ML frameworks, and is currently integrated into CMSSW (see the CMS ML tutorial). PyTorch hosts an excellent tutorial on exporting a model from PyTorch to ONNX. ONNX is available in CMSSW (see a relevant discussion in the CMSSW git repo).

"},{"location":"inference/pytorch.html#example-use-cases","title":"Example Use Cases","text":"

The \\(ZZ\\rightarrow 4b\\) analysis utilizes trained PyTorch models via ONNX in CMSSW (see the corresponding repo). Briefly, they run ONNX in CMSSW_11_X via the CMSSW package PhysicsTools/ONNXRuntime, using it to define a multiClassifierONNX class. This multiclassifier is capable of loading pre-trained PyTorch models specified by a modelFile string as follows:

#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n\nstd::unique_ptr<cms::Ort::ONNXRuntime> model;\nOrt::SessionOptions* session_options = new Ort::SessionOptions();\nsession_options->SetIntraOpNumThreads(1);\nmodel = std::make_unique<cms::Ort::ONNXRuntime>(modelFile, session_options);\n
"},{"location":"inference/pytorch.html#via-triton","title":"Via Triton","text":"

Coprocessors (GPUs, FPGAs, etc.) are frequently used to accelerate ML operations such as inference and training. In the 'as-a-service' paradigm, users can access cloud-based applications through lightweight client inferfaces. The Services for Optimized Network Inference on Coprocessors (SONIC) framework implements this paradigm in CMSSW, allowing the optimal integration of GPUs into event processing workflows. One powerful implementation of SONIC is the the NVIDIA Triton Inference Server, which is flexible with respect to ML framework, storage source, and hardware infrastructure. For more details, see the corresponding NVIDIA developer blog entry.

A Graph Attention Network (GAN) is available via Triton in CMSSW, and can be accessed here: https://github.com/cms-sw/cmssw/tree/master/HeterogeneousCore/SonicTriton/test

"},{"location":"inference/pytorch.html#training-tips","title":"Training Tips","text":"
  • When instantiating a DataLoader, shuffle=True should be enabled for training data but not for validation and testing data. At each training epoch, this will vary the order of data objects in each batch; accordingly, it is not efficient to load the full dataset (in its original ordering) into GPU memory before training. Instead, enable num_workers>1; this allows the DataLoader to load batches to the GPU as they're prepared. Note that this launches muliple threads on the CPU. For more information, see a corresponding discussion in the PyTorch forum.
"},{"location":"inference/sonic_triton.html","title":"Service-based inference with Triton/Sonic","text":"

This page is still under construction. For the moment, please see the Sonic+Triton tutorial given as part of the Machine Learning HATS@LPC 2021.

  • Link to Indico agenda
  • Slides
  • Exercise twiki
"},{"location":"inference/standalone.html","title":"Standalone framework","text":"

Todo.

Idea: Working w/ TF+ROOT standalone (outside of CMSSW)

"},{"location":"inference/swan_aws.html","title":"SWAN + AWS","text":"

Todo.

Ideas: best practices cost model instance priving need to log out monitoring madatory

"},{"location":"inference/tensorflow1.html","title":"Direct inference with TensorFlow 1","text":"

While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.

"},{"location":"inference/tensorflow2.html","title":"Direct inference with TensorFlow 2","text":"

TensorFlow 2 is available since CMSSW_11_1_X (cmssw#28711, cmsdist#5525). The integration into the software stack can be found in cmsdist/tensorflow.spec and the interface is located in cmssw/PhysicsTools/TensorFlow.

"},{"location":"inference/tensorflow2.html#available-versions","title":"Available versions","text":"Python 3 on el8Python 3 on slc7Python 2 on slc7 TensorFlow el8_amd64_gcc10 el8_amd64_gcc11 v2.6.0 \u2265 CMSSW_12_3_4 - v2.6.4 \u2265 CMSSW_12_5_0 \u2265 CMSSW_12_5_0 TensorFlow slc7_amd64_gcc900 slc7_amd64_gcc10 slc7_amd64_gcc11 v2.1.0 \u2265 CMSSW_11_1_0 - - v2.3.1 \u2265 CMSSW_11_2_0 - - v2.4.1 \u2265 CMSSW_11_3_0 - - v2.5.0 \u2265 CMSSW_12_0_0 \u2265 CMSSW_12_0_0 - v2.6.0 \u2265 CMSSW_12_1_0 \u2265 CMSSW_12_1_0 \u2265 CMSSW_12_3_0 v2.6.4 - \u2265 CMSSW_12_5_0 \u2265 CMSSW_13_0_0 TensorFlow slc7_amd64_gcc900 v2.1.0 \u2265 CMSSW_11_1_0 v2.3.1 \u2265 CMSSW_11_2_0

At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.

"},{"location":"inference/tensorflow2.html#software-setup","title":"Software setup","text":"

To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.

export SCRAM_ARCH=\"el8_amd64_gcc11\"\nexport CMSSW_VERSION=\"CMSSW_12_6_0\"\n\nsource \"/cvmfs/cms.cern.ch/cmsset_default.sh\" \"\"\n\ncmsrel \"${CMSSW_VERSION}\"\ncd \"${CMSSW_VERSION}/src\"\n\ncmsenv\nscram b\n

Below, the cmsml Python package is used to convert models from TensorFlow objects (tf.function's or Keras models) to protobuf graph files (documentation). It should be available after executing the commands above. You can check its version via

python -c \"import cmsml; print(cmsml.__version__)\"\n

and compare to the released tags. If you want to install a newer version from either the master branch of the cmsml repository or the Python package index (PyPI), you can simply do that via pip.

masterPyPI
# into your user directory (usually ~/.local)\npip install --upgrade --user git+https://github.com/cms-ml/cmsml\n\n# _or_\n\n# into a custom directory\npip install --upgrade --prefix \"CUSTOM_DIRECTORY\" git+https://github.com/cms-ml/cmsml\n
# into your user directory (usually ~/.local)\npip install --upgrade --user cmsml\n\n# _or_\n\n# into a custom directory\npip install --upgrade --prefix \"CUSTOM_DIRECTORY\" cmsml\n
"},{"location":"inference/tensorflow2.html#saving-your-model","title":"Saving your model","text":"

After successfully training, you should save your model in a protobuf graph file which can be read by the interface in CMSSW. Naturally, you only want to save that part of your model that is required to run the network prediction, i.e., it should not contain operations related to model training or loss functions (unless explicitely required). Also, to reduce the memory footprint and to accelerate the inference, variables should be converted to constant tensors. Both of these model transformations are provided by the cmsml package.

Instructions on how to transform and save your model are shown below, depending on whether you use Keras or plain TensorFlow with tf.function's.

Kerastf.function

The code below saves a Keras Model instance as a protobuf graph file using cmsml.tensorflow.save_graph. In order for Keras to built the internal graph representation before saving, make sure to either compile the model, or pass an input_shape to the first layer:

# coding: utf-8\n\nimport tensorflow as tf\nimport tf.keras.layers as layers\nimport cmsml\n\n# define your model\nmodel = tf.keras.Sequential()\nmodel.add(layers.InputLayer(input_shape=(10,), name=\"input\"))\nmodel.add(layers.Dense(100, activation=\"tanh\"))\nmodel.add(layers.Dense(3, activation=\"softmax\", name=\"output\"))\n\n# train it\n...\n\n# convert to binary (.pb extension) protobuf\n# with variables converted to constants\ncmsml.tensorflow.save_graph(\"graph.pb\", model, variables_to_constants=True)\n

Following the Keras naming conventions for certain layers, the input will be named \"input\" while the output is named \"sequential/output/Softmax\". To cross check the names, you can save the graph in text format by using the extension \".pb.txt\".

Let's consider you write your network model in a single tf.function.

# coding: utf-8\n\nimport tensorflow as tf\nimport cmsml\n\n# define the model\n@tf.function\ndef model(x):\n    # lift variable initialization to the lowest context so they are\n    # not re-initialized on every call (eager calls or signature tracing)\n    with tf.init_scope():\n        W = tf.Variable(tf.ones([10, 1]))\n        b = tf.Variable(tf.ones([1]))\n\n    # define your \"complex\" model here\n    h = tf.add(tf.matmul(x, W), b)\n    y = tf.tanh(h, name=\"y\")\n\n    return y\n

In TensorFlow terms, the model function is polymorphic - it accepts different types of the input tensor x (tf.float32, tf.float64, ...). For each type, TensorFlow will create a concrete function with an associated tf.Graph object. This mechanism is referred to as signature tracing. For deeper insights into tf.function, the concepts of signature tracing, polymorphic and concrete functions, see the guide on Better performance with tf.function.

To save the model as a protobuf graph file, you explicitely need to create a concrete function. However, this is fairly easy once you know the exact type and shape of all input arguments.

# create a concrete function\ncmodel = model.get_concrete_function(\n    tf.TensorSpec(shape=[2, 10], dtype=tf.float32),\n)\n\n# convert to binary (.pb extension) protobuf\n# with variables converted to constants\ncmsml.tensorflow.save_graph(\"graph.pb\", cmodel, variables_to_constants=True)\n

The input will be named \"x\" while the output is named \"y\". To cross check the names, you can save the graph in text format by using the extension \".pb.txt\".

Different method: Frozen signatures

Instead of creating a polymorphic tf.function and extracting a concrete one in a second step, you can directly define an input signature upon definition.

@tf.function(input_signature=(tf.TensorSpec(shape=[2, 10], dtype=tf.float32),))\ndef model(x):\n    ...\n

This disables signature tracing since the input signature is frozen. However, you can directly pass it to cmsml.tensorflow.save_graph.

"},{"location":"inference/tensorflow2.html#inference-in-cmssw","title":"Inference in CMSSW","text":"

The inference can be implemented to run in a single thread. In general, this does not mean that the module cannot be executed with multiple threads (cmsRun --numThreads <N> <CFG_FILE>), but rather that its performance in terms of evaluation time and especially memory consumption is likely to be suboptimal. Therefore, for modules to be integrated into CMSSW, the multi-threaded implementation is strongly recommended.

"},{"location":"inference/tensorflow2.html#cmssw-module-setup","title":"CMSSW module setup","text":"

If you aim to use the TensorFlow interface in a CMSSW plugin, make sure to include

<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n

in your plugins/BuildFile.xml file. If you are using the interface inside the src/ or interface/ directory of your module, make sure to create a global BuildFile.xml file next to theses directories, containing (at least):

<use name=\"PhysicsTools/TensorFlow\" />\n\n<export>\n<lib name=\"1\" />\n</export>\n
"},{"location":"inference/tensorflow2.html#single-threaded-inference","title":"Single-threaded inference","text":"

Despite tf.Session being removed in the Python interface as of TensorFlow 2, the concepts of

  • Graph's, containing the constant computational structure and trained variables of your model,
  • Session's, handling execution and data exchange, and
  • the separation between them

live on in the C++ interface. Thus, the overall inference approach is 1) include the interface, 2) initialize Graph and session, 3) per event create input tensors and run the inference, and 4) cleanup.

"},{"location":"inference/tensorflow2.html#1-includes","title":"1. Includes","text":"
#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n// further framework includes\n...\n
"},{"location":"inference/tensorflow2.html#2-initialize-objects","title":"2. Initialize objects","text":"
// configure logging to show warnings (see table below)\ntensorflow::setLogging(\"2\");\n\n// load the graph definition\ntensorflow::GraphDef* graphDef = tensorflow::loadGraphDef(\"/path/to/constantgraph.pb\");\n\n// create a session\ntensorflow::Session* session = tensorflow::createSession(graphDef);\n
"},{"location":"inference/tensorflow2.html#3-inference","title":"3. Inference","text":"
// create an input tensor\n// (example: single batch of 10 values)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, { 1, 10 });\n\n\n// fill the tensor with your input data\n// (example: just fill consecutive values)\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// run the evaluation\nstd::vector<tensorflow::Tensor> outputs;\ntensorflow::run(session, { { \"input\", input } }, { \"output\" }, &outputs);\n\n// process the output tensor\n// (example: print the 5th value of the 0th (the only) example)\nstd::cout << outputs[0].matrix<float>()(0, 5) << std::endl;\n// -> float\n
"},{"location":"inference/tensorflow2.html#4-cleanup","title":"4. Cleanup","text":"
tensorflow::closeSession(session);\ndelete graphDef;\n
"},{"location":"inference/tensorflow2.html#full-example","title":"Full example","text":"Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 graph.pb\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.py
/*\n * Example plugin to demonstrate the direct single-threaded inference with TensorFlow 2.\n */\n\n#include <memory>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n\nclass MyPlugin : public edm::one::EDAnalyzer<> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet&);\n~MyPlugin(){};\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::string graphPath_;\nstd::string inputTensorName_;\nstd::string outputTensorName_;\n\ntensorflow::GraphDef* graphDef_;\ntensorflow::Session* session_;\n};\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"graphPath\");\ndesc.add<std::string>(\"inputTensorName\");\ndesc.add<std::string>(\"outputTensorName\");\ndescriptions.addWithDefaultLabel(desc);\n}\n\nMyPlugin::MyPlugin(const edm::ParameterSet& config)\n: graphPath_(config.getParameter<std::string>(\"graphPath\")),\ninputTensorName_(config.getParameter<std::string>(\"inputTensorName\")),\noutputTensorName_(config.getParameter<std::string>(\"outputTensorName\")),\ngraphDef_(nullptr),\nsession_(nullptr) {\n// set tensorflow log level to warning\ntensorflow::setLogging(\"2\");\n}\n\nvoid MyPlugin::beginJob() {\n// load the graph\ngraphDef_ = tensorflow::loadGraphDef(graphPath_);\n\n// create a new session and add the graphDef\nsession_ = tensorflow::createSession(graphDef_);\n}\n\nvoid MyPlugin::endJob() {\n// close the session\ntensorflow::closeSession(session_);\n\n// delete the graph\ndelete graphDef_;\ngraphDef_ = nullptr;\n}\n\nvoid MyPlugin::analyze(const edm::Event& event, const edm::EventSetup& setup) {\n// define a tensor and fill it with range(10)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, {1, 10});\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output and run\nstd::vector<tensorflow::Tensor> outputs;\ntensorflow::run(session_, {{inputTensorName_, input}}, {outputTensorName_}, &outputs);\n\n// print the output\nstd::cout << \" -> \" << outputs[0].matrix<float>()(0, 0) << std::endl << std::endl;\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# get the data/ directory\nthisdir = os.path.dirname(os.path.abspath(__file__))\ndatadir = os.path.join(os.path.dirname(thisdir), \"data\")\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIISummer20UL17MiniAODv2/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/00000/005708B7-331C-904E-88B9-189011E6C9DD.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(\n    input=cms.untracked.int32(10),\n)\nprocess.source = cms.Source(\n    \"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles),\n)\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\nprocess.myPlugin.graphPath = cms.string(os.path.join(datadir, \"graph.pb\"))\nprocess.myPlugin.inputTensorName = cms.string(\"input\")\nprocess.myPlugin.outputTensorName = cms.string(\"output\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n
"},{"location":"inference/tensorflow2.html#multi-threaded-inference","title":"Multi-threaded inference","text":"

Compared to the single-threaded implementation above, the multi-threaded version has one major difference: both the Graph and the Session are no longer members of a particular module instance, but rather shared between all instances in all threads. See the documentation on the C++ interface of stream modules for details.

Recommendation updated

The previous recommendation stated that the Session is not constant and thus, should not be placed in the global cache, but rather created once per stream module instance. However, it was discovered that, although not explicitely declared as constant in the tensorflow::run() / Session::run() interface, the session is actually not changed during evaluation and can be treated as being effectively constant.

As a result, it is safe to move it to the global cache, next to the Graph object. The TensorFlow interface in CMSSW was adjusted in order to accept const objects in cmssw#40161.

Thus, the overall inference approach is 1) include the interface, 2) let your plugin inherit from edm::stream::EDAnalyzerasdasd and declare the GlobalCache, 3) store in cconst Session*, pointing to the cached session, and 4) per event create input tensors and run the inference.

"},{"location":"inference/tensorflow2.html#1-includes_1","title":"1. Includes","text":"
#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n// further framework includes\n...\n

Note that stream/EDAnalyzer.h is included rather than one/EDAnalyzer.h.

"},{"location":"inference/tensorflow2.html#2-define-and-use-the-global-cache","title":"2. Define and use the global cache","text":"

The cache definition is done by declaring a simple struct. However, for the purpose of just storing a graph and a session object, a so-called tensorflow::SessionCache struct is already provided centrally. It was added in cmssw#40284 and its usage is shown in the following. In case the tensorflow::SessionCache is not (yet) available in your version of CMSSW, expand the \"Custom cache struct\" section below.

Use it in the edm::GlobalCache template argument and adjust the plugin accordingly.

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<tensorflow::SessionCache>> {\npublic:\nexplicit GraphLoadingMT(const edm::ParameterSet&, const tensorflow::SessionCache*);\n~GraphLoadingMT();\n\n// an additional static method for initializing the global cache\nstatic std::unique_ptr<tensorflow::SessionCache> initializeGlobalCache(const edm::ParameterSet&);\nstatic void globalEndJob(const CacheData*);\n...\n

Implement initializeGlobalCache to control the behavior of how the cache object is created. The destructor of tensorflow::SessionCache already handles the closing of the session itself and the deletion of all objects.

std::unique_ptr<tensorflow::SessionCache> MyPlugin::initializeGlobalCache(const edm::ParameterSet& config) {\nstd::string graphPath = edm::FileInPath(params.getParameter<std::string>(\"graphPath\")).fullPath();\nreturn std::make_unique<tensorflow::SessionCache>(graphPath);\n}\n
Custom cache struct
struct MyCache {\nMyCache() : {\n}\n\nstd::atomic<tensorflow::GraphDef*> graph;\nstd::atomic<tensorflow::Session*> session;\n};\n

Use it in the edm::GlobalCache template argument and adjust the plugin accordingly.

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<CacheData>> {\npublic:\nexplicit GraphLoadingMT(const edm::ParameterSet&, const CacheData*);\n~GraphLoadingMT();\n\n// two additional static methods for handling the global cache\nstatic std::unique_ptr<CacheData> initializeGlobalCache(const edm::ParameterSet&);\nstatic void globalEndJob(const CacheData*);\n...\n

Implement initializeGlobalCache and globalEndJob to control the behavior of how the cache object is created and destroyed.

See the full example below for more details.

"},{"location":"inference/tensorflow2.html#3-initialize-objects","title":"3. Initialize objects","text":"

In your module constructor, you can get a pointer to the constant session to perform model evaluation during the event loop.

// declaration in header\nconst tensorflow::Session* _session;\n\n// get a pointer to the const session stored in the cache in the constructor init\nMyPlugin::MyPlugin(const edm::ParameterSet& config,  const tensorflow::SessionCache* cache)\n: session_(cache->getSession()) {\n...\n}\n
"},{"location":"inference/tensorflow2.html#4-inference","title":"4. Inference","text":"
// create an input tensor\n// (example: single batch of 10 values)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, { 1, 10 });\n\n\n// fill the tensor with your input data\n// (example: just fill consecutive values)\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output\nstd::vector<tensorflow::Tensor> outputs;\n\n// evaluate\n// note: in case this line causes the compiler to complain about the const'ness of the session_ in\n//       this call, your CMSSW version might not yet support passing a const session, so in this\n//       case, pass \"const_cast<tensorflow::Session*>(session_)\"\ntensorflow::run(session_, { { inputTensorName, input } }, { outputTensorName }, &outputs);\n\n// process the output tensor\n// (example: print the 5th value of the 0th (the only) example)\nstd::cout << outputs[0].matrix<float>()(0, 5) << std::endl;\n// -> float\n

Note

If the TensorFlow interface in your CMSSW release does not yet accept const sessions, line 19 in the example above will cause an error during compilation. In this case, replace session_ in that line to

const_cast<tensorflow::Session*>(session_)\n
"},{"location":"inference/tensorflow2.html#full-example_1","title":"Full example","text":"Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 graph.pb\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.py
/*\n * Example plugin to demonstrate the direct multi-threaded inference with TensorFlow 2.\n */\n\n#include <memory>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n\n// put a tensorflow::SessionCache into the global cache structure\n// the session cache wraps both a tf graph and a tf session instance and also handles their deletion\nclass MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<tensorflow::SessionCache>> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet&, const tensorflow::SessionCache*);\n~MyPlugin(){};\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\n// an additional static method for initializing the global cache\nstatic std::unique_ptr<tensorflow::SessionCache> initializeGlobalCache(const edm::ParameterSet&);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::string inputTensorName_;\nstd::string outputTensorName_;\n\n// a pointer to the session created by the global session cache\nconst tensorflow::Session* session_;\n};\n\nstd::unique_ptr<tensorflow::SessionCache> MyPlugin::initializeGlobalCache(const edm::ParameterSet& params) {\n// this method is supposed to create, initialize and return a SessionCache instance\nstd::string graphPath = edm::FileInPath(params.getParameter<std::string>(\"graphPath\")).fullPath();\n// Setup the TF backend by configuration\nif (params.getParameter<std::string>(\"tf_backend\") == \"cuda\"){\ntensorflow::Options options { tensorflow::Backend::cuda};\n}else {\ntensorflow::Options options { tensorflow::Backend::cpu};\n}\nreturn std::make_unique<tensorflow::SessionCache>(graphPath, options);\n}\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"graphPath\");\ndesc.add<std::string>(\"inputTensorName\");\ndesc.add<std::string>(\"outputTensorName\");\ndescriptions.addWithDefaultLabel(desc);\n}\n\nMyPlugin::MyPlugin(const edm::ParameterSet& config,  const tensorflow::SessionCache* cache)\n: inputTensorName_(config.getParameter<std::string>(\"inputTensorName\")),\noutputTensorName_(config.getParameter<std::string>(\"outputTensorName\")),\nsession_(cache->getSession()) {}\n\nvoid MyPlugin::beginJob() {}\n\nvoid MyPlugin::endJob() {\n// close the session\ntensorflow::closeSession(session_);\n}\n\nvoid MyPlugin::analyze(const edm::Event& event, const edm::EventSetup& setup) {\n// define a tensor and fill it with range(10)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, {1, 10});\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output\nstd::vector<tensorflow::Tensor> outputs;\n\n// evaluate\n// note: in case this line causes the compile to complain about the const'ness of the session_ in\n//       this call, your CMSSW version might not yet support passing a const session, so in this\n//       case, pass \"const_cast<tensorflow::Session*>(session_)\"\ntensorflow::run(session_, {{inputTensorName_, input}}, {outputTensorName_}, &outputs);\n\n// print the output\nstd::cout << \" -> \" << outputs[0].matrix<float>()(0, 0) << std::endl << std::endl;\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# get the data/ directory\nthisdir = os.path.dirname(os.path.abspath(__file__))\ndatadir = os.path.join(os.path.dirname(thisdir), \"data\")\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIISummer20UL17MiniAODv2/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/00000/005708B7-331C-904E-88B9-189011E6C9DD.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(\n    input=cms.untracked.int32(10),\n)\nprocess.source = cms.Source(\n    \"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles),\n)\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\nprocess.myPlugin.graphPath = cms.string(os.path.join(datadir, \"graph.pb\"))\nprocess.myPlugin.inputTensorName = cms.string(\"input\")\nprocess.myPlugin.outputTensorName = cms.string(\"output\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n
"},{"location":"inference/tensorflow2.html#gpu-backend","title":"GPU backend","text":"

By default the TensorFlow sessions get created for CPU running. Since CMSSW_13_1_X the GPU backend for TensorFlow is available in the cmssw release.

Minimal changes are needed in the inference code to move the model on the GPU. A tensorflow::Options struct is available to setup the backend.

tensorflow::Options options { tensorflow::Backend::cuda};\n\n# Initialize the cache\ntensorflow::SessionCache cache(pbFile, options);\n# or a single session\nconst tensorflow::Session* session = tensorflow::createSession(graphDef, options);\n

CMSSW modules should add an options in the PSets of the producers and analyzers to configure on the fly the TensorFlow backend for the sessions created by the plugins.

"},{"location":"inference/tensorflow2.html#optimization","title":"Optimization","text":"

Depending on the use case, the following approaches can optimize the inference performance. It could be worth checking them out in your algorithm.

Further optimization approaches can be found in the integration checklist.

"},{"location":"inference/tensorflow2.html#reusing-tensors","title":"Reusing tensors","text":"

In some cases, instead of creating new input tensors for each inference call, you might want to store input tensors as members of your plugin. This is of course possible if you know its exact shape a-prioro and comes with the cost of keeping the tensor in memory for the lifetime of your module instance.

You can use

tensor.flat<float>().setZero();\n

to reset the values of your tensor prior to each call.

"},{"location":"inference/tensorflow2.html#tensor-data-access-via-pointers","title":"Tensor data access via pointers","text":"

As shown in the examples above, tensor data can be accessed through methods such as flat<type>() or matrix<type>() which return objects that represent the underlying data in the requested structure (tensorflow::Tensor C++ API). To read and manipulate particular elements, you can directly call this object with the coordinates of an element.

// matrix returns a 2D representation\n// set element (b,i) to f\ntensor.matrix<float>()(b, i) = float(f);\n

However, doing this for a large input tensor might entail some overhead. Since the data is actually contiguous in memory (C-style \"row-major\" memory ordering), a faster (though less explicit) way of interacting with tensor data is using a pointer.

// get the pointer to the first tensor element\nfloat* d = tensor.flat<float>().data();\n

Now, the tensor data can be filled using simple and fast pointer arithmetic.

// fill tensor data using pointer arithmethic\n// memory ordering is row-major, so the most outer loop corresponds dimension 0\nfor (size_t b = 0; b < batchSize; b++) {\nfor (size_t i = 0; i < nFeatures; i++, d++) {  // note the d++\n*d = float(i);\n}\n}\n
"},{"location":"inference/tensorflow2.html#inter-and-intra-operation-parallelism","title":"Inter- and intra-operation parallelism","text":"

Debugging and local processing only

Parallelism between (inter) and within (intra) operations can greatly improve the inference performance. However, this allows TensorFlow to manage and schedule threads on its own, possibly interfering with the thread model inherent to CMSSW. For inference code that is to be officially integrated, you should avoid inter- and intra-op parallelism and rather adhere to the examples shown above.

You can configure the amount of inter- and infra-op threads via the second argument of the tensorflow::createSession method.

SimpleVerbose
tensorflow::Session* session = tensorflow::createSession(graphDef, nThreads);\n
tensorflow::SessionOptions sessionOptions;\nsessionOptions.config.set_intra_op_parallelism_threads(nThreads);\nsessionOptions.config.set_inter_op_parallelism_threads(nThreads);\n\ntensorflow::Session* session = tensorflow::createSession(graphDef, sessionOptions);\n

Then, when calling tensorflow::run, pass the internal name of the TensorFlow threadpool, i.e. \"tensorflow\", as the last argument.

std::vector<tensorflow::Tensor> outputs;\ntensorflow::run(\nsession,\n{ { inputTensorName, input } },\n{ outputTensorName },\n&outputs,\n\"tensorflow\"\n);\n
"},{"location":"inference/tensorflow2.html#miscellaneous","title":"Miscellaneous","text":""},{"location":"inference/tensorflow2.html#logging","title":"Logging","text":"

By default, TensorFlow logging is quite verbose. This can be changed by either setting the TF_CPP_MIN_LOG_LEVEL environment varibale before calling cmsRun, or within your code through tensorflow::setLogging(level).

Verbosity level TF_CPP_MIN_LOG_LEVEL debug \"0\" info \"1\" (default) warning \"2\" error \"3\" none \"4\"

Forwarding logs to the MessageLogger service is not possible yet.

"},{"location":"inference/tensorflow2.html#links-and-further-reading","title":"Links and further reading","text":"
  • cmsml package
  • CMSSW
    • TensorFlow interface documentation
    • TensorFlow interface header
    • CMSSW process options
    • C++ interface of stream modules
  • TensorFlow
    • TensorFlow 2 tutorial
    • tf.function
    • C++ API
    • tensorflow::Tensor
    • tensorflow::Operation
    • tensorflow::ClientSession
  • Keras
    • API

Authors: Marcel Rieger

"},{"location":"inference/tfaas.html","title":"TFaaS","text":""},{"location":"inference/tfaas.html#tensorflow-as-a-service","title":"TensorFlow as a Service","text":"

TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.

For CERN users we already deploy TFaaS on the following URL: https://cms-tfaas.cern.ch

It can be used by CMS members using any HTTP based client. For example, here is a basic access from curl client:

curl -k https://cms-tfaas.cern.ch/models\n[\n  {\n    \"name\": \"luca\",\n    \"model\": \"prova.pb\",\n    \"labels\": \"labels.csv\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890554036 +0000 UTC m=+600537.976386186\"\n  },\n  {\n    \"name\": \"test_luca_1024\",\n    \"model\": \"saved_model.pb\",\n    \"labels\": \"labels.txt\",\n    \"options\": null,\n    \"inputNode\": \"dense_input_1:0\",\n    \"outputNode\": \"dense_3/Sigmoid:0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890776518 +0000 UTC m=+600537.976608672\"\n  },\n  {\n    \"name\": \"vk\",\n    \"model\": \"model.pb\",\n    \"labels\": \"labels.txt\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890903234 +0000 UTC m=+600537.976735378\"\n  }\n]\n

The following APIs are available: - /upload to push your favorite TF model to TFaaS server either for Form or as tar-ball bundle, see examples below - /delete to delete your TF model from TFaaS server - /models to view existing TF models on TFaaS server - /predict/json to serve TF model predictions in JSON data-format - /predict/proto to serve TF model predictions in ProtoBuffer data-format - /predict/image to serve TF model predictions forimages in JPG/PNG formats

"},{"location":"inference/tfaas.html#look-up-your-favorite-model","title":"\u2780 look-up your favorite model","text":"

You may easily look-up your ML model from TFaaS server, e.g.

curl https://cms-tfaas.cern.ch/models\n# possible output may looks like this\n[\n  {\n    \"name\": \"luca\",\n    \"model\": \"prova.pb\",\n    \"labels\": \"labels.csv\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-11-08 20:07:18.397487027 +0000 UTC m=+2091094.457327022\"\n  }\n  ...\n]\n
The provided /models API will list the name of the model, its file name, labels file, possible options, input and output nodes, description and proper timestamp when it was added to TFaaS repository

"},{"location":"inference/tfaas.html#upload-your-tf-model-to-tfaas-server","title":"\u2781 upload your TF model to TFaaS server","text":"

If your model is not in TFaaS server you may easily add it as following:

# example of image based model upload\ncurl -X POST https://cms-tfaas.cern.ch/upload\n-F 'name=ImageModel' -F 'params=@/path/params.json'\n-F 'model=@/path/tf_model.pb' -F 'labels=@/path/labels.txt'\n\n# example of TF pb file upload\ncurl -s -X POST https://cms-tfaas.cern.ch/upload \\\n    -F 'name=vk' -F 'params=@/path/params.json' \\\n    -F 'model=@/path/model.pb' -F 'labels=@/path/labels.txt'\n\n# example of bundle upload produce with Keras TF\n# here is our saved model area\nls model\nassets         saved_model.pb variables\n# we can create tarball and upload it to TFaaS via bundle end-point\ntar cfz model.tar.gz model\ncurl -X POST -H \"Content-Encoding: gzip\" \\\n             -H \"content-type: application/octet-stream\" \\\n             --data-binary @/path/models.tar.gz https://cms-tfaas.cern.ch/upload\n

"},{"location":"inference/tfaas.html#get-your-predictions","title":"\u2782 get your predictions","text":"

Finally, you may obtain predictions from your favorite model by using proper API, e.g.

# obtain predictions from your ImageModel\ncurl https://cms-tfaas.cern.ch/image -F 'image=@/path/file.png' -F 'model=ImageModel'\n\n# obtain predictions from your TF based model\ncat input.json\n{\"keys\": [...], \"values\": [...], \"model\":\"model\"}\n\n# call to get predictions from /json end-point using input.json\ncurl -s -X POST -H \"Content-type: application/json\" \\\n    -d@/path/input.json https://cms-tfaas.cern.ch/json\n

Fore more information please visit curl client page.

"},{"location":"inference/tfaas.html#tfaas-interface","title":"TFaaS interface","text":"

Clients communicate with TFaaS via HTTP protocol. See examples for Curl, Python and C++ clients.

"},{"location":"inference/tfaas.html#tfaas-benchmarks","title":"TFaaS benchmarks","text":"

Benchmark results on CentOS, 24 cores, 32GB of RAM serving DL NN with 42x128x128x128x64x64x1x1 architecture (JSON and ProtoBuffer formats show similar performance): - 400 req/sec for 100 concurrent clients, 1000 requests in total - 480 req/sec for 200 concurrent clients, 5000 requests in total

For more information please visit bencmarks page.

"},{"location":"inference/xgboost.html","title":"Direct inference with XGBoost","text":""},{"location":"inference/xgboost.html#general","title":"General","text":"

XGBoost is avaliable (at least) since CMSSW_9_2_4 cmssw#19377.

In CMSSW environment, XGBoost can be used via its Python API.

For UL era, there are different verisons available for different SCRAM_ARCH:

  1. For slc7_amd64_gcc700 and above, ver.0.80 is available.

  2. For slc7_amd64_gcc900 and above, ver.1.3.3 is available.

  3. Please note that different major versions have different behavior( See Caveat Session).

"},{"location":"inference/xgboost.html#existing-examples","title":"Existing Examples","text":"

There are some existing good examples of using XGBoost under CMSSW, as listed below:

  1. Offical sample for testing the integration of XGBoost library with CMSSW.

  2. Useful codes created by Dr. Huilin Qu for inference with existing trained model.

  3. C/C++ Interface for inference with existing trained model.

We will provide examples for both C/C++ interface and python interface of XGBoost under CMSSW environment.

"},{"location":"inference/xgboost.html#example-classification-of-points-from-joint-gaussian-distribution","title":"Example: Classification of points from joint-Gaussian distribution.","text":"

In this specific example, you will use XGBoost to classify data points generated from two 8-dimension joint-Gaussian distribution.

Feature Index 0 1 2 3 4 5 6 7 \u03bc1 1 2 3 4 5 6 7 8 \u03bc2 0 1.9 3.2 4.5 4.8 6.1 8.1 11 \u03c3\u00bd = \u03c3 1 1 1 1 1 1 1 1 |\u03bc1 - \u03bc2| / \u03c3 1 0.1 0.2 0.5 0.2 0.1 1.1 3

All generated data points for train(1:10000,2:10000) and test(1:1000,2:1000) are stored as Train_data.csv/Test_data.csv.

"},{"location":"inference/xgboost.html#preparing-model","title":"Preparing Model","text":"

The training process of a XGBoost model can be done outside of CMSSW. We provide a python script for illustration.

# importing necessary models\nimport numpy as np\nimport pandas as pd \nfrom xgboost import XGBClassifier # Or XGBRegressor for Logistic Regression\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n# specify parameters via map\nparam = {'n_estimators':50}\nxgb = XGBClassifier(param)\n\n# using Pandas.DataFrame data-format, other available format are XGBoost's DMatrix and numpy.ndarray\n\ntrain_data = pd.read_csv(\"path/to/the/data\") # The training dataset is code/XGBoost/Train_data.csv\n\ntrain_Variable = train_data['0', '1', '2', '3', '4', '5', '6', '7']\ntrain_Score = train_data['Type'] # Score should be integer, 0, 1, (2 and larger for multiclass)\n\ntest_data = pd.read_csv(\"path/to/the/data\") # The testing dataset is code/XGBoost/Test_data.csv\n\ntest_Variable = test_data['0', '1', '2', '3', '4', '5', '6', '7']\ntest_Score = test_data['Type']\n\n# Now the data are well prepared and named as train_Variable, train_Score and test_Variable, test_Score.\n\nxgb.fit(train_Variable, train_Score) # Training\n\nxgb.predict(test_Variable) # Outputs are integers\n\nxgb.predict_proba(test_Variable) # Output scores , output structre: [prob for 0, prob for 1,...]\n\nxgb.save_model(\"\\Path\\To\\Where\\You\\Want\\ModelName.model\") # Saving model\n
The saved model ModelName.model is thus available for python and C/C++ api to load. Please use the XGBoost major version consistently (see Caveat).

While training with data from different datasets, proper treatment of weights are necessary for better model performance. Please refer to Official Recommendation for more details.

"},{"location":"inference/xgboost.html#cc-usage-with-cmssw","title":"C/C++ Usage with CMSSW","text":"

To use a saved XGBoost model with C/C++ code, it is convenient to use the XGBoost's offical C api. Here we provide a simple example as following.

"},{"location":"inference/xgboost.html#module-setup","title":"Module setup","text":"

There is no official CMSSW interface for XGBoost while its library are placed in cvmfs of CMSSW. Thus we have to use the raw c_api as well as setting up the library manually.

  1. To run XGBoost's c_api within CMSSW framework, in addition to the following standard setup.
    export SCRAM_ARCH=\"slc7_amd64_gcc700\" # To use higher version, please switch to slc7_amd64_900\nexport CMSSW_VERSION=\"CMSSW_X_Y_Z\"\n\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\n\ncmsrel \"$CMSSW_VERSION\"\ncd \"$CMSSW_VERSION/src\"\n\ncmsenv\nscram b\n
    The addtional effort is to add corresponding xml file(s) to $CMSSW_BASE/toolbox$CMSSW_BASE/config/toolbox/$SCRAM_ARCH/tools/selected/ for setting up XGBoost.
  1. For lower version (<1), add two xml files as below.

    xgboost.xml

     <tool name=\"xgboost\" version=\"0.80\">\n<lib name=\"xgboost\"/>\n<client>\n<environment name=\"LIBDIR\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/lib\"/>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<use name=\"rabit\"/>\n</tool>\n
    rabit.xml
     <tool name=\"rabit\" version=\"0.80\">\n<client>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/rabit/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>  </tool>\n
    Please note that the path in cvmfs is not fixed, one can list all available versions in the py2-xgboost directory and choose one to use.

  2. For higher version (>=1), and one xml file

    xgboost.xml

    <tool name=\"xgboost\" version=\"0.80\">\n<lib name=\"xgboost\"/>\n<client>\n<environment name=\"LIBDIR\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/lib64\"/>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>  </tool>\n
    Also one has the freedom to choose the available xgboost version inside xgboost directory.

  1. After adding xml file(s), the following commands should be executed for setting up.

    1. For lower version (<1), use
      scram setup rabit\nscram setup xgboost\n
    2. For higher version (>=1), use
      scram setup xgboost\n
  2. For using XGBoost as a plugin of CMSSW, it is necessary to add

    <use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
    in your plugins/BuildFile.xml. If you are using the interface inside the src/ or interface/ directory of your module, make sure to create a global BuildFile.xml file next to theses directories, containing (at least):
    <use name=\"xgboost\"/>\n<export>\n<lib   name=\"1\"/>\n</export>\n

  3. The libxgboost.so would be too large to load for cmsRun job, please using the following commands for pre-loading:

    export LD_PRELOAD=$CMSSW_BASE/external/$SCRAM_ARCH/lib/libxgboost.so\n

"},{"location":"inference/xgboost.html#basic-usage-of-c-api","title":"Basic Usage of C API","text":"

In order to use c_api of XGBoost to load model and operate inference, one should construct necessaries objects:

  1. Files to include

    #include <xgboost/c_api.h> 

  2. BoosterHandle: worker of XGBoost

    // Declare Object\nBoosterHandle booster_;\n// Allocate memory in C style\nXGBoosterCreate(NULL,0,&booster_);\n// Load Model\nXGBoosterLoadModel(booster_,model_path.c_str()); // second argument should be a const char *.\n

  3. DMatrixHandle: handle to dmatrix, the data format of XGBoost

    float TestData[2000][8] // Suppose 2000 data points, each data point has 8 dimension\n// Assign data to the \"TestData\" 2d array ... \n// Declare object\nDMatrixHandle data_;\n// Allocate memory and use external float array to initialize\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_); // The first argument takes in float * namely 1d float array only, 2nd & 3rd: shape of input, 4th: value to replace missing ones\n

  4. XGBoosterPredict: function for inference

    bst_ulong outlen; // bst_ulong is a typedef of unsigned long\nconst float *f; // array to store predictions\nXGBoosterPredict(booster_,data_,0,0,&out_len,&f);// lower version API\n// XGBoosterPredict(booster_,data_,0,0,0,&out_len,&f);// higher version API\n/*\nlower version (ver.<1) API\nXGB_DLL int XGBoosterPredict(   \nBoosterHandle   handle,\nDMatrixHandle   dmat,\nint     option_mask, // 0 for normal output, namely reporting scores\nint     training, // 0 for prediction\nbst_ulong *     out_len,\nconst float **  out_result \n)\n\nhigher version (ver.>=1) API\nXGB_DLL int XGBoosterPredict(   \nBoosterHandle   handle,\nDMatrixHandle   dmat,\nint     option_mask, // 0 for normal output, namely reporting scores\nint ntree_limit, // how many trees for prediction, set to 0 means no limit\nint     training, // 0 for prediction\nbst_ulong *     out_len,\nconst float **  out_result \n)\n*/\n

"},{"location":"inference/xgboost.html#full-example","title":"Full Example","text":"Click to expand full example

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 XGBoostExample.cc\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 python/\n\u2502   \u2514\u2500\u2500 xgboost_cfg.py\n\u2502\n\u251c\u2500\u2500 toolbox/ (storing necessary xml(s) to be copied to toolbox/ of $CMSSW_BASE)\n\u2502   \u2514\u2500\u2500 xgboost.xml\n\u2502   \u2514\u2500\u2500 rabit.xml (lower version only)\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 Test_data.csv\n    \u2514\u2500\u2500 lowVer.model / highVer.model \n
Please also note that in order to operate inference in an event-by-event way, please put XGBoosterPredict in analyze rather than beginJob.

plugins/XGBoostExample.cc for lower version XGBoostplugins/BuildFile.xml for lower version XGBoostpython/xgboost_cfg.py for lower version XGBoostplugins/XGBoostExample.cc for higher version XGBoostplugins/BuildFile.xml for higher version XGBoostpython/xgboost_cfg.py for higher version XGBoost
// -*- C++ -*-\n//\n// Package:    XGB_Example/XGBoostExample\n// Class:      XGBoostExample\n//\n/**\\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc\n\n Description: [one line class summary]\n\n Implementation:\n     [Notes on implementation]\n*/\n//\n// Original Author:  Qian Sitian\n//         Created:  Sat, 19 Jun 2021 08:38:51 GMT\n//\n//\n\n\n// system include files\n#include <memory>\n\n// user include files\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"FWCore/Utilities/interface/InputTag.h\"\n#include \"DataFormats/TrackReco/interface/Track.h\"\n#include \"DataFormats/TrackReco/interface/TrackFwd.h\"\n\n#include <xgboost/c_api.h>\n#include <vector>\n#include <tuple>\n#include <string>\n#include <iostream>\n#include <fstream>\n#include <sstream>\n\nusing namespace std;\n\nvector<vector<double>> readinCSV(const char* name){\nauto fin = ifstream(name);\nvector<vector<double>> floatVec;\nstring strFloat;\nfloat fNum;\nint counter = 0;\ngetline(fin,strFloat);\nwhile(getline(fin,strFloat))\n{\nstd::stringstream  linestream(strFloat);\nfloatVec.push_back(std::vector<double>());\nwhile(linestream>>fNum)\n{\nfloatVec[counter].push_back(fNum);\nif (linestream.peek() == ',')\nlinestream.ignore();\n}\n++counter;\n}\nreturn floatVec;\n}\n\n//\n// class declaration\n//\n\n// If the analyzer does not use TFileService, please remove\n// the template argument to the base class so the class inherits\n// from  edm::one::EDAnalyzer<>\n// This will improve performance in multithreaded jobs.\n\n\n\nclass XGBoostExample : public edm::one::EDAnalyzer<>  {\npublic:\nexplicit XGBoostExample(const edm::ParameterSet&);\n~XGBoostExample();\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions& descriptions);\n\n\nprivate:\nvirtual void beginJob() ;\nvirtual void analyze(const edm::Event&, const edm::EventSetup&) ;\nvirtual void endJob() ;\n\n// ----------member data ---------------------------\n\nstd::string test_data_path;\nstd::string model_path;\n\n\n\n\n};\n\n//\n// constants, enums and typedefs\n//\n\n//\n// static data member definitions\n//\n\n//\n// constructors and destructor\n//\nXGBoostExample::XGBoostExample(const edm::ParameterSet& config):\ntest_data_path(config.getParameter<std::string>(\"test_data_path\")),\nmodel_path(config.getParameter<std::string>(\"model_path\"))\n{\n\n}\n\n\nXGBoostExample::~XGBoostExample()\n{\n\n// do anything here that needs to be done at desctruction time\n// (e.g. close files, deallocate resources etc.)\n\n}\n\n\n//\n// member functions\n//\n\nvoid\nXGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)\n{\n}\n\n\nvoid\nXGBoostExample::beginJob()\n{\nBoosterHandle booster_;\nXGBoosterCreate(NULL,0,&booster_);\ncout<<\"Hello World No.2\"<<endl;\nXGBoosterLoadModel(booster_,model_path.c_str());\nunsigned long numFeature = 0;\ncout<<\"Hello World No.3\"<<endl;\nvector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());\ncout<<\"Hello World No.4\"<<endl;\nfloat TestData[2000][8];\ncout<<\"Hello World No.5\"<<endl;\nfor(unsigned i=0; (i < 2000); i++)\n{ for(unsigned j=0; (j < 8); j++)\n{\nTestData[i][j] = TestDataVector[i][j];\n//  cout<<TestData[i][j]<<\"\\t\";\n} //cout<<endl;\n}\ncout<<\"Hello World No.6\"<<endl;\nDMatrixHandle data_;\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);\ncout<<\"Hello World No.7\"<<endl;\nbst_ulong out_len=0;\nconst float *f;\ncout<<out_len<<endl;\nauto ret=XGBoosterPredict(booster_, data_, 0,0,&out_len,&f);\ncout<<ret<<endl;\nfor (unsigned int i=0;i<2;i++)\nstd::cout <<  i << \"\\t\"<< f[i] << std::endl;\ncout<<\"Hello World No.8\"<<endl;\n}\n\nvoid\nXGBoostExample::endJob()\n{\n}\n\nvoid\nXGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n//The following says we do not know what parameters are allowed so do no validation\n// Please change this to state exactly what you do use, even if it is no parameters\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"test_data_path\");\ndesc.add<std::string>(\"model_path\");\ndescriptions.addWithDefaultLabel(desc);\n\n//Specify that only 'tracks' is allowed\n//To use, remove the default given above and uncomment below\n//ParameterSetDescription desc;\n//desc.addUntracked<edm::InputTag>(\"tracks\",\"ctfWithMaterialTracks\");\n//descriptions.addDefault(desc);\n}\n\n//define this as a plug-in\nDEFINE_FWK_MODULE(XGBoostExample);\n
<use name=\"FWCore/Framework\"/>\n<use name=\"FWCore/PluginManager\"/>\n<use name=\"FWCore/ParameterSet\"/>\n<use name=\"DataFormats/TrackReco\"/>\n<use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n# setup minimal options\n#options = VarParsing(\"python\")\n#options.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root\")  # noqa\n#options.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(1))\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))\nprocess.source = cms.Source(\"EmptySource\")\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\nprocess.XGBoostExample = cms.EDAnalyzer(\"XGBoostExample\")\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\n#process.load(\"XGB_Example.XGBoostExample.XGBoostExample_cfi\")\nprocess.XGBoostExample.model_path = cms.string(\"/Your/Path/data/lowVer.model\")\nprocess.XGBoostExample.test_data_path = cms.string(\"/Your/Path/data/Test_data.csv\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.XGBoostExample)\n
// -*- C++ -*-\n//\n// Package:    XGB_Example/XGBoostExample\n// Class:      XGBoostExample\n//\n/**\\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc\n\n Description: [one line class summary]\n\n Implementation:\n     [Notes on implementation]\n*/\n//\n// Original Author:  Qian Sitian\n//         Created:  Sat, 19 Jun 2021 08:38:51 GMT\n//\n//\n\n\n// system include files\n#include <memory>\n\n// user include files\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"FWCore/Utilities/interface/InputTag.h\"\n#include \"DataFormats/TrackReco/interface/Track.h\"\n#include \"DataFormats/TrackReco/interface/TrackFwd.h\"\n\n#include <xgboost/c_api.h>\n#include <vector>\n#include <tuple>\n#include <string>\n#include <iostream>\n#include <fstream>\n#include <sstream>\n\nusing namespace std;\n\nvector<vector<double>> readinCSV(const char* name){\nauto fin = ifstream(name);\nvector<vector<double>> floatVec;\nstring strFloat;\nfloat fNum;\nint counter = 0;\ngetline(fin,strFloat);\nwhile(getline(fin,strFloat))\n{\nstd::stringstream  linestream(strFloat);\nfloatVec.push_back(std::vector<double>());\nwhile(linestream>>fNum)\n{\nfloatVec[counter].push_back(fNum);\nif (linestream.peek() == ',')\nlinestream.ignore();\n}\n++counter;\n}\nreturn floatVec;\n}\n\n//\n// class declaration\n//\n\n// If the analyzer does not use TFileService, please remove\n// the template argument to the base class so the class inherits\n// from  edm::one::EDAnalyzer<>\n// This will improve performance in multithreaded jobs.\n\n\n\nclass XGBoostExample : public edm::one::EDAnalyzer<>  {\npublic:\nexplicit XGBoostExample(const edm::ParameterSet&);\n~XGBoostExample();\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions& descriptions);\n\n\nprivate:\nvirtual void beginJob() ;\nvirtual void analyze(const edm::Event&, const edm::EventSetup&) ;\nvirtual void endJob() ;\n\n// ----------member data ---------------------------\n\nstd::string test_data_path;\nstd::string model_path;\n\n\n\n\n};\n\n//\n// constants, enums and typedefs\n//\n\n//\n// static data member definitions\n//\n\n//\n// constructors and destructor\n//\nXGBoostExample::XGBoostExample(const edm::ParameterSet& config):\ntest_data_path(config.getParameter<std::string>(\"test_data_path\")),\nmodel_path(config.getParameter<std::string>(\"model_path\"))\n{\n\n}\n\n\nXGBoostExample::~XGBoostExample()\n{\n\n// do anything here that needs to be done at desctruction time\n// (e.g. close files, deallocate resources etc.)\n\n}\n\n\n//\n// member functions\n//\n\nvoid\nXGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)\n{\n}\n\n\nvoid\nXGBoostExample::beginJob()\n{\nBoosterHandle booster_;\nXGBoosterCreate(NULL,0,&booster_);\nXGBoosterLoadModel(booster_,model_path.c_str());\nunsigned long numFeature = 0;\nvector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());\nfloat TestData[2000][8];\nfor(unsigned i=0; (i < 2000); i++)\n{ for(unsigned j=0; (j < 8); j++)\n{\nTestData[i][j] = TestDataVector[i][j];\n//  cout<<TestData[i][j]<<\"\\t\";\n} //cout<<endl;\n}\nDMatrixHandle data_;\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);\nbst_ulong out_len=0;\nconst float *f;\nauto ret=XGBoosterPredict(booster_, data_,0, 0,0,&out_len,&f);\nfor (unsigned int i=0;i<out_len;i++)\nstd::cout <<  i << \"\\t\"<< f[i] << std::endl;\n}\n\nvoid\nXGBoostExample::endJob()\n{\n}\n\nvoid\nXGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n//The following says we do not know what parameters are allowed so do no validation\n// Please change this to state exactly what you do use, even if it is no parameters\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"test_data_path\");\ndesc.add<std::string>(\"model_path\");\ndescriptions.addWithDefaultLabel(desc);\n\n//Specify that only 'tracks' is allowed\n//To use, remove the default given above and uncomment below\n//ParameterSetDescription desc;\n//desc.addUntracked<edm::InputTag>(\"tracks\",\"ctfWithMaterialTracks\");\n//descriptions.addDefault(desc);\n}\n\n//define this as a plug-in\nDEFINE_FWK_MODULE(XGBoostExample);\n
<use name=\"FWCore/Framework\"/>\n<use name=\"FWCore/PluginManager\"/>\n<use name=\"FWCore/ParameterSet\"/>\n<use name=\"DataFormats/TrackReco\"/>\n<use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n# setup minimal options\n#options = VarParsing(\"python\")\n#options.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root\")  # noqa\n#options.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(10))\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))\nprocess.source = cms.Source(\"EmptySource\")\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring(options.inputFiles))\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\nprocess.XGBoostExample = cms.EDAnalyzer(\"XGBoostExample\")\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\n#process.load(\"XGB_Example.XGBoostExample.XGBoostExample_cfi\")\nprocess.XGBoostExample.model_path = cms.string(\"/Your/Path/data/highVer.model\")  \nprocess.XGBoostExample.test_data_path = cms.string(\"/Your/Path/data/Test_data.csv\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.XGBoostExample)\n
"},{"location":"inference/xgboost.html#python-usage","title":"Python Usage","text":"

To use XGBoost's python interface, using the snippet below under CMSSW environment

# importing necessary models\nimport numpy as np\nimport pandas as pd \nfrom xgboost import XGBClassifier\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n\nxgb = XGBClassifier()\nxgb.load_model('ModelName.model')\n\n# After loading model, usage is the same as discussed in the model preparation section.\n

"},{"location":"inference/xgboost.html#caveat","title":"Caveat","text":"

It is worth mentioning that both behavior and APIs of different XGBoost version can have difference.

  1. When using c_api for C/C++ inference, for ver.<1, the API is XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, int training, bst_ulong * out_len,const float ** out_result), while for ver.>=1 the API changes to XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, unsigned int ntree_limit, int training, bst_ulong * out_len,const float ** out_result).

  2. Model from ver.>=1 cannot be used for ver.<1.

Other important issue for C/C++ user is that DMatrix only takes in single precision floats (float), not double precision floats (double).

"},{"location":"inference/xgboost.html#appendix-tips-for-xgboost-users","title":"Appendix: Tips for XGBoost users","text":""},{"location":"inference/xgboost.html#importance-plot","title":"Importance Plot","text":"

XGBoost uses F-score to describe feature importance quantatitively. XGBoost's python API provides a nice tool,plot_importance, to plot the feature importance conveniently after finishing train.

# Once the training is done, the plot_importance function can thus be used to plot the feature importance.\nfrom xgboost import plot_importance # Import the function\n\nplot_importance(xgb) # suppose the xgboost object is named \"xgb\"\nplt.savefig(\"importance_plot.pdf\") # plot_importance is based on matplotlib, so the plot can be saved use plt.savefig()\n
The importance plot is consistent with our expectation, as in our toy-model, the data points differ by most on the feature \"7\". (see toy model setup).

"},{"location":"inference/xgboost.html#roc-curve-and-auc","title":"ROC Curve and AUC","text":"

The receiver operating characteristic (ROC) and auccrency (AUC) are key quantities to describe the model performance. For XGBoost, ROC curve and auc score can be easily obtained with the help of sci-kit learn (sklearn) functionals, which is also in CMSSW software.

from sklearn.metrics import roc_auc_score,roc_curve,auc\n# ROC and AUC should be obtained on test set\n# Suppose the ground truth is 'y_test', and the output score is named as 'y_score'\n\nfpr, tpr, _ = roc_curve(y_test, y_score)\nroc_auc = auc(fpr, tpr)\n\nplt.figure()\nlw = 2\nplt.plot(fpr, tpr, color='darkorange',\n         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\nplt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\nplt.xlim([0.0, 1.0])\nplt.ylim([0.0, 1.05])\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title('Receiver operating characteristic example')\nplt.legend(loc=\"lower right\")\n# plt.show() # display the figure when not using jupyter display\nplt.savefig(\"roc.png\") # resulting plot is shown below\n

"},{"location":"inference/xgboost.html#reference-of-xgboost","title":"Reference of XGBoost","text":"
  1. XGBoost Wiki: https://en.wikipedia.org/wiki/XGBoost
  2. XGBoost Github Repo.: https://github.com/dmlc/xgboost
  3. XGBoost offical api tutorial
  4. Latest, Python: https://xgboost.readthedocs.io/en/latest/python/index.html
  5. Latest, C/C++: https://xgboost.readthedocs.io/en/latest/tutorials/c_api_tutorial.html
  6. Older (0.80), Python: https://xgboost.readthedocs.io/en/release_0.80/python/index.html
  7. No Tutorial for older version C/C++ api, source code: https://github.com/dmlc/xgboost/blob/release_0.80/src/c_api/c_api.cc
"},{"location":"innovation/hackathons.html","title":"CMS Machine Learning Hackathons","text":"

Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.

If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!

Below follows a list of previous successful hackathons.

"},{"location":"innovation/hackathons.html#hgcal-ticl-reconstruction","title":"HGCAL TICL reconstruction","text":"

20 Jun 2022 - 24 Jun 2022 https://indico.cern.ch/e/ticlhack

Abstract: The HGCAL reconstruction relies on \u201cThe Iterative CLustering\u201d (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the \u201ctracksters\u201d. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.

In this project we plan to tackle the following topics using ML:

  • trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
  • tuning of the existing pattern recognition algorithms
  • reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
  • trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.
"},{"location":"innovation/hackathons.html#material","title":"Material:","text":"

A CodiMD document has been created with an overview of the topics and to keep track of the activities during the hackathon:

https://codimd.web.cern.ch/s/hMd74Yi7J

"},{"location":"innovation/hackathons.html#jet-tagging","title":"Jet tagging","text":"

8 Nov 2021 - 11 Nov 2021 https://indico.cern.ch/e/jethack

Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.

Two main projects are covered:

  • Jet tagging for scouting
  • Jet tagging for Level-1
"},{"location":"innovation/hackathons.html#jet-tagging-for-scouting","title":"Jet tagging for scouting","text":"

Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:

Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b

"},{"location":"innovation/hackathons.html#jet-tagging-for-level-1","title":"Jet tagging for Level-1","text":"

Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:

  • Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
  • Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
  • Study tagging performance as a function of the number of jet constituents
  • Study tagging performance for a \"real\" input vector (zero-paddes, perhaps unsorted)
  • Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
  • Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets
"},{"location":"innovation/hackathons.html#gnn-4-tracking","title":"GNN-4-tracking","text":"

27 Sept 2021 - 1 Oct 2021

https://indico.cern.ch/e/gnn4tracks

Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.

The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.

Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:

  • Task 1: Create a package for extracting graph features and building graphs in CMSSW.
  • Task 2. GNN inference on Sonic servers
  • Task 3: Track fitting after GNN track building
  • Task 4. Performance evaluation for the new track collection
"},{"location":"innovation/hackathons.html#material_1","title":"Material:","text":"

Code is provided at this GitHub organisation. Project are listed here.

"},{"location":"innovation/hackathons.html#anomaly-detection","title":"Anomaly detection","text":"

In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.

There are two main projects:

"},{"location":"innovation/hackathons.html#event-based-anomaly-detection-algorithms-for-the-level-1-trigger","title":"Event-based anomaly detection algorithms for the Level-1 Trigger","text":""},{"location":"innovation/hackathons.html#jet-based-anomaly-detection-algorithms-for-the-high-level-trigger-specifically-targeting-run-3-scouting","title":"Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting","text":""},{"location":"innovation/hackathons.html#material_2","title":"Material:","text":"

A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.

"},{"location":"innovation/journal_club.html","title":"CMS Machine Learning Journal Club","text":"

Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.

To vote for or to propose new papers for discussion, go to https://cms-ml-journalclub.web.cern.ch/.

Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.

"},{"location":"innovation/journal_club.html#dealing-with-nuisance-parameters-using-machine-learning-in-high-energy-physics-a-review","title":"Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review","text":"

Tommaso Dorigo, Pablo de Castro

Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.

  • Indico
  • Paper
"},{"location":"innovation/journal_club.html#mapping-machine-learned-physics-into-a-human-readable-space","title":"Mapping Machine-Learned Physics into a Human-Readable Space","text":"

Taylor Faucett, Jesse Thaler, Daniel Whiteson

Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper

"},{"location":"innovation/journal_club.html#model-interpretability-2-papers","title":"Model Interpretability (2 papers):","text":"
  • Indico
"},{"location":"innovation/journal_club.html#identifying-the-relevant-dependencies-of-the-neural-network-response-on-characteristics-of-the-input-space","title":"Identifying the relevant dependencies of the neural network response on characteristics of the input space","text":"

Stefan Wunsch, Raphael Friese, Roger Wolf, G\u00fcnter Quast

Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.

  • Paper
"},{"location":"innovation/journal_club.html#innvestigate-neural-networks","title":"iNNvestigate neural networks!","text":"

Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam H\u00e4gele, Kristof T. Sch\u00fctt, Gr\u00e9goire Montavon, Wojciech Samek, Klaus-Robert M\u00fcller, Sven D\u00e4hne, Pieter-Jan Kindermans

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

  • Paper
  • Code
"},{"location":"innovation/journal_club.html#simulation-based-inference-in-particle-physics-and-beyond-and-beyond","title":"Simulation-based inference in particle physics and beyond (and beyond)","text":"

Johann Brehmer, Kyle Cranmer

Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.

  • Indico
  • Paper
  • Code
"},{"location":"innovation/journal_club.html#efficiency-parameterization-with-neural-networks","title":"Efficiency Parameterization with Neural Networks","text":"

C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi

Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code

"},{"location":"innovation/journal_club.html#a-general-framework-for-uncertainty-estimation-in-deep-learning","title":"A General Framework for Uncertainty Estimation in Deep Learning","text":"

Antonio Loquercio, Mattia Seg\u00f9, Davide Scaramuzza

Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.

  • Indico
  • Paper
  • Code
"},{"location":"optimization/data_augmentation.html","title":"Data augmentation","text":""},{"location":"optimization/data_augmentation.html#introduction","title":"Introduction","text":"

This introduction is based on papers by Shorten & Khoshgoftaar, 2019 and Rebuffi et al., 2021 among others

With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.

The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.

The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.

Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.

Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)

Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:

  • Improvement of model prediction precision
  • More training data for the model
  • Preventing data scarcity for state-of-the-art models
  • Reduction of over overfitting and creation of data variability
  • Increased model generalization properties
  • Help in resolving class imbalance problems in datasets
  • Reduced cost of data collection and labeling
  • Enabling rare event prediction

And some words of caution:

  • There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
  • Don't trust the augmented data blindly
  • Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
  • There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
  • Ensure the validity of the augmented data before using it in ML models.
  • If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.
"},{"location":"optimization/data_augmentation.html#feature-engineering","title":"Feature Engineering","text":"

This part is based mostly on Erdmann et al., 2018

Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.

With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, \"low-level\" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using \"high-level\" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.

Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.

"},{"location":"optimization/data_augmentation.html#hep-application-lorentz-boosted-network","title":"HEP Application - Lorentz Boosted Network","text":"

For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.

Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.

The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.

The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.

The usual ML workflow, employing LBN, is as follows:

Step-1: LBN(M, F)\n\n    1.0: Input hyperparameters: number of combinations M; number of features F\n    1.0: Choose: number of incoming particles, N, according to the research\n         question\n\n    1.1: Combination of input four-vectors to particles and rest frames\n\n    1.2: Lorentz transformations\n\n    1.3 Extraction of suitable high-level objects\n\n\nStep-2: NN\n\n    2.X: Train some form of a NN using an objective function that depends on\n         the analysis / research question.\n
Fig. 2: The Lorentz Boost Network architecture (figure taken from Erdmann et al., 2018)

The LBN package is also pip-installable:

pip install lbn\n
"},{"location":"optimization/data_augmentation.html#rda-techniques","title":"RDA Techniques","text":"

This section and the following subsection are based on the papers by Freer & Yang, 2020, Dolan & Ore, 2021, Barnard et al., 2016, and Bradshaw et al., 2019

RDA methods augment the existing dataset by performance some transformation on the existing data points. These transformations could include rotation, flipping, color shift (for an image), Fourier transforming (for signal processing) or some other transformation that preserves the validity of the data point and its corresponding label. As mentioned in Freer & Yang, 2020, these types of transformations augment the dataset to capture potential variations that the population of data may exhibit, allowing the network to capture a more generalized view of the sampled data.

"},{"location":"optimization/data_augmentation.html#hep-application-zooming","title":"HEP Application - Zooming","text":"

In Barnard et al., 2016, the authors investigate the effect of parton shower modelling in DNN jet taggers using images of hadronically decaying W bosons. They introduce a method known as zooming to study the scale invariance of these networks. This is the RDA strategy used by Dolan & Ore, 2021. Zooming is similar to a normalization procedure such that it standardizes features in signal data, but it aims to not create similar features in background.

After some standard data processing steps, including jet trimming and clustering via the \\(k_t\\) algorithm, and some further processing to remove spatial symmetries, the resulting jet image depicts the leading subjet and subleading subjet directly below. Barnard et al., 2016 notes that the separation between the leading and subleading subjets varies linearly as \\(2m/p_T\\) where \\(m\\) and \\(p_T\\) are the mass and transverse momentum of the jet. Standardizing this separation, or removing the linear dependence, would allow the DNN tagger to generalize to a wide range of jet \\(p_T\\). To this end, the authors construct a factor, \\(R/\\DeltaR_{act}\\), where \\(R\\) is some fixed value and \\(\\DeltaR_{act}\\) is the separation between the leading and subleading subjets. To discriminate between signal and background images with this factor, the authors enlarge the jet images by a scaling factor of \\(\\text{max}(R/s,1)\\) where \\(s = 2m_W/p_T\\) and \\(R\\) is the original jet clustering size. This process of jet image enlargement by a linear mass and \\(p_T\\) dependent factor to account for the distane between the leading and subleading jet is known as zooming. This process can be thought of as an RDA technique to augment the data in a domain-specific way.

Advantage of using the zooming technique is that it makes the construction of scale invariant taggers easier. Scale invariant searches which are able to interpolate between the boosted and resolved parts of phase space have the advantage of being applicable over a broad range of masses and kinematics, allowing a single search or analysis to be effective where previously more than one may have been necessary.

As predicted the zoomed network outperforms the unzoomed one, particularly at low signal efficiency, where the background rejection rises by around 20%. Zooming has the greatest effect at high pT.

"},{"location":"optimization/data_augmentation.html#traditional-sda-techniques","title":"Traditional SDA Techniques","text":"

Text in part based on He et al., 2010

Generally speaking, imbalanced learning occurs whenever some type of data distribution dominates the instance space compared to other data distributions. Methods for handling imbalanced learning problems can be divided into the following five major categories:

  • Sampling strategies
  • Synthetic data generation (SMOTE & ADASYN & DataBoost-IM) - aims to overcome the imbalance by artificially generating data samples.
  • Cost-sensitive learning - uses cost-matrix for different types of errors or instance to facilitate learning from imbalanced data sets. This means that cost-sensitive learning does not modify the imbalanced data distribution directly, but targets this problem by using different cost-matrices that describe the cost for misclassifying any particular data sample.
  • Active learning - conventionally used to solve problems related to unlabeled data, though recently it has been used in learning imbalanced data sets. Instead of searching the entire training space, this method effectively selects informative instances from a random set of training populations, therefore significantly reducing the computational cost when dealing with large imbalanced data sets.
  • Kernel-based methods - by integrating the regularized orthogonal weighed least squares (ROWLS) estimator, a kernel classifier construction algorithm is based on orthogonal forward selection (OFS) to optimize the model generalization for learning from two-class imbalanced data sets.
"},{"location":"optimization/data_augmentation.html#sampling","title":"Sampling","text":"

When the percentage of the minority class is less than 5%, it can be considered a rare event. When a dataset is imbalanced or when a rare event occurs, it will be difficult to get a meaningful and good predictive model due to lack of information about the rare event Au et al., 2010. In these cases, re-sampling techniques can be helpful. The re-sampling techniques are implemented in four different categories: undersampling the majority class, oversampling the minority class, combining over- and undersampling, and ensembling sampling. Oversampling and undersampling are found to work well in improving the classification for the imbalanced dataset. Yap et al., 2013

Stratified sampling (STS) This technique is used in cases where the data can be partitioned into strata (subpopulations), where each strata should be collectively exhaustive and mutually exclusive. The process of dividing the data into homogeneus subgroups before sampling is referred to as stratification. The two common strategies of STS are proportionate allocation (PA) and optimum (disproportionate) allocation (OA). The former uses a fraction in each of the stata that is proportional to that of the total population. The latter uses the standard deviation of the distribution of the variable as well, so that the larger samples are taken from the strata that has the greatest variability to generate the least possible sampling variance. The advantages of using STS include smaller error in estimation (if measurements within strata have lower standard deviation) and similarity in uncertainties across all strata in case there is high variability in a given strata.

NOTE: STS is only useful if the population can be exhaustively partitioned into subgroups. Also in case of unknown class priors (the ratio of strata to the whole population) might have deleterious effects on the classification performance.

Over- and undersampling Oversampling randomly duplicates minority class samples, while undersampling discards majority class samples in order to modify the class distribution. While oversampling might lead to overfitting, since it makes exact copies of the minority samples, undersampling may discard potentially useful majority samples.

Oversampling and undersampling are essentially opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like Synthetic Minority Over-sampling TEchnique (SMOTE).

It has been shown that the combination of SMOTE and undersampling performs better than only undersampling the majority class. However, over- and undersampling remain popular as it each is much easier to implement alone than in some complex hybrid approach.

Synthetic Minority Over-sampling Technique (SMOTE) Text mostly based on Chawla et al., 2002 and in part on He et al., 2010

In case of Synthetic Minority Over-sampling Technique (SMOTE), the minority class is oversampled by creating synthetic examples along the line segments joining any or all of the \\(k\\)-nearest neighbours in the minority class. The synthetic examples cause the classifier to create larger and less specific decision regions, rather than smaller and more specific regions. More general regions are now learned for the minority class samples rather than those being subsumed by the majority class samples around them. In this way SMOTE shifts the classifier learning bias toward the minority class and thus has the effect of allowing the model to generalize better.

There also exist extensions of this work like SMOTE-Boost in which the syntetic procedure was integrated with adaptive boosting techniques to change the method of updating weights to better compensate for skewed distributions.

So in general SMOTE proceeds as follows

SMOTE(N, X, k)\nInput: N - Number of synthetic samples to be generated\n       X - Underrepresented data\n       k - Hyperparameter of number of nearest neighbours to be chosen\n\nCreate an empty list SYNTHETIC_SAMPLES\nWhile N_SYNTHETIC_SAMPLES < N\n    1. Randomly choose an entry xRand from X\n    2. Find k nearest neighbours from X\n    3. Randomly choose an entry xNeighbour from the k nearest neighbours\n    4. Take difference dx between the xRand and xNeighbour\n    5. Multiply dx by a random number between 0 and 1\n    6. Append the result to SYNTHETIC_SAMPLES\nExtend X by SYNTHETIC_SAMPLES\n

Adaptive synthetic sampling approach (ADASYN) Text mostly based on He et al., 2010

Adaptive synthetic sampling approach (ADASYN) is a sampling approach for learning from imbalanced datasets. The main idea is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. Thus, ADASYN improves learning with respect to the data distributions by reducing the bias introduced by the class imbalance and by adaptively shifting the classification boundary toward the difficult examples.

The objectives of ADASYN are reducing bias and learning adaptively. The key idea of this algorithm is to use a density distribution as a criterion to decide the number of synthetic samples that need to be generated for each minority data example. Physically, this density distribution is a distribution of weights for different minority class examples according to their level of difficulty in learning. The resulting dataset after using ADASYN will not only provide a balanced representation of the data distribution (according to the desired balance level defined in the configuration), but it also forces the learning algorithm to focus on those difficult to learn examples. It has been shown He et al., 2010, that this algorithm improves accuracy for both minority and majority classes and does not sacrifice one class in preference for another.

ADASYN is not limited to only two-class learning, but can also be generalized to multiple-class imbalanced learning problems as well as incremental learning applications.

For more details and comparisons of ADASYN to other algorithms, please see He et al., 2010.

"},{"location":"optimization/data_augmentation.html#existing-implementations","title":"Existing implementations","text":"

Imbalanced-learn is an open-source Python library which provides a suite of algorithms for treating the class imbalance problem.

For augmentig image data, one can use of of the following:

  • Albumentations
  • ImgAug
  • Autoaugment
  • Augmentor
  • DeepAugmnent

But it is also possible to use tools directly implemented by tensorflow, keras etc. For example:

flipped_image = tf.image.flip_left_right(image)\n
"},{"location":"optimization/data_augmentation.html#deep-learning-based-sda-techniques","title":"Deep Learning-based SDA Techniques","text":"

In data science, data augmentation techniques are used to increase the amount of data by either synthetically creating data from already existing samples via a GAN or modifying the data at hand with small noise or rotation. (Rebuffi et al., 2021)

More recently, data augmentation studies have begun to focus on the field of deep learning (DL), more specifically on the ability of generative models, like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), to create artificial data. This synthetic data is then introduced during the classification model training process to improve performance and results.

Generative Adversarial Networks (GANs) The following text is written based on the works by Musella & Pandolfi, 2018 and Hashemi et al., 2019 and Kansal et al., 2022 and Rehm et al., 2021 and Choi & Lim, 2021 and Kansal et al., 2020

GANs have been proposed as a fast and accurate way of modeling high energy jet formation (Paganini et al., 2017a) and modeling showers throughcalorimeters of high-energy physics experiments (Paganini et al., 2017 ; Paganini et al., 2012; Erdman et al., 2020; Musella & Pandolfi, 2018) GANs have also been trained to accurately approximate bottlenecks in computationally expensive simulations of particle physics experiments. Applications in the context of present and proposed CERN experiments have demonstrated the potential of these methods for accelerating simulation and/or improving simulation fidelity (ATLAS Collaboration, 2018; SHiP Collaboration, 2019).

The generative model approximates the combined response of aparticle detecor simulation and reconstruction algorithms to hadronic jets given the latent space of uniformly distributed noise, auxiliary features and jet image at particle level (jets clustered from the list of stable particles produced by PYTHIA).

In the paper by Musella & Pandolfi, 2018, the authors apply generative models parametrized by neural networks (GANs in particular) to the simulation of particles-detector response to hadronic jets. They show that this parametrization achieves high-fidelity while increasing the processing speed by several orders of magnitude.

Their model is trained to be capable of predicting the combined effect of particle-detector simulation models and reconstruction algorithms to hadronic jets.

Generative adversarial networks (GANs) are pairs of neural networks, a generative and a discriminative one, that are trained concurrently as players of a minimax game (Musella & Pandolfi, 2018). The task of the generative network is to produce, starting from a latent space with a fixed distribution, samples that the discriminative model tries to distinguish from samples drawn from a target dataset. This kind of setup allows the distribution of the target dataset to be learned, provided that both of the networks have high enough capacity.

The input to these networks are hadronic jets, represented as \"gray-scale\" images of fixed size centered around the jet axis, with the pixel intensity corresponding to the energy fraction in a given cell. The architectures of the networks are based on the image-to-image translation. There few differences between this approach and image-to-image translation. Firstly, non-empty pixels are explicitly modelled in the generated images since these are much sparser than the natural ones. Secondly, feature matching and a dedicated adversarial classifier enforce good modelling of the total pixel intensity (energy). Lastly, the generator is conditioned on some auxiliary inputs.

By predicting directly the objects used at analysis level and thus reproducing the output of both detector simulation and reconstruction algorithms, computation time is reduced. This kind of philosophy is very similar to parametrized detector simulations, which are used in HEP for phenomenological studies. The attained accuracies are comparable to the full simulation and reconstruction chain.

"},{"location":"optimization/data_augmentation.html#variational-autoencoders-vaes","title":"Variational autoencoders (VAEs)","text":"

The following section is partly based on Otten et al., 2021

In contrast to the traditional autoencoder (AE) that outputs a single value for each encoding dimension, variational autoencoders (VAEs) provide a probabilistic interpretation for describing an observation in latent space.

In case of VAEs, the encoder model is sometimes referred to as the recognition model and the decoder model as generative model.

By constructing the encoder model to output a distribution of the values from which we randomly sample to feed into our decoder model, we are enforcing a continuous, smooth latent space representation. Thus we expect our decoder model to be able to accurately reconstruct the input for any sampling of the latent distributions, which then means that values residing close to each other in latent space should have very similar reconstructions.

"},{"location":"optimization/data_augmentation.html#ml-powered-data-generation-for-fast-simulation","title":"ML-powered Data Generation for Fast Simulation","text":"

The following text is based on this Chen et al., 2020

We rely on accurate simulation of physics processes, however currently it is very common for LHC physics to be affected by large systematic uncertanties due to the limited amount of simulated data, especially for precise measurements of SM processes for which large datasets are already available. So far the most widely used simulator is GEANT4 that provides state-of-the-art accuracy. But running this is demanding, both in terms of time and resources. Consequently, delivering synthetic data at the pace at which LHC delivers real data is one of the most challenging tasks for computing infrastructures of the LHC experiments. The typical time it takes to simulate one single event is in the ballpark of 100 seconds.

Recently, generative algorithms based on deep learning have been proposed as a possible solution to speed up GEANT4. However, one needs to work beyond the collision-as-image paradigm so that the DL-based simulation accounts for the irregular geometry of a typical detector while delivering a dataset in a format compatible with downstream reconstruction software.

One method to solve this bottleneck was proposed by Chen et al., 2020. They adopt a generative DL model to convert an analysis specific representation of collision events at generator level to the corresponding representation at reconstruction level. Thus, this novel, fast-simulation workflow starts from a large amount of generator-level events to deliver large analysis-specific samples.

They trained a neural network to model detector resolution effects as a transfer function acting on an analysis-specific set of relevant features, computed at generator level. However, their model does not sample events from a latent space (like a GAN or a plain VAE). Instead, it works as a fast simulator of a given generator-level event, preserving the correspondence between the reconstructed and the generated event, which allows us to compare event-by-event residual distributions. Furthermore, this model is much simpler than a generative model.

Step one in this workflow is generating events in their full format, which is the most resource heavy task, where, as noted before, generating one event takes roughly 100 seconds. However, with this new proposed method O(1000) events are generated per second. This would save on storage: for the full format O(1) MB/event is needed, where for the DL model only 8 MB was used to store 100000 events. To train the model, they used NVIDIA RTX2080 and it trained for 30 minutes, which in terms of overall production time is negligible. For generating N=1M events and n=10%N, one would save 90% of the CPU resources and 79% of the disk storage. Thus augmenting the centrally produced data is a viable method and could help the HEP community to face the computing challenges of the High-Luminosity LHC.

Another more extreme approach investigated the use of GANs and VAEs for generating physics quantities which are relevant to a specific analysis. In this case, one learns the N-dimensional density function of the event, in a space defined by the quantities of interest for a given analysis. So sampling from this function, one can generate new data. Trade-off between statistical precision (decreases with the increasing amount of generated events) and the systematic uncertainty that could be induced by a non accurate description of the n-dim pdf.

Qualitatively, no accuracy deterioration was observed due to scaling the dataset size for DL. This fact proves the robustness of the proposed methodology and its effectiveness for data augmentation.

"},{"location":"optimization/data_augmentation.html#open-challenges-in-data-augmentation","title":"Open challenges in Data Augmentation","text":"

Excerpts are taken from Li, 2020

The limitations of conventional data augmentation approaches reveal huge opportunities for research advances. Below we summarize a few challenges that motivate some of the works in the area of data augmentation.

  • From manual to automated search algorithms: As opposed to performing suboptimal manual search, how can we design learnable algorithms to find augmentation strategies that can outperform human-designed heuristics?
  • From practical to theoretical understanding: Despite the rapid progress of creating various augmentation approaches pragmatically, understanding their benefits remains a mystery because of a lack of analytic tools. How can we theoretically understand various data augmentations used in practice?
  • From coarse-grained to fine-grained model quality assurance: While most existing data augmentation approaches focus on improving the overall performance of a model, it is often imperative to have a finer-grained perspective on critical subpopulations of data. When a model exhibits inconsistent predictions on important subgroups of data, how can we exploit data augmentations to mitigate the performance gap in a prescribed way?
"},{"location":"optimization/data_augmentation.html#references","title":"References","text":"
  • Shorten & Khoshgoftaar, 2019, \"A survey on Image Data Augmentationfor Deep Learning\"
  • Freer & Yang, 2020, \"Data augmentation for self-paced motor imagery classification with C-LSTM\"
  • Li, 2020, \"Automating Data Augmentation: Practice, Theory and New Direction\"
  • Rebuffi et al., 2021, \"Data Augmentation Can Improve Robustness\"
  • Erdmann et al., 2018, \"Lorentz Boost Networks: Autonomous Physics-Inspired Feature Engineering\"
  • Dolan & Ore, 2021, \"Meta-learning and data augmentation for mass-generalised jet taggers\"
  • Bradshaw et al., 2019, \"Mass agnostic jet taggers\"
  • Chang et al., 2018, \"What is the Machine Learning?\"
  • Oliveira et al. 2017, \"Jet-Images \u2013 Deep Learning Edition\"
  • Barnard et al., 2016, \"Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks\"
  • Chen et al., 2020, \"Data augmentation at the LHC through analysis-specific fast simulation with deep learning\"
  • Musella & Pandolfi, 2018, \"Fast and accurate simulation of particle detectors using generative adversarial networks\"
  • Hashemi et al., 2019, \"LHC analysis-specific datasets with Generative Adversarial Networks\"
  • Kansal et al., 2022, \"Particle Cloud Generation with Message Passing Generative Adversarial Networks\"
  • Rehm et al., 2021, \"Reduced Precision Strategies for Deep Learning: A High Energy Physics Generative Adversarial Network Use Case\"
  • Choi & Lim, 2021, \"A Data-driven Event Generator for Hadron Colliders using Wasserstein Generative Adversarial Network\"
  • Kansal et al., 2020, \"Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics\"
  • Otten et al., 2021, \"Event Generation and Statistical Sampling for Physics with Deep Generative Models and a Density Information Buffer\"
  • Yap et al., 2013, \"An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets\"
  • Au et al., 2010, \"Mining Rare Events Data by Sampling and Boosting: A Case Study\"
  • Chawla et al., 2002, \"SMOTE: Synthetic Minority Over-sampling Technique\"
  • He et al., 2010, \"ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning\"
  • Erdman et al., 2020, \"Precise simulation of electromagnetic calorimeter showers using a Wasserstein Generative Adversarial Network\"
  • Paganini et al., 2012, \"CaloGAN: Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks\"
  • Paganini et al., 2017, \"Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multi-Layer Calorimeters\"
  • Paganini et al., 2017, \"Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics Synthesis\"
  • ATLAS Collaboration, 2018, \"Deep generative models for fast shower simulation in ATLAS\"
  • SHiP Collaboration, 2019, \"Fast simulation of muons produced at the SHiP experiment using Generative Adversarial Networks\"

Content may be edited and published elsewhere by the author.

Page author: Laurits Tani, 2022

"},{"location":"optimization/importance.html","title":"Feature Importance","text":"

Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:

  • Important features can be used to create simplified models, e.g. to mitigate overfitting.
  • Using only important features can reduce the latency and memory requirements of the model.
  • The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
  • If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.

In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.

"},{"location":"optimization/importance.html#general-discussion","title":"General Discussion","text":"

Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:

"},{"location":"optimization/importance.html#filter-methods","title":"Filter Methods","text":"

Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,

  • Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.

  • Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.

  • Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.

  • Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.

"},{"location":"optimization/importance.html#embedded-methods","title":"Embedded Methods","text":"

Embedded methods are specific to a prediction model and independent of the dataset. Examples:

  • L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.
"},{"location":"optimization/importance.html#wrapper-methods","title":"Wrapper Methods","text":"

Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:

  • Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\\(f\\)) to training data, yielding \\(f(X_\\mathrm{train})\\), where \\(X_\\mathrm{train}\\in\\mathbb{R}^{n\\times d}\\) for \\(n\\) input examples with \\(d\\) features. Next, measure the model's performance on testing data for some loss \\(\\mathcal{L}\\), i.e. \\(s=\\mathcal{L}\\big(f(X_\\mathrm{test}), y_\\mathrm{test}\\big)\\). For each feature \\(j\\in[1\\ ..\\ d]\\), randomly shuffle the corresponding column in \\(X_\\mathrm{test}\\) to form \\(X_\\mathrm{test}^{(j)}\\). Repeat this process \\(K\\) times, so that for \\(k\\in [1\\ ..\\ K]\\) each random shuffling of feature column \\(j\\) gives a corrupted input dataset \\(X_\\mathrm{test}^{(j,k)}\\). Finally, define the permutation importance of feature \\(j\\) as the difference between the un-corrupted validation score and average validation score over the corrupted \\(X_\\mathrm{test}^{(j,k)}\\) datasets:
\\[\\texttt{PI}_j = s - \\frac{1}{K}\\sum_{k=1}^{K} \\mathcal{L}[f(X_\\mathrm{test}^{(j,k)}), y_\\mathrm{test}]\\]
  • Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \\(D\\) initial features, RFE returns the set of \\(d < D\\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \\(d\\) features remain.
"},{"location":"optimization/importance.html#introduction-by-example","title":"Introduction by Example","text":""},{"location":"optimization/importance.html#direct-interpretation","title":"Direct Interpretation","text":"

Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:

from sklearn.datasets import load_diabetes\nfrom sklearn.model_selection import train_test_split\n\ndiabetes = load_diabetes()\nX_train, X_val, y_train, y_val = train_test_split(diabetes.data, diabetes.target, random_state=0)\nprint(X_train.shape)\n>>> (331,10)\nprint(y_train.shape)\n>>> (331,)\nprint(X_val.shape)\n>>> (111, 10)\nprint(y_val.shape)\n>>> (111,)\nprint(diabetes.feature_names)\n['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']\n
To begin, let's use Ridge Regression (L2-regularized linear regression) to model diabetes progression as a function of the input markers. The absolute value of a regression coefficient (slope) corresponding to a feature can be interpreted the impact of a feature on the final fit:

from sklearn.linear_model import Ridge\nfrom sklearn.feature_selection import RFE\n\nmodel = Ridge(alpha=1e-2).fit(X_train, y_train)\nprint(f'Initial model score: {model.score(X_val, y_val):.3f}')\n\nfor i in np.argsort(-abs(model.coef_)):\n    print(diabetes.feature_names[i], abs(model.coef_[i]))\n\n>>> Initial model score: 0.357\n>>> bmi: 592.253\n>>> s5: 580.078\n>>> bp: 297.258\n>>> s1: 252.425\n>>> sex: 203.436\n>>> s3: 145.196\n>>> s4: 97.033\n>>> age: 39.103\n>>> s6: 32.945\n>>> s2: 20.906\n
These results indicate that the bmi and s5 fields have the largest impact on the output of this regression model, while age, s6, and s2 have the smallest. Further interpretation is subject to the nature of the input data (see Common Pitfalls in the Interpretation of Coefficients of Linear Models). Note that scikit-learn has tools available to faciliate feature selections.

"},{"location":"optimization/importance.html#permutation-importance","title":"Permutation Importance","text":"

In the context of our ridge regression example, we can calculate the permutation importance of each feature as follows (based on scikit-learn docs):

from sklearn.inspection import permutation_importance\n\nmodel = Ridge(alpha=1e-2).fit(X_train, y_train)\nprint(f'Initial model score: {model.score(X_val, y_val):.3f}')\n\nr = permutation_importance(model, X_val, y_val, n_repeats=30, random_state=0)\nfor i in r.importances_mean.argsort()[::-1]:\n    print(f\"{diabetes.feature_names[i]:<8}\"\n          f\"{r.importances_mean[i]:.3f}\"\n          f\" +/- {r.importances_std[i]:.3f}\")\n\n>>> Initial model score: 0.357\n>>> s5      0.204 +/- 0.050\n>>> bmi     0.176 +/- 0.048\n>>> bp      0.088 +/- 0.033\n>>> sex     0.056 +/- 0.023\n>>> s1      0.042 +/- 0.031\n>>> s4      0.003 +/- 0.008\n>>> s6      0.003 +/- 0.003\n>>> s3      0.002 +/- 0.013\n>>> s2      0.002 +/- 0.003\n>>> age     -0.002 +/- 0.004\n
These results are roughly consistent with the direct interpretation of the linear regression parameters; s5 and bmi are the most permutation-important features. This is because both have significant permutation importance scores (0.204, 0.176) when compared to the initial model score (0.357), meaning their random permutations significantly degraded the model perforamnce. On the other hand, s2 and age have approximately no permutation importance, meaning that the model's performance was robust to random permutations of these features.

"},{"location":"optimization/importance.html#l1-enforced-sparsity","title":"L1-Enforced Sparsity","text":"

In some applications it may be useful to reject features with low importance. Models biased towards sparsity are one way to achieve this goal, as they are designed to ignore a subset of features with the least impact on the model's output. In the context of linear regression, sparsity can be enforced by imposing L1 regularization on the regression coefficients (LASSO regression):

\\[\\mathcal{L}_\\mathrm{LASSO} = \\frac{1}{2n}||y - Xw||^2_2 + \\alpha||w||_1\\]

Depending on the strength of the regularization \\((\\alpha)\\), this loss function is biased to zero-out features of low importance. In our diabetes regression example,

model = Lasso(alpha=1e-1).fit(X_train, y_train)\nprint(f'Model score: {model.score(X_val, y_val):.3f}')\n\nfor i in np.argsort(-abs(model.coef_)):\n    print(f'{diabetes.feature_names[i]}: {abs(model.coef_[i]):.3f}')\n\n>>> Model score: 0.355\n>>> bmi: 592.203\n>>> s5: 507.363\n>>> bp: 240.124\n>>> s3: 219.104\n>>> sex: 129.784\n>>> s2: 47.628\n>>> s1: 41.641\n>>> age: 0.000\n>>> s4: 0.000\n>>> s6: 0.000\n
For this value of \\(\\alpha\\), we see that the model has rejected the age, s4, and s6 features as unimportant (consistent with the permutation importance measures above) while achieving a similar model score as the previous ridge regression strategy.

"},{"location":"optimization/importance.html#recursive-feature-elimination","title":"Recursive Feature Elimination","text":"

Another common strategy is recursive feature elimination (RFE). Though RFE can be used for regression applications as well, we turn our attention to a classification task for the sake of variety. The following discussions are based on the Breast Cancer Wisconsin Diagnostic Dataset, which maps 30 numeric features corresponding to digitized breast mass images to a binary classification of benign or malignant.

from sklearn.datasets import load_breast_cancer\nfrom sklearn.svm import SVC\nfrom sklearn.model_selection import StratifiedKFold\n\ndata = load_breast_cancer()\nX_train, X_val, y_train, y_val = train_test_split(data.data, data.target, random_state=0)\nprint(X_train.shape)\n>>> (426, 30)\nprint(y_train.shape)\n>>> (426,)\nprint(X_val.shape)\n>>> (143, 30)\nprint(y_val.shape)\n>>> (143,)\nprint(breast_cancer.feature_names)\n>>> ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']\n

Given a classifier and a classification task, recursive feature elimination (RFE, see original paper) is the process of identifying the subset of input features leading to the most performative model. Here we employ a support vector machine classifier (SVM) with a linear kernel to perform binary classification on the input data. We ask for the top \\(j\\in[1\\ .. \\ d]\\) most important features in a for loop, computing the classification accuracy when only these features are leveraged.

from sklearn.feature_selection import RFE\n\nfeatures = np.array(breast_cancer.feature_names)\nsvc = SVC(kernel='linear')\nfor n_features in np.arange(1, 30, 1):\n    rfe = RFE(estimator=svc, step=1, n_features_to_select=n_features)\n    rfe.fit(X_train, y_train)\n    print(f'n_features={n_features}, accuracy={rfe.score(X_val, y_val):.3f}')\n    print(f' - selected: {features[rfe.support_]}')\n\n>>> n_features=1, accuracy=0.881\n>>>  - selected: ['worst concave points']\n>>> n_features=2, accuracy=0.874\n>>>  - selected: ['worst concavity' 'worst concave points']\n>>> n_features=3, accuracy=0.867\n>>>  - selected: ['mean concave points' 'worst concavity' 'worst concave points']\n ...\n>>> n_features=16, accuracy=0.930\n>>> n_features=17, accuracy=0.965\n>>> n_features=18, accuracy=0.951\n...\n>>> n_features=27, accuracy=0.958\n>>> n_features=28, accuracy=0.958\n>>> n_features=29, accuracy=0.958\n
Here we've shown a subset of the output. In the first output lines, we see that the 'worst concave points' feature alone leads to 88.1% accuracy. Including the next two most important features actually degrades the classification accuracy. We then skip to the top 17 features, which in this case we observe to yield the best performance for the linear SVM classifier. The addition of more features does not lead to additional perforamnce boosts. In this way, RFE can be treated as a model wrapper introducing an additional hyperparameter, n_features_to_select, which can be used to optimize model performance. A more principled optimization using k-fold cross validation with RFE is available in the scikit-learn docs.

"},{"location":"optimization/importance.html#feature-correlations","title":"Feature Correlations","text":"

In the above, we have focused specifically on interpreting the importance of single features. However, it may be that several features are correlated, sharing the responsibility for the overall prediction of the model. In this case, some measures of feature importance may inappropriately downweight correlated features in a so-called correlation bias (see Classification with Correlated Features: Unrelability of Feature Ranking and Solutions). For example, the permutation invariance of \\(d\\) correlated features is shown to decrease (as a function of correlation strength) faster for higher \\(d\\) (see Correlation and Variable importance in Random Forests).

We can see these effects in action using the breast cancer dataset, following the corresponding scikit-learn example

from sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import load_breast_cancer\n\ndata = load_breast_cancer()\nX, y = data.data, data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\nclf = RandomForestClassifier(n_estimators=100, random_state=42)\nclf.fit(X_train, y_train)\nprint(\"Accuracy on test data: {:.2f}\".format(clf.score(X_test, y_test)))\n\n>>> Accuracy on test data: 0.97\n
Here we've implemented a random forest classifier and achieved a high accuracy (97%) on the benign vs. malignent predictions. The permutation importances for the 10 most important training features are:

r = permutation_importance(clf, X_train, y_train, n_repeats=10, random_state=42)\nfor i in r.importances_mean.argsort()[::-1][:10]:\n    print(f\"{breast_cancer.feature_names[i]:<8}\"\n          f\"  {r.importances_mean[i]:.5f}\"\n          f\" +/- {r.importances_std[i]:.5f}\")\n\n>>> worst concave points  0.00681 +/- 0.00305\n>>> mean concave points  0.00329 +/- 0.00188\n>>> worst texture  0.00258 +/- 0.00070\n>>> radius error  0.00235 +/- 0.00000\n>>> mean texture  0.00188 +/- 0.00094\n>>> mean compactness  0.00188 +/- 0.00094\n>>> area error  0.00188 +/- 0.00094\n>>> worst concavity  0.00164 +/- 0.00108\n>>> mean radius  0.00141 +/- 0.00115\n>>> compactness error  0.00141 +/- 0.00115\n

In this case, even the most permutation important features have mean importance scores \\(<0.007\\), which doesn't indicate much importance. This is surprising, because we saw via RFE that a linear SVM can achieve \\(\\approx 88\\%\\) classification accuracy with this feature alone. This indicates that worst concave points, in addition to other meaningful features, may belong to subclusters of correlated features. In the corresponding scikit-learn example, the authors show that subsets of correlated features can be extracted by calculating a dendogram and selecting representative features from each correlated subset. They achieve \\(97\\%\\) accuracy (the same as with the full dataset) by selecting only five such representative variables.

"},{"location":"optimization/importance.html#feature-importance-in-decision-trees","title":"Feature Importance in Decision Trees","text":"

Here we focus on decision trees, which are particularly interpretable classifiers that often appear as ensembles (or boosted decision tree (BDT) algorithms) in HEP. Consider a classification dataset \\(X=\\{x_n\\}_{n=1}^{N}\\), \\(x_n\\in\\mathbb{R}^{D}\\), with truth labels \\(Y=\\{y_n\\}_{n=1}^N\\), \\(y_n\\in\\{1,...,C\\}\\) corresponding \\(C\\) classes. These truth labels naturally partition \\(X\\) into subsets \\(X_c\\) with class probabilities \\(p(c)=|X_c|/|X|\\). Decision trees begin with a root node \\(t_0\\) containing all of \\(X\\). The tree is grown from the root by recursively splitting the input set \\(X\\) in a principled way; internal nodes (or branch nodes) correspond to a decision of the form

\\[\\begin{aligned} &(x_n)_d\\leq\\delta \\implies\\ \\text{sample}\\ n\\ \\text{goes to left child node}\\\\ &(x_n)_d>\\delta \\implies\\ \\text{sample}\\ n\\ \\text{goes to right child node} \\end{aligned}\\]

We emphasize that the decision boundary is drawn by considering a single feature field \\(d\\) and partitioning the \\(n^\\mathrm{th}\\) sample by the value at that feature field. Decision boundaries at each internal parent node \\(t_P\\) are formed by choosing a \"split criterion,\" which describes how to partition the set of elements at this node into left and right child nodes \\(t_L\\), \\(t_R\\) with \\(X_{t_L}\\subset X_{t_P}\\) and \\(X_{t_R}\\subset X_{t_P}\\), \\(X_{t_L}\\cup X_{t_R}=X_{t_P}\\). This partitioning is optimal if \\(X_{t_L}\\) and \\(X_{t_R}\\) are pure, each containing only members of the same class. Impurity measures are used to evaluate the degree to which the set of data points at a given tree node \\(t\\) are not pure. One common impurity measure is Gini Impurity,

\\[\\begin{aligned} I(t) = \\sum_{c=1}^C p(c|t)(1-p(c|t)) \\end{aligned}\\]

Here, \\(p(c|t)\\) is the probability of drawing a member of class \\(c\\) from the set of elements at node \\(t\\). For example, the Gini impurity at the root node (corresponding to the whole dataset) is

\\[\\begin{aligned} I(t_0) = \\sum_{c=1}^C \\frac{|X_c|}{|X|}(1-\\frac{|X_c|}{|X|}) \\end{aligned}\\]

In a balanced binary dataset, this would give \\(I(t_0)=1/2\\). If the set at node \\(t\\) is pure, i.e. class labels corresponding to \\(X_t\\) are identical, then \\(I(t)=0\\). We can use \\(I(t)\\) to produce an optimal splitting from parent \\(t_p\\) to children \\(t_L\\) and \\(t_R\\) by defining an impurity gain,

\\[\\begin{aligned} \\Delta I = I(t_P) - I(t_L) - I(t_R) \\end{aligned}\\]

This quantity describes the relative impurity between a parent node and its children. If \\(X_{t_P}\\) contains only two classes, an optimal splitting would separate them into \\(X_{p_L}\\) and \\(X_{p_R}\\), producing pure children nodes with \\(I(t_L)=I(t_R)=0\\) and, correspondingly, \\(\\Delta I(t_p) = I(t_P)\\). Accordingly, good splitting decisions should maximize impurity gain. Note that the impurity gain is often weighted, for example Scikit-Learn defines:

\\[\\begin{aligned} \\Delta I(t_p) = \\frac{|X_{t_p}|}{|X|}\\bigg(I(t_p) - \\frac{|X_{t_L}|}{|X_{t_p}|} I(t_L) - \\frac{|X_{t_R}|}{|X_{t_p}|} I(t_R) \\bigg) \\end{aligned}\\]

In general, a pure node cannot be split further and must therefore be a leaf. Likewise, a node for which there is no splitting yielding \\(\\Delta I > 0\\) must be labeled a leaf. These splitting decisions are made recursively at each node in a tree until some stopping condition is met. Stopping conditions may include maximum tree depths or leaf node counts, or threshhold on the maximum impurity gain.

Impurity gain gives us insight into the importance of a decision. In particular, larger \\(\\Delta I\\) indicates a more important decision. If some feature \\((x_n)_d\\) is the basis for several decision splits in a decision tree, the sum of impurity gains at these splits gives insight into the importance of this feature. Accordingly, one measure of the feature importance of \\(d\\) is the average (with respect to the total number of internal nodes) impurity gain imparted by decision split on \\(d\\). This method generalizes to the case of BDTs, in which case one would average this quantity across all weak learner trees in the ensemble.

Note that though decision trees are based on the feature \\(d\\) producing the best (maximum impurity gain) split at a given branch node, surrogate splits are often used to retain additional splits corresponding to features other than \\(d\\). Denote the feature maximizing the impurity gain \\(d_1\\) and producing a split boundary \\(\\delta_1\\). Surrogte splitting involves tracking secondary splits with boundaries \\(\\delta_2, \\delta_3,...\\) corresponding to \\(d_2,d_3,...\\) that have the highest correlation with the maximum impurity gain split. The upshot is that in the event that input data is missing a value at field \\(d_1\\), there are backup decision boundaries to use, mitigating the need to define multiple trees for similar data. Using this generalized notion of a decision tree, wherein each branch node contains a primary decision boundary maximizing impurity gain and several additional surrogate split boundaries, we can average the impurity gain produced at feature field \\(d\\) over all its occurances as a decision split or a surrogate split. This definition of feature importance generalizes the previous to include additional correlations.

"},{"location":"optimization/importance.html#example","title":"Example","text":"

Let us now turn to an example:

import numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.datasets import load_wine\nfrom sklearn.inspection import DecisionBoundaryDisplay\nfrom sklearn.metrics import log_loss\nfrom sklearn.model_selection import train_test_split\n\nwine_data = load_wine() \nprint(wine_data.data.shape)\nprint(wine_data.feature_names)\nprint(np.unique(wine_data.target))\n>>> (178, 13)\n>>> ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']\n>>> [0 1 2]\n

This sklearn wine dataset has 178 entries with 13 features and truth labels corresponding to membership in one of \\(C=3\\) classes. We can train a decision tree classifier as follows:

X, y = wine_data.data, wine_data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)\nclassifier = DecisionTreeClassifier(criterion='gini', splitter='best', random_state=27)\nclassifier.fit(X_train, y_train)\nX_test_pred = classifier.predict(X_test)\nprint('Test Set Performance')\nprint('Number misclassified:', sum(X_test_pred!=y_test))\nprint(f'Accuracy: {classifier.score(X_test, y_test):.3f}')\n>>> Test Set Performance\n>>> Number misclassified: 0\n>>> Accuracy: 1.000\n

In this case, the classifier has generalized perfectly, fitting the test set with \\(100\\%\\) accuracy. Let's take a look into how it makes predictions:

tree = classifier.tree_\nn_nodes = tree.node_count\nnode_features = tree.feature\nthresholds = tree.threshold\nchildren_L = tree.children_left\nchildren_R = tree.children_right\nfeature_names = np.array(wine_data.feature_names)\n\nprint(f'The tree has {n_nodes} nodes')\nfor n in range(n_nodes):\n    if children_L[n]==children_R[n]: continue # leaf node\n    print(f'Decision split at node {n}:',\n          f'{feature_names[node_features[n]]}({node_features[n]}) <=',\n          f'{thresholds[n]:.2f}')\n\n>>> The tree has 13 nodes\n>>> Decision split at node 0: color_intensity(9) <= 3.46\n>>> Decision split at node 2: od280/od315_of_diluted_wines(11) <= 2.48\n>>> Decision split at node 3: flavanoids(6) <= 1.40\n>>> Decision split at node 5: color_intensity(9) <= 7.18\n>>> Decision split at node 8: proline(12) <= 724.50\n>>> Decision split at node 9: malic_acid(1) <= 3.33\n

Here we see that several features are used to generate decision boundaries. For example, the dataset is split at the root node by a cut on the \\(\\texttt{color_intensity}\\) feature. The importance of each feature can be taken to be the average impurity gain it generates across all nodes, so we expect that one (or several) of the five unique features used at the decision splits will be the most important features by this definition. Indeed, we see,

feature_names = np.array(wine_data.feature_names)\nimportances = classifier.feature_importances_\nfor i in range(len(importances)):\n    print(f'{feature_names[i]}: {importances[i]:.3f}')\nprint('\\nMost important features', \n      feature_names[np.argsort(importances)[-3:]])\n\n>>> alcohol: 0.000\n>>> malic_acid: 0.021\n>>> ash: 0.000\n>>> alcalinity_of_ash: 0.000\n>>> magnesium: 0.000\n>>> total_phenols: 0.000\n>>> flavanoids: 0.028\n>>> nonflavanoid_phenols: 0.000\n>>> proanthocyanins: 0.000\n>>> color_intensity: 0.363\n>>> hue: 0.000\n>>> od280/od315_of_diluted_wines: 0.424\n>>> proline: 0.165\n\n>>> Most important features ['proline' 'color_intensity' 'od280/od315_of_diluted_wines']\n

This is an embedded method for generating feature importance - it's cooked right into the decision tree model. Let's verify these results using a wrapper method, permutation importance:

from sklearn.inspection import permutation_importance\n\nprint(f'Initial classifier score: {classifier.score(X_test, y_test):.3f}')\n\nr = permutation_importance(classifier, X_test, y_test, n_repeats=30, random_state=0)\nfor i in r.importances_mean.argsort()[::-1]:\n    print(f\"{feature_names[i]:<8}\"\n          f\" {r.importances_mean[i]:.3f}\"\n          f\" +/- {r.importances_std[i]:.3f}\")\n\n>>> Initial classifier score: 1.000\n\n>>> color_intensity 0.266 +/- 0.040\n>>> od280/od315_of_diluted_wines 0.237 +/- 0.049\n>>> proline  0.210 +/- 0.041\n>>> flavanoids 0.127 +/- 0.025\n>>> malic_acid 0.004 +/- 0.008\n>>> hue      0.000 +/- 0.000\n>>> proanthocyanins 0.000 +/- 0.000\n>>> nonflavanoid_phenols 0.000 +/- 0.000\n>>> total_phenols 0.000 +/- 0.000\n>>> magnesium 0.000 +/- 0.000\n>>> alcalinity_of_ash 0.000 +/- 0.000\n>>> ash      0.000 +/- 0.000\n>>> alcohol  0.000 +/- 0.000\n

The tree's performance is hurt the most if the \\(\\texttt{color_intensity}\\), \\(\\texttt{od280/od315_of_diluted_wines}\\), or \\(\\texttt{proline}\\) features are permuted, consistent with the impurity gain measure of feature importance.

"},{"location":"optimization/model_optimization.html","title":"Model optimization","text":"

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum and may be edited and published elsewhere by the author.

"},{"location":"optimization/model_optimization.html#what-we-talk-about-when-we-talk-about-model-optimization","title":"What we talk about when we talk about model optimization","text":"

Given some data \\(x\\) and a family of functionals parameterized by (a vector of) parameters \\(\\theta\\) (e.g. for DNN training weights), the problem of learning consists in finding \\(argmin_\\theta Loss(f_\\theta(x) - y_{true})\\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.

The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.

The cartoon illustrates the general idea behind gradient descent to find the minimum of a function in a multidimensional parameter space (figure elaborated from an image by the huawei forums).

The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.

When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \\(f_\\theta\\) is also a function of these hyperparameters \\(\\lambda\\): \\(f_{(\\theta, \\lambda)}\\), and the \\(\\lambda\\) can be optimized

The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.

The cartoon illustrates the general idea behind gradient descent to optimize the model complexity (in terms of the choice of hyperparameters) multidimensional parameter and hyperparameter space (figure elaborated from an image by the huawei forums)."},{"location":"optimization/model_optimization.html#caveat-which-data-should-you-use-to-optimize-your-model","title":"Caveat: which data should you use to optimize your model","text":"

In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:

  • no hyperparameter optimization is performed;
  • no overtraining is found;
  • the number of training data is high enough to make statistical fluctuations negligible.

If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).

"},{"location":"optimization/model_optimization.html#grid-search","title":"Grid Search","text":"

The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: \"Deep Learning\".

The cartoon illustrates the general idea behind grid search (image taken from Goodfellow, Bengio, Courville: \"Deep Learning\").

To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.

The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \\(m\\) parameters and testing \\(n\\) values for each of them leads to \\(\\mathcal{O}(n^m)\\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.

Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.

Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \\(\\{-1, 0, 1\\}\\):

  • if the best parameter is found to be at the boundary (1), then extend range (\\(\\{1, 2, 3\\}\\)) and do the search in the new range;
  • if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \\(\\{-0.1, 0, 0.1\\}\\).
"},{"location":"optimization/model_optimization.html#random-search","title":"Random search","text":"

An improvement of the grid search is the random search, which proceeds like this:

  • you provide a marginal p.d.f. for each hyperparameter;
  • you sample from the joint p.d.f. a certain number of training configurations;
  • you train for each of these configurations to build the loss function landscape.

This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.

Random search also work best when done iteratively. The differences between grid and random search are again illustrated in Goodfellow, Bengio, Courville: \"Deep Learning\".

The cartoon illustrates the general idea behind random search, as opposed to grid search (image taken from Goodfellow, Bengio, Courville: \"Deep Learning\")."},{"location":"optimization/model_optimization.html#model-based-optimization-by-gradient-descent","title":"Model-based optimization by gradient descent","text":"

Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \\(C\\) is continuous and differentiable with respect to both the parameters \\(\\theta\\) (e.g. weights) and hyperparameters \\(\\lambda\\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).

A diagram illustrating the way gradient-based model optimization works has been prepared by Bengio, doi:10.1162/089976600300015187.

The diagram illustrates the way model optimization can be recast as a model selection problem, where a model selection criterion involves a differentiable validation set error (image taken from Bengio, doi:10.1162/089976600300015187)."},{"location":"optimization/model_optimization.html#model-based-optimization-by-surrogates","title":"Model-based optimization by surrogates","text":"

Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.

The diagram illustrates the pseudocode for the Sequential Model-based Global Optimization (image taken from Bergstra et al).

This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.

"},{"location":"optimization/model_optimization.html#bayesian-optimization","title":"Bayesian Optimization","text":"

We are now ready to tackle in full what is referred to as Bayesian optimization.

Bayesian optimization assumes that the unknown function \\(f(\\theta, \\lambda)\\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \\(\\lambda\\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of \"sampled from a Gaussian process\", we need to define what a Gaussian process is.

"},{"location":"optimization/model_optimization.html#gaussian-processes","title":"Gaussian processes","text":"

Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.

The diagram illustrates the evolution of a Gaussian process, when adding interpolating points (image taken from Kevin Jamieson's CSE599 lecture notes).

GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).

"},{"location":"optimization/model_optimization.html#the-basic-idea-behind-bayesian-optimization","title":"The basic idea behind Bayesian optimization","text":"

Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.

The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).

There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:

  • maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
  • maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
  • maximize the GP Upper confidence bound: minimize \"regret\" over the course of the optimization.
"},{"location":"optimization/model_optimization.html#historical-note","title":"Historical note","text":"

Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)

"},{"location":"optimization/model_optimization.html#bayesian-optimization-in-practice","title":"Bayesian optimization in practice","text":"

The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).

Practical illustration of Bayesian Optimization (images taken from a tutorial on BO by Martin Krasser])."},{"location":"optimization/model_optimization.html#limitations-and-some-workaround-of-bayesian-optimization","title":"Limitations (and some workaround) of Bayesian Optimization","text":"

There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.

First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Mat\u00e9rn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.

Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.

The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.

"},{"location":"optimization/model_optimization.html#alternatives-to-gaussian-processes-tree-based-models","title":"Alternatives to Gaussian processes: Tree-based models","text":"

Gaussian Processes model directly \\(P(hyperpar | data)\\) but are not the only suitable surrogate models for Bayesian optimization

The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \\(P(data | hyperpar)\\) and \\(P(hyperpar)\\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.

TPEs have been used in CMS already around 2017 in a VHbb analysis (see repository by Sean-Jiun Wang) and in a charged Higgs to tb search (HIG-18-004, doi:10.1007/JHEP01(2020)096).

"},{"location":"optimization/model_optimization.html#implementations-of-bayesian-optimization","title":"Implementations of Bayesian Optimization","text":"
  • Implementations in R are readily available as the R-studio tuning package;
  • Scikit-learn provides a handy implementation of Gaussian processes;
  • **scipy* provides a handy implementation of the optimization routines;
  • hyperopt provides a handy implementation of distributed hyperparameter optimization routines;
    • GPs not coded by default, hence must rely on scikit-learn;
    • Parzen tree estimators are implemented by default (together with random search);
  • Several handy tutorials online focussed on hyperparameters optimization
    • Tutorial by Martin Krasser;
    • Tutorial by Jason Brownlee;
  • Early example of hyperopt in CMS
    • VHbb analysis: repository by Sean-Jiun Wang), for optimization of a BDT;
    • Charged Higgs HIG-18-004, doi:10.1007/JHEP01(2020)096) for optimization of a DNN (no public link for the code, contact me if needed)
  • Several expansions and improvements (particularly targeted at HPC clusters) are available, see e.g. this talk by Eric Wulff.
"},{"location":"optimization/model_optimization.html#caveats-dont-get-too-obsessed-with-model-optimization","title":"Caveats: don't get too obsessed with model optimization","text":"

In general, optimizing model structure is a good thing. F. Chollet e.g. says \"If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human\". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.

Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \\(pred\\_a = model\\_a.predict(x)\\), ..., \\(pred\\_d = model\\_d.predict(x)\\). You then pool the predictions: \\(pooled\\_pred = (pred\\_a + pred\\_b + pred\\_c + pred\\_d)/4.\\). THis works if all models are kind of good: if one is significantly worse than the others, then \\(pooled\\_pred\\) may not be as good as the best model of the pool.

You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \\(pooled\\_pred = 0.5\\cdot pred\\_a + 0.25\\cdot pred\\_b + 0.1\\cdot pred\\_c + 0.15\\cdot pred\\_d)/4.\\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum. Content may be edited and published elsewhere by the author. Page author: Pietro Vischia, 2022

"},{"location":"resources/cloud_resources/index.html","title":"Cloud Resources","text":"

Work in progress.

"},{"location":"resources/dataset_resources/index.html","title":"CMS-ML Dataset Tab","text":""},{"location":"resources/dataset_resources/index.html#introduction","title":"Introduction","text":"

Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!

"},{"location":"resources/dataset_resources/index.html#1-jetnet","title":"1. JetNet","text":""},{"location":"resources/dataset_resources/index.html#links","title":"Links","text":"

Github Repository

Zenodo

"},{"location":"resources/dataset_resources/index.html#description","title":"Description","text":"

JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fr\u00e9chet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fr\u00e9chet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.

"},{"location":"resources/dataset_resources/index.html#nature-of-objects","title":"Nature of Objects","text":"
  • Objects: Gluon (g), Top Quark (t), Light Quark (q), W boson (w), and Z boson (z) jets of ~1 TeV transverse momentum (\\(p_T\\))
  • Number of Objects: N = 177252, 177945, 170679, 177172, 176952 for g, t, q, w, z jets respectively.
"},{"location":"resources/dataset_resources/index.html#format-of-dataset","title":"Format of Dataset","text":"
  • File Type: HDF5
  • Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\\eta, \\varphi, \\p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\\(p_T\\), \\(\\eta\\), mass, # of particles].
"},{"location":"resources/dataset_resources/index.html#related-projects","title":"Related Projects","text":"
  • Top tagging benchmark
  • Particle Cloud Generation with Message Passing Generative Adversarial Networks
"},{"location":"resources/dataset_resources/index.html#2-top-tagging-benchmark-dataset","title":"2. Top Tagging Benchmark Dataset","text":""},{"location":"resources/dataset_resources/index.html#links_1","title":"Links","text":"

Zenodo

"},{"location":"resources/dataset_resources/index.html#description_1","title":"Description","text":"

A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within \u2206R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable \"ttv\" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.

"},{"location":"resources/dataset_resources/index.html#nature-of-objects_1","title":"Nature of Objects","text":"
  • Objects: 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8
  • Number of Objects: In total 1.2M training events, 400k validation events and 400k test events.
"},{"location":"resources/dataset_resources/index.html#format-of-dataset_1","title":"Format of Dataset","text":"
  • File Type: HDF5
  • Structure: Use \u201ctrain\u201d for training, \u201cval\u201d for validation during the training and \u201ctest\u201d for final testing and reporting results. For details, see the Zenodo link
"},{"location":"resources/dataset_resources/index.html#related-projects_1","title":"Related Projects","text":"
  • Butter, Anja; Kasieczka, Gregor; Plehn, Tilman and Russell, Michael (2017). Based on data from 10.21468/SciPostPhys.5.3.028 (1707.08966)
  • Kasieczka, Gregor et al (2019). Dataset used for arXiv:1902.09914 (The Machine Learning Landscape of Top Taggers)
"},{"location":"resources/dataset_resources/index.html#more-dataset-coming-in","title":"More dataset coming in!","text":"

Have any questions? Want your dataset shown on this page? Contact the ML Knowledge Subgroup!

"},{"location":"resources/fpga_resources/index.html","title":"FPGA Resource","text":"

Work in progress.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_gpu.html","title":"lxplus-gpu.cern.ch","text":""},{"location":"resources/gpu_resources/cms_resources/lxplus_gpu.html#how-to-use-it","title":"How to use it?","text":"

lxplus-gpu are special lxplus nodes with GPU support. You can access these nodes by executing

ssh <your_user_name>@lxplus-gpu.cern.ch\n

The configuration of the software environment for lxplus-gpu is described in the Software Environments page.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html","title":"HTCondor With GPU resources","text":"

In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#how-to-require-gpus-in-htcondor","title":"How to require GPUs in HTCondor","text":"

People can require their jobs to have GPU support by adding the following requirements to the condor submission file.

request_gpus = n # n equal to the number of GPUs required\n
"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#further-documentation","title":"Further documentation","text":"

There are good materials providing detailed documentation on how to run HTCondor jobs with GPU support at both machines.

The configuration of the software environment for lxplus-gpu and HTcondor is described in the Software Environments page. Moreover the page Using container explains step by step how to build a docker image to be run on HTCondor jobs.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#more-available-resources","title":"More available resources","text":"
  1. A complete documentation can be found from the GPUs section in CERN Batch Docs. Where a Tensorflow example is supplied. This documentation also contains instructions on advanced HTCondor configuration, for instance constraining GPU device or CUDA version.
  2. A good example on submitting GPU HTCondor job @ Lxplus is the weaver-benchmark project. It provides a concrete example on how to setup environment for weaver framework and operate trainning and testing process within a single job. Detailed description can be found at section ParticleNet of this documentation.

    In principle, this example can be run elsewhere as HTCondor jobs. However, paths to the datasets should be modified to meet the requirements.

  3. CMS Connect also provides a documentation on GPU job submission. In this documentation there is also a Tensorflow example.

    When submitting GPU jobs @ CMS Connect, especially for Machine Learning purpose, EOS space @ CERN are not accessible as a directory, therefore one should consider using xrootd utilities as documented in this page

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html","title":"ml.cern.ch","text":"

ml.cern.ch is a Kubeflow based ML solution provided by CERN.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#kubeflow","title":"Kubeflow","text":"

Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#how-to-access","title":"How to access","text":"

ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#examples","title":"Examples","text":"

After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.

  1. mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
  2. katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.
"},{"location":"resources/gpu_resources/cms_resources/swan.html","title":"SWAN","text":""},{"location":"resources/gpu_resources/cms_resources/swan.html#preparation","title":"Preparation","text":"
  1. Registration:

    To require GPU resources for SWAN: According to this thread, one can create a ticket through this link to ask for GPU support at SWAN, it is now in beta version and limited to a small scale. 2. Setup SWAN with GPU resources:

    1. Once the registration is done, one can login SWAN with Kerberes8 support and then create his SWAN environment.

      \ud83d\udca1 Note: When configuring the SWAN environment you will be given your choice of software stack. Be careful to use a software release with GPU support as well as an appropriate CUDA version. If you need to install additional software, it must be compatible with your chosen CUDA version.

Another important option is the environment script, which will be discussed later in this document.

"},{"location":"resources/gpu_resources/cms_resources/swan.html#working-with-swan","title":"Working with SWAN","text":"
  1. After creation, one will browse the SWAN main directory My Project where all existing projects are displayed. A new project can be created by clicking the upper right \"+\" button. After creation one will be redirected to the newly created project, at which point the \"+\" button on the upper right panel can be used for creating new notebook.

  2. It is possible to use the terminal for installing new packages or monitoring computational resources.

    1. For package installation, one can install packages with package management tools, e.g. pip for python. To use the installed packages, you will need to wrap the environment configuration in a scrip, which will be executed by SWAN. Detailed documentation can be found by clicking the upper right \"?\" button.

    2. In addition to using top and htop to monitor ordinary resources, you can use nvidia-smi to monitor GPU usage.

"},{"location":"resources/gpu_resources/cms_resources/swan.html#examples","title":"Examples","text":"

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.

"},{"location":"resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html","title":"Pytorch mnist","text":"
from __future__ import print_function\nimport argparse\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torchvision import datasets, transforms\nfrom torch.optim.lr_scheduler import StepLR\n
class Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.dropout1 = nn.Dropout(0.25)\n        self.dropout2 = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(9216, 128)\n        self.fc2 = nn.Linear(128, 10)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = F.relu(x)\n        x = self.conv2(x)\n        x = F.relu(x)\n        x = F.max_pool2d(x, 2)\n        x = self.dropout1(x)\n        x = torch.flatten(x, 1)\n        x = self.fc1(x)\n        x = F.relu(x)\n        x = self.dropout2(x)\n        x = self.fc2(x)\n        output = F.log_softmax(x, dim=1)\n        return output\n
def train(args, model, device, train_loader, optimizer, epoch):\n    model.train()\n    for batch_idx, (data, target) in enumerate(train_loader):\n        data, target = data.to(device), target.to(device)\n\n        optimizer.zero_grad()\n        output = model(data)\n        loss = F.nll_loss(output, target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n            if args[\"dry_run\"]:\n                break\n
def test(model, device, test_loader):\n    model.eval()\n    test_loss = 0\n    correct = 0\n    with torch.no_grad():\n        for data, target in test_loader:\n            data, target = data.to(device), target.to(device)\n            output = model(data)\n            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss\n            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability\n            correct += pred.eq(target.view_as(pred)).sum().item()\n\n    test_loss /= len(test_loader.dataset)\n\n    print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n        test_loss, correct, len(test_loader.dataset),\n        100. * correct / len(test_loader.dataset)))\n
torch.cuda.is_available() # Check if cuda is available\n
train_kwargs = {\"batch_size\":64}\ntest_kwargs = {\"batch_size\":1000}\n
cuda_kwargs = {'num_workers': 1,\n               'pin_memory': True,\n               'shuffle': True}\ntrain_kwargs.update(cuda_kwargs)\ntest_kwargs.update(cuda_kwargs)\n
transform=transforms.Compose([\n    transforms.ToTensor(),\n    transforms.Normalize((0.1307,), (0.3081,))\n    ])\n
dataset1 = datasets.MNIST('./data', train=True, download=True,\n                   transform=transform)\ndataset2 = datasets.MNIST('./data', train=False,\n                   transform=transform)\ntrain_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)\ntest_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)\n
device = torch.device(\"cuda\")\nmodel = Net().to(device)\noptimizer = optim.Adadelta(model.parameters(), lr=1.0)\nscheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n
args = {\"dry_run\":False, \"log_interval\":100}\nfor epoch in range(1, 14 + 1):\n    train(args, model, device, train_loader, optimizer, epoch)\n    test(model, device, test_loader)\n    scheduler.step()\n
"},{"location":"resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html","title":"Toptagging mlp","text":"

import torch\nimport torch.nn as nn\nfrom torch.utils.data.dataset import Dataset\nimport pandas as pd\nimport numpy as np\nimport uproot3\nimport torch.optim as optim\nfrom torch.optim.lr_scheduler import StepLR\nimport torch.nn.functional as F\nimport awkward0\n
class MultiLayerPerceptron(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    layer_params : list\n        List of the feature size for each layer.\n    \"\"\"\n\n    def __init__(self, input_dims, num_classes,\n                 layer_params=(256,64,16),\n                 **kwargs):\n\n        super(MultiLayerPerceptron, self).__init__(**kwargs)\n        channels = [input_dims] + list(layer_params) + [num_classes]\n        layers = []\n        for i in range(len(channels) - 1):\n            layers.append(nn.Sequential(nn.Linear(channels[i], channels[i + 1]),\n                                        nn.ReLU()))\n        self.mlp = nn.Sequential(*layers)\n\n    def forward(self, x):\n        # x: the feature vector initally read from the data structure, in dimension (N, C, P)\n        x = x.flatten(start_dim=1) # (N, L), where L = C * P\n        return self.mlp(x)\n\n    def predict(self,x):\n        pred = F.softmax(self.forward(x))\n        ans = []\n        for t in pred:\n            if t[0] > t[1]:\n                ans.append(1)\n            else:\n                ans.append(0)\n        return torch.tensor(ans)\n

def train(args, model, device, train_loader, optimizer, epoch):\n    model.train()\n    for batch_idx, (data, target) in enumerate(train_loader):\n        data, target = data.to(device), target.to(device)\n        optimizer.zero_grad()\n        output = model(data)\n        loss = F.nll_loss(output, target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n            if args[\"dry_run\"]:\n                break\n
input_branches = [\n                  'Part_Etarel',\n                  'Part_Phirel',\n                  'Part_E_log',\n                  'Part_P_log'\n                 ]\n\noutput_branches = ['is_signal_new']\n
train_dataset = uproot3.open(\"TopTaggingMLP/train.root\")[\"Events\"].arrays(input_branches+output_branches,namedecode='utf-8')\ntrain_dataset = {name:train_dataset[name].astype(\"float32\") for name in input_branches+output_branches}\ntest_dataset = uproot3.open(\"/eos/user/c/coli/public/weaver-benchmark/top_tagging/samples/prep/top_test_0.root\")[\"Events\"].arrays(input_branches+output_branches,namedecode='utf-8')\ntest_dataset = {name:test_dataset[name].astype(\"float32\") for name in input_branches+output_branches}\n
for ds in [train_dataset,test_dataset]:\n    for name in ds.keys():\n        if isinstance(ds[name],awkward0.JaggedArray):\n            ds[name] = ds[name].pad(30,clip=True).fillna(0).regular().astype(\"float32\")\n
class PF_Features(Dataset):\n    def __init__(self,mode = \"train\"):\n        if mode == \"train\":\n            self.x = {key:train_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':train_dataset['is_signal_new']}\n        elif mode == \"test\":\n            self.x = {key:test_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':test_dataset['is_signal_new']}\n        elif model == \"val\":\n            self.x = {key:test_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':test_dataset['is_signal_new']}\n\n    def __len__(self):\n        return len(self.y['is_signal_new'])\n\n    def __getitem__(self,idx):\n        X = [self.x[key][idx].copy() for key in input_branches]\n        X = np.vstack(X)\n        y = self.y['is_signal_new'][idx].copy()\n        return X,y\n
torch.cuda.is_available() # Check if cuda is available\n
True\n
device = torch.device(\"cuda\")\n
train_kwargs = {\"batch_size\":1000}\ntest_kwargs = {\"batch_size\":10}\ncuda_kwargs = {'num_workers': 1,\n               'pin_memory': True,\n               'shuffle': True}\ntrain_kwargs.update(cuda_kwargs)\ntest_kwargs.update(cuda_kwargs)\n
model = MultiLayerPerceptron(input_dims = 4 * 30, num_classes=2).to(device)\n
optimizer = optim.Adam(model.parameters(), lr=0.01)\n
train_loader = torch.utils.data.DataLoader(PF_Features(mode=\"train\"),**train_kwargs)\ntest_loader = torch.utils.data.DataLoader(PF_Features(mode=\"test\"),**test_kwargs)\n
loss_func = torch.nn.CrossEntropyLoss()\n
args = {\"dry_run\":False, \"log_interval\":500}\nfor epoch in range(1,10+1):\n    for batch_idx, (data, target) in enumerate(train_loader):\n        inputs = data.to(device)#.flatten(start_dim=1)\n        target = target.long().to(device)\n        optimizer.zero_grad()\n        output = model.forward(inputs)\n        loss = loss_func(output,target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n
"},{"location":"software_envs/containers.html","title":"Using containers","text":"

Containers are a great solution to isolate a software environment, especially in batch systems like lxplus. At the moment two container solutations are supported Apptainer ( previously called Singularity), and Docker.

"},{"location":"software_envs/containers.html#using-singularity","title":"Using Singularity","text":"

The unpacked.cern.ch service mounts on CVMFS contains many singularity images, some of which are suitable for machine learning applications. A description of each of the images is beyond the scope of this document. However, if you find an image which is useful for your application, you can use if by running a Singularity container with the appropriate options. For example:

singularity run --nv --bind <bind_mount_path> /cvmfs/unpacked.cern.ch/<path_to_image>\n

"},{"location":"software_envs/containers.html#examples","title":"Examples","text":"

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.

"},{"location":"software_envs/containers.html#using-docker","title":"Using Docker","text":"

Docker is not supported at the moment in the interactive node of lxplus (like lxplus-gpu). However Docker is supported on HTCondor for job submission.

This option can be very handy for users, as HTCondor can pull images from any public registry, like DockerHub or GitLab registry. The user can follow this workflow: 1. Define a custom image on top of a commonly available pytorch or tensorflow image 2. Add the desidered packages and configuration 3. Push the docker image on a registry 4. Use the image in a HTCondor job

The rest of the page is a step by step tutorial for this workflow.

"},{"location":"software_envs/containers.html#define-the-image","title":"Define the image","text":"
  1. Define a file Dockerfile

    FROM pytorch/pytorch:latest\n\nADD localfolder_with_code /opt/mycode\n\n\nRUN  cd /opt/mycode && pip install -e . # or pip install requirements\n\n# Install the required Python packages\nRUN pip install \\\n    numpy \\\n    sympy \\\n    scikit-learn \\\n    numba \\\n    opt_einsum \\\n    h5py \\\n    cytoolz \\\n    tensorboardx \\\n    seaborn \\\n    rich \\\n    pytorch-lightning==1.7\n\nor \nADD requirements.txt \npip install -r requirements.txt\n
  2. Build the image

    docker build -t username/pytorch-condor-gpu:tag .\n

    and push it (after having setup the credentials with docker login hub.docker.com)

    docker push username/pytorch-condor-gpu:tag\n
  3. Setup the condor job with a submission file submitfile as:

    universe                = docker\ndocker_image            = user/pytorch-condor-gpu:tag\nexecutable              = job.sh\nwhen_to_transfer_output = ON_EXIT\noutput                  = $(ClusterId).$(ProcId).out\nerror                   = $(ClusterId).$(ProcId).err\nlog                     = $(ClusterId).$(ProcId).log\nrequest_gpus            = 1\nrequest_cpus            = 2\n+Requirements           = OpSysAndVer =?= \"CentOS7\"\n+JobFlavour = espresso\nqueue 1\n
  4. For testing purpose one can start a job interactively and debug

    condor_submit -interactive submitfile\n
"},{"location":"software_envs/lcg_environments.html","title":"LCG environments","text":""},{"location":"software_envs/lcg_environments.html#software-environment","title":"Software Environment","text":"

The software environment for ML application trainings can be setup in different ways. In this page we focus on the CERN lxplus environment.

"},{"location":"software_envs/lcg_environments.html#lcg-release-software","title":"LCG release software","text":"

Checking out an ideal software bundle with Cuda support at http://lcginfo.cern.ch/, one can set up an LCG environment by executing

source /cvmfs/sft.cern.ch/lcg/views/<name of bundle>/**x86_64-centos*-gcc11-opt**/setup.sh\n

On lxplus-gpu nodes, usually equipped with AlmaLinux 9.1 (also called Centos9), one should use the proper lcg release. At the time of writing (May 2023) the recommended environment to use GPUs is:

source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\n
"},{"location":"software_envs/lcg_environments.html#customized-environments","title":"Customized environments","text":"

One can create custom Python environment using virtualenv or venv tools, in order to avoid messing up with the global python environment.

The user has the choice of building a virtual environment from scratch or by basing on top of a LCG release.

"},{"location":"software_envs/lcg_environments.html#virtual-environment-from-scratch","title":"Virtual environment from scratch","text":"

The first approach is cleaner but requires downloading the full set of libraries needed for pytorch or TensorFlow (very heavy). Moreover the compatibility with the computing environment (usually lxplus-gpu) is not guaranteed.

  1. Create the environment in a folder of choice, usually called myenv

    python3 -m venv --system-site-packages myenv\nsource myenv/bin/activate   # activate the environment\n# Add following line to .bashrc if you want to activate this environment by default (not recommended)\n# source \"/afs/cern.ch/user/<first letter of your username>/<username>/<path-to-myenv-folder>/myenv/bin/activate\"\n
  2. To install packages properly, one should carefully check the CUDA version with nvidia-smi (as shown in figure before), and then find a proper version, pytorch is used as an example.

    # Execute the command shown in your terminal\npip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html\npip install jupyterlab matplotlib scikit-hep # install other packages if they are needed\n
"},{"location":"software_envs/lcg_environments.html#virtual-environment-on-top-of-lcg","title":"Virtual environment on top of LCG","text":"

Creating a virtual environment only to add packages on top of a specific LCG release can be a very effective and inexpesive way to manage the Python environment in lxplus.

N.B A caveat is that the users needs to remember to activate the lcg environment before activating his virtual environment.

  1. Activate the lcg environment of choice

    source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\n
  2. Create the enviroment as above

    python3 -m venv --system-site-packages myenv\nsource myenv/bin/activate   # activate the environment\n
  3. Now the user can work in the environment as before but Pytorch and tensorflow libraries will be available. If a single package needs to be update one can do

pip install --upgrade tensorflow=newer.version\n

This will install the package in the local environment.

At the next login, the user will need to perform these steps to get back the environment:

source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\nsource myenv/bin/activate\n
"},{"location":"software_envs/lcg_environments.html#conda-environments","title":"Conda environments","text":"

Using conda package manager: conda pacakge manager is more convenient to install and use. To begin with, obtaining an Anaconda or Miniconda installer for Linux x86_64 platform. Then execute it on Lxplus.

1. Please note that if you update your shell configuration (e.g. `.bashrc` file) by `conda init`, you may encounter failure due to inconsistent environment configuration.\n2. Installing packages via `conda` also needs special consideration on selecting proper CUDA version as discussed in `pip` part.\n
"},{"location":"training/MLaaS4HEP.html","title":"MLaaS4HEP","text":""},{"location":"training/MLaaS4HEP.html#machine-learning-as-a-service-for-hep","title":"Machine Learning as a Service for HEP","text":"

MLaaS for HEP is a set of Python-based modules to support reading HEP data and stream them to the ML tool of the user's choice. It consists of three independent layers: - Data Streaming layer to handle remote data, see reader.py - Data Training layer to train ML model for given HEP data, see workflow.py - Data Inference layer, see tfaas_client.py

The MLaaS4HEP resopitory can be found here.

The general architecture of MLaaS4HEP looks like this:

Even though this architecture was originally developed for dealing with HEP ROOT files, we extend it to other data formats. As of right now, following data formats are supported: JSON, CSV, Parquet, and ROOT. All of the formats support reading files from the local file system or HDFS, while the ROOT format supports reading files via the XRootD protocol.

The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients. The TFaaS documentation can be found here.

"},{"location":"training/MLaaS4HEP.html#dependencies","title":"Dependencies","text":"

Here is a list of the dependencies: - pyarrow for reading data from HDFS file system - uproot for reading ROOT files - numpy, pandas for data representation - modin for fast panda support - numba for speeing up individual functions

"},{"location":"training/MLaaS4HEP.html#installation","title":"Installation","text":"

The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images

# run MLaaS4HEP docker container\ndocker run veknet/mlaas4hep\n# run TFaaS docker container\ndocker run veknet/tfaas\n

"},{"location":"training/MLaaS4HEP.html#reading-root-files","title":"Reading ROOT files","text":"

MLaaS4HEP python repository provides the reader.py module that defines a DataReader class able to read either local or remote ROOT files (via xrootd) in chunks. It is based on the uproot framework.

Basic usage

# setup the proper environment, e.g.\n# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework\n# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries\n\n# get help and option description\nreader --help\n\n# here is a concrete example of reading local ROOT file:\nreader --fin=/opt/cms/data/Tau_Run2017F-31Mar2018-v1_NANOAOD.root --info --verbose=1 --nevts=2000\n\n# here is an example of reading remote ROOT file:\nreader --fin=root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root --verbose=1 --nevts=2000 --info\n\n# both of aforementioned commands produce the following output\nReading root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root\n# 1000 entries, 883 branches, 4.113945007324219 MB, 0.6002757549285889 sec, 6.853425235896175 MB/sec, 1.6659010326328503 kHz\n# 1000 entries, 883 branches, 4.067909240722656 MB, 1.3497390747070312 sec, 3.0138486148558896 MB/sec, 0.740883937302516 kHz\n###total time elapsed for reading + specs computing: 2.2570559978485107 sec; number of chunks 2\n###total time elapsed for reading: 1.9500117301940918 sec; number of chunks 2\n\n--- first pass: 1131872 events, (648-flat, 232-jagged) branches, 2463 attrs\nVMEM used: 29.896704 (MB) SWAP used: 0.0 (MB)\n<__main__.RootDataReader object at 0x7fb0cdfe4a00> init is complete in 2.265552043914795 sec\nNumber of events  : 1131872\n# flat branches   : 648\nCaloMET_phi values in [-3.140625, 3.13671875] range, dim=N/A\nCaloMET_pt values in [0.783203125, 257.75] range, dim=N/A\nCaloMET_sumEt values in [820.0, 3790.0] range, dim=N/A\n

More examples about using uproot may be found here and here.

"},{"location":"training/MLaaS4HEP.html#how-to-train-ml-models-on-hep-root-data","title":"How to train ML models on HEP ROOT data","text":"

The MLaaS4HEP framework allows to train ML models in different ways: - using full dataset (i.e. the entire amount of events stored in input ROOT files) - using chunks, as subsets of a dataset, which dimension can be chosen directly by the user and can vary between 1 and the total number of events - using local or remote ROOT files.

The training phase is managed by the workflow.py module which performs the following actions: - read all input ROOT files in chunks to compute a specs file (where the main information about the ROOT files are stored: the dimension of branches, the minimum and the maximum for each branch, and the number of events for each ROOT file) - perform the training cycle (each time using a new chunk of events) - create a new chunk of events taken proportionally from the input ROOT files - extract and convert each event in a list of NumPy arrays - normalize the events - fix the Jagged Arrays dimension - create the masking vector - use the chunk to train the ML model provided by the user

A schematic representation of the steps performed in the MLaaS4HEP pipeline, in particular those inside the Data Streaming and Data Training layers, is:

If the dataset is large and exceed the amount of RAM on the training node, then the user should consider the chunk approach. This allows to train the ML model each time using a different chunk, until the entire dataset is completely read. In this case the user should pay close attention to the ML model convergence, and validate it after each chunk. For more information look at this, this and this. Using different training approach has pros and cons. For instance, training on entire dataset can guarantee the ML model convergence, but the dataset should fits into RAM of the training node. While chunk approach allows to split the dataset to fit in the hardware resources, but it requires proper model evaluation after each chunk training. In terms of training speed, this choice should be faster than training on the entire dataset, since after having used a chunk for training, that chunk is no longer read and used subsequently (this effect is prominent when remote ROOT files are used). Finally, user should be aware of potential divergence of ML model when training last chunk of the dataset and check for bias towards last chunk. For instance, user may implement a K-fold cross validation approach to train on N-1 chunks (i.e. folds in this case) and use one chunk for validation.

A detailed description of how to use the workflow.py module for training a ML model reading ROOT files from the opendata portal, can be found here. Please see how the user has to provide several information when run the workflow.py module, e.g. the definition of the ML model, and then is task of MLaaS4HEP framework to perform all the training procedure using the ML model provided by the user.

For a complete description of MLaaS4HEP see this paper.

"},{"location":"training/autoencoders.html","title":"Autoencoders","text":""},{"location":"training/autoencoders.html#introduction","title":"Introduction","text":"

Autoencoders are a powerful tool that has gained popularity in HEP and beyond recently. These types of algorithms are neural networks that learn to decompress data with minimal reconstruction error (Goodfellow, et. al.).

The idea of using neural networks for dimensionality reduction or feature learning dates back to the early 1990s. Autoencoders, or \"autoassociative neural networks,\" were originally proposed as a nonlinear generalization of principle component analysis (PCA) (Kramer). More recently, connections between autoencoders and latent variable models have brought these types of algorithms into the generative modeling space.

The two main parts of an autoencoder algorithm are the encoder function \\(f(x)\\) and the decoder function \\(g(x)\\). The learning process of an autoencoder is a minimization of a loss function, \\(L(x,g(f(x)))\\), that compares the original data to the output of the decoder, similar to that of a neural network. As such, these algorithms can be trained using the same techniques, like minibatch gradient descent with backpropagation. Below is a representation of an autoencoder from Mathworks.

"},{"location":"training/autoencoders.html#constrained-autoencoders-undercomplete-and-regularized","title":"Constrained Autoencoders (Undercomplete and Regularized)","text":"

Information in this section can be found in Goodfellow, et. al.

An autoencoder that is able to perfectly reconstruct the original data one-to-one, such that \\(g(f(x)) = x\\), is not very useful for extracting salient information from the data. There are several methods imposed on simple autoencoders to encourage them to extract useful aspects of the data.

One way of avoiding perfect data reconstruction is by constraining the dimension of the encoding function \\(f(x)\\) to be less than the data \\(x\\). These types of autoencoders are called undercomplete autoencoders, which force the imperfect copying of the data such that the encoding and decoding networks can prioritize the most useful aspects of the data.

However, if undercomplete encoders are given too much capacity, they will struggle to learn anything of importance from the data. Similarly, this problem occurs in autoencoders with encoder dimensionality greater than or equal to the data (the overcomplete case). In order to train any architecture of AE successfully, constraints based on the complexity of the target distribution must be imposed, apart from small dimensionality. These regularized autoencoders can have constraints on sparsity, robustness to noise, and robustness to changes in data (the derivative).

"},{"location":"training/autoencoders.html#sparse-autoencoders","title":"Sparse Autoencoders","text":"

Sparse autoencoders place a penalty to enforce sparsity in the encoding layer \\(\\mathbf{h} = f(\\mathbf{x})\\) such that \\(L(\\mathbf{x}, g(f(\\mathbf{x}))) + \\Omega(\\mathbf{h})\\). This penalty prevents the autoencoder from learning the identity transformation, extracting useful features of the data to be used in later tasks, such as classification. While the penalty term can be thought of as a regularizing term for a feedforward network, we can expand this view to think of the entire sparse autoencoder framework as approximating the maximum likelihood estimation of a generative model with latent variables \\(h\\). When approximating the maximum likelihood, the joint distribution \\(p_{\\text{model}}(\\mathbf{x}, \\mathbf{h})\\) can be approximated as

\\[ \\text{log} [ p_{\\text{model}}(\\mathbf{x})] = \\text{log} [p_{\\text{model}}(\\mathbf{h})] + [\\text{log} p_{\\text{model}}(\\mathbf{x} | \\mathbf{h})] \\]

where \\(p_{\\text{model}}(\\mathbf{h})\\) is the prior distribution over the latent variables, instead of the model's parameters. Here, we approximate the sum over all possible prior distribution values to be a point estimate at one highly likely value of \\(\\mathbf{h}\\). This prior term is what introduces the sparsity requirement, for example with the Laplace prior, $$ p_{\\text{model}}(h_i) = \\frac{\\lambda}{2}e^{-\\lambda|h_i|}. $$

The log-prior is then

$$ \\text{log} [p_{\\text{model}}(\\mathbf{h})] = \\sum_i (\\lambda|h_i| - \\text{log}\\frac{\\lambda}{2}) = \\Omega(\\mathbf{h}) + \\text{const}. $$ This example demonstrates how the model's distribution over latent variables (prior) gives rise to a sparsity penalty.

"},{"location":"training/autoencoders.html#penalized-autoencoders","title":"Penalized Autoencoders","text":"

Similar to sparse autoencoders, a traditional penalty term can be introduced to the cost function to regularize the autoencoder, such that the function to minimize becomes $$ L(\\mathbf{x},g(f(\\mathbf{x}))) + \\Omega(\\mathbf{h},\\mathbf{x}). $$ where $$ \\Omega(\\mathbf{h},\\mathbf{x}) = \\lambda\\sum_i ||\\nabla_{\\mathbf{x}}h_i||^2. $$ Because of the dependence on the gradient of the latent variables with respect to the input variables, if \\(\\mathbf{x}\\) changes slightly, the model is penalized for learning those slight variations. This type of regularization leads to a contractive autoencoder (CAE).

"},{"location":"training/autoencoders.html#denoising-autoencoders","title":"Denoising Autoencoders","text":"

Another way to encourage autoencoders to learn useful features of the data is training the algorithm to minimize a cost function that compares the original data (\\(\\mathbf{x}\\)) to encoded and decoded data that has been injected with noise (\\(f(g(\\mathbf{\\tilde{x}}))\\), $$ L(\\mathbf{x},g(f(\\mathbf{\\tilde{x}}))) $$ Denoising autoencoders then must learn to undo the effect of the noise in the encoded/decoded data. The autoencoder is able to learn the structure of the probability density function of the data (\\(p_{\\text{data}}\\)) as a function of the input variables (\\(x\\)) through this process (Alain, Bengio, Bengio, et. al.). With this type of cost function, even overcomplete, high-capacity autoencoders can avoid learning the identity transformation.

"},{"location":"training/autoencoders.html#variational-autoencoders","title":"Variational Autoencoders","text":"

Variational autoencoders (VAEs), introduced by Kigma and Welling, are similar to normal AEs. They are comprised of neural nets, which maps the input to latent space (encoder) and back (decoder), where the latent space is a low-dimensional, variational distribution. VAEs are bidirectional, generating data or estimating distributions, and were initially designed for unsupervised learning but can also be very useful in semi-supervised and fully supervised scenarios (Goodfellow, et. al.).

VAEs are trained by maximizing the variational lower bound associated with data point \\(\\mathbf{x}\\), which is a function of the approximate posterior (inference network, or encoder), \\(q(\\mathbf{z})\\). Latent variable \\(\\mathbf{z}\\) is drawn from this encoder distribution, with \\(p_\\text{model}(\\mathbf{x} | \\mathbf{z})\\) viewed as the decoder network. The variational lower bound (also called the evidence lower bound or ELBO) is a trade-off between the join log-likelihood of the visible and latent variables, and the KL divergence between the model prior and the approximate posterior, shown below (Goodfellow, et. al.).

$$ \\mathcal{L}(q) = E_{\\mathbf{z} \\sim q(\\mathbf{z} | \\mathbf{x})} \\text{log}p_\\text{model}(\\mathbf{x} | \\mathbf{z}) - D_\\text{KL}(q || p) $$.

Methods for optimizing the VAE by learning the variational lower bound include EM meta-algorithms like probabilistic PCA (Goodfellow, et. al.).

"},{"location":"training/autoencoders.html#applications-in-hep","title":"Applications in HEP","text":"

One of the more popular applications of AEs in HEP include anomaly detection. Because autoencoders are trained to learn latent features of a dataset, any new data that does not match those features could be classified as an anomaly and picked out by the AE. Examples of AEs for anomaly detection in HEP are listed below:

  • Anomaly detection in high-energy physics using a quantum autoencoder
  • Particle Graph Autoencoders and Differentiable, Learned Energy Mover's Distance
  • Bump Hunting in Latent Space

Another application of (V)AEs in HEP is data generation, as once the likelihood of the latent variables is approximated it can be used to generate new data. Examples of this application in HEP for simulation of various physics processes are listed below:

  • Deep generative models for fast shower simulation in ATLAS
  • Sparse Data Generation for Particle-Based Simulation of Hadronic Jets in the LHC
  • Variational Autoencoders for Jet Simulation

Finally, the latent space learned by (V)AEs give a parsimonious and information-rich phase space from which one can make inferences. Examples of using (V)AEs to learn approximate and/or compressed representations of data are given below:

  • An Exploration of Learnt Representations of W Jets
  • Machine-Learning Compression for Particle Physics Discoveries
  • Decoding Photons: Physics in the Latent Space of a BIB-AE Generative Network

More examples of (V)AEs in HEP can be found at the HEP ML Living Review.

"},{"location":"training/autoencoders.html#references","title":"References","text":"
  • Goodfellow, et. al., 2016, Deep Learning
  • Alain, Bengio, 2013, \"What Regularized Auto-Encoders Learn from the Data Generating Distribution\"
  • Bengio, et. al., 2013, \"Generalized Denoising Auto-Encoders as Generative Models\"
  • Kramer, 1991, \"Nonlinear principle component analysis using autoassociative neural networks\"
  • Kingma, Welling, 2013, \"Auto-Encoding Variational Bayes\"
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":"

Welcome to the documentation hub for the CMS Machine Learning Group! The goal of this page is to provide CMS analyzers a centralized place to gather machine learning information relevant to their work. However, we are not seeking to rewrite external documentation. Whenever applicable, we will link to external documentation, such as the iML groups HEP Living Review or their ML Resources repository. What you will find here are pages covering:

  • ML best practices
  • How to optimize a NN
  • Common pitfalls for CMS analyzers
  • Direct and indirect inferencing using a variety of ML packages
  • How to get a model integrated into CMSSW

And much more!

If you think we are missing some important information, please contact the ML Knowledge Subgroup!

"},{"location":"general_advice/intro.html","title":"Introduction","text":"

In general, ML models don't really work out of the box. For example, most often it is not sufficient to simply instantiate the model class, call its fit() method followed by predict(), and then proceed straight to the inference step of the analysis.

from sklearn.datasets import make_circles\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import SVC\n\nX, y = make_circles(noise=0.2, factor=0.5, random_state=1)\nX_train, X_test, y_train, y_test = \\\n        train_test_split(X, y, test_size=.4, random_state=42)\n\nclf = SVC(kernel=\"linear\", C=0.025)\nclf.fit(X_train, y_train)\nprint(f'Accuracy: {clf.score(X_test, y_test)}')\n# Accuracy: 0.4\n

Being an extremely simplified and naive example, one would be lucky to have the code above produce a valid and optimal model. This is because it explicitly doesn't check for those things which could've gone wrong and therefore is prone to producing undesirable results. Indeed, there are several pitfalls which one may encounter on the way towards implementation of ML into their analysis pipeline. These can be easily avoided by being aware of those and performing a few simple checks here and there.

Therefore, this section is intended to review potential issues on the ML side and how they can be approached in order to train a robust and optimal model. The section is designed to be, to a large extent, analysis-agnostic. It will focus on common, generalized validation steps from ML perspective, without paying particular emphasis on the physical context. However, for illustrative purposes, it will be supplemented with some examples from HEP and additional links for further reading. As the last remark, in the following there will mostly an emphasis on the validation items specific to supervised learning. This includes classification and regression problems as being so far the most common use cases amongst HEP analysts.

The General Advice chapter is divided into into 3 sections. Things become logically aligned if presented from the perspective of the training procedure (fitting/loss minimisation part). That is, the sections will group validation items as they need to be investigated:

  • Before training
  • During training
  • After training

Authors: Oleg Filatov

"},{"location":"general_advice/after/after.html","title":"After training","text":"

After the necessary steps to design the ML experiment has been made, the training has been performed and verified to be stable and consistent, there are still a few things to be checked to further solidify the confidence in the model performance.

"},{"location":"general_advice/after/after.html#final-evaluation","title":"Final evaluation","text":"

Before the training, initial data set is to be split into the train and test parts, where the former is used to train the model (possibly, with cross-validation), while the latter remains blinded. Once all the optimisations to the model architecture have been made and the model is \"frozen\", one proceeds to the evaluation of the metrics' values on the test set. This would be the very last check of the model for overfitting and in case there is none, one expects to see little or no difference comparing to the values on (cross)validation set used throughout the training. In turn, any discrepancies could point to possible overfitting happening in the training stage (or also possibly data leakage), which requires further investigation.

The next step to check is the output score of the model (probability1) for each class. It can be done, for example, in the form of a TMVA-like overtraining check (see Figure 1) which also allows to spot overtraining:

Figure 1. Comparison of model output for signal and background classes overlaid for train and test data sets. [source: root-forum.cern.ch]

In general, what is important to look at is that in the category for class C (defined as argmax(score_i)), the score for a class C peaks at values closer to 1. Whereas the other classes doesn't have such property with peaking on the left side of 1 and smoothly falling down to zero as the model score in the category approaches 1. Or, in other words, that the distributions of the model score for various classes are not overlapping and are as far apart as possible. This would be an indication that the model indeed distinguishes between the classes.

Another thing to look at is the data/simulation agreement for class categories. Since it is the output of the model for each category which is used in further statistical inference step, it is important to verify that data/simulation agreement of input features is properly propagated through the model into categories' distribution. This can be achieved by producing the plot similar to the one shown on Figure 2: the stacked templates for backround processes are fitted and compared with the actual predictions for the data for the set of events classified to be in the given category (jet-fakes in the example). If the output data/simulation agreement is worse than the input one, it might point to an existing bias of the model in the way it treats data and simulation events.

Figure 2. Postfit jet-fake NN score for the mutau channel. Note that the distribution for jet-fakes class is dominant in this category and also peaks at value 1 (mind the log scale), which is an indication of good identification of this background process by the model. Furthermore, ratio of data and MC templates is equal to 1 within uncertainties. [source: CMS-PAS-HIG-20-006]"},{"location":"general_advice/after/after.html#robustness","title":"Robustness","text":"

Once there is high confidence that the model isn't overtrained and no distortion in the input feature data/MC agreement is introduced, one can consider studying the robustness of the model to the parameter/input variations. Effectively, the model can be considered as a \"point estimate\", and any variations are helpful to understand the variance of the model outputs - hence, the model's robustness to changes.

A simple example would be a hyperparameter optimisation, where various model parameters a varied to find the best one in terms of performance. Moreover, in HEP there is a helpful (for this particular case) notion of systematic uncertainties, which is a perfect tool to study model robustness to input data variations.

Since in any case they need to be incorporated into the final statistical fit (to be performed on some interpretation of the model score), it implies that these uncertainties need to be \"propagated\" through the model. A sizeable fraction of those uncertainties are so-called \"up/down\" (or shape) variations, and therefore it is a good opportunity to study, how the model output responds to those up/down input feature changes. If there is a high sensitivity observed, one need to consider removing the most influencing feature from the training, or trying decorrelation techniques to decrease the impact of systematic-affected feature on the model output.

"},{"location":"general_advice/after/after.html#systematic-biases","title":"Systematic biases","text":"

Lastly, possible systematic biases arising the ML approach should be estimated. Being a broad and not fully formalised topic, a few examples will be given below to outline the possible sources of those.

  • The first one could be a domain shift, that is the situation where the model is trained on one data domain, but is apllied to a different one (e.g. trained on simulated data, applied on real one). In order to account for that, corresponding scale factor corrections are traditionally derived, and those will come with some uncertainty as well.
  • Another example would be the case of undertraining. Consider the case of fitting a complex polynomial data with a simple linear function. In that case, the model has high bias (and low variance) which results in a systematic shift of its prediction to be taken into account.
  • Care needs to be taken in cases where a cut is applied on the model output. Cuts might potentially introduce shifts and in case of the model score, which is a variable with a complex and non-linear relationship with input features, it might create undesirable biases. For example, in case of cutting on the output score and looking at the invariant mass distribution (e.g. of two jets), one can observe an effect which is known as mass sculpting (see Figure 3). In that case, the background distribution peaks at the mass of the signal resonance used as a signal in the classification task. After applying such cut, signal and background shapes overlap and become very similar, which dillutes the discrimination power between two hypotheses if invariant mass was to be used as the observable to be fitted.
Figure 3. Left: Distributions of signal and background events without selection. Right: Background distributions at 50% signal efficiency (true positive rate) for different classifiers. The unconstrained classifier sculpts a peak at the W-boson mass, while other classifiers do not. [source: arXiv:2010.09745]
  1. Here it is assumed that it can be treated as probability to be assigned to a given class. This is mostly the case if there is a sigmoid/softmax used on the output layer of the neural network and the model is trained with a cross-entropy loss function.\u00a0\u21a9

"},{"location":"general_advice/before/domains.html","title":"Domains","text":"

Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:

  1. The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
  2. A proper preprocessing of the data set should be performed so that the training step goes smoothly.

In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.

"},{"location":"general_advice/before/domains.html#coverage","title":"Coverage","text":"

To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,

Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.

Examples
  • In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
  • Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.
"},{"location":"general_advice/before/domains.html#solution","title":"Solution","text":"

It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.

It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.

Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.

Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.

"},{"location":"general_advice/before/domains.html#population","title":"Population","text":"

Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,

It is important to make sure that the phase space of interest is well-represented in the training data set.

Example

This is what is often called in HEP jargon \"little statistics in the tails\": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.

"},{"location":"general_advice/before/domains.html#solution_1","title":"Solution","text":"

Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).

Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.

Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.

  1. From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper.\u00a0\u21a9

"},{"location":"general_advice/before/features.html","title":"Features","text":"

In the previous section, the data was considered from a general \"domain\" perspective and in this section a more low level view will be outlined. In particular, an emphasis will be made on features (input variables) as they play a crucial role in the training of any ML model. Essentially being the handle on and the gateway into data for the model, they are expected to reflect the data from the perspective which is important to the problem at hand and therefore define the model performance on the task.

The topic of feature engineering is very extensive and complex to be covered in this section, so the emphasis will be made primarily on the general aspects relevant to the HEP context. Broadly speaking, one should ask themselves the following questions during the data preparation:

  • Are features understood?
  • Are features correctly modelled?
  • Are features appropriately processed?
"},{"location":"general_advice/before/features.html#understanding","title":"Understanding","text":"

Clearly one should motivate for themselves (and then possibly for analysis reviewers) why this exact set of features and not the other one has been selected1. Aside from physical understanding and intuition it would be good if a priori expert knowledge is supplemented by running further experiments.

Here one can consider either studies done prior to the training or after it. As for the former, studying feature correlations (with the target variable as well) e.g. by computing Pearson and/or Spearman correlation coefficients and plotting several histogram/scatter plots could bring some helpful insights. As for the latter, exploring feature importances as the trained model deems it important can boost the understanding of both the data and the model altogether.

"},{"location":"general_advice/before/features.html#modelling","title":"Modelling","text":"

Although seemingly obvious, for the sake of completeness the point of achieving good data/MC agreement should be mentioned. It has always been a must to be checked in a cut-based approach and ML-based one is of no difference: the principle \"garbage in, garbage out\" still holds.

Example

For example, classical feed-forward neural network is just a continuous function mapping the input space to the output one, so any discrepancies in the input might propagate to the output. In case of boosted decision trees it is also applicable: any (domain) differences in the shape of input (training) distribution w.r.t. true \"data\" distribution might sizeably affect the construction of decision boundary in the feature space.

Figure 1. Control plot for a visible mass of tau lepton pair in emu final state. [source: CMS-TAU-18-001]

Since features are the handle on the data, checking for each input feature that the ratio of data to MC features' histograms is close to 1 within uncertainties (aka by eye) is one of the options. For a more formal approach, one can perform goodness of fit (GoF) tests in 1D and 2D, checking that as it was used for example in the analysis of Higgs boson decaying into tau leptons.

If the modelling is shown to be insufficient, the corresponding feature should be either removed, or mismodelling needs to be investigated and resolved.

"},{"location":"general_advice/before/features.html#processing","title":"Processing","text":"

Feature preprocessing can also be understood from a broader perspective of data preprocessing, i.e. transformations which need to be performed with data prior to training a model. Another way to look at this is of a step where raw data is converted into prepared data. That makes it an important part of any ML pipeline since it ensures that a smooth convergence and stability of the training is reached.

Example

In fact, the training process might not even begin (presence of NaN values) or break in the middle (outlier causing the gradients to explode). Furthermore, data can be completely misunderstood by the model which can potentially caused undesirable interpretation and performance (treatment of categorical variables as numerical).

Therefore, below there is a non-exhaustive list of the most common items to be addressed during the preprocessing step to ensure the good quality of training. For a more comprehensive overview and also code examples please refer to a detailed documentation of sklearn package and also on possible pitfalls which can arise at this point.

  • Feature encoding
  • NaN/inf/missing values2
  • Outliers & noisy data
  • Standartisation & transformations

Finally, these are the items which are worth considering in the preprocessing of data in general. However, one can also apply transformations at the level of batches as they are passed through the model. This will be briefly covered in the following section.

  1. Here it is already assumed that a proper data representation has been chosen, i.e. the way to vectorize the data to form a particular structure (e.g. image -> tensor, social network -> graph, text -> embeddings). Being on its own a whole big topic, it is left for a curious reader to dive into.\u00a0\u21a9

  2. Depending on the library and how particular model is implemented there, these values can be handled automatically under the hood.\u00a0\u21a9

"},{"location":"general_advice/before/inputs.html","title":"Inputs","text":"

After data is preprocessed as a whole, there is a question of how this data should be supplied to the model. On its way there it potentially needs to undergo a few splits which will be described below. Plus, a few additional comments about training weights and motivation for their choice will be outlined.

"},{"location":"general_advice/before/inputs.html#data-split","title":"Data split","text":"

The first thing one should consider to do is to perform a split of the entire data set into train/validation(/test) data sets. This is an important one because it serves the purpose of diagnosis for overfitting. The topic will be covered in more details in the corresponding section and here a brief introduction will be given.

Figure 1. Decision boundaries for underfitted, optimal and overfitted models. [source: ibm.com/cloud/learn/overfitting]

The trained model is called to be overfitted (or overtrained) when it fails to generalise to solve a given problem.

One of examples would be that the model learns to predict exactly the training data and once given a new unseen data drawn from the same distribution it fails to predict the target corrrectly (right plot on Figure 1). Obviously, this is an undesirable behaviour since one wants their model to be \"universal\" and provide robust and correct decisions regardless of the data subset sampled from the same population.

Hence the solution to check for ability to generalise and to spot overfitting: test a trained model on a separate data set, which is the same1 as the training one. If the model performance gets significantly worse there, it is a sign that something went wrong and the model's predictive power isn't generalising to the same population.

Figure 2. Data split worflow before the training. Also cross-validation is shown as the technique to find optimal hyperparameters. [source: scikit-learn.org/stable/modules/cross_validation.html]

Clearly, the simplest way to find this data set is to put aside a part of the original one and leave it untouched until the final model is trained - this is what is called \"test\" data set in the first paragraph of this subsection. When the model has been finalised and optimised, this data set is \"unblinded\" and model performance on it is evaluated. Practically, this split can be easily performed with train_test_split() method of sklearn library.

But it might be not that simple

Indeed, there are few things to be aware of. Firstly, there is a question of how much data needs to be left for validation. Usually it is common to take the test fraction in the range [0.1, 0.4], however it is mostly up for analyzers to decide. The important trade-off which needs to be taken into account here is that between robustness of the test metric estimate (too small test data set - poorly estimated metric) and robustness of the trained model (too little training data - less performative model).

Secondly, note that the split should be done in a way that each subset is as close as possible to the one which the model will face at the final inference stage. But since usually it isn't feasible to bridge the gap between domains, the split at least should be uniform between training/testing to be able to judge fairly the model performance.

Lastly, in extreme case there might be no sufficient amount of data to perform the training, not even speaking of setting aside a part of it for validation. Here a way out would be to go for a few-shot learning, using cross-validation during the training, regularising the model to avoid overfitting or to try to find/generate more (possibly similar) data.

Lastly, one can also considering to put aside yet another fraction of original data set, what was called \"validation\" data set. This can be used to monitor the model during the training and more details on that will follow in the overfitting section.

"},{"location":"general_advice/before/inputs.html#batches","title":"Batches","text":"

Usually it is the case the training/validation/testing data set can't entirely fit into the memory due to a large size. That is why it gets split into batches (chunks) of a given size which are then fed one by one into the model during the training/testing.

While forming the batches it is important to keep in mind that batches should be sampled uniformly (i.e. from the same underlying PDF as of the original data set).

That means that each batch is populated similarly to the others according to features which are important to the given task (e.g. particles' pt/eta, number of jets, etc.). This is needed to ensure that gradients computed for each batch aren't different from each other and therefore the gradient descent doesn't encounter any sizeable stochasticities during the optimisation step.2

Lastly, it was already mentioned that one should perform preprocessing of the data set prior to training. However, this step can be substituted and/or complemented with an addition of a layer into the architecture, which will essentially do a specified part of preprocessing on every batch as they go through the model. One of the most prominent examples could be an addition of batch/group normalization, coupled with weight standardization layers which turned out to sizeably boost the performance on the large variety of benchmarks.

"},{"location":"general_advice/before/inputs.html#training-weights","title":"Training weights","text":"

Next, one can zoom into the batch and consider the level of single entries there (e.g. events). This is where the training weights come into play. Since the value of a loss function for a given batch is represented as a sum over all the entries in the batch, this sum can be naturally turned into a weighted sum. For example, in case of a cross-entropy loss with y_pred, y_true, w being vectors of predicted labels, true labels and weights respectively:

def CrossEntropy(y_pred, y_true, w): # assuming y_true = {0, 1}\n    return -w*[y_true*log(y_pred) + (1-y_true)*log(1-y_pred)]\n

It is important to disentangle here two factors which define the weight to be applied on a per-event basis because of the different motivations behind them:

  • accounting for imbalance in training data
  • accounting for imbalance in nature
"},{"location":"general_advice/before/inputs.html#imbalance-in-training-data","title":"Imbalance in training data","text":"

The first point is related to the fact, that in case of classification we may have significantly more (>O(1) times) training data for one class than for the other. Since the training data usually comes from MC simulation, that corresponds to the case when there is more events generated for one physical process than for another. Therefore, here we want to make sure that model is equally presented with instances of each class - this may have a significant impact on the model performance depending on the loss/metric choice.

Example

Consider the case when there is 1M events of target = 0 and 100 events of target = 1 in the training data set and a model is fitted by minimising cross-entropy to distinguish between those classes. In that case the resulted model can easily turn out to be a constant function predicting the majority target = 0, simply because this would be the optimal solution in terms of the loss function minimisation. If using accuracy as a metric for validation, this will result in a value close to 1 on the training data.

To account for this type of imbalance, the following weight simply needs to be introduced according to the target label of an object:

train_df['weight'] = 1\ntrain_df.loc[train_df.target == 0, 'weight'] /= np.sum(train_df.loc[train_df.target == 0, 'weight'])\ntrain_df.loc[train_df.target == 1, 'weight'] /= np.sum(train_df.loc[train_df.target == 1, 'weight'])\n

Alternatively, one can consider using other ways of balancing classes aside of those with training weights. For a more detailed description of them and also a general problem statement see imbalanced-learn documentation.

"},{"location":"general_advice/before/inputs.html#imbalance-in-nature","title":"Imbalance in nature","text":"

The second case corresponds to the fact that in experiment we expect some classes to be more represented than the others. For example, the signal process usually has way smaller cross-section than background ones and therefore we expect to have in the end fewer events of the signal class. So the motivation of using weights in that case would be to augment the optimisation problem with additional knowledge of expected contribution of physical processes.

Practically, the notion of expected number of events is incorporated into the weights per physical process so that the following conditions hold3:

As a part of this reweighting, one would naturally need to perform the normalisation as of the previous point, however the difference between those two is something which is worth emphasising.

  1. That is, sampled independently and identically (i.i.d) from the same distribution.\u00a0\u21a9

  2. Although this is a somewhat intuitive statement which may or may not be impactful for a given task and depends on the training procedure itself, it is advisable to keep this aspect in mind while preparing batches for training.\u00a0\u21a9

  3. See also Chapter 2 of the HiggsML overview document \u21a9

"},{"location":"general_advice/before/metrics.html","title":"Metrics & Losses","text":""},{"location":"general_advice/before/metrics.html#metric","title":"Metric","text":"

Metric is a function which evaluates model's performance given true labels and model predictions for a particular data set.

That makes it an important ingredient in the model training as being a measure of the model's quality. However, metrics as estimators can be sensitive to some effects (e.g. class imbalance) and provide biased or over/underoptimistic results. Additionally, they might not be relevant to a physical problem in mind and to the undestanding of what is a \"good\" model1. This in turn can result in suboptimally tuned hyperparameters or in general to suboptimally trained model.

Therefore, it is important to choose metrics wisely, so that they reflect the physical problem to be solved and additionaly don't introduce any biases in the performance estimate. The whole topic of metrics would be too broad to get covered in this section, so please refer to a corresponding documentation of sklearn as it provides an exhaustive list of available metrics with additional materials and can be used as a good starting point.

Examples of HEP-specific metrics

Speaking of those metrics which were developed in the HEP field, the most prominent one is approximate median significance (AMS), firstly introduced in Asymptotic formulae for likelihood-based tests of new physics and then adopted in the HiggsML challenge on Kaggle.

Essentially being an estimate of the expected signal sensitivity and hence being closely related to the final result of analysis, it can also be used not only as a metric but also as a loss function to be directly optimised in the training.

"},{"location":"general_advice/before/metrics.html#loss-function","title":"Loss function","text":"

In fact, metrics and loss functions are very similar to each other: they both give an estimate of how well (or bad) model performs and both used to monitor the quality of the model. So the same comments as in the metrics section apply to loss functions too. However, loss function plays a crucial role because it is additionally used in the training as a functional to be optimised. That makes its choice a handle to explicitly steer the training process towards a more optimal and relevant solution.

Example of things going wrong

It is known that L2 loss (MSE) is sensitive to outliers in data and L1 loss (MAE) on the other hand is robust to them. Therefore, if outliers were overlooked in the training data set and the model was fitted, it may result in significant bias in its predictions. As an illustration, this toy example compares Huber vs Ridge regressors, where the latter shows a more robust behaviour.

A simple example of that was already mentioned in domains section - namely, one can emphasise specific regions in the phase space by attributing events there a larger weight in the loss function. Intuitively, for the same fraction of mispredicted events in the training data set, the class with a larger attributed weight should bring more penalty to the loss function. This way model should be able to learn to pay more attention to those \"upweighted\" events2.

Examples in HEP beyond classical MSE/MAE/cross entropy
  • b-jet energy regression, being a part of nonresonant HH to bb gamma gamma analysis, uses Huber and two quantile loss terms for simultaneous prediction of point and dispersion estimators of the target disstribution.
  • DeepTau, a CMS deployed model for tau identification, uses several focal loss terms to give higher weight to more misclassified cases

However, one can go further than that and consider the training procedure from a larger, statistical inference perspective. From there, one can try to construct a loss function which would directly optimise the end goal of the analysis. INFERNO is an example of such an approach, with a loss function being an expected uncertainty on the parameter of interest. Moreover, one can try also to make the model aware of nuisance parameters which affect the analysis by incorporating those into the training procedure, please see this review for a comprehensive overview of the corresponding methods.

  1. For example, that corresponds to asking oneself a question: \"what is more suitable for the purpose of the analysis: F1-score, accuracy, recall or ROC AUC?\"\u00a0\u21a9

  2. However, these are expectations one may have in theory. In practise, optimisation procedure depends on many variables and can go in different ways. Therefore, the weighting scheme should be studied by running experiments on the case-by-case basis.\u00a0\u21a9

"},{"location":"general_advice/before/model.html","title":"Model","text":"

There is definitely an enormous variety of ML models available on the market, which makes the choice of a suitable one for a given problem at hand not entirely straightforward. So far being to a large extent an experimental field, the general advice here would be to try various and pick the one giving the best physical result.

However, there are in any case several common remarks to be pointed out, all glued together with a simple underlying idea:

Start off from a simple baseline, then gradually increase the complexity to improve upon it.

  1. In the first place, one need to carefully consider whether there is a need for training an ML model at all. There might be problems where this approach would be a (time-consuming) overkill and a simple conventional statistical methods would deliver results faster and even better.

  2. If ML methods are expected to bring improvement, then it makes sense to try out simple models first. Assuming a proper set of high-level features has been selected, ensemble of trees (random forest/boosted decision tree) or simple feedforward neural networks might be a good choice here. If time and resources permit, it might be beneficial to compare the results of these trainings to a no-ML approach (e.g. cut-based) to get the feeling of how much the gain in performance is. In most of the use cases, those models will be already sufficient to solve a given classification/regression problem in case of dealing with high-level variables.

  3. If it feels like there is still room for improvement, try hyperparameter tuning first to see if it is possible to squeeze more performance out of the current model and data. It can easily be that the model is sensitive to a hyperparameter choice and a have a sizeable variance in performance across hyperparameter space.

  4. If the hyperparameter space has been thoroughly explored and optimal point has been found, one can additionally try to play around with the data, for example, by augmenting the current data set with more samples. Since in general the model performance profits from having more training data, augmentation might also boost the overall performance.

  5. Lastly, more advanced architectures can be probed. At this point the choice of data representation plays a crucial role since more complex architectures are designed to adopt more sophisticated patterns in data. While in ML research is still ongoing to unify together all the complexity of such models (and promisingly, also using effective field theory approach), in HEP there's an ongoing process of probing various architectures to see which type fits the most in HEP field.

Models in HEP

One of the most prominent benchmarks so far is the one done by G. Kasieczka et. al on the top tagging data set, where in particular ParticleNet turned out to be a state of the art. This had been a yet another solid argument in favour of using graph neural networks in HEP due to its natural suitability in terms of data representation.

Illustration from G. Kasieczka et. al showing ROC curves for all evaluated algorithms.

"},{"location":"general_advice/during/opt.html","title":"Optimisation problems","text":"Figure 1. The loss surfaces of ResNet-56 with/without skip connections. [source: \"Visualizing the Loss Landscape of Neural Nets\" paper]

However, it might be that for a given task overfitting is of no concern, but there are still instabilities in loss function convergence happening during the training1. The loss landscape is a complex object having multiple local minima and which is moreover not at all understood due to the high dimensionality of the problem. That makes the gradient descent procedure of finding a minimum not that simple. However, if instabilities are observed, there are a few common things which could explain that:

  • The main candidate for a problem might be the learning rate (LR). Being an important hyperparameter which steers the optimisation, setting it too high make cause extremily stochastic behaviour which will likely cause the optimisation to get stuck in some random minimum being way far from optimum. Oppositely, setting it too low may cause the convergence to take very long time. The optimal value in between those extremes can still be problematic due to a chance of getting stuck in a local minimum on the way towards a better one. That is why several approaches on LR schedulers (e.g. cosine annealing) and also adaptive LR (e.g. Adam being the most prominent one) have been developed to have more flexibility during the training, as opposed to setting LR fixed from the very beginning of the training until its end.

  • Another possibility is that there are NaN/inf values or uniformities/outliers appearing in the input batches. It can cause the gradient updates to go beyond the normal scale and therefore dramatically affect the stability of the loss optimisation. This can be avoided by careful data preprocessing and batch formation.

  • Last but not the least, there is a chance that gradients will explode or vanish during the training, which will reveal itself as a rapid increase/stagnation in the loss function values. This is largely the feature of deep architectures, where during the backpropagation gradients are accumulated from one layer to another, and therefore any minor deviations in scale can exponentially amplify/diminish as they get multiplied. Since it is the scale of the trainable weights themselves which defines the weight gradients, a proper weight initialisation can foster smooth and consistent gradient updates. Also, batch normalisation together with weight standartization showed to be a powerful technique to consistently improve performance across various domains. Finally, a choice of activation function is particularly important since it directly contributes to a gradient computation. For example, a sigmoid function is known to cause gradients to vanish due to its gradient being 0 at large input values. Therefore, it is often suggested to stick to classical ReLU or try other alternatives to see if it brings improvement in performance.

  1. Sometimes particularly peculiar.\u00a0\u21a9

"},{"location":"general_advice/during/overfitting.html","title":"Overfitting","text":"

Given that the training experiment has been set up correctly (with some of the most common problems described in before training section), actually few things can go wrong during the training process itself. Broadly speaking, they fall into two categories: overfitting related and optimisation problem related. Both of them can be easily spotted by closely monitoring the training procedure, as will be described in the following.

"},{"location":"general_advice/during/overfitting.html#overfitting","title":"Overfitting","text":"

The concept of overfitting (also called overtraining) was previously introduced in inputs section and here we will elaborate a bit more on that. In its essence, overfitting as the situation where the model fails to generalise to a given problem can have several underlying explanations:

The first one would be the case where the model complexity is way too large for a problem and a data set being considered.

Example

A simple example would be fitting of some linearly distributed data with a polynomial function of a large degree. Or in general, when the number of trainable parameters is significantly larger when the size of the training data set.

This can be solved prior to training by applying regularisation to the model, which in it essence means constraining its capacity to learn the data representation. This is somewhat related also to the concept of Ockham's razor: namely that the less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the data sample. As of the practical side of regularisation, please have a look at this webpage for a detailed overview and implementation examples.

Furthermore, a recipe for training neural networks by A. Karpathy is a highly-recommended guideline not only on regularisation, but on training ML models in general.

The second case is a more general idea that any reasonable model at some point starts to overfit.

Example

Here one can look at overfitting as the point where the model considers noise to be of the same relevance and start to \"focus\" on it way too much. Since data almost always contains noise, this makes it in principle highly probable to reach overfitting at some point.

Both of the cases outlined above can be spotted simply by tracking the evolution of loss/metrics on the validation data set . Which means that additionally to the train/test split done prior to training (as described in inputs section), one need to set aside also some fraction of the training data to perform validation throughout the training. By plotting the values of loss function/metric both on train and validation sets as the training proceeds, overfitting manifests itself as the increase in the value of the metric on the validation set while it is still continues to decrease on the training set:

Figure 1. Error metric as a function of number of iterations for train and validation sets. Vertical dashed line represents the separation between the region of underfitting (model hasn't captured well the data complexity to solve the problem) and overfitting (model does not longer generalise to unseen data). The point between these two regions is the optimal moment when the training should stop. [source: ibm.com/cloud/learn/overfitting]

Essentially, it means that from that turning point onwards the model is trying to learn better and better the noise in training data at the expense of generalisation power. Therefore, it doesn't make sense to train the model from that point on and the training should be stopped.

To automate the process of finding this \"sweat spot\", many ML libraries include early stopping as one of its parameters in the fit() function. If early stopping is set to, for example, 10 iterations, the training will automatically stop once the validation metric is no longer improving for the last 10 iterations.

"},{"location":"general_advice/during/xvalidation.html","title":"Cross-validation","text":"

However, in practice what one often deals with is a hyperparameter optimisation - running of several trainings to find the optimal hyperparameter for a given family of models (e.g. BDT or feed-forward NN).

The number of trials in the hyperparameter space can easily reach hundreds or thousands, and in that case naive approach of training the model for each hyperparameters' set on the same train data set and evaluating its performance on the same test data set is very likely prone to overfitting. In that case, an experimentalist overfits to the test data set by choosing the best value of the metric and effectively adapting the model to suit the test data set best, therefore loosing the model's ability to generalise.

In order to prevent that, a cross-validation (CV) technique is often used:

Figure 1. Illustration of the data set split for cross-validation. [source: scikit-learn.org/stable/modules/cross_validation.html]

The idea behind it is that instead of a single split of the data into train/validation sets, the training data set is split into N folds. Then, the model with the same fixed hyperparameter set is trained N times in a way that at the i-th iteration the i-th fold is left out of the training and used only for validation, while the other N-1 folds are used for the training.

In this fashion, after the training of N models in the end there is N values of a metric computed on each fold. The values now can be averaged to give a more robust estimate of model performance for a given hyperparameter set. Also a variance can be computed to estimate the range of metric values. After having completed the N-fold CV training, the same approach is to be repeated for other hyperparameter values and the best set of those is picked based on the best fold-averaged metric value.

Further insights

Effectively, with CV approach the whole training data set plays the role of a validation one, which makes the overfitting to a single chunk of it (as in naive train/val split) less likely to happen. Complementary to that, more training data is used to train a single model oppositely to a single and fixed train/val split, moreover making the model less dependant on the choice of the split.

Alternatively, one can think of this procedure is of building a model ensemble which is inherently an approach more robust to overfitting and in general performing better than a single model.

"},{"location":"inference/checklist.html","title":"Integration checklist","text":"

Todo.

"},{"location":"inference/conifer.html","title":"Direct inference with conifer","text":""},{"location":"inference/conifer.html#introduction","title":"Introduction","text":"

conifer is a Python package developed by the Fast Machine Learning Lab for the deployment of Boosted Decision Trees in FPGAs for Level 1 Trigger applications. Documentation, examples, and tutorials are available from the conifer website, GitHub, and the hls4ml tutorial respectively. conifer is on the Python Package Index and can be installed like pip install conifer. Targeting FPGAs requires Xilinx's Vivado/Vitis suite of software. Here's a brief summary of features:

  • conversion from common BDT training frameworks: scikit-learn, XGBoost, Tensorflow Decision Forests (TF DF), TMVA, and ONNX
  • conversion to FPGA firmware with backends: HLS (C++ for FPGA), VHDL, C++ (for CPU)
  • utilities for bit- and cycle-accurate firmware simulation, and interface to FPGA synthesis tools for evaluation and deployment from Python
"},{"location":"inference/conifer.html#emulation-in-cmssw","title":"Emulation in CMSSW","text":"

All L1T algorithms require bit-exact emulation for performance studies and validation of the hardware system. For conifer this is provided with a single header file at L1Trigger/Phase2L1ParticleFlow/interface/conifer.h. The user must also provide the BDT JSON file exported from the conifer Python tool for their model. JSON loading in CMSSW uses the nlohmann/json external.

Both the conifer FPGA firmware and C++ emulation use Xilinx's arbitrary precision types for fixed-point arithmetic (hls external of CMSSW). This is cheaper and faster in the FPGA fabric than floating-point types. An important part of the model preparation process is choosing the proper fixed-point data types to avoid loss of performance compared to the trained model. Input preprocessing, in particular scaling, can help constrain the input variables to a smaller numerical range, but may also have a hardware cost to implement. In C++ the arbitrary precision types are specified like: ap_fixed<width, integer, rounding mode, saturation mode>.

Minimal preparation from Python:

import conifer\nmodel = conifer. ... # convert or load a conifer model\n# e.g. model = conifer.converters.convert_from_xgboost(xgboost_model)\nmodel.save('my_bdt.json')\n

CMSSW C++ user code:

// include the conifer emulation header file\n#include \"L1Trigger/Phase2L1ParticleFlow/interface/conifer.h\"\n\n... model setup\n// define the input/threshold and score types\n// important: this needs to match the firmware settings for bit-exactness!\n// note: can use native types like float/double for development/debugging\ntypedef ap_fixed<18,8> input_t;\ntypedef ap_fixed<12,3,AP_RND_CONV,AP_SAT> score_t;\n\n// create a conifer BDT instance\n// 'true' to use balanced add-tree score aggregation (needed for bit-exactness)\nbdt = conifer::BDT<input_t, score_t, true>(\"my_bdt.json\");\n\n... inference\n// prepare the inputs, vector length same as model n_features\nstd::vector<input_t> inputs = ... \n// run inference, scores vector length same as model n_classes (or 1 for binary classification/regression)\nstd::vector<score_t> scores = bdt.decision_function(inputs);\n

conifer does not compute class probabilities from the raw predictions for the avoidance of extra resource and latency cost in the L1T deployment. Cuts or working points should therefore be applied on the raw predictions.

"},{"location":"inference/hls4ml.html","title":"Direct inference with hls4ml","text":"

hls4ml is a Python package developed by the Fast Machine Learning Lab. It's primary purpose is to create firmware implementations of machine learning (ML) models to be run on FPGAs. The package interfaces with a high-level synthesis (HLS) backend (i.e. Xilinx Vivado HLS) to transpile the ML model into hardware description language (HDL). The primary hls4ml documentation, including API reference pages, is located here.

The main hls4ml tutorial code is kept on GitHub. Users are welcome to walk through the notebooks at their own pace. There is also a set of slides linked to the README.

That said, there have been several cases where the hls4ml developers have given live demonstrations and tutorials. Below is a non-exhaustive list of tutorials given in the last few years (newest on top).

Workshop/Conference Date Links 23rd Virtual IEEE Real Time Conference August 03, 2022 Indico 2022 CMS ML Town Hall July 22, 2022 Contribution Link a3d3 hls4ml @ Snowmass CSS 2022: Tutorial July 21, 2022 Slides, Recording, JupyterHub Fast Machine Learning for Science Workshop December 3, 2020 Indico, Slides, GitHub, Interactive Notebooks hls4ml @ UZH ML Workshop November 17, 2020 Indico, Slides ICCAD 2020 November 5, 2020 https://events-siteplex.confcats.io/iccad2022/wp-content/uploads/sites/72/2021/12/2020_ICCAD_ConferenceProgram.pdf, GitHub 4th IML Workshop October 19, 2020 Indico, Slides, Instructions, Notebooks, Recording 22nd Virtual IEEE Real Time Conference October 15, 2020 Indico, Slides, Notebooks 30th International Conference on Field-Programmable Logic and Applications September 4, 2020 Program hls4ml tutorial @ CERN June 3, 2020 Indico, Slides, Notebooks Fast Machine Learning September 12, 2019 Indico 1st Real Time Analysis Workshop, Universit\u00e9 Paris-Saclay July 16, 2019 Indico, Slides, Autoencoder Tutorial"},{"location":"inference/onnx.html","title":"Direct inference with ONNX Runtime","text":"

ONNX is an open format built to represent machine learning models. It is designed to improve interoperability across a variety of frameworks and platforms in the AI tools community\u2014most deep learning frameworks (e.g. XGBoost, TensorFlow, PyTorch which are frequently used in CMS) support converting their model into the ONNX format or loading a model from an ONNX format.

The figure showing the ONNX interoperability. (Source from website.)

ONNX Runtime is a tool aiming for the acceleration of machine learning inferencing across a variety of deployment platforms. It allows to \"run any ONNX model using a single set of inference APIs that provide access to the best hardware acceleration available\". It includes \"built-in optimization features that trim and consolidate nodes without impacting model accuracy.\"

The CMSSW interface to ONNX Runtime is avaiable since CMSSW_11_1_X (cmssw#28112, cmsdist#5020). Its functionality is improved in CMSSW_11_2_X. The final implementation is also backported to CMSSW_10_6_X to facilitate Run 2 UL data reprocessing. The inference of a number of deep learning tagger models (e.g. DeepJet, DeepTauID, ParticleNet, DeepDoubleX, etc.) has been made with ONNX Runtime in the routine of UL processing and has gained substantial speedup.

On this page, we will use a simple example to show how to use ONNX Runtime for deep learning model inference in the CMSSW framework, both in C++ (e.g. to process the MiniAOD file) and in Python (e.g. using NanoAOD-tools to process the NanoAODs). This may help readers who will deploy an ONNX model into their analyses or in the CMSSW framework.

"},{"location":"inference/onnx.html#software-setup","title":"Software Setup","text":"

We use CMSSW_11_2_5_patch2 to show the simple example for ONNX Runtime inference. The example can also work under the new 12 releases (note that inference with C++ can also run on CMSSW_10_6_X)

export SCRAM_ARCH=\"slc7_amd64_gcc900\"\nexport CMSSW_VERSION=\"CMSSW_11_2_5_patch2\"\n\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\n\ncmsrel \"$CMSSW_VERSION\"\ncd \"$CMSSW_VERSION/src\"\n\ncmsenv\nscram b\n
"},{"location":"inference/onnx.html#converting-model-to-onnx","title":"Converting model to ONNX","text":"

The model deployed into CMSSW or our analysis needs to be converted to ONNX from the original framework format where it is trained. Please see here for a nice deck of tutorials on converting models from different mainstream frameworks into ONNX.

Here we take PyTorch as an example. A PyTorch model can be converted by torch.onnx.export(...). As a simple illustration, we convert a randomly initialized feed-forward network implemented in PyTorch, with 10 input nodes and 2 output nodes, and two hidden layers with 64 nodes each. The conversion code is presented below. The output model model.onnx will be deployed under the CMSSW framework in our following tutorial.

Click to expand
import torch\nimport torch.nn as nn\ntorch.manual_seed(42)\n\nclass SimpleMLP(nn.Module):\n\n    def __init__(self, **kwargs):\n        super(SimpleMLP, self).__init__(**kwargs)\n        self.mlp = nn.Sequential(\n            nn.Linear(10, 64), nn.BatchNorm1d(64), nn.ReLU(), \n            nn.Linear(64, 64), nn.BatchNorm1d(64), nn.ReLU(), \n            nn.Linear(64, 2), nn.ReLU(), \n            )\n    def forward(self, x):\n        # input x: (batch_size, feature_dim=10)\n        x = self.mlp(x)\n        return torch.softmax(x, dim=1)\n\nmodel = SimpleMLP()\n\n# create dummy input for the model\ndummy_input = torch.ones(1, 10, requires_grad=True) # batch size = 1\n\n# export model to ONNX\ntorch.onnx.export(model, dummy_input, \"model.onnx\", verbose=True, input_names=['my_input'], output_names=['my_output'])\n
"},{"location":"inference/onnx.html#inference-in-cmssw-c","title":"Inference in CMSSW (C++)","text":"

We will introduce how to write a module to run inference on the ONNX model under the CMSSW framework. CMSSW is known for its multi-threaded ability. In a threaded framework, multiple threads are served for processing events in the event loop. The logic is straightforward: a new event is assigned to idled threads following the first-come-first-serve princlple.

In most cases, each thread is able to process events individually as the majority of event processing workflow can be accomplished only by seeing the information of that event. Thus, the stream modules (stream EDAnalyzer and stream EDFilter) are used frequently as each thread holds an individual copy of the module instance\u2014they do not need to communicate with each other. It is however also possible to share a global cache object between all threads in case sharing information across threads is necessary. In all, such CMSSW EDAnalyzer modules are declared by class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<CacheData>> (similar for EDFilter). Details can be found in documentation on the C++ interface of stream modules.

Let's then think about what would happen when interfacing CMSSW with ONNX for model inference. When ONNX Runtime accepts a model, it converts the model into an in-memory representation, and performance a variety of optimizations depending on the operators in the model. The procedure is done when an ONNX Runtime Session is created with an inputting model. The economic method will then be to hold only one Session for all threads\u2014this may save memory to a large extent, as the model has only one copy in memory. Upon request from multiple threads to do inference with their input data, the Session accepts those requests and serializes them, then produces the output data. ONNX Runtime has by design accepted that multithread threads invoke the Run() method on the same inference Session object. Therefore, what has left us to do is to

  1. create a Session as a global object in our CMSSW module and share it among all threads;
  2. in each thread, we process the input data and then call the Run() method from that global Session.

That's the main logic for implementing ONNX inference in CMSSW. For details of high-level designs of ONNX Runtime, please see documentation here.

With this concept, let's build the module.

"},{"location":"inference/onnx.html#1-includes","title":"1. includes","text":"
#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n// further framework includes\n...\n

We include stream/EDAnalyzer.h to build the stream CMSSW module.

"},{"location":"inference/onnx.html#2-global-cache-object","title":"2. Global cache object","text":"

In CMSSW there exists a class ONNXRuntime which can be used directly as the global cache object. Upon initialization from a given model, it holds the ONNX Runtime Session object and provides the handle to invoke the Run() for model inference.

We put the ONNXRuntime class in the edm::GlobalCache template argument:

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<ONNXRuntime>> {\n...\n};\n
"},{"location":"inference/onnx.html#3-initiate-objects","title":"3. Initiate objects","text":"

In the stream EDAnlyzer module, it provides a hook initializeGlobalCache() to initiate the global object. We simply do

std::unique_ptr<ONNXRuntime> MyPlugin::initializeGlobalCache(const edm::ParameterSet &iConfig) {\nreturn std::make_unique<ONNXRuntime>(iConfig.getParameter<edm::FileInPath>(\"model_path\").fullPath());\n}\n

to initiate the ONNXRuntime object upon a given model path.

"},{"location":"inference/onnx.html#4-inference","title":"4. Inference","text":"

We know the event processing step is implemented in the void EDAnalyzer::analyze method. When an event is assigned to a valid thread, the content will be processed in that thread. This can go in parallel with other threads processing other events.

We need to first construct the input data dedicated to the event. Here we create a dummy input: a sequence of consecutive integers of length 10. The input is set by replacing the values of our pre-booked vector, data_. This member variable has vector<vector<float>> format and is initialised as { {0, 0, ..., 0} } (contains only one element, which is a vector of 10 zeros). In processing of each event, the input data_ is modified:

std::vector<float> &group_data = data_[0];\nfor (size_t i = 0; i < 10; i++){\ngroup_data[i] = float(iEvent.id().event() % 100 + i);\n}\n

Then, we send data_ to the inference engine and get the model output:

std::vector<float> outputs = globalCache()->run(input_names_, data_, input_shapes_)[0];\n

We clarify a few details here.

First, we use globalCache() which is a class method in our stream CMSSW module to access the global object shared across all threads. In our case it is the ONNXRuntime instance.

The run() method is a wrapper to call Run() on the ONNX Session. Definations on the method arguments are (code from link):

// Run inference and get outputs\n// input_names: list of the names of the input nodes.\n// input_values: list of input arrays for each input node. The order of `input_values` must match `input_names`.\n// input_shapes: list of `int64_t` arrays specifying the shape of each input node. Can leave empty if the model does not have dynamic axes.\n// output_names: names of the output nodes to get outputs from. Empty list means all output nodes.\n// batch_size: number of samples in the batch. Each array in `input_values` must have a shape layout of (batch_size, ...).\n// Returns: a std::vector<std::vector<float>>, with the order matched to `output_names`.\n// When `output_names` is empty, will return all outputs ordered as in `getOutputNames()`.\nFloatArrays run(const std::vector<std::string>& input_names,\nFloatArrays& input_values,\nconst std::vector<std::vector<int64_t>>& input_shapes = {},\nconst std::vector<std::string>& output_names = {},\nint64_t batch_size = 1) const;\n
where we have
typedef std::vector<std::vector<float>> FloatArrays;\n

In our case, input_names is set to {\"my_input\"} which corresponds to the names upon model creation. input_values is a length-1 vector, and input_values[0] is a vector of float of length 10, which are inputs to the 10 nodes. input_shapes can be set empty here and will be necessary for advanced usage, when our input has dynamic lengths (e.g., in boosed jet tagging, we use different numbers of particle-flow candidates and secondary vertices as input).

For the usual model design, we have only one vector of output. In such a case, the output is simply a length-1 vector, and we use [0] to get the vector of two float numbers\u2014the output of the model.

"},{"location":"inference/onnx.html#full-example","title":"Full example","text":"

Let's construct the full example.

Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 model.onnx\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.pydata/model.onnx
/*\n * Example plugin to demonstrate the direct multi-threaded inference with ONNX Runtime.\n */\n\n#include <memory>\n#include <iostream>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n\n#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n\nusing namespace cms::Ort;\n\nclass MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<ONNXRuntime>> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet &, const ONNXRuntime *);\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\nstatic std::unique_ptr<ONNXRuntime> initializeGlobalCache(const edm::ParameterSet &);\nstatic void globalEndJob(const ONNXRuntime *);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::vector<std::string> input_names_;\nstd::vector<std::vector<int64_t>> input_shapes_;\nFloatArrays data_; // each stream hosts its own data\n};\n\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<edm::FileInPath>(\"model_path\", edm::FileInPath(\"MySubsystem/MyModule/data/model.onnx\"));\ndesc.add<std::vector<std::string>>(\"input_names\", std::vector<std::string>({\"my_input\"}));\ndescriptions.addWithDefaultLabel(desc);\n}\n\n\nMyPlugin::MyPlugin(const edm::ParameterSet &iConfig, const ONNXRuntime *cache)\n: input_names_(iConfig.getParameter<std::vector<std::string>>(\"input_names\")),\ninput_shapes_() {\n// initialize the input data arrays\n// note there is only one element in the FloatArrays type (i.e. vector<vector<float>>) variable\ndata_.emplace_back(10, 0);\n}\n\n\nstd::unique_ptr<ONNXRuntime> MyPlugin::initializeGlobalCache(const edm::ParameterSet &iConfig) {\nreturn std::make_unique<ONNXRuntime>(iConfig.getParameter<edm::FileInPath>(\"model_path\").fullPath());\n}\n\nvoid MyPlugin::globalEndJob(const ONNXRuntime *cache) {}\n\nvoid MyPlugin::analyze(const edm::Event &iEvent, const edm::EventSetup &iSetup) {\n// prepare dummy inputs for every event\nstd::vector<float> &group_data = data_[0];\nfor (size_t i = 0; i < 10; i++){\ngroup_data[i] = float(iEvent.id().event() % 100 + i);\n}\n\n// run prediction and get outputs\nstd::vector<float> outputs = globalCache()->run(input_names_, data_, input_shapes_)[0];\n\n// print the input and output data\nstd::cout << \"input data -> \";\nfor (auto &i: group_data) { std::cout << i << \" \"; }\nstd::cout << std::endl << \"output data -> \";\nfor (auto &i: outputs) { std::cout << i << \" \"; }\nstd::cout << std::endl;\n\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/ONNXRuntime\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"/store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(10))\nprocess.source = cms.Source(\"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles))\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup options for multithreaded\nprocess.options.numberOfThreads=cms.untracked.uint32(1)\nprocess.options.numberOfStreams=cms.untracked.uint32(0)\nprocess.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32(1)\n\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\n# specify the path of the ONNX model\nprocess.myPlugin.model_path = \"MySubsystem/MyModule/data/model.onnx\"\n# input names as defined in the model\n# the order of name strings should also corresponds to the order of input data array feed to the model\nprocess.myPlugin.input_names = [\"my_input\"]\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n

The model is produced by code in the section \"Converting model to ONNX\" and can be downloaded here.

"},{"location":"inference/onnx.html#test-our-module","title":"Test our module","text":"

Under MySubsystem/MyModule/test, run cmsRun my_plugin_cfg.py to launch our module. You may see the following from the output, which include the input and output vectors in the inference process.

Click to see the output
...\n19-Jul-2022 10:50:41 CEST  Successfully opened file root://xrootd-cms.infn.it//store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\nBegin processing the 1st record. Run 1, Event 27074045, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.494 CEST\ninput data -> 45 46 47 48 49 50 51 52 53 54\noutput data -> 0.995657 0.00434343\nBegin processing the 2nd record. Run 1, Event 27074048, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 48 49 50 51 52 53 54 55 56 57\noutput data -> 0.996884 0.00311563\nBegin processing the 3rd record. Run 1, Event 27074059, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 59 60 61 62 63 64 65 66 67 68\noutput data -> 0.999081 0.000919373\nBegin processing the 4th record. Run 1, Event 27074061, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.495 CEST\ninput data -> 61 62 63 64 65 66 67 68 69 70\noutput data -> 0.999264 0.000736247\nBegin processing the 5th record. Run 1, Event 27074046, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 46 47 48 49 50 51 52 53 54 55\noutput data -> 0.996112 0.00388828\nBegin processing the 6th record. Run 1, Event 27074047, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 47 48 49 50 51 52 53 54 55 56\noutput data -> 0.996519 0.00348065\nBegin processing the 7th record. Run 1, Event 27074064, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 64 65 66 67 68 69 70 71 72 73\noutput data -> 0.999472 0.000527586\nBegin processing the 8th record. Run 1, Event 27074074, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 74 75 76 77 78 79 80 81 82 83\noutput data -> 0.999826 0.000173664\nBegin processing the 9th record. Run 1, Event 27074050, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 50 51 52 53 54 55 56 57 58 59\noutput data -> 0.997504 0.00249614\nBegin processing the 10th record. Run 1, Event 27074060, LumiSection 10021 on stream 0 at 19-Jul-2022 10:50:43.496 CEST\ninput data -> 60 61 62 63 64 65 66 67 68 69\noutput data -> 0.999177 0.000822734\n19-Jul-2022 10:50:43 CEST  Closed file root://xrootd-cms.infn.it//store/mc/RunIISummer20UL18MiniAODv2/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v16_L1v1-v2/230000/4C8619B2-D0C0-4647-B946-B33754F4ED16.root\n

Also we could try launching the script with more threads. Change the corresponding line in my_plugin_cfg.py as follows to activate the multi-threaded mode with 4 threads.

process.options.numberOfThreads=cms.untracked.uint32(4)\n

Launch the script again, and one could see the same results, but with the inference processed concurrently on 4 threads.

"},{"location":"inference/onnx.html#inference-in-cmssw-python","title":"Inference in CMSSW (Python)","text":"

Doing ONNX Runtime inference with python is possible as well. For those releases that have the ONNX Runtime C++ package installed, the onnxruntime python package is also installed in python3 (except for CMSSW_10_6_X). We still use CMSSW_11_2_5_patch2 to run our examples. We could quickly check if onnxruntime is available by:

python3 -c \"import onnxruntime; print('onnxruntime available')\"\n

The python code is simple to construct: following the quick examples \"Get started with ORT for Python\", we create the file MySubsystem/MyModule/test/my_standalone_test.py as follows:

import onnxruntime as ort\nimport numpy as np\n\n# create input data in the float format (32 bit)\ndata = np.arange(45, 55).astype(np.float32)\n\n# create inference session using ort.InferenceSession from a given model\nort_sess = ort.InferenceSession('../data/model.onnx')\n\n# run inference\noutputs = ort_sess.run(None, {'my_input': np.array([data])})[0]\n\n# print input and output\nprint('input ->', data)\nprint('output ->', outputs)\n

Under the directory MySubsystem/MyModule/test, run the example with python3 my_standalone_test.py. Then we see the output:

input -> [45. 46. 47. 48. 49. 50. 51. 52. 53. 54.]\noutput -> [[0.9956566  0.00434343]]\n

Using ONNX Runtime on NanoAOD-tools follows the same logic. Here we create the ONNX Session in the beginning stage and run inference in the event loop. Note that NanoAOD-tools runs the event loop in the single-thread mode.

Please find details in the following block.

Click to see the NanoAOD-tools example

We run the NanoAOD-tools example following the above CMSSW_11_2_5_patch2 environment. According to the setup instruction in NanoAOD-tools, do

cd $CMSSW_BASE/src\ngit clone https://github.com/cms-nanoAOD/nanoAOD-tools.git PhysicsTools/NanoAODTools\ncd PhysicsTools/NanoAODTools\ncmsenv\nscram b\n

Now we add our custom module to run ONNX Runtime inference. Create a file PhysicsTools/NanoAODTools/python/postprocessing/examples/exampleOrtModule.py with the content:

from PhysicsTools.NanoAODTools.postprocessing.framework.datamodel import Collection\nfrom PhysicsTools.NanoAODTools.postprocessing.framework.eventloop import Module\nimport ROOT\nROOT.PyConfig.IgnoreCommandLineOptions = True\n\nimport onnxruntime as ort\nimport numpy as np\nimport os \n\nclass exampleOrtProducer(Module):\n    def __init__(self):\n        pass\n\n    def beginJob(self):\n        model_path = os.path.join(os.getenv(\"CMSSW_BASE\"), 'src', 'MySubsystem/MyModule/data/model.onnx')\nself.ort_sess = ort.InferenceSession(model_path)\ndef endJob(self):\n        pass\n\n    def beginFile(self, inputFile, outputFile, inputTree, wrappedOutputTree):\n        self.out = wrappedOutputTree\n        self.out.branch(\"OrtScore\", \"F\")\n\n    def endFile(self, inputFile, outputFile, inputTree, wrappedOutputTree):\n        pass\n\n    def analyze(self, event):\n\"\"\"process event, return True (go to next module) or False (fail, go to next event)\"\"\"\n\n        # create input data\n        data = np.arange(event.event % 100, event.event % 100 + 10).astype(np.float32)\n        # run inference\noutputs = self.ort_sess.run(None, {'my_input': np.array([data])})[0]\n# print input and output\n        print('input ->', data)\n        print('output ->', outputs)\n\n        self.out.fillBranch(\"OrtScore\", outputs[0][0])\n        return True\n\n\n# define modules using the syntax 'name = lambda : constructor' to avoid having them loaded when not needed\n\nexampleOrtModuleConstr = lambda: exampleOrtProducer()\n

Please notice the highlighted lines for the creation of ONNX Runtime Session and launching the inference.

Finally, following the test command from NanoAOD-tools, we run our custom module in python3 by

python3 scripts/nano_postproc.py outDir /eos/cms/store/user/andrey/f.root -I PhysicsTools.NanoAODTools.postprocessing.examples.exampleOrtModule exampleOrtModuleConstr -N 10\n

We should see the output as follows

processing.examples.exampleOrtModule exampleOrtModuleConstr -N 10\nLoading exampleOrtModuleConstr from PhysicsTools.NanoAODTools.postprocessing.examples.exampleOrtModule\nWill write selected trees to outDir\nPre-select 10 entries out of 10 (100.00%)\ninput -> [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]\noutput -> [[0.83919346 0.16080655]]\ninput -> [ 7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]\noutput -> [[0.76994413 0.2300559 ]]\ninput -> [ 4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]\noutput -> [[0.7116992 0.2883008]]\ninput -> [ 2.  3.  4.  5.  6.  7.  8.  9. 10. 11.]\noutput -> [[0.66414535 0.33585465]]\ninput -> [ 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.]\noutput -> [[0.80617136 0.19382869]]\ninput -> [ 6.  7.  8.  9. 10. 11. 12. 13. 14. 15.]\noutput -> [[0.75187963 0.2481204 ]]\ninput -> [16. 17. 18. 19. 20. 21. 22. 23. 24. 25.]\noutput -> [[0.9014619  0.09853811]]\ninput -> [18. 19. 20. 21. 22. 23. 24. 25. 26. 27.]\noutput -> [[0.9202239  0.07977609]]\ninput -> [ 5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]\noutput -> [[0.7330253  0.26697478]]\ninput -> [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]\noutput -> [[0.82333535 0.17666471]]\nProcessed 10 preselected entries from /eos/cms/store/user/andrey/f.root (10 entries). Finally selected 10 entries\nDone outDir/f_Skim.root\nTotal time 1.1 sec. to process 10 events. Rate = 9.3 Hz.\n

"},{"location":"inference/onnx.html#links-and-further-reading","title":"Links and further reading","text":"
  • ONNX/ONNX Runtime
    • Tutorials on converting models to ONNX format
    • ONNX Runtime C++ example
    • ONNX Runtime C++ API
    • ONNX Runtime python example
    • ONNX Runtime python API
    • ONNX Runtime in CMSSW (talk)

Developers: Huilin Qu

Authors: Congqiao Li

"},{"location":"inference/particlenet.html","title":"ParticleNet","text":"

ParticleNet [arXiv:1902.08570] is an advanced neural network architecture that has many applications in CMS, including heavy flavour jet tagging, jet mass regression, etc. The network is fed by various low-level point-like objects as input, e.g., the particle-flow candidates, to predict a feature of a jet.

The full architecture of the ParticleNet model. We'll walk through the details in the following sections.

On this page, we introduce several user-specific aspects of the ParticleNet model. We cover the following items in three sections:

  1. An introduction to ParticleNet, including

    • a general description of ParticleNet
    • the advantages brought from the architecture by concept
    • a sketch of ParticleNet applications in CMS and other relevant works
  2. An introduction to Weaver and model implementations, introduced in a step-by-step manner:

    • build three network models and understand them from the technical side; use the out-of-the-box commands to run these examples on a benchmark task. The three networks are (1) a simple feed-forward NN, (2) a DeepAK8 model (based on 1D CNN), and eventually (3) the ParticleNet model (based on DGCNN).
    • try to reproduce the original performance and make the ROC plots.

    This section is friendly to the ML newcomers. The goal is to help readers understand the underlying structure of the \"ParticleNet\".

  3. Tuning the ParticleNet model, including

    • tips for readers who are using/modifying the ParticleNet model to achieve a better performance

    This section can be helpful in practice. It provides tips on model training, tunning, validation, etc. It targets the situations when readers apply their own ParticleNet (or ParticleNet-like) model to the custom task.

Corresponding persons:

  • Huilin Qu, Loukas Gouskos (original developers of ParticleNet)
  • Congqiao Li (author of the page)
"},{"location":"inference/particlenet.html#introduction-to-particlenet","title":"Introduction to ParticleNet","text":""},{"location":"inference/particlenet.html#1-general-description","title":"1. General description","text":"

ParticleNet is a graph neural net (GNN) model. The key ingredient of ParticleNet is the graph convolutional operation, i.e., the edge convolution (EdgeConv) and the dynamic graph CNN (DGCNN) method [arXiv:1801.07829] applied on the \"point cloud\" data structure.

We will disassemble the ParticleNet model and provide a detailed exploration in the next section, but here we briefly explain the key features of the model.

Intuitively, ParticleNet treats all candidates inside an object as a \"point cloud\", which is a permutational-invariant set of points (e.g. a set of PF candidates), each carrying a feature vector (\u03b7, \u03c6, pT, charge, etc.). The DGCNN uses the EdgeConv operation to exploit their spatial correlations (two-dimensional on the \u03b7-\u03c6 plain) by finding the k-nearest neighbours of each point and generate a new latent graph layer where points are scattered on a high-dimensional latent space. This is a graph-type analogue of the classical 2D convolution operation, which acts on a regular 2D grid (e.g., a picture) using a 3\u00d73 local patch to explore the relations of a single-pixel with its 8 nearest pixels, then generates a new 2D grid.

The cartoon illustrates the convolutional operation acted on the regular grid and on the point cloud (plot from ML4Jets 2018 talk).

As a consequence, the EdgeConv operation transforms the graph to a new graph, which has a changed spatial relationship among points. It then acts on the second graph to produce the third graph, showing the stackability of the convolution operation. This illustrates the \"dynamic\" property as the graph topology changes after each EdgeConv layer.

"},{"location":"inference/particlenet.html#2-advantage","title":"2. Advantage","text":"

By concept, the advantage of the network may come from exploiting the permutational-invariant symmetry of the points, which is intrinsic to our physics objects. This symmetry is held naturally in a point cloud representation.

In a recent study on jet physics or event-based analysis using ML techniques, there are increasing interest to explore the point cloud data structure. We explain here conceptually why a \"point cloud\" representation outperforms the classical ones, including the variable-length 2D vector structure passing to a 1D CNN or any type of RNN, and imaged-based representation passing through a 2D CNN. By using the 1D CNN, the points (PF candidates) are more often ordered by pT to fix on the 1D grid. Only correlations with neighbouring points with similar pT are learned by the network with a convolution operation. The Long Short-Term Memory (LSTM) type recurrent neural network (RNN) provides the flexibility to feed in a variant-length sequence and has a \"memory\" mechanism to cooperate the information it learns from an early node to the latest node. The concern is that such ordering of the sequence is somewhat artificial, and not an underlying property that an NN must learn to accomplish the classification task. As a comparison, in the task of the natural language processing where LSTM has a huge advantage, the order of words are important characteristic of a language itself (reflects the \"grammar\" in some circumstances) and is a feature the NN must learn to master the language. The imaged-based data explored by a 2D CNN stems from the image recognition task. A jet image with proper standardization is usually performed before feeding into the network. In this sense, it lacks local features which the 2D local patch is better at capturing, e.g. the ear of the cat that a local patch can capture by scanning over the entire image. The jet image is appearing to hold the features globally (e.g. two-prong structure for W-tagging). The sparsity of data is another concern in that it introduces redundant information to present a jet on the regular grid, making the network hard to capture the key properties.

"},{"location":"inference/particlenet.html#3-applications-and-other-related-work","title":"3. Applications and other related work","text":"

Here we briefly summarize the applications and ongoing works on ParticleNet. Public CMS results include

  • large-R jet with R=0.8 tagging (for W/Z/H/t) using ParticleNet [CMS-DP-2020/002]
  • regression on the large-R jet mass based on the ParticleNet model [CMS-DP-2021/017]

ParticleNet architecture is also applied on small radius R=0.4 jets for the b/c-tagging and quark/gluon classification (see this talk (CMS internal)). A recent ongoing work applies the ParticleNet architecture in heavy flavour tagging at HLT (see this talk (CMS internal)). The ParticleNet model is recently updated to ParticleNeXt and see further improvement (see the ML4Jets 2021 talk).

Recent works in the joint field of HEP and ML also shed light on exploiting the point cloud data structure and GNN-based architectures. We see very active progress in recent years. Here list some useful materials for the reader's reference.

  • Some pheno-based work are summarized in the HEP \u00d7 ML living review, especially in the \"graph\" and \"sets\" categories.
  • An overview of GNN applications to CMS, see CMS ML forum (CMS internal). Also see more recent GNN application progress in ML forums: Oct 20, Nov 3.
  • At the time of writing, various novel GNN-based models are explored and introduced in the recent ML4Jets2021 meeting.
"},{"location":"inference/particlenet.html#introduction-to-weaver-and-model-implementations","title":"Introduction to Weaver and model implementations","text":"

Weaver is a machine learning R&D framework for high energy physics (HEP) applications. It trains the neural net with PyTorch and is capable of exporting the model to the ONNX format for fast inference. A detailed guide is presented on Weaver README page.

Now we walk through three solid examples to get you familiar with Weaver. We use the benchmark of the top tagging task [arXiv:1707.08966] in the following example. Some useful information can be found in the \"top tagging\" section in the IML public datasets webpage (the gDoc).

Our goal is to do some warm-up with Weaver, and more importantly, to explore from a technical side the neural net architectures: a simple multi-layer perceptron (MLP) model, a more complicated \"DeepAK8 tagger\" model based on 1D CNN with ResNet, and the \"ParticleNet model,\" which is based on DGCNN. We will dig deeper into their implementations in Weaver and try to illustrate as many details as possible. Finally, we compare their performance and see if we can reproduce the benchmark record with the model. Please clone the repo weaver-benchmark and we'll get started. The Weaver repo will be cloned as a submodule.

git clone --recursive https://github.com/colizz/weaver-benchmark.git\n\n# Create a soft link inside weaver so that it can find data/model cards\nln -s ../top_tagging weaver-benchmark/weaver/top_tagging\n

"},{"location":"inference/particlenet.html#1-build-models-in-weaver","title":"1. Build models in Weaver","text":"

When implementing a new training in Weaver, two key elements are crucial: the model and the data configuration file. The model defines the network architecture we are using, and the data configuration includes which variables to use for training, which pre-selection to apply, how to assign truth labels, etc.

Technically, The model configuration file includes a get_model function that returns a torch.nn.Module type model and a dictionary of model info used to export an ONNX-format model. The data configuration is a YAML file describing how to process the input data. Please see the Weaver README for details.

Before moving on, we need a preprocessing of the benchmark datasets. The original sample is an H5 file including branches like energy E_i and 3-momenta PX_i, PY_i, PZ_i for each jet constituent i (i=0, ..., 199) inside a jet. All branches are in the 1D flat structure. We reconstruct the data in a way that the jet features are 2D vectors (e.g., in the vector<float> format): Part_E, Part_PX, Part_PY, Part_PZ, with variable-length that corresponds to the number of constituents. Note that this is a commonly used data structure, similar to the NanoAOD format in CMS.

The datasets can be found at CERN EOS space /eos/user/c/coli/public/weaver-benchmark/top_tagging/samples. The input files used in this page are in fact the ROOT files produced by the preprocessing step, stored under the prep/ subdirectory. It includes three sets of data for training, validation, and test.

Note

To preprocess the input files from the original datasets manually, direct to the weaver-benchmark base directory and run

python utils/convert_top_datasets.py -i <your-sample-dir>\n
This will convert the .h5 file to ROOT ntuples and create some new variables for each jet, including the relative \u03b7 and \u03c6 value w.r.t. main axis of the jet of each jet constituent. The converted files are stored in prep/ subfolder of the original directory.

Then, we show three NN model configurations below and provide detailed explanations of the code. We make meticulous efforts on the illustration of the model architecture, especially in the ParticleNet case.

A simple MLPDeepAK8 (1D CNN)ParticleNet (DGCNN)

The full architecture of the proof-of-concept multi-layer perceptron model.

A simple multi-layer perceptron model is first provided here as proof of the concept. All layers are based on the linear transformation of the 1D vectors. The model configuration card is shown in top_tagging/networks/mlp_pf.py. First, we implement an MLP network in the nn.Module class.

MLP implementation

Also, see top_tagging/networks/mlp_pf.py. We elaborate here on several aspects.

  • A sequence of linear layers and ReLU activation functions is defined in nn.Sequential(nn.Linear(channels[i], channels[i + 1]), nn.ReLU()). By combining multiple of them, we construct a simple multi-layer perceptron.

  • The input data x takes the 3D format, in the dimension (N, C, P), which is decided by our data structure and the data configuration card. Here, N is the mini-batch size, C is the feature size, and P is the size of constituents per jet. To feed into our MLP, we flatten the last two dimensions by x = x.flatten(start_dim=1) to form the vector of dimension (N, L).

class MultiLayerPerceptron(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    layer_params : list\n        List of the feature size for each layer.\n    \"\"\"\n\n    def __init__(self, input_dims, num_classes,\n                layer_params=(1024, 256, 256),\n                **kwargs):\n\n        super(MultiLayerPerceptron, self).__init__(**kwargs)\n        channels = [input_dims] + list(layer_params) + [num_classes]\n        layers = []\n        for i in range(len(channels) - 1):\n            layers.append(nn.Sequential(nn.Linear(channels[i], channels[i + 1]),\n                                        nn.ReLU()))\n        self.mlp = nn.Sequential(*layers)\n\n    def forward(self, x):\n        # x: the feature vector initally read from the data structure, in dimension (N, C, P)\n        x = x.flatten(start_dim=1) # (N, L), where L = C * P\n        return self.mlp(x)\n

Then, we write the get_model and get_loss functions which will be sent into Weaver's training code.

get_model and get_loss function

Also see top_tagging/networks/mlp_pf.py. We elaborate here on several aspects.

  • Inside get_model, the model is essentially the MLP class we define, and the model_info takes the default definition, including the input/output shape, the dimensions of the dynamic axes for the input/output data shape that will guide the ONNX model exportation.
  • The get_loss function is not changed as in the classification task we always use the cross-entropy loss function.
def get_model(data_config, **kwargs):\n    layer_params = (1024, 256, 256)\n    _, pf_length, pf_features_dims = data_config.input_shapes['pf_features']\n    input_dims = pf_length * pf_features_dims\n    num_classes = len(data_config.label_value)\n    model = MultiLayerPerceptron(input_dims, num_classes, layer_params=layer_params)\n\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The output below shows the full structure of the MLP network printed by PyTorch. You will see it in the Weaver output during the training.

The full-scale structure of the MLP network
MultiLayerPerceptron(\n  |0.739 M, 100.000% Params, 0.001 GMac, 100.000% MACs|\n  (mlp): Sequential(\n    |0.739 M, 100.000% Params, 0.001 GMac, 100.000% MACs|\n    (0): Sequential(\n      |0.411 M, 55.540% Params, 0.0 GMac, 55.563% MACs|\n      (0): Linear(in_features=400, out_features=1024, bias=True, |0.411 M, 55.540% Params, 0.0 GMac, 55.425% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.138% MACs|)\n    )\n    (1): Sequential(\n      |0.262 M, 35.492% Params, 0.0 GMac, 35.452% MACs|\n      (0): Linear(in_features=1024, out_features=256, bias=True, |0.262 M, 35.492% Params, 0.0 GMac, 35.418% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.035% MACs|)\n    )\n    (2): Sequential(\n      |0.066 M, 8.899% Params, 0.0 GMac, 8.915% MACs|\n      (0): Linear(in_features=256, out_features=256, bias=True, |0.066 M, 8.899% Params, 0.0 GMac, 8.880% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.035% MACs|)\n    )\n    (3): Sequential(\n      |0.001 M, 0.070% Params, 0.0 GMac, 0.070% MACs|\n      (0): Linear(in_features=256, out_features=2, bias=True, |0.001 M, 0.070% Params, 0.0 GMac, 0.069% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n    )\n  )\n)\n

The data card is shown in top_tagging/data/pf_features.yaml. It defines one input group, pf_features, which takes four variables Etarel, Phirel, E_log, P_log. This is based on our data structure, where these variables are 2D vectors with variable lengths. The length is chosen as 100 in a way that the last dimension (the jet constituent dimension) is always truncated or padded to have length 100.

MLP data config top_tagging/data/pf_features.yaml

Also see top_tagging/data/pf_features.yaml. See a tour guide to the data configuration card in Weaver README.

selection:\n### use `&`, `|`, `~` for logical operations on numpy arrays\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\n\nnew_variables:\n### [format] name: formula\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\nis_bkg: np.logical_not(is_signal_new)\n\npreprocess:\n### method: [manual, auto] - whether to use manually specified parameters for variable standardization\nmethod: manual\n### data_fraction: fraction of events to use when calculating the mean/scale for the standardization\ndata_fraction:\n\ninputs:\npf_features:\nlength: 100\nvars:\n### [format 1]: var_name (no transformation)\n### [format 2]: [var_name,\n###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto),\n###              multiply_by(optional, default=1),\n###              clip_min(optional, default=-5),\n###              clip_max(optional, default=5),\n###              pad_value(optional, default=0)]\n- Part_Etarel\n- Part_Phirel\n- [Part_E_log, 2, 1]\n- [Part_P_log, 2, 1]\n\nlabels:\n### type can be `simple`, `custom`\n### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels\ntype: simple\nvalue: [\nis_signal_new, is_bkg\n]\n### [option 2] otherwise use `custom` to define the label, then `value` is a map\n# type: custom\n# value:\n# target_mass: np.where(fj_isQCD, fj_genjet_sdmass, fj_gen_mass)\n\nobservers:\n- origIdx\n- idx\n- Part_E_tot\n- Part_PX_tot\n- Part_PY_tot\n- Part_PZ_tot\n- Part_P_tot\n- Part_Eta_tot\n- Part_Phi_tot\n\n# weights:\n### [option 1] use precomputed weights stored in the input files\n# use_precomputed_weights: true\n# weight_branches: [weight, class_weight]\n### [option 2] compute weights on-the-fly using reweighting histograms\n

In the following two models (i.e., the DeepAK8 and the ParticleNet model) you will see that the data card is very similar. The change will only be the way we present the input group(s).

The full architecture of the DeepAK8 model, which is based on 1D CNN with ResNet architecture.

Note

The DeepAK8 tagger is a widely used highly-boosted jet tagger in the CMS community. The design of the model can be found in the CMS paper [arXiv:2004.08262]. The original model is trained on MXNet and its configuration can be found here.

We now migrate the model architecture to Weaver and train it on PyTorch. Also, we narrow the multi-class output score to the binary output to adapt our binary classification task (top vs. QCD jet).

The model card is given in top_tagging/networks/deepak8_pf.py. The DeepAK8 model is inspired by the ResNet architecture. The key ingredient is the ResNet unit constructed by multiple CNN layers with a shortcut connection. First, we define the ResNet unit in the model card.

ResNet unit implementation

See top_tagging/networks/deepak8_pf.py. We elaborate here on several aspects.

  • A ResNet unit is made of two 1D CNNs with batch normalization and ReLU activation function.
  • The shortcut is introduced here by directly adding the input data to the processed data after passing the CNN layers. The shortcut connection help to ease the training for the \"deeper\" model [arXiv:1512.03385]. Note that a trivial linear transformation is applied (self.conv_sc) if the feature dimension of the input and output data does not match.
class ResNetUnit(nn.Module):\nr\"\"\"Parameters\n    ----------\n    in_channels : int\n        Number of channels in the input vectors.\n    out_channels : int\n        Number of channels in the output vectors.\n    strides: tuple\n        Strides of the two convolutional layers, in the form of (stride0, stride1)\n    \"\"\"\n\n    def __init__(self, in_channels, out_channels, strides=(1,1), **kwargs):\n\n        super(ResNetUnit, self).__init__(**kwargs)\n        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size=3, stride=strides[0], padding=1)\n        self.bn1 = nn.BatchNorm1d(out_channels)\n        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size=3, stride=strides[1], padding=1)\n        self.bn2 = nn.BatchNorm1d(out_channels)\n        self.relu = nn.ReLU()\n        self.dim_match = True\n        if not in_channels == out_channels or not strides == (1,1): # dimensions not match\n            self.dim_match = False\n            self.conv_sc = nn.Conv1d(in_channels, out_channels, kernel_size=1, stride=strides[0]*strides[1], bias=False)\n\n    def forward(self, x):\n        identity = x\n        x = self.conv1(x)\n        x = self.bn1(x)\n        x = self.relu(x)\n        x = self.conv2(x)\n        x = self.bn2(x)\n        x = self.relu(x)\n        # print('resnet unit', identity.shape, x.shape, self.dim_match)\n        if self.dim_match:\n            return identity + x\n        else:\n            return self.conv_sc(identity) + x\n

With the ResNet unit, we construct the DeepAK8 model. The model hyperparameters are chosen as follows.

conv_params = [(32,), (64, 64), (64, 64), (128, 128)]\nfc_params = [(512, 0.2)]\n

DeepAK8 model implementation

See top_tagging/networks/deepak8_pf.py. Note that the main architecture is a PyTorch re-implementation of the code here based on the MXNet.

class ResNet(nn.Module):\nr\"\"\"Parameters\n    ----------\n    features_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    conv_params : list\n        List of the convolution layer parameters.\n        The first element is a tuple of size 1, defining the transformed feature size for the initial feature convolution layer.\n        The following are tuples of feature size for multiple stages of the ResNet units. Each number defines an individual ResNet unit.\n    fc_params: list\n        List of fully connected layer parameters after all EdgeConv blocks, each element in the format of\n        (n_feat, drop_rate)\n    \"\"\"\n\n    def __init__(self, features_dims, num_classes,\n                conv_params=[(32,), (64, 64), (64, 64), (128, 128)],\n                fc_params=[(512, 0.2)],\n                **kwargs):\n\n        super(ResNet, self).__init__(**kwargs)\n        self.conv_params = conv_params\n        self.num_stages = len(conv_params) - 1\n        self.fts_conv = nn.Sequential(nn.Conv1d(in_channels=features_dims, out_channels=conv_params[0][0], kernel_size=3, stride=1, padding=1),\n                                    nn.BatchNorm1d(conv_params[0][0]),\n                                    nn.ReLU())\n\n        # define ResNet units for each stage. Each unit is composed of a sequence of ResNetUnit block\n        self.resnet_units = nn.ModuleDict()\n        for i in range(self.num_stages):\n            # stack units[i] layers in this stage\n            unit_layers = []\n            for j in range(len(conv_params[i + 1])):\n                in_channels, out_channels = (conv_params[i][-1], conv_params[i + 1][0]) if j == 0 \\\n                                            else (conv_params[i + 1][j - 1], conv_params[i + 1][j])\n                strides = (2, 1) if (j == 0 and i > 0) else (1, 1)\n                unit_layers.append(ResNetUnit(in_channels, out_channels, strides))\n\n            self.resnet_units.add_module('resnet_unit_%d' % i, nn.Sequential(*unit_layers))\n\n        # define fully connected layers\n        fcs = []\n        for idx, layer_param in enumerate(fc_params):\n            channels, drop_rate = layer_param\n            in_chn = conv_params[-1][-1] if idx == 0 else fc_params[idx - 1][0]\n            fcs.append(nn.Sequential(nn.Linear(in_chn, channels), nn.ReLU(), nn.Dropout(drop_rate)))\n        fcs.append(nn.Linear(fc_params[-1][0], num_classes))\n        self.fc = nn.Sequential(*fcs)\n\n    def forward(self, x):\n        # x: the feature vector, (N, C, P)\n        x = self.fts_conv(x)\n        for i in range(self.num_stages):\n            x = self.resnet_units['resnet_unit_%d' % i](x) # (N, C', P'), P'<P due to kernal_size>1 or stride>1\n\n        # global average pooling\n        x = x.sum(dim=-1) / x.shape[-1] # (N, C')\n        # fully connected\n        x = self.fc(x) # (N, out_chn)\n        return x\n\n\ndef get_model(data_config, **kwargs):\n    conv_params = [(32,), (64, 64), (64, 64), (128, 128)]\n    fc_params = [(512, 0.2)]\n\n    pf_features_dims = len(data_config.input_dicts['pf_features'])\n    num_classes = len(data_config.label_value)\n    model = ResNet(pf_features_dims, num_classes,\n                conv_params=conv_params,\n                fc_params=fc_params)\n\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    print(data_config.input_shapes)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The output below shows the full structure of the DeepAK8 model based on 1D CNN with ResNet. It is printed by PyTorch and you will see it in the Weaver output during training.

The full-scale structure of the DeepAK8 architecture
ResNet(\n  |0.349 M, 100.000% Params, 0.012 GMac, 100.000% MACs|\n  (fts_conv): Sequential(\n    |0.0 M, 0.137% Params, 0.0 GMac, 0.427% MACs|\n    (0): Conv1d(4, 32, kernel_size=(3,), stride=(1,), padding=(1,), |0.0 M, 0.119% Params, 0.0 GMac, 0.347% MACs|)\n    (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.018% Params, 0.0 GMac, 0.053% MACs|)\n    (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.027% MACs|)\n  )\n  (resnet_units): ModuleDict(\n    |0.282 M, 80.652% Params, 0.012 GMac, 99.010% MACs|\n    (resnet_unit_0): Sequential(\n      |0.046 M, 13.124% Params, 0.005 GMac, 38.409% MACs|\n      (0): ResNetUnit(\n        |0.021 M, 5.976% Params, 0.002 GMac, 17.497% MACs|\n        (conv1): Conv1d(32, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.006 M, 1.778% Params, 0.001 GMac, 5.175% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.107% MACs|)\n        (conv_sc): Conv1d(32, 64, kernel_size=(1,), stride=(1,), bias=False, |0.002 M, 0.587% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.025 M, 7.149% Params, 0.003 GMac, 20.912% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 10.296% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.107% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.107% MACs|)\n      )\n    )\n    (resnet_unit_1): Sequential(\n      |0.054 M, 15.471% Params, 0.003 GMac, 22.619% MACs|\n      (0): ResNetUnit(\n        |0.029 M, 8.322% Params, 0.001 GMac, 12.163% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(2,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n        (conv_sc): Conv1d(64, 64, kernel_size=(1,), stride=(2,), bias=False, |0.004 M, 1.173% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.025 M, 7.149% Params, 0.001 GMac, 10.456% MACs|\n        (conv1): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(64, 64, kernel_size=(3,), stride=(1,), padding=(1,), |0.012 M, 3.538% Params, 0.001 GMac, 5.148% MACs|)\n        (bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.037% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n      )\n    )\n    (resnet_unit_2): Sequential(\n      |0.182 M, 52.057% Params, 0.005 GMac, 37.982% MACs|\n      (0): ResNetUnit(\n        |0.083 M, 23.682% Params, 0.002 GMac, 17.284% MACs|\n        (conv1): Conv1d(64, 128, kernel_size=(3,), stride=(2,), padding=(1,), |0.025 M, 7.075% Params, 0.001 GMac, 5.148% MACs|)\n        (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n        (conv_sc): Conv1d(64, 128, kernel_size=(1,), stride=(2,), bias=False, |0.008 M, 2.346% Params, 0.0 GMac, 1.707% MACs|)\n      )\n      (1): ResNetUnit(\n        |0.099 M, 28.375% Params, 0.002 GMac, 20.698% MACs|\n        (conv1): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (conv2): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,), |0.049 M, 14.114% Params, 0.001 GMac, 10.269% MACs|)\n        (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.073% Params, 0.0 GMac, 0.053% MACs|)\n        (relu): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.053% MACs|)\n      )\n    )\n  )\n  (fc): Sequential(\n    |0.067 M, 19.210% Params, 0.0 GMac, 0.563% MACs|\n    (0): Sequential(\n      |0.066 M, 18.917% Params, 0.0 GMac, 0.555% MACs|\n      (0): Linear(in_features=128, out_features=512, bias=True, |0.066 M, 18.917% Params, 0.0 GMac, 0.551% MACs|)\n      (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.004% MACs|)\n      (2): Dropout(p=0.2, inplace=False, |0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n    )\n    (1): Linear(in_features=512, out_features=2, bias=True, |0.001 M, 0.294% Params, 0.0 GMac, 0.009% MACs|)\n  )\n)\n

The data card is the same as the MLP case, shown in top_tagging/data/pf_features.yaml.

The full architecture of the ParticleNet model, which is based on DGCNN and EdgeConv.

Note

The ParticleNet model applied to the CMS analysis is provided in weaver/networks/particle_net_pf_sv.py, and the data card in weaver/data/ak15_points_pf_sv.yaml. Here we use a similar configuration card to deal with the benchmark task.

We will elaborate on the ParticleNet model and focus more on the technical side in this section. The model is defined in top_tagging/networks/particlenet_pf.py, but it imports some constructor, the EdgeConv block, in weaver/utils/nn/model/ParticleNet.py. The EdgeConv is illustrated in the cartoon.

Illustration of the EdgeConv block

From an EdgeConv block's point of view, it requires two classes of features as input: the \"coordinates\" and the \"features\". These features are the per point properties, in the 2D shape with dimensions (C, P), where C is the size of the features (the feature size of \"coordinates\" and the \"features\" can be different, marked as C_pts, C_fts in the following code), and P is the number of points. The block outputs the new features that the model learns, also in the 2D shape with dimensions (C_fts_out, P).

What happens inside the EdgeConv block? And how is the output feature vector transferred from the input features using the topology of the point cloud? The answer is encoded in the edge convolution (EdgeConv).

The edge convolution is an analogue convolution method defined on a point cloud, whose shape is given by the \"coordinates\" of points. Specifically, the input \"coordinates\" provide a view of spatial relations of the points in the Euclidean space. It determines the k-nearest neighbouring points for each point that will guide the update of the feature vector of a point. For each point, the updated feature vector is based on the current state of the point and its k neighbours. Guided by this spirit, all features of the point cloud forms a 3D vector with dimensions (C, P, K), where C is the per-point feature size (e.g., \u03b7, \u03c6, pT\uff0c...), P is the number of points, and K the k-NN number. The structured vector is linearly transformed by acting 2D CNN on the feature dimension C. This helps to aggregate the feature information and exploit the correlations of each point with its adjacent points. A shortcut connection is also introduced inspired by the ResNet.

Note

The feature dimension C after exploring the k neighbours of each point actually doubles the value of the initial feature dimension. Here, a new set of features is constructed by subtracting the feature a point carries to the features its k neighbours carry (namely xi \u2013 xi_j for point i, and j=1,...,k). This way, the correlation of each point with its neighbours are well captured.

Below shows how the EdgeConv structure is implemented in the code.

EdgeConv block implementation

See weaver/utils/nn/model/ParticleNet.py, or the following code block annotated with more comments. We elaborate here on several aspects.

  • The EdgeConvBlock takes the feature dimension in_feat, out_feats which are C_fts, C_fts_out we introduced above.
  • The input data vectors to forward() are \"coordinates\" and \"features\" vector, in the dimension of (N, C_pts(C_fts), P) as introduced above. The first dimension is the mini-batch size.
  • self.get_graph_feature() helps to aggregate k-nearest neighbours for each point. The resulting vector is in the dimension of (N, C_fts(0), P, K) as we discussed above, K being the k-NN number. Note that the C_fts(0) doubles the value of the original input feature dimension C_fts as mentioned above.
  • After convolutions, the per-point features are merged by taking the mean of all k-nearest neighbouring vectors:
    fts = x.mean(dim=-1)  # (N, C, P)\n
class EdgeConvBlock(nn.Module):\nr\"\"\"EdgeConv layer.\n    Introduced in \"`Dynamic Graph CNN for Learning on Point Clouds\n    <https://arxiv.org/pdf/1801.07829>`__\".  Can be described as follows:\n    .. math::\n    x_i^{(l+1)} = \\max_{j \\in \\mathcal{N}(i)} \\mathrm{ReLU}(\n    \\Theta \\cdot (x_j^{(l)} - x_i^{(l)}) + \\Phi \\cdot x_i^{(l)})\n    where :math:`\\mathcal{N}(i)` is the neighbor of :math:`i`.\n    Parameters\n    ----------\n    in_feat : int\n        Input feature size.\n    out_feat : int\n        Output feature size.\n    batch_norm : bool\n        Whether to include batch normalization on messages.\n    \"\"\"\n\n    def __init__(self, k, in_feat, out_feats, batch_norm=True, activation=True, cpu_mode=False):\n        super(EdgeConvBlock, self).__init__()\n        self.k = k\n        self.batch_norm = batch_norm\n        self.activation = activation\n        self.num_layers = len(out_feats)\n        self.get_graph_feature = get_graph_feature_v2 if cpu_mode else get_graph_feature_v1\n\n        self.convs = nn.ModuleList()\n        for i in range(self.num_layers):\n            self.convs.append(nn.Conv2d(2 * in_feat if i == 0 else out_feats[i - 1], out_feats[i], kernel_size=1, bias=False if self.batch_norm else True))\n\n        if batch_norm:\n            self.bns = nn.ModuleList()\n            for i in range(self.num_layers):\n                self.bns.append(nn.BatchNorm2d(out_feats[i]))\n\n        if activation:\n            self.acts = nn.ModuleList()\n            for i in range(self.num_layers):\n                self.acts.append(nn.ReLU())\n\n        if in_feat == out_feats[-1]:\n            self.sc = None\n        else:\n            self.sc = nn.Conv1d(in_feat, out_feats[-1], kernel_size=1, bias=False)\n            self.sc_bn = nn.BatchNorm1d(out_feats[-1])\n\n        if activation:\n            self.sc_act = nn.ReLU()\n\n    def forward(self, points, features):\n        # points:   (N, C_pts, P)\n        # features: (N, C_fts, P)\n        # N: batch size, C: feature size per point, P: number of points\n\n        topk_indices = knn(points, self.k) # (N, P, K)\n        x = self.get_graph_feature(features, self.k, topk_indices) # (N, C_fts(0), P, K)\n\n        for conv, bn, act in zip(self.convs, self.bns, self.acts):\n            x = conv(x)  # (N, C', P, K)\n            if bn:\n                x = bn(x)\n            if act:\n                x = act(x)\n\n        fts = x.mean(dim=-1)  # (N, C, P)\n\n        # shortcut\n        if self.sc:\n            sc = self.sc(features)  # (N, C_out, P)\n            sc = self.sc_bn(sc)\n        else:\n            sc = features\n\n        return self.sc_act(sc + fts)  # (N, C_out, P)\n

With the EdgeConv architecture as the building block, the ParticleNet model is constructed as follow.

The ParticleNet model stacks three EdgeConv blocks to construct higher-level features and passing them through the pipeline. The points (i.e., in our case, the particle candidates inside a jet) are not changing, but the per-point \"coordinates\" and \"features\" vectors changes, in both values and dimensions.

For the first EdgeConv block, the \"coordinates\" only include the relative \u03b7 and \u03c6 value of each particle. The \"features\" is a vector with a standard length of 32, which is linearly transformed from the initial feature vectors including the components of relative \u03b7, \u03c6, the log of pT, etc. The first EdgeConv block outputs a per-point feature vector of length 64, which is taken as both the \"coordinates\" and \"features\" to the next EdgeConv block. That is to say, the next k-NN is applied on the 64D high-dimensional spatial space to capture the new relations of points learned by the model. This is visualized by the input/output arrows showing the data flow of the model. We see that this architecture illustrates the stackability of the EdgeConv block, and is the core to the Dynamic Graph CNN (DGCNN), as the model can dynamically change the correlations of each point based on learnable features.

A fusion technique is also used by concatenating the three EdgeConv output vectors together (adding the dimensions), instead of using the last EdgeConv output, to form an output vector. This is also one form of shortcut implementations that helps to ease the training for a complex and deep convolutional network model.

The concatenated vectors per point are then averaged over points to produce a single 1D vector of the whole point cloud. The vector passes through one fully connected layer, with a dropout rate of p=0.1 to prevent overfitting. Then, in our example, the full network outputs two scores after a softmax, representing the one-hot encoding of the top vs. QCD class.

The ParticleNet implementation is shown below.

ParticleNet model implementation

See weaver/utils/nn/model/ParticleNet.py, or the following code block annotated with more comments. We elaborate here on several mean points.

  • The stack of multiple EdgeConv blocks are implemented in
    for idx, conv in enumerate(self.edge_convs):\n    pts = (points if idx == 0 else fts) + coord_shift\n    fts = conv(pts, fts) * mask\n
  • The multiple EdgeConv layer parameters are given by conv_params, which takes a list of tuples, each tuple in the format of (K, (C1, C2, C3)). K for the k-NN number, C1,2,3 for convolution feature sizes of three layers in an EdgeConv block.
  • The fully connected layer parameters are given by fc_params, which takes a list of tuples, each tuple in the format of (n_feat, drop_rate).
class ParticleNet(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions (C_fts).\n    num_classes : int\n        Number of output classes.\n    conv_params : list\n        List of convolution parameters of EdgeConv blocks, each element in the format of (K, (C1, C2, C3)).\n        K for the kNN number, C1,2,3 for convolution feature sizes of three layers in an EdgeConv block.\n    fc_params: list\n        List of fully connected layer parameters after all EdgeConv blocks, each element in the format of\n        (n_feat, drop_rate)\n    use_fusion: bool\n        If true, concatenates all output features from each EdgeConv before the fully connected layer.\n    use_fts_bn: bool\n        If true, applies a batch norm before feeding to the EdgeConv block.\n    use_counts: bool\n        If true, uses the real count of points instead of the padded size (the max point size).\n    for_inference: bool\n        Whether this is an inference routine. If true, applies a softmax to the output.\n    for_segmentation: bool\n        Whether the model is set up for the point cloud segmentation (instead of classification) task. If true,\n        does not merge the features after the last EdgeConv, and apply Conv1D instead of the linear layer.\n        The output is hence each output_features per point, instead of output_features.\n    \"\"\"\n\n\n    def __init__(self,\n                input_dims,\n                num_classes,\n                conv_params=[(7, (32, 32, 32)), (7, (64, 64, 64))],\n                fc_params=[(128, 0.1)],\n                use_fusion=True,\n                use_fts_bn=True,\n                use_counts=True,\n                for_inference=False,\n                for_segmentation=False,\n                **kwargs):\n        super(ParticleNet, self).__init__(**kwargs)\n\n        self.use_fts_bn = use_fts_bn\n        if self.use_fts_bn:\n            self.bn_fts = nn.BatchNorm1d(input_dims)\n\n        self.use_counts = use_counts\n\n        self.edge_convs = nn.ModuleList()\n        for idx, layer_param in enumerate(conv_params):\n            k, channels = layer_param\n            in_feat = input_dims if idx == 0 else conv_params[idx - 1][1][-1]\n            self.edge_convs.append(EdgeConvBlock(k=k, in_feat=in_feat, out_feats=channels, cpu_mode=for_inference))\n\n        self.use_fusion = use_fusion\n        if self.use_fusion:\n            in_chn = sum(x[-1] for _, x in conv_params)\n            out_chn = np.clip((in_chn // 128) * 128, 128, 1024)\n            self.fusion_block = nn.Sequential(nn.Conv1d(in_chn, out_chn, kernel_size=1, bias=False), nn.BatchNorm1d(out_chn), nn.ReLU())\n\n        self.for_segmentation = for_segmentation\n\n        fcs = []\n        for idx, layer_param in enumerate(fc_params):\n            channels, drop_rate = layer_param\n            if idx == 0:\n                in_chn = out_chn if self.use_fusion else conv_params[-1][1][-1]\n            else:\n                in_chn = fc_params[idx - 1][0]\n            if self.for_segmentation:\n                fcs.append(nn.Sequential(nn.Conv1d(in_chn, channels, kernel_size=1, bias=False),\n                                        nn.BatchNorm1d(channels), nn.ReLU(), nn.Dropout(drop_rate)))\n            else:\n                fcs.append(nn.Sequential(nn.Linear(in_chn, channels), nn.ReLU(), nn.Dropout(drop_rate)))\n        if self.for_segmentation:\n            fcs.append(nn.Conv1d(fc_params[-1][0], num_classes, kernel_size=1))\n        else:\n            fcs.append(nn.Linear(fc_params[-1][0], num_classes))\n        self.fc = nn.Sequential(*fcs)\n\n        self.for_inference = for_inference\n\n    def forward(self, points, features, mask=None):\n#         print('points:\\n', points)\n#         print('features:\\n', features)\n        if mask is None:\n            mask = (features.abs().sum(dim=1, keepdim=True) != 0)  # (N, 1, P)\n        points *= mask\n        features *= mask\n        coord_shift = (mask == 0) * 1e9\n        if self.use_counts:\n            counts = mask.float().sum(dim=-1)\n            counts = torch.max(counts, torch.ones_like(counts))  # >=1\n\n        if self.use_fts_bn:\n            fts = self.bn_fts(features) * mask\n        else:\n            fts = features\n        outputs = []\n        for idx, conv in enumerate(self.edge_convs):\n            pts = (points if idx == 0 else fts) + coord_shift\n            fts = conv(pts, fts) * mask\n            if self.use_fusion:\n                outputs.append(fts)\n        if self.use_fusion:\n            fts = self.fusion_block(torch.cat(outputs, dim=1)) * mask\n\n#         assert(((fts.abs().sum(dim=1, keepdim=True) != 0).float() - mask.float()).abs().sum().item() == 0)\n\n        if self.for_segmentation:\n            x = fts\n        else:\n            if self.use_counts:\n                x = fts.sum(dim=-1) / counts  # divide by the real counts\n            else:\n                x = fts.mean(dim=-1)\n\n        output = self.fc(x)\n        if self.for_inference:\n            output = torch.softmax(output, dim=1)\n        # print('output:\\n', output)\n        return output\n

Above are the capsulation of all ParticleNet building blocks. Eventually, we have the model defined in the model card top_tagging/networks/particlenet_pf.py, in the ParticleNetTagger1Path class, meaning we only use the ParticleNet pipeline that deals with one set of the point cloud (i.e., the particle candidates).

Info

Two sets of point clouds in the CMS application, namely the particle-flow candidates and secondary vertices, are used. This requires special handling to merge the clouds before feeding them to the first layer of EdgeConv.

ParticleNet model config

Also see top_tagging/networks/particlenet_pf.py.

import torch\nimport torch.nn as nn\nfrom utils.nn.model.ParticleNet import ParticleNet, FeatureConv\n\n\nclass ParticleNetTagger1Path(nn.Module):\n\n    def __init__(self,\n                pf_features_dims,\n                num_classes,\n                conv_params=[(7, (32, 32, 32)), (7, (64, 64, 64))],\n                fc_params=[(128, 0.1)],\n                use_fusion=True,\n                use_fts_bn=True,\n                use_counts=True,\n                pf_input_dropout=None,\n                for_inference=False,\n                **kwargs):\n        super(ParticleNetTagger1Path, self).__init__(**kwargs)\n        self.pf_input_dropout = nn.Dropout(pf_input_dropout) if pf_input_dropout else None\n        self.pf_conv = FeatureConv(pf_features_dims, 32)\n        self.pn = ParticleNet(input_dims=32,\n                            num_classes=num_classes,\n                            conv_params=conv_params,\n                            fc_params=fc_params,\n                            use_fusion=use_fusion,\n                            use_fts_bn=use_fts_bn,\n                            use_counts=use_counts,\n                            for_inference=for_inference)\n\n    def forward(self, pf_points, pf_features, pf_mask):\n        if self.pf_input_dropout:\n            pf_mask = (self.pf_input_dropout(pf_mask) != 0).float()\n            pf_points *= pf_mask\n            pf_features *= pf_mask\n\n        return self.pn(pf_points, self.pf_conv(pf_features * pf_mask) * pf_mask, pf_mask)\n\n\ndef get_model(data_config, **kwargs):\n    conv_params = [\n        (16, (64, 64, 64)),\n        (16, (128, 128, 128)),\n        (16, (256, 256, 256)),\n        ]\n    fc_params = [(256, 0.1)]\n    use_fusion = True\n\n    pf_features_dims = len(data_config.input_dicts['pf_features'])\n    num_classes = len(data_config.label_value)\n    model = ParticleNetTagger1Path(pf_features_dims, num_classes,\n                            conv_params, fc_params,\n                            use_fusion=use_fusion,\n                            use_fts_bn=kwargs.get('use_fts_bn', False),\n                            use_counts=kwargs.get('use_counts', True),\n                            pf_input_dropout=kwargs.get('pf_input_dropout', None),\n                            for_inference=kwargs.get('for_inference', False)\n                            )\n    model_info = {\n        'input_names':list(data_config.input_names),\n        'input_shapes':{k:((1,) + s[1:]) for k, s in data_config.input_shapes.items()},\n        'output_names':['softmax'],\n        'dynamic_axes':{**{k:{0:'N', 2:'n_' + k.split('_')[0]} for k in data_config.input_names}, **{'softmax':{0:'N'}}},\n        }\n\n    print(model, model_info)\n    print(data_config.input_shapes)\n    return model, model_info\n\n\ndef get_loss(data_config, **kwargs):\n    return torch.nn.CrossEntropyLoss()\n

The most important parameters are conv_params and fc_params, which decides the model parameters of EdgeConv blocks and the fully connected layer. See details in the above \"ParticleNet model implementation\" box.

conv_params = [\n    (16, (64, 64, 64)),\n    (16, (128, 128, 128)),\n    (16, (256, 256, 256)),\n    ]\nfc_params = [(256, 0.1)]\n

A full structure printed from PyTorch is shown below. It will appear in the Weaver output during training.

ParticleNet full-scale structure
ParticleNetTagger1Path(\n  |0.577 M, 100.000% Params, 0.441 GMac, 100.000% MACs|\n  (pf_conv): FeatureConv(\n    |0.0 M, 0.035% Params, 0.0 GMac, 0.005% MACs|\n    (conv): Sequential(\n      |0.0 M, 0.035% Params, 0.0 GMac, 0.005% MACs|\n      (0): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.001% Params, 0.0 GMac, 0.000% MACs|)\n      (1): Conv1d(4, 32, kernel_size=(1,), stride=(1,), bias=False, |0.0 M, 0.022% Params, 0.0 GMac, 0.003% MACs|)\n      (2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.011% Params, 0.0 GMac, 0.001% MACs|)\n      (3): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.001% MACs|)\n    )\n  )\n  (pn): ParticleNet(\n    |0.577 M, 99.965% Params, 0.441 GMac, 99.995% MACs|\n    (edge_convs): ModuleList(\n      |0.305 M, 52.823% Params, 0.424 GMac, 96.047% MACs|\n      (0): EdgeConvBlock(\n        |0.015 M, 2.575% Params, 0.021 GMac, 4.716% MACs|\n        (convs): ModuleList(\n          |0.012 M, 2.131% Params, 0.02 GMac, 4.456% MACs|\n          (0): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n          (1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n          (2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.004 M, 0.710% Params, 0.007 GMac, 1.485% MACs|)\n        )\n        (bns): ModuleList(\n          |0.0 M, 0.067% Params, 0.001 GMac, 0.139% MACs|\n          (0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n          (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n          (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.046% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.0 GMac, 0.070% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.023% MACs|)\n        )\n        (sc): Conv1d(32, 64, kernel_size=(1,), stride=(1,), bias=False, |0.002 M, 0.355% Params, 0.0 GMac, 0.046% MACs|)\n        (sc_bn): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.022% Params, 0.0 GMac, 0.003% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.001% MACs|)\n      )\n      (1): EdgeConvBlock(\n        |0.058 M, 10.121% Params, 0.081 GMac, 18.437% MACs|\n        (convs): ModuleList(\n          |0.049 M, 8.523% Params, 0.079 GMac, 17.825% MACs|\n          (0): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n          (1): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n          (2): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.016 M, 2.841% Params, 0.026 GMac, 5.942% MACs|)\n        )\n        (bns): ModuleList(\n          |0.001 M, 0.133% Params, 0.001 GMac, 0.279% MACs|\n          (0): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n          (2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.093% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.001 GMac, 0.139% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.046% MACs|)\n        )\n        (sc): Conv1d(64, 128, kernel_size=(1,), stride=(1,), bias=False, |0.008 M, 1.420% Params, 0.001 GMac, 0.186% MACs|)\n        (sc_bn): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.0 M, 0.044% Params, 0.0 GMac, 0.006% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.003% MACs|)\n      )\n      (2): EdgeConvBlock(\n        |0.231 M, 40.128% Params, 0.322 GMac, 72.894% MACs|\n        (convs): ModuleList(\n          |0.197 M, 34.091% Params, 0.315 GMac, 71.299% MACs|\n          (0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n          (1): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n          (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False, |0.066 M, 11.364% Params, 0.105 GMac, 23.766% MACs|)\n        )\n        (bns): ModuleList(\n          |0.002 M, 0.266% Params, 0.002 GMac, 0.557% MACs|\n          (0): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n          (2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.001 GMac, 0.186% MACs|)\n        )\n        (acts): ModuleList(\n          |0.0 M, 0.000% Params, 0.001 GMac, 0.279% MACs|\n          (0): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n          (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n          (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.093% MACs|)\n        )\n        (sc): Conv1d(128, 256, kernel_size=(1,), stride=(1,), bias=False, |0.033 M, 5.682% Params, 0.003 GMac, 0.743% MACs|)\n        (sc_bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.089% Params, 0.0 GMac, 0.012% MACs|)\n        (sc_act): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.006% MACs|)\n      )\n    )\n    (fusion_block): Sequential(\n      |0.173 M, 29.963% Params, 0.017 GMac, 3.925% MACs|\n      (0): Conv1d(448, 384, kernel_size=(1,), stride=(1,), bias=False, |0.172 M, 29.830% Params, 0.017 GMac, 3.899% MACs|)\n      (1): BatchNorm1d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, |0.001 M, 0.133% Params, 0.0 GMac, 0.017% MACs|)\n      (2): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.009% MACs|)\n    )\n    (fc): Sequential(\n      |0.099 M, 17.179% Params, 0.0 GMac, 0.023% MACs|\n      (0): Sequential(\n        |0.099 M, 17.090% Params, 0.0 GMac, 0.022% MACs|\n        (0): Linear(in_features=384, out_features=256, bias=True, |0.099 M, 17.090% Params, 0.0 GMac, 0.022% MACs|)\n        (1): ReLU(|0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n        (2): Dropout(p=0.1, inplace=False, |0.0 M, 0.000% Params, 0.0 GMac, 0.000% MACs|)\n      )\n      (1): Linear(in_features=256, out_features=2, bias=True, |0.001 M, 0.089% Params, 0.0 GMac, 0.000% MACs|)\n    )\n  )\n)\n

The data card is shown in top_tagging/data/pf_points_features.yaml, given in a similar way as in the MLP example. Here we group the inputs into three classes: pf_points, pf_features and pf_masks. They correspond to the forward(self, pf_points, pf_features, pf_mask) prototype of our nn.Module model, and will send in these 2D vectors in the mini-batch size for each iteration during training/prediction.

ParticleNet data config top_tagging/data/pf_points_features.yaml

See top_tagging/data/pf_points_features.yaml.

selection:\n### use `&`, `|`, `~` for logical operations on numpy arrays\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\n\nnew_variables:\n### [format] name: formula\n### can use functions from `math`, `np` (numpy), and `awkward` in the expression\npf_mask: awkward.JaggedArray.ones_like(Part_E)\nis_bkg: np.logical_not(is_signal_new)\n\npreprocess:\n### method: [manual, auto] - whether to use manually specified parameters for variable standardization\nmethod: manual\n### data_fraction: fraction of events to use when calculating the mean/scale for the standardization\ndata_fraction:\n\ninputs:\npf_points:\nlength: 100\nvars:\n- Part_Etarel\n- Part_Phirel\npf_features:\nlength: 100\nvars:\n### [format 1]: var_name (no transformation)\n### [format 2]: [var_name,\n###              subtract_by(optional, default=None, no transf. if preprocess.method=manual, auto transf. if preprocess.method=auto),\n###              multiply_by(optional, default=1),\n###              clip_min(optional, default=-5),\n###              clip_max(optional, default=5),\n###              pad_value(optional, default=0)]\n- Part_Etarel\n- Part_Phirel\n- [Part_E_log, 2, 1]\n- [Part_P_log, 2, 1]\npf_mask:\nlength: 100\nvars:\n- pf_mask\n\nlabels:\n### type can be `simple`, `custom`\n### [option 1] use `simple` for binary/multi-class classification, then `value` is a list of 0-1 labels\ntype: simple\nvalue: [\nis_signal_new, is_bkg\n]\n### [option 2] otherwise use `custom` to define the label, then `value` is a map\n# type: custom\n# value:\n# target_mass: np.where(fj_isQCD, fj_genjet_sdmass, fj_gen_mass)\n\nobservers:\n- origIdx\n- idx\n- Part_E_tot\n- Part_PX_tot\n- Part_PY_tot\n- Part_PZ_tot\n- Part_P_tot\n- Part_Eta_tot\n- Part_Phi_tot\n\n# weights:\n### [option 1] use precomputed weights stored in the input files\n# use_precomputed_weights: true\n# weight_branches: [weight, class_weight]\n### [option 2] compute weights on-the-fly using reweighting histograms\n

Now we have walked through the detailed description of three networks in their architecture as well as their implementations in Weaver.

Before ending this section, we summarize the three networks on their (1) model and data configuration cards, (2) the number of parameters, and (3) computational complexity in the following table. Note that we'll refer to the shell variables provided here in the following training example.

Model ${PREFIX} ${MODEL_CONFIG} ${DATA_CONFIG} Parameters Computational complexity MLP mlp mlp_pf.py pf_features.yaml 739k 0.001 GMac DeepAK8 (1D CNN) deepak8 deepak8_pf.py pf_features.yaml 349k 0.012 GMac ParticleNet (DGCNN) particlenet particlenet_pf.py pf_points_features.yaml 577k 0.441 GMac"},{"location":"inference/particlenet.html#2-start-training","title":"2. Start training!","text":"

Now we train the three neural networks based on the provided model and data configurations.

Here we present three ways of training. For readers who have a local machine with CUDA GPUs, please try out training on the local GPUs. Readers who would like to try on CPUs can also refer to the local GPU instruction. It is also possible to borrow the GPU resources from the lxplus HTCondor or CMS Connect. Please find in the following that meets your situation.

Train on local GPUsUse GPUs on lxplus HTCondorUse GPUs on CMS Connect

The three networks can be trained with a universal script. Enter the weaver base folder and run the following command. Note that ${DATA_CONFIG}, ${MODEL_CONFIG}, and ${PREFIX} refers to the value in the above table for each example, and the fake path should be replaced with the correct one.

PREFIX='<prefix-from-table>'\nMODEL_CONFIG='<model-config-from-table>'\nDATA_CONFIG='<data-config-from-table>'\nPATH_TO_SAMPLES='<your-path-to-samples>'\n\npython train.py \\\n --data-train ${PATH_TO_SAMPLES}'/prep/top_train_*.root' \\\n --data-val ${PATH_TO_SAMPLES}'/prep/top_val_*.root' \\\n --fetch-by-file --fetch-step 1 --num-workers 3 \\\n --data-config top_tagging/data/${DATA_CONFIG} \\\n --network-config top_tagging/networks/${MODEL_CONFIG} \\\n --model-prefix output/${PREFIX} \\\n --gpus 0,1 --batch-size 1024 --start-lr 5e-3 --num-epochs 20 --optimizer ranger \\\n --log output/${PREFIX}.train.log\n

Here --gpus 0,1 specifies the GPUs to run with the device ID 1 and 2. For training on CPUs, please use --gpu ''.

A detailed description of the training command can be found in Weaver README. Below we will note a few more caveats about the data loading options, though the specific settings will depend on the specifics of the input data.

Caveats on the data loading options

Our goal in data loading is to guarantee that the data loaded in every mini-batch is evenly distributed with different labels, though they are not necessarily stored evenly in the file. Besides, we also need to ensure that the on-the-fly loading and preprocessing of data should be smooth and not be a bottleneck of the data delivering pipeline. The total amount of loaded data also needs to be controlled so as not to explode the entire memory. The following guidelines should be used to choose the best options for your use case:

  • in the default case, data are loaded from every input file with a small proportion per fetch-step, provided by --fetch-step (default is 0.01). This adapts to the case when we have multiple classes of input, each class having multiple files (e.g., it adapts to the real CMS application because we may have multiple nano_i.root files for different input classes). The strategy gathered all pieces per fetch-step from all input files, shuffle them, and present the data we need in each regular mini-batch. One can also append --num-workers n with n being the number of paralleled workers to load the data.
  • --fetch-step 1 --num-workers 1. This strategy helps in the case we have few input files with data in different labels not evenly distributed. In the extreme case, we only have 1 file, with all data at the top being one class (signal) and data at the bottom being another class (background), or we have 2 or multiple files, each containing a specific class. In this option, --fetch-step 1 guarantees the entire data in the file is loaded and participate in the shuffle. Therefore all classes are safely mixed before sending to the mini-batch. --num-workers 1 means we only use one worker that takes care of all files to avoid inconsistent loading speeds of multiple workers (depending on CPUs). This strategy can further cooperate with --in-memory so that all data are put permanently in memory and will not be reloaded every epoch. --fetch-by-file is the option we can use when all input files have a similar structure. See Weaver README:

An alternative approach is the \"file-based\" strategy, which can be enabled with --fetch-by-files. This approach will instead read all events from every file for each step, and it will read m input files (m is set by --fetch-step) before mixing and shuffling the loaded events. This strategy is more suitable when each input file is already a mixture of all types of events (e.g., pre-processed with NNTools), otherwise it may lead to suboptimal training performance. However, a higher data loading speed can generally be achieved with this approach.

Please note that you can test if all data classes are well mixed by printing the truth label in each mini-batch. Also, remember to test if data are loaded just-in-time by monitoring the GPU performance \u2014 if switching the data loading strategy helps improve the GPU efficiency, it means the previous data loader is the bottleneck in the pipeline to deliver and use the data.

After training, we predict the score on the test datasets using the best model:

PREFIX='<prefix-from-table>'\nMODEL_CONFIG='<model-config-from-table>'\nDATA_CONFIG='<data-config-from-table>'\nPATH_TO_SAMPLES='<your-path-to-samples>'\n\npython train.py --predict \\\n --data-test ${PATH_TO_SAMPLES}'/prep/top_test_*.root' \\\n --num-workers 3 \\\n --data-config top_tagging/data/${DATA_CONFIG} \\\n --network-config top_tagging/networks/${MODEL_CONFIG} \\\n --model-prefix output/${PREFIX}_best_epoch_state.pt \\\n --gpus 0,1 --batch-size 1024 \\\n --predict-output output/${PREFIX}_predict.root\n

On lxplus HTCondor, the GPU(s) can be booked via the arguments request_gpus. To get familiar with the GPU service, please refer to the documentation here.

While it is not possible to test the script locally, you can try out the condor_ssh_to_job command to connect to the remote condor machine that runs the jobs. This interesting feature will help you with debugging or monitoring the condor job.

Here we provide the example executed script and the condor submitted file for the training and predicting task. Create the following two files:

The executable: run.sh

Still, please remember to specify ${DATA_CONFIG}, ${MODEL_CONFIG}, and ${PREFIX} as shown in the above table, and replace the fake path with the correct one.

#!/bin/bash\n\nPREFIX=$1\nMODEL_CONFIG=$2\nDATA_CONFIG=$3\nPATH_TO_SAMPLES=$4\nWORKDIR=`pwd`\n\n# Download miniconda\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda_install.sh\nbash miniconda_install.sh -b -p ${WORKDIR}/miniconda\nexport PATH=$WORKDIR/miniconda/bin:$PATH\npip install numpy pandas scikit-learn scipy matplotlib tqdm PyYAML\npip install uproot3 awkward0 lz4 xxhash\npip install tables\npip install onnxruntime-gpu\npip install tensorboard\npip install torch\n\n# CUDA environment setup\nexport PATH=$PATH:/usr/local/cuda-10.2/bin\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.2/lib64\nexport LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda-10.2/lib64\n\n# Clone weaver-benchmark\ngit clone --recursive https://github.com/colizz/weaver-benchmark.git\nln -s ../top_tagging weaver-benchmark/weaver/top_tagging\ncd weaver-benchmark/weaver/\nmkdir output\n\n# Training, using 1 GPU\npython train.py \\\n--data-train ${PATH_TO_SAMPLES}'/prep/top_train_*.root' \\\n--data-val ${PATH_TO_SAMPLES}'/prep/top_val_*.root' \\\n--fetch-by-file --fetch-step 1 --num-workers 3 \\\n--data-config top_tagging/data/${DATA_CONFIG} \\\n--network-config top_tagging/networks/${MODEL_CONFIG} \\\n--model-prefix output/${PREFIX} \\\n--gpus 0 --batch-size 1024 --start-lr 5e-3 --num-epochs 20 --optimizer ranger \\\n--log output/${PREFIX}.train.log\n\n# Predicting score, using 1 GPU\npython train.py --predict \\\n--data-test ${PATH_TO_SAMPLES}'/prep/top_test_*.root' \\\n--num-workers 3 \\\n--data-config top_tagging/data/${DATA_CONFIG} \\\n--network-config top_tagging/networks/${MODEL_CONFIG} \\\n--model-prefix output/${PREFIX}_best_epoch_state.pt \\\n--gpus 0 --batch-size 1024 \\\n--predict-output output/${PREFIX}_predict.root\n\n[ -d \"runs/\" ] && tar -caf output.tar output/ runs/ || tar -caf output.tar output/\n

HTCondor submitted file: submit.sub

Modify the argument line. These are the bash variable PREFIX, MODEL_CONFIG, DATA_CONFIG, PATH_TO_SAMPLES used in the Weaver command. Since the EOS directory is accessable accross all condor nodes on lxplus, one may directly specify <your-path-to-samples> as the EOS path provided above. An example is shown in the commented line.

Universe                = vanilla\nexecutable              = run.sh\narguments               = <prefix> <model-config> <data-config> <your-path-to-samples>\n#arguments              = mlp mlp_pf.py pf_features.yaml /eos/user/c/coli/public/weaver-benchmark/top_tagging/samples\noutput                  = job.$(ClusterId).$(ProcId).out\nerror                   = job.$(ClusterId).$(ProcId).err\nlog                     = job.$(ClusterId).log\nshould_transfer_files   = YES\nwhen_to_transfer_output = ON_EXIT_OR_EVICT\ntransfer_output_files   = weaver-benchmark/weaver/output.tar\ntransfer_output_remaps  = \"output.tar = output.$(ClusterId).$(ProcId).tar\"\nrequest_GPUs = 1\nrequest_CPUs = 4\n+MaxRuntime = 604800\nqueue\n

Make the run.sh script an executable, then submit the job.

chmod +x run.sh\ncondor_submit submit.sub\n
A tarball will be transfered back with the weaver/output directory where the trained models and the predicted ROOT file are stored.

CMS Connect provides several GPU nodes. One can request to run GPU condor jobs in a similar way as on lxplus, please refer to the link: https://ci-connect.atlassian.net/wiki/spaces/CMS/pages/80117822/Requesting+GPUs

As the EOS user space may not be accessed from the remote node launched by CMS Connect, one may consider either (1) migrating the input files by condor, or (2) using XRootD to transfer the input file from EOS space to the condor node, before running the Weaver train command.

"},{"location":"inference/particlenet.html#3-evaluation-of-models","title":"3. Evaluation of models","text":"

In the output folder, we find the trained PyTorch models after every epoch and the log file that records the loss and accuracy in the runtime.

The predict step also produces a predicted root file in the output folder, including the truth label, the predicted store, and several observer variables we provided in the data card. With the predicted root file, we make the ROC curve comparing the performance of the three trained models.

Here is the result from my training:

Model AUC Accuracy 1/eB (@eS=0.3) MLP 0.961 0.898 186 DeepAK8 (1D CNN) 0.979 0.927 585 ParticleNet (DGCNN) 0.984 0.936 1030

We see that the ParticleNet model shows an outstanding performance in this classification task. Besides, the DeepAK8 and ParticleNet results are similar to the benchmark values found in the gDoc. We address that the performance can be further improved by some following tricks:

  • Train an ensemble of models with different initial parametrization. For each event/jet, take the final predicted score as the mean/median of the score ensembles predicted by each model. This is a widely used ML technique to pursue an extra few percent of improvements.
  • Use more input variables for training. We note that in the above training example, only four input variables are used instead of a full suite of input features as done in the ParticleNet paper [arXiv:1902.08570]. Additional variables (e.g. \u0394R or log(pT / pT(jet))) can be designed based on the given 4-momenta, and, although providing redundant information in principle, can still help the network fully exploit the point cloud structure and thus do a better discrimination job.
  • The fine-tuning of the model will also bring some performance gain. See details in the next section.
"},{"location":"inference/particlenet.html#tuning-the-particlenet-model","title":"Tuning the ParticleNet model","text":"

When it comes to the real application of any DNN model, tunning the hyperparameters is an important path towards a better performance. In this section, we provide some tips on the ParticleNet model tunning. For a more detailed discussion on this topic, see more in the \"validation\" chapter in the documentation.

"},{"location":"inference/particlenet.html#1-choices-on-the-optimizer-and-the-learning-rate","title":"1. Choices on the optimizer and the learning rate","text":"

The optimizer decides how our neural network update all its parameters, and the learning rate means how fast the parameters changes in one training iteration.

Learning rate is the most important hyperparameter to choose from before concrete training is done. Here we quote from a suggested strategy: if you only have the opportunity to optimize one hyperparameter, choose the learning rate. The optimizer is also important because a wiser strategy usually means avoid the zig-zagging updating route, avoid falling into the local minima and even adapting different strategies for the fast-changing parameters and the slow ones. Adam (and its several variations) is a widely used optimizer. Another recently developed advanced optimizer is Ranger that combines RAdam and LookAhead. However, one should note that the few percent level improvement by using different optimizers is likely to be smeared by an unoptimized learning rate.

The above training scheme uses a start learning rate of 5e-3, and Ranger as the optimizer. It uses a flat+decay schedular, in a way that the LR starts to decay after processing 70% of epochs, and gradually reduce to 0.01 of its original value when nearing the completion of all epochs.

First, we note that the current case is already well optimized. Therefore, by simply reuse the current choice, the training will converge to a stable result in general. But it is always good in practice to test several choices of the optimizer and reoptimize the learning rate.

Weaver integrates multiple optimizers. In the above training command, we use --optimizer ranger to adopt the Ranger optimizer. It is also possible to switch to --optimizer adam or --optimizer adamW.

Weaver also provides the interface to optimize the learning rate before real training is performed. In the ParticleNet model training, we append

--lr-finder 5e-6,5e0,200\n
in the command, then a specific learning-rate finder program will be launched. This setup scans over the LR from 5e-6 to 5e0 by applying 200 mini-batches of training. It outputs a plot showing the training loss for different starting learning rates. In general, a lower training loss means a better choice of the learning rate parameter.

Below shows the results from LR finder by specifying --lr-finder 5e-6,5e0,200, for the --optimizer adamW (left) and the --optimizer ranger (right) case.

The training loss forms a basin shape which indicates that the optimal learning rate falls somewhere in the middle. We extract two aspects from the plots. First, the basin covers a wide range, meaning that the LR finder only provides a rough estimation. But it is a good attempt to first run the LR finder to have an overall feeling. For the Ranger case (right figure), one can choose the range 1e-3 to 1e-2 and further determine the optminal learning rate by delivering the full training. Second, we should be aware that different optimizer takes different optimal LR values. As can be seen here, the AdamW in general requires a small LR than Ranger.

"},{"location":"inference/particlenet.html#2-visualize-the-training-with-tensorboard","title":"2. Visualize the training with TensorBoard","text":"

To monitor the full training/evaluation accuracy and the loss for each mini-batch, we can draw support from a nicely integrated utility, TensorBoard, to employ real-time monitoring. See the introduction page from PyTorch: https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

To activate TensorBoard, append (note that replace ${PREFIX} according to the above table)

--tensorboard ${PREFIX}\n
to the training command. The runs/ subfolder containing the TensorBoard monitoring log will appear in the Weaver directory (if you are launching condor jobs, the runs/ folder will be transferred back in the tarball). Then, one can run
tensorboard --logdir=runs\n
to start the TensorBoard service and go to URL https://localhost:6006 to view the TensorBoard dashboard.

The below plots show the training and evaluation loss, in our standard choice with LR being 5e-3, and in the case of a small LR 2e-3 and a large LR 1e-2. Note that all tested LR values are within the basin in the LR finder plots.

We see that in the evaluated loss plot, the standard LR outperforms two variational choices. The reason may be that a larger LR finds difficulty in converging to the global minima, while a smaller LR may not be adequate to reach the minima point in a journey of 20 epochs. Overall, we see 5e-3 as a good choice as the starting LR for the Ranger optimizer.

"},{"location":"inference/particlenet.html#3-optimize-the-model","title":"3. Optimize the model","text":"

In practice, tuning the model size is also an important task. By concept, a smaller model tends to have unsatisfactory performance due to the limited ability to learn many local features. As the model size goes up, the performance will climb to some extent, but may further decrease due to the network \"degradation\" (deeper models have difficulty learning features). Besides, a heavier model may also cause the overfitting issue. In practice, it also leads to larger inference time which is the main concern when coming to real applications.

For the ParticleNet model case, we also test between a smaller and larger variation of the model size. Recall that the original model is defined by the following layer parameters.

conv_params = [\n    (16, (64, 64, 64)),\n    (16, (128, 128, 128)),\n    (16, (256, 256, 256)),\n    ]\nfc_params = [(256, 0.1)]\n
We can replace the code block with
ec_k = kwargs.get('ec_k', 16)\nec_c1 = kwargs.get('ec_c1', 64)\nec_c2 = kwargs.get('ec_c2', 128)\nec_c3 = kwargs.get('ec_c3', 256)\nfc_c, fc_p = kwargs.get('fc_c', 256), kwargs.get('fc_p', 0.1)\nconv_params = [\n    (ec_k, (ec_c1, ec_c1, ec_c1)),\n    (ec_k, (ec_c2, ec_c2, ec_c2)),\n    (ec_k, (ec_c3, ec_c3, ec_c3)),\n    ]\nfc_params = [(fc_c, fc_p)]\n
Then we have the ability to tune the model parameters from the command line. Append the extra arguments in the training command
--network-option ec_k 32 --network-option ec_c1 128 --network-option ec_c2 192 --network-option ec_c3 256\n
and the model parameters will take the new values as specified.

We test over two cases, one with the above setting to enlarge the model, and another by using

--network-option ec_c1 64 --network-option ec_c2 64 --network-option ec_c3 96\n
to adopt a lite version.

The Tensorboard monitoring plots in the training/evaluation loss is shown as follows.

We see that the \"heavy\" model reaches even smaller training loss, meaning that the model does not meet the degradation issue yet. However, the evaluation loss is not catching up with the training loss, showing some degree of overtraining in this scheme. From the evaluation result, we see no improvement by moving to a heavy model.

"},{"location":"inference/particlenet.html#4-apply-preselection-and-class-weights","title":"4. Apply preselection and class weights","text":"

In HEP applications, it is sometimes required to train a multi-class classifier. While it is simple to specify the input classes in the label section of the Weaver data config, it is sometimes ignored to set up the preselection and assign the suitable class weights for training. Using an unoptimized configuration, the trained model will not reach the best performance although no error message will result.

Since our top tagging example is a binary classification problem, there is no specific need to configure the preselection and class weights. Below we summarize some experiences that may be applicable in reader's custom multi-class training task.

The preselection should be chosen in a way that all remaining events passing the selection should fall into one and only one category. In other words, events with no labels attached should not be kept since it will confuse the training process.

Class weights (the class_weights option under weights in the data config) control the relative importance of input sample categories for training. Implementation-wise, it changes the event probability in a specific category chosen as training input events. The class weight comes into effect when one trains a multi-class classifier. Take 3-class case (denoted as [A, B, C]) as an example, the class_weights: [1, 1, 1] gives equal weights to all categories. Retraining the input with class_weights: [10, 1, 1] may result in a better discriminating power for class A vs. B or A vs. C; while the power of B separating with C will be weakened. As a trade-off between separating A vs. C and B vs. C, the class weights need to be intentionally tuned to achieve reasonable performance.

After the class weights are tuned, one can use another method to further factor out the interplay across categories, i.e., to define a \"binarized\" score between two classes only. Suppose the raw score for the three classes are P(A), P(B), and P(C) (their sum should be 1), then one can define the discriminant P(BvsC) = P(B) / (P(B)+P(C)) to separate B vs. C. In this way, the saparating power of B vs. C will remain unchanged for class_weights configured as either [1, 1, 1] or [10, 1, 1]. This strategy has been widely used in CMS to define composite tagger discrimant which are applied analysis-wise.

Above, we discuss in a very detailed manner on various attempts we can make to optimize the model. We hope the practical experiences presented here will help readers develop and deploy the complex ML model.

"},{"location":"inference/performance.html","title":"Performance of inference tools","text":""},{"location":"inference/pyg.html","title":"PyTorch Geometric","text":"

Geometric deep learning (GDL) is an emerging field focused on applying machine learning (ML) techniques to non-Euclidean domains such as graphs, point clouds, and manifolds. The PyTorch Geometric (PyG) library extends PyTorch to include GDL functionality, for example classes necessary to handle data with irregular structure. PyG is introduced at a high level in Fast Graph Representation Learning with PyTorch Geometric and in detail in the PyG docs.

"},{"location":"inference/pyg.html#gdl-with-pyg","title":"GDL with PyG","text":"

A complete reveiw of GDL is available in the following recently-published (and freely-available) textbook: Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. The authors specify several key GDL architectures including convolutional neural networks (CNNs) operating on grids, Deep Sets architectures operating on sets, and graph neural networks (GNNs) operating on graphs, collections of nodes connected by edges. PyG is focused in particular on graph-structured data, which naturally encompases set-structured data. In fact, many state-of-the-art GNN architectures are implemented in PyG (see the docs)! A review of the landscape of GNN architectures is available in Graph Neural Networks: A Review of Methods and Applications.

"},{"location":"inference/pyg.html#the-data-class-pyg-graphs","title":"The Data Class: PyG Graphs","text":"

Graphs are data structures designed to encode data structured as a set of objects and relations. Objects are embedded as graph nodes \\(u\\in\\mathcal{V}\\), where \\(\\mathcal{V}\\) is the node set. Relations are represented by edges \\((i,j)\\in\\mathcal{E}\\) between nodes, where \\(\\mathcal{E}\\) is the edge set. Denote the sizes of the node and edge sets as \\(|\\mathcal{V}|=n_\\mathrm{nodes}\\) and \\(|\\mathcal{E}|=n_\\mathrm{edges}\\) respectively. The choice of edge connectivity determines the local structure of a graph, which has important downstream effects on graph-based learning algorithms. Graph construction is the process of embedding input data onto a graph structure. Graph-based learning algorithms are correspondingly imbued with a relational inductive bias based on the choice of graph representation; a graph's edge connectivity defines its local structure. The simplest graph construction routine is to construct no edges, yielding a permutation invariant set of objects. On the other hand, fully-connected graphs connect every node-node pair with an edge, yielding \\(n_\\mathrm{edges}=n_\\mathrm{nodes}(n_\\mathrm{nodes}-1)/2\\) edges. This representation may be feasible for small inputs like particle clouds corresponding to a jet, but is intractible for large-scale applications such as high-pileup tracking datasets. Notably, dynamic graph construction techniques operate on input point clouds, constructing edges on them dynamically during inference. For example, EdgeConv and GravNet GNN layers dynamically construct edges between nodes projected into a latent space; multiple such layers may be applied in sequence, yielding many intermediate graph representations on an input point cloud.

In general, nodes can have positions \\(\\{p_i\\}_{i=1}^{n_\\mathrm{nodes}}\\), \\(p_i\\in\\mathbb{R}^{n_\\mathrm{space\\_dim}}\\), and features (attributes) \\(\\{x_i\\}_{i=1}^{n_\\mathrm{nodes}}\\), \\(x_i\\in\\mathbb{R}^{n_\\mathrm{node\\_dim}}\\). In some applications like GNN-based particle tracking, node positions are taken to be the features. In others, e.g. jet identification, positional information may be used to seed dynamic graph consturction while kinematic features are propagated as edge features. Edges, too, can have features \\(\\{e_{ij}\\}_{(i,j)\\in\\mathcal{E}}\\), \\(e_{ij}\\in\\mathbb{R}^{n_\\mathrm{edge\\_dim}}\\), but do not have positions; instead, edges are defined by the nodes they connect, and may therefore be represented by, for example, the distance between the respective node-node pair. In PyG, graphs are stored as instances of the data class, whose fields fully specify the graph:

  • data.x: node feature matrix, \\(X\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{node\\_dim}}\\)
  • data.edge_index: node indices at each end of each edge, \\(I\\in\\mathbb{R}^{2\\times n_\\mathrm{edges}}\\)
  • data.edge_attr: edge feature matrix, \\(E\\in\\mathbb{R}^{n_\\mathrm{edges}\\times n_\\mathrm{edge\\_dim}}\\)
  • data.y: training target with arbitary shape (\\(y\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{out}}\\) for node-level targets, \\(y\\in\\mathbb{R}^{n_\\mathrm{edges}\\times n_\\mathrm{out}}\\) for edge-level targets or \\(y\\in\\mathbb{R}^{1\\times n_\\mathrm{out}}\\) for node-level targets).
  • data.pos: Node position matrix, \\(P\\in\\mathbb{R}^{n_\\mathrm{nodes}\\times n_\\mathrm{space\\_dim}}\\)

The PyG Introduction By Example tutorial covers the basics of graph creation, batching, transformation, and inference using this data class.

As an example, consider the ZINC chemical compounds dataset, which available as a built-in dataset in PyG:

from torch_geometric.datasets import ZINC\ntrain_dataset = ZINC(root='/tmp/ZINC', subset=True, split='train')\ntest_dataset =  ZINC(root='/tmp/ZINC', subset=True, split='test')\nlen(train_dataset)\n>>> 10000\nlen(test_dataset)\n>>> 1000   \n
Each graph in the dataset is a chemical compound; nodes are atoms and edges are chemical bonds. The node features x are categorical atom labels and the edge features edge_attr are categorical bond labels. The edge_index matrix lists all bonds present in the compound in COO format. The truth labels y indicate a synthetic computed property called constrained solubility; given a set of molecules represented as graphs, the task is to regress the constrained solubility. Therefore, this dataset is suitable for graph-level regression. Let's take a look at one molecule:

data = train_dataset[27]\ndata.x # node features\n>>> tensor([[0], [0], [1], [2], [0], \n            [0], [2], [0], [1], [2],\n            [4], [0], [0], [0], [0],\n            [4], [0], [0], [0], [0]])\n\ndata.pos # node positions \n>>> None\n\ndata.edge_index # COO edge indices\n>>> tensor([[ 0,  1,  1,  1,  2,  3,  3,  4,  4,  \n              5,  5,  6,  6,  7,  7,  7,  8,  9, \n              9, 10, 10, 10, 11, 11, 12, 12, 13, \n              13, 14, 14, 15, 15, 15, 16, 16, 16,\n              16, 17, 18, 19], # node indices w/ outgoing edges\n            [ 1,  0,  2,  3,  1,  1,  4,  3,  5,  \n              4,  6,  5,  7,  6,  8,  9,  7,  7,\n              10,  9, 11, 15, 10, 12, 11, 13, 12, \n              14, 13, 15, 10, 14, 16, 15, 17, 18,\n              19, 16, 16, 16]]) # node indices w/ incoming edges\n\ndata.edge_attr # edge features\n>>> tensor([1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, \n            1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1,\n            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \n            1, 1, 1, 1])\n\ndata.y # truth labels\n>>> tensor([-0.0972])\n\ndata.num_nodes\n>>> 20\n\ndata.num_edges\n>>> 40\n\ndata.num_node_features\n>>> 1 \n

We can load the full set of graphs onto an available GPU and create PyG dataloaders as follows:

import torch\nfrom torch_geometric.data import DataLoader\n\ndevice = 'cuda:0' if torch.cuda.is_available() else 'cpu'\ntest_dataset = [d.to(device) for d in test_dataset]\ntrain_dataset = [d.to(device) for d in train_dataset]\ntest_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)\ntrain_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)\n

"},{"location":"inference/pyg.html#the-message-passing-base-class-pyg-gnns","title":"The Message Passing Base Class: PyG GNNs","text":"

The 2017 paper Neural Message Passing for Quantum Chemistry presents a unified framework for a swath of GNN architectures known as message passing neural networks (MPNNs). MPNNs are GNNs whose feature updates are given by:

\\[x_i^{(k)} = \\gamma^{(k)} \\left(x_i^{(k-1)}, \\square_{j \\in \\mathcal{N}(i)} \\, \\phi^{(k)}\\left(x_i^{(k-1)}, x_j^{(k-1)},e_{ij}\\right) \\right)\\]

Here, \\(\\gamma\\) and \\(\\phi\\) are learnable functions (which we can approximate as multilayer perceptrons), \\(\\square\\) is a permutation-invariant function (e.g. mean, max, add), and \\(\\mathcal{N}(i)\\) is the neighborhood of node \\(i\\). In PyG, you'd write your own MPNN by using the MessagePassing base class, implementing each of the above mathematical objects as an explicit function.

  • MessagePassing.message() : define an explicit NN for \\(\\phi\\), use it to calculate \"messages\" between a node \\(x_i^{(k-1)}\\) and its neighbors \\(x_j^{(k-1)}\\), \\(j\\in\\mathcal{N}(i)\\), leveraging edge features \\(e_{ij}\\) if applicable
  • MessagePassing.propagate() : in this step, messages are calculated via the message function and aggregated across each receiving node; the keyword aggr (which can be 'add', 'max', or 'mean') is used to specify the specific permutation invariant function \\(\\square_{j\\in\\mathcal{N}(i)}\\) used for message aggregation.
  • MessagePassing.update() : the results of message passing are used to update the node features \\(x_i^{(k)}\\) through the \\(\\gamma\\) MLP

The specific implementations of message(), propagate(), and update() are up to the user. A specific example is available in the PyG Creating Message Passing Networks tutorial

"},{"location":"inference/pyg.html#message-passing-with-zinc-data","title":"Message-Passing with ZINC Data","text":"

Returning to the ZINC molecular compound dataset, we can design a message-passing layer to aggregate messages across molecular graphs. Here, we'll define a multi-layer perceptron (MLP) class and use it to build a message passing layer (MPL) the following equation:

\\[x_i' = \\gamma \\left(x_i, \\frac{1}{|\\mathcal{N}(i)|}\\sum_{j \\in \\mathcal{N}(i)} \\, \\phi\\left([x_i, x_j, e_{j,i}\\right]) \\right)\\]

Here, the MLP dimensions are constrained. Since \\(x_i, e_{i,j}\\in\\mathbb{R}\\), the \\(\\phi\\) MLP must map \\(\\mathbb{R}^3\\) to \\(\\mathbb{R}^\\mathrm{message\\_size}\\). Similarly, \\(\\gamma\\) must map \\(\\mathbb{R}^{1+\\mathrm{\\mathrm{message\\_size}}}\\) to \\(\\mathbb{R}^\\mathrm{out}\\).

from torch_geometric.nn import MessagePassing\nimport torch.nn as nn\nfrom torch.nn import Sequential as Seq, Linear, ReLU\n\nclass MLP(nn.Module):\n    def __init__(self, input_size, output_size):\n        super(MLP, self).__init__()\n\n        self.layers = nn.Sequential(\n            nn.Linear(input_size, 16),\n            nn.ReLU(),\n            nn.Linear(16, 16),\n            nn.ReLU(),\n            nn.Linear(16, output_size),\n        )\n\n    def forward(self, x):\n        return self.layers(x)\n\nclass MPLayer(MessagePassing):\n    def __init__(self, n_node_feats, n_edge_feats, message_size, output_size):\n        super(MPLayer, self).__init__(aggr='mean', \n                                      flow='source_to_target')\n        self.phi = MLP(2*n_node_feats + n_edge_feats, message_size)\n        self.gamma = MLP(message_size + n_node_feats, output_size)\n\n    def forward(self, x, edge_index, edge_attr):\n        return self.propagate(edge_index, x=x, edge_attr=edge_attr)\n\n    def message(self, x_i, x_j, edge_attr):       \n        return self.phi(torch.cat([x_i, x_j, edge_attr], dim=1))\n\n    def update(self, aggr_out, x):\n        return self.gamma(torch.cat([x, aggr_out], dim=1))\n

Let's apply this layer to one of the ZINC molecules:

molecule = train_dataset[0]\ntorch.Size([29, 1]) # 29 atoms and 1 feature (atom label)\nmpl = MPLayer(1, 1, 16, 8).to(device) # message_size = 16, output_size = 8\nxprime = mpl(graph.x.float(), graph.edge_index, graph.edge_attr.unsqueeze(1))\nxprime.shape\n>>> torch.Size([29, 8]) # 29 atoms and 8 features\n
There we have it - the message passing layer has produced 8 new features for each atom.

"},{"location":"inference/pytorch.html","title":"PyTorch Inference","text":"

PyTorch is an open source ML library developed by Facebook's AI Research lab. Initially released in late-2016, PyTorch is a relatively new tool, but has become increasingly popular among ML researchers (in fact, some analyses suggest it's becoming more popular than TensorFlow in academic communities!). PyTorch is written in idiomatic Python, so its syntax is easy to parse for experienced Python programmers. Additionally, it is highly compatible with graphics processing units (GPUs), which can substantially accelerate many deep learning workflows. To date PyTorch has not been integrated into CMSSW. Trained PyTorch models may be evaluated in CMSSW via ONNX Runtime, but model construction and training workflows must currently exist outside of CMSSW. Given the considerable interest in PyTorch within the HEP/ML community, we have reason to believe it will soon be available, so stay tuned!

"},{"location":"inference/pytorch.html#introductory-references","title":"Introductory References","text":"
  • PyTorch Install Guide
  • PyTorch Tutorials
  • LPC HATs: PyTorch
  • Deep Learning w/ PyTorch Course Repo
  • CODAS-HEP
"},{"location":"inference/pytorch.html#the-basics","title":"The Basics","text":"

The following documentation surrounds a set of code snippets designed to highlight some important ML features made available in PyTorch. In the following sections, we'll break down snippets from this script, highlighting specifically the PyTorch objects in it.

"},{"location":"inference/pytorch.html#tensors","title":"Tensors","text":"

The fundamental PyTorch object is the tensor. At a glance, tensors behave similarly to NumPy arrays. For example, they are broadcasted, concatenated, and sliced in exactly the same way. The following examples highlight some common numpy-like tensor transformations:

a = torch.randn(size=(2,2))\n>>> tensor([[ 1.3552, -0.0204],\n            [ 1.2677, -0.8926]])\na.view(-1, 1)\n>>> tensor([[ 1.3552],\n            [-0.0204],\n            [ 1.2677],\n            [-0.8926]])\na.transpose(0, 1)\n>>> tensor([[ 1.3552,  1.2677],\n            [-0.0204, -0.8926]])\na.unsqueeze(dim=0)\n>>> tensor([[[ 1.3552, -0.0204],\n             [ 1.2677, -0.8926]]])\na.squeeze(dim=0)\n>>> tensor([[ 1.3552, -0.0204],\n            [ 1.2677, -0.8926]])\n
Additionally, torch supports familiar matrix operations with various syntax options:
m1 = torch.randn(size=(2,3))\nm2 = torch.randn(size=(3,2))\nx = torch.randn(3)\n\nm1 @ m2 == m1.mm(m2) # matrix multiplication\n>>> tensor([[True, True],\n            [True, True]])\n\nm1 @ x == m1.mv(x) # matrix-vector multiplication\n>>> tensor([True, True])\n\nm1.t() == m1.transpose(0, 1) # matrix transpose\n>>> tensor([[True, True],\n            [True, True],\n            [True, True]])\n
Note that tensor.transpose(dim0, dim1) is a more general operation than tensor.t(). It is important to note that tensors have been ''upgraded'' from Numpy arrays in two key ways: 1) Tensors have native GPU support. If a GPU is available at runtime, tensors can be transferred from CPU to GPU, where computations such as matrix operations are substantially faster. Note that tensor operations must be performed on objects on the same device. PyTorch supports CUDA tensor types for GPU computation (see the PyTorch Cuda Semantics guide). 2) Tensors support automatic gradient (audograd) calculations, such that operations on tensors flagged with requires_grad=True are automatically tracked. The flow of tracked tensor operations defines a computation graph in which nodes are tensors and edges are functions mapping input tensors to output tensors. Gradients are calculated numerically via autograd by walking through this computation graph.

"},{"location":"inference/pytorch.html#gpu-support","title":"GPU Support","text":"

Tensors are created on the host CPU by default:

b = torch.zeros([2,3], dtype=torch.int32)\nb.device\n>>> cpu\n

You can also create tensors on any available GPUs:

torch.cuda.is_available() # check that a GPU is available\n>>> True \ncuda0 = torch.device('cuda:0')\nc = torch.ones([2,3], dtype=torch.int32, device=cuda0)\nc.device\n>>> cuda:0\n

You can also move tensors between devices:

b = b.to(cuda0)\nb.device\n>>> cuda:0\n

There are trade-offs between computations on the CPU and GPU. GPUs have limited memory and there is a cost associated with transfering data from CPUs to GPUs. However, GPUs perform heavy matrix operations much faster than CPUs, and are therefore often used to speed up training routines.

N = 1000 # \nfor i, N in enumerate([10, 100, 500, 1000, 5000]):\n    print(\"({},{}) Matrices:\".format(N,N))\n    M1_cpu = torch.randn(size=(N,N), device='cpu')\n    M2_cpu = torch.randn(size=(N,N), device='cpu')\n    M1_gpu = torch.randn(size=(N,N), device=cuda0)\n    M2_gpu = torch.randn(size=(N,N), device=cuda0)\n    if (i==0):\n        print('Check devices for each tensor:')\n        print('M1_cpu, M2_cpu devices:', M1_cpu.device, M2_cpu.device)\n        print('M1_gpu, M2_gpu devices:', M1_gpu.device, M2_gpu.device)\n\n    def large_matrix_multiply(M1, M2):\n        return M1 * M2.transpose(0,1)\n\n    n_iter = 1000\n    t_cpu = Timer(lambda: large_matrix_multiply(M1_cpu, M2_cpu))\n    cpu_time = t_cpu.timeit(number=n_iter)/n_iter\n    print('cpu time per call: {:.6f} s'.format(cpu_time))\n\n    t_gpu = Timer(lambda: large_matrix_multiply(M1_gpu, M2_gpu))\n    gpu_time = t_gpu.timeit(number=n_iter)/n_iter\n    print('gpu time per call: {:.6f} s'.format(gpu_time))\n    print('gpu_time/cpu_time: {:.6f}\\n'.format(gpu_time/cpu_time))\n\n>>> (10,10) Matrices:\nCheck devices for each tensor:\nM1_cpu, M2_cpu devices: cpu cpu\nM1_gpu, M2_gpu devices: cuda:0 cuda:0\ncpu time per call: 0.000008 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 1.904711\n\n(100,100) Matrices:\ncpu time per call: 0.000015 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 0.993163\n\n(500,500) Matrices:\ncpu time per call: 0.000058 s\ngpu time per call: 0.000016 s\ngpu_time/cpu_time: 0.267371\n\n(1000,1000) Matrices:\ncpu time per call: 0.000170 s\ngpu time per call: 0.000015 s\ngpu_time/cpu_time: 0.089784\n\n(5000,5000) Matrices:\ncpu time per call: 0.025083 s\ngpu time per call: 0.000011 s\ngpu_time/cpu_time: 0.000419\n

The complete list of Torch Tensor operations is available in the docs.

"},{"location":"inference/pytorch.html#autograd","title":"Autograd","text":"

Backpropagation occurs automatically through autograd. For example, consider the following function and its derivatives:

\\[\\begin{aligned} f(\\textbf{a}, \\textbf{b}) &= \\textbf{a}^T \\textbf{X} \\textbf{b} \\\\ \\frac{\\partial f}{\\partial \\textbf{a}} &= \\textbf{b}^T \\textbf{X}^T\\\\ \\frac{\\partial f}{\\partial \\textbf{b}} &= \\textbf{a}^T \\textbf{X} \\end{aligned}\\]

Given specific choices of \\(\\textbf{X}\\), \\(\\textbf{a}\\), and \\(\\textbf{b}\\), we can calculate the corresponding derivatives via autograd by requiring a gradient to be stored in each relevant tensor:

X = torch.ones((2,2), requires_grad=True)\na = torch.tensor([0.5, 1], requires_grad=True)\nb = torch.tensor([0.5, -2], requires_grad=True)\nf = a.T @ X @ b\nf\n>>> tensor(-2.2500, grad_fn=<DotBackward>) \nf.backward() # backprop \na.grad\n>>> tensor([-1.5000, -1.5000])\nb.T @ X.T \n>>> tensor([-1.5000, -1.5000], grad_fn=<SqueezeBackward3>)\nb.grad\n>>> tensor([1.5000, 1.5000])\na.T @ X\n>>> tensor([1.5000, 1.5000], grad_fn=<SqueezeBackward3>)\n
The tensor.backward() call initiates backpropagation, accumulating the gradient backward through a series of grad_fn labels tied to each tensor (e.g. <DotBackward>, indicating the dot product \\((\\textbf{a}^T\\textbf{X})\\textbf{b}\\)).

"},{"location":"inference/pytorch.html#data-utils","title":"Data Utils","text":"

PyTorch is equipped with many useful data-handling utilities. For example, the torch.utils.data package implements datasets (torch.utils.data.Dataset) and iterable data loaders (torch.utils.data.DataLoader). Additionally, various batching and sampling schemes are available.

You can create custom iterable datasets via torch.utils.data.Dataset, for example a dataset collecting the results of XOR on two binary inputs:

from torch.utils.data import Dataset\n\nclass Data(Dataset):\n    def __init__(self, device):\n        self.samples = torch.tensor([[0,0], [0,1], [1,0], [1,1]]).float().to(device)\n        self.targets = np.logical_xor(self.samples[:,0], \n                                      self.samples[:,1]).float().to(device)\n\n    def __len__(self):\n        return len(self.targets)\n\n    def __getitem__(self,idx):\n        return({'x': self.samples[idx],\n                'y': self.targets[idx]})\n
Dataloaders, from torch.utils.data.DataLoader, can generate shuffled batches of data via multiple workers. Here, we load our datasets onto the GPU:
from torch.utils.data import DataLoader\n\ndevice = 'cpu'\ntrain_data = Data(device)\ntest_data = Data(device)\ntrain_loader = DataLoader(train_data, batch_size=1, shuffle=True, num_workers=2)\ntest_loader = DataLoader(test_data, batch_size=1, shuffle=False, num_workers=2)\nfor i, batch in enumerate(train_loader):\n    print(i, batch)\n\n>>> 0 {'x': tensor([[0., 0.]]), 'y': tensor([0.])}\n    1 {'x': tensor([[1., 0.]]), 'y': tensor([1.])}\n    2 {'x': tensor([[1., 1.]]), 'y': tensor([0.])}\n    3 {'x': tensor([[0., 1.]]), 'y': tensor([1.])}\n
The full set of data utils is available in the docs.

"},{"location":"inference/pytorch.html#neural-networks","title":"Neural Networks","text":"

The PyTorch nn package specifies a set of modules that correspond to different neural network (NN) components and operations. For example, the torch.nn.Linear module defines a linear transform with learnable parameters and the torch.nn.Flatten module flattens two contiguous tensor dimensions. The torch.nn.Sequential module contains a set of modules such as torch.nn.Linear and torch.nn.Sequential, chaining them together to form the forward pass of a forward network. Furthermore, one may specify various pre-implemented loss functions, for example torch.nn.BCELoss and torch.nn.KLDivLoss. The full set of PyTorch NN building blocks is available in the docs.

As an example, we can design a simple neural network designed to reproduce the output of the XOR operation on binary inputs. To do so, we can compute a simple NN of the form:

\\[\\begin{aligned} x_{in}&\\in\\{0,1\\}^{2}\\\\ l_1 &= \\sigma(W_1^Tx_{in} + b_1); \\ W_1\\in\\mathbb{R}^{2\\times2},\\ b_1\\in\\mathbb{R}^{2}\\\\ l_2 &= \\sigma(W_2^Tx + b_2); \\ W_2\\in\\mathbb{R}^{2},\\ b_1\\in\\mathbb{R}\\\\ \\end{aligned}\\]
import torch.nn as nn\n\nclass Network(nn.Module):\n\n    def __init__(self):\n        super().__init__()\n\n        self.l1 = nn.Linear(2, 2)\n        self.l2 = nn.Linear(2, 1)\n\n    def forward(self, x):\n        x = torch.sigmoid(self.l1(x))\n        x = torch.sigmoid(self.l2(x))\n        return x\n\nmodel = Network().to(device)\nmodel(train_data['x'])\n\n>>> tensor([[0.5000],\n            [0.4814],\n            [0.5148],\n            [0.4957]], grad_fn=<SigmoidBackward>)\n
"},{"location":"inference/pytorch.html#optimizers","title":"Optimizers","text":"

Training a neural network involves minimizing a loss function; classes in the torch.optim package implement various optimization strategies for example stochastic gradient descent and Adam through torch.optim.SGD and torch.optim.Adam respectively. Optimizers are configurable through parameters such as the learning rate (configuring the optimizer's step size). The full set of optimizers and accompanying tutorials are available in the docs.

To demonstrate the use of an optimizer, let's train the NN above to produce the results of the XOR operation on binary inputs. Here we'll use the Adam optimizer:

from torch import optim\nfrom torch.optim.lr_scheduler import StepLR\nfrom matplotlib import pyplot as plt\n\n# helpful references:\n# Learning XOR: exploring the space of a classic problem\n# https://towardsdatascience.com/how-neural-networks-solve-the-xor-problem-59763136bdd7\n# https://courses.cs.washington.edu/courses/cse446/18wi/sections/section8/XOR-Pytorch.html\n\n# the training function initiates backprop and \n# steps the optimizer towards the weights that \n# optimize the loss function \ndef train(model, train_loader, optimizer, epoch):\n    model.train()\n    losses = []\n    for i, batch in enumerate(train_loader):\n        optimizer.zero_grad()\n        output = model(batch['x'])\n        y, output = batch['y'], output.squeeze(1)\n\n        # optimize binary cross entropy:\n        # https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html\n        loss = F.binary_cross_entropy(output, y, reduction='mean')\n        loss.backward()\n        optimizer.step()\n        losses.append(loss.item())\n\n    return np.mean(losses)\n\n# the test function does not adjust the model's weights\ndef test(model, test_loader):\n    model.eval()\n    losses, n_correct, n_incorrect = [], 0, 0\n    with torch.no_grad():\n        for i, batch in enumerate(test_loader):\n            output = model(batch['x'])\n            y, output = batch['y'], output.squeeze(1)\n            loss = F.binary_cross_entropy(output, y, \n                                          reduction='mean').item()\n            losses.append(loss)\n\n            # determine accuracy by thresholding model output at 0.5\n            batch_correct = torch.sum(((output>0.5) & (y==1)) |\n                                      ((output<0.5) & (y==0)))\n            batch_incorrect = len(y) - batch_correct\n            n_correct += batch_correct\n            n_incorrect += batch_incorrect\n\n    return np.mean(losses), n_correct/(n_correct+n_incorrect)\n\n\n# randomly initialize the model's weights\nfor module in model.modules():\n    if isinstance(module, nn.Linear):\n        module.weight.data.normal_(0, 1)\n\n# send weights to optimizer \nlr = 2.5e-2\noptimizer = optim.Adam(model.parameters(), lr=lr)\n\nepochs = 500\nfor epoch in range(1, epochs + 1):\n    train_loss = train(model, train_loader, optimizer, epoch)\n    test_loss, test_acc = test(model, test_loader)\n    if epoch%25==0:\n        print('epoch={}: train_loss={:.3f}, test_loss={:.3f}, test_acc={:.3f}'\n              .format(epoch, train_loss, test_loss, test_acc))\n\n>>> epoch=25: train_loss=0.683, test_loss=0.681, test_acc=0.500\n    epoch=50: train_loss=0.665, test_loss=0.664, test_acc=0.750\n    epoch=75: train_loss=0.640, test_loss=0.635, test_acc=0.750\n    epoch=100: train_loss=0.598, test_loss=0.595, test_acc=0.750\n    epoch=125: train_loss=0.554, test_loss=0.550, test_acc=0.750\n    epoch=150: train_loss=0.502, test_loss=0.498, test_acc=0.750\n    epoch=175: train_loss=0.435, test_loss=0.432, test_acc=0.750\n    epoch=200: train_loss=0.360, test_loss=0.358, test_acc=0.750\n    epoch=225: train_loss=0.290, test_loss=0.287, test_acc=1.000\n    epoch=250: train_loss=0.230, test_loss=0.228, test_acc=1.000\n    epoch=275: train_loss=0.184, test_loss=0.183, test_acc=1.000\n    epoch=300: train_loss=0.149, test_loss=0.148, test_acc=1.000\n    epoch=325: train_loss=0.122, test_loss=0.122, test_acc=1.000\n    epoch=350: train_loss=0.102, test_loss=0.101, test_acc=1.000\n    epoch=375: train_loss=0.086, test_loss=0.086, test_acc=1.000\n    epoch=400: train_loss=0.074, test_loss=0.073, test_acc=1.000\n    epoch=425: train_loss=0.064, test_loss=0.063, test_acc=1.000\n    epoch=450: train_loss=0.056, test_loss=0.055, test_acc=1.000\n    epoch=475: train_loss=0.049, test_loss=0.049, test_acc=1.000\n    epoch=500: train_loss=0.043, test_loss=0.043, test_acc=1.000\n
Here, the model has converged to 100% test accuracy, indicating that it has learned to reproduce the XOR outputs perfectly. Note that even though the test accuracy is 100%, the test loss (BCE) decreases steadily; this is because the BCE loss is nonzero when \\(y_{output}\\) is not exactly 0 or 1, while accuracy is determined by thresholding the model outputs such that each prediction is the boolean \\((y_{output} > 0.5)\\). This highlights that it is important to choose the correct performance metric for an ML problem. In the case of XOR, perfect test accuracy is sufficient. Let's check that we've recovered the XOR output by extracting the model's weights and using them to build a custom XOR function:

for name, param in model.named_parameters():\n    if param.requires_grad:\n        print(name, param.data)\n\n>>> l1.weight tensor([[ 7.2888, -6.4168],\n                      [ 7.2824, -8.1637]])\n    l1.bias tensor([ 2.6895, -3.9633])\n    l2.weight tensor([[-6.3500,  8.0990]])\n    l2.bias tensor([2.5058])\n

Because our model was built with nn.Linear modules, we have weight matrices and bias terms. Next, we'll hard-code the matrix operations into a custom XOR function based on the architecture of the NN:

def XOR(x):\n    w1 = torch.tensor([[ 7.2888, -6.4168],\n                       [ 7.2824, -8.1637]]).t()\n    b1 = torch.tensor([ 2.6895, -3.9633])\n    layer1_out = torch.tensor([x[0]*w1[0,0] + x[1]*w1[1,0] + b1[0],\n                               x[0]*w1[0,1] + x[1]*w1[1,1] + b1[1]])\n    layer1_out = torch.sigmoid(layer1_out)\n\n    w2 = torch.tensor([-6.3500,  8.0990])\n    b2 = 2.5058\n    layer2_out = layer1_out[0]*w2[0] + layer1_out[1]*w2[1] + b2\n    layer2_out = torch.sigmoid(layer2_out)\n    return layer2_out, (layer2_out > 0.5)\n\nXOR([0.,0.])\n>>> (tensor(0.0359), tensor(False))\nXOR([0.,1.])\n>>> (tensor(0.9135), tensor(True))\nXOR([1.,0.])\n>>> (tensor(0.9815), tensor(True))\nXOR([1.,1.])\n>>> (tensor(0.0265), tensor(False))\n

There we have it - the NN learned XOR!

"},{"location":"inference/pytorch.html#pytorch-in-cmssw","title":"PyTorch in CMSSW","text":""},{"location":"inference/pytorch.html#via-onnx","title":"Via ONNX","text":"

One way to incorporate your PyTorch models into CMSSW is through the Open Neural Network Exchange (ONNX) Runtime tool. In brief, ONNX supports training and inference for a variety of ML frameworks, and is currently integrated into CMSSW (see the CMS ML tutorial). PyTorch hosts an excellent tutorial on exporting a model from PyTorch to ONNX. ONNX is available in CMSSW (see a relevant discussion in the CMSSW git repo).

"},{"location":"inference/pytorch.html#example-use-cases","title":"Example Use Cases","text":"

The \\(ZZ\\rightarrow 4b\\) analysis utilizes trained PyTorch models via ONNX in CMSSW (see the corresponding repo). Briefly, they run ONNX in CMSSW_11_X via the CMSSW package PhysicsTools/ONNXRuntime, using it to define a multiClassifierONNX class. This multiclassifier is capable of loading pre-trained PyTorch models specified by a modelFile string as follows:

#include \"PhysicsTools/ONNXRuntime/interface/ONNXRuntime.h\"\n\nstd::unique_ptr<cms::Ort::ONNXRuntime> model;\nOrt::SessionOptions* session_options = new Ort::SessionOptions();\nsession_options->SetIntraOpNumThreads(1);\nmodel = std::make_unique<cms::Ort::ONNXRuntime>(modelFile, session_options);\n
"},{"location":"inference/pytorch.html#via-triton","title":"Via Triton","text":"

Coprocessors (GPUs, FPGAs, etc.) are frequently used to accelerate ML operations such as inference and training. In the 'as-a-service' paradigm, users can access cloud-based applications through lightweight client inferfaces. The Services for Optimized Network Inference on Coprocessors (SONIC) framework implements this paradigm in CMSSW, allowing the optimal integration of GPUs into event processing workflows. One powerful implementation of SONIC is the the NVIDIA Triton Inference Server, which is flexible with respect to ML framework, storage source, and hardware infrastructure. For more details, see the corresponding NVIDIA developer blog entry.

A Graph Attention Network (GAN) is available via Triton in CMSSW, and can be accessed here: https://github.com/cms-sw/cmssw/tree/master/HeterogeneousCore/SonicTriton/test

"},{"location":"inference/pytorch.html#training-tips","title":"Training Tips","text":"
  • When instantiating a DataLoader, shuffle=True should be enabled for training data but not for validation and testing data. At each training epoch, this will vary the order of data objects in each batch; accordingly, it is not efficient to load the full dataset (in its original ordering) into GPU memory before training. Instead, enable num_workers>1; this allows the DataLoader to load batches to the GPU as they're prepared. Note that this launches muliple threads on the CPU. For more information, see a corresponding discussion in the PyTorch forum.
"},{"location":"inference/sonic_triton.html","title":"Service-based inference with Triton/Sonic","text":"

This page is still under construction. For the moment, please see the Sonic+Triton tutorial given as part of the Machine Learning HATS@LPC 2021.

  • Link to Indico agenda
  • Slides
  • Exercise twiki
"},{"location":"inference/standalone.html","title":"Standalone framework","text":"

Todo.

Idea: Working w/ TF+ROOT standalone (outside of CMSSW)

"},{"location":"inference/swan_aws.html","title":"SWAN + AWS","text":"

Todo.

Ideas: best practices cost model instance priving need to log out monitoring madatory

"},{"location":"inference/tensorflow1.html","title":"Direct inference with TensorFlow 1","text":"

While it is technically still possible to use TensorFlow 1, this version of TensorFlow is quite old and is no longer supported by CMSSW. We highly recommend that you update your model to TensorFlow 2 and follow the integration guide in the Inference/Direct inference/TensorFlow 2 documentation.

"},{"location":"inference/tensorflow2.html","title":"Direct inference with TensorFlow 2","text":"

TensorFlow 2 is available since CMSSW_11_1_X (cmssw#28711, cmsdist#5525). The integration into the software stack can be found in cmsdist/tensorflow.spec and the interface is located in cmssw/PhysicsTools/TensorFlow.

"},{"location":"inference/tensorflow2.html#available-versions","title":"Available versions","text":"Python 3 on el8Python 3 on slc7Python 2 on slc7 TensorFlow el8_amd64_gcc10 el8_amd64_gcc11 v2.6.0 \u2265 CMSSW_12_3_4 - v2.6.4 \u2265 CMSSW_12_5_0 \u2265 CMSSW_12_5_0 TensorFlow slc7_amd64_gcc900 slc7_amd64_gcc10 slc7_amd64_gcc11 v2.1.0 \u2265 CMSSW_11_1_0 - - v2.3.1 \u2265 CMSSW_11_2_0 - - v2.4.1 \u2265 CMSSW_11_3_0 - - v2.5.0 \u2265 CMSSW_12_0_0 \u2265 CMSSW_12_0_0 - v2.6.0 \u2265 CMSSW_12_1_0 \u2265 CMSSW_12_1_0 \u2265 CMSSW_12_3_0 v2.6.4 - \u2265 CMSSW_12_5_0 \u2265 CMSSW_13_0_0 TensorFlow slc7_amd64_gcc900 v2.1.0 \u2265 CMSSW_11_1_0 v2.3.1 \u2265 CMSSW_11_2_0

At this time, only CPU support is provided. While GPU support is generally possible, it is currently disabled due to some interference with production workflows but will be enabled once they are resolved.

"},{"location":"inference/tensorflow2.html#software-setup","title":"Software setup","text":"

To run the examples shown below, create a mininmal inference setup with the following snippet. Adapt the SCRAM_ARCH according to your operating system and desired compiler.

export SCRAM_ARCH=\"el8_amd64_gcc11\"\nexport CMSSW_VERSION=\"CMSSW_12_6_0\"\n\nsource \"/cvmfs/cms.cern.ch/cmsset_default.sh\" \"\"\n\ncmsrel \"${CMSSW_VERSION}\"\ncd \"${CMSSW_VERSION}/src\"\n\ncmsenv\nscram b\n

Below, the cmsml Python package is used to convert models from TensorFlow objects (tf.function's or Keras models) to protobuf graph files (documentation). It should be available after executing the commands above. You can check its version via

python -c \"import cmsml; print(cmsml.__version__)\"\n

and compare to the released tags. If you want to install a newer version from either the master branch of the cmsml repository or the Python package index (PyPI), you can simply do that via pip.

masterPyPI
# into your user directory (usually ~/.local)\npip install --upgrade --user git+https://github.com/cms-ml/cmsml\n\n# _or_\n\n# into a custom directory\npip install --upgrade --prefix \"CUSTOM_DIRECTORY\" git+https://github.com/cms-ml/cmsml\n
# into your user directory (usually ~/.local)\npip install --upgrade --user cmsml\n\n# _or_\n\n# into a custom directory\npip install --upgrade --prefix \"CUSTOM_DIRECTORY\" cmsml\n
"},{"location":"inference/tensorflow2.html#saving-your-model","title":"Saving your model","text":"

After successfully training, you should save your model in a protobuf graph file which can be read by the interface in CMSSW. Naturally, you only want to save that part of your model that is required to run the network prediction, i.e., it should not contain operations related to model training or loss functions (unless explicitely required). Also, to reduce the memory footprint and to accelerate the inference, variables should be converted to constant tensors. Both of these model transformations are provided by the cmsml package.

Instructions on how to transform and save your model are shown below, depending on whether you use Keras or plain TensorFlow with tf.function's.

Kerastf.function

The code below saves a Keras Model instance as a protobuf graph file using cmsml.tensorflow.save_graph. In order for Keras to built the internal graph representation before saving, make sure to either compile the model, or pass an input_shape to the first layer:

# coding: utf-8\n\nimport tensorflow as tf\nimport tf.keras.layers as layers\nimport cmsml\n\n# define your model\nmodel = tf.keras.Sequential()\nmodel.add(layers.InputLayer(input_shape=(10,), name=\"input\"))\nmodel.add(layers.Dense(100, activation=\"tanh\"))\nmodel.add(layers.Dense(3, activation=\"softmax\", name=\"output\"))\n\n# train it\n...\n\n# convert to binary (.pb extension) protobuf\n# with variables converted to constants\ncmsml.tensorflow.save_graph(\"graph.pb\", model, variables_to_constants=True)\n

Following the Keras naming conventions for certain layers, the input will be named \"input\" while the output is named \"sequential/output/Softmax\". To cross check the names, you can save the graph in text format by using the extension \".pb.txt\".

Let's consider you write your network model in a single tf.function.

# coding: utf-8\n\nimport tensorflow as tf\nimport cmsml\n\n# define the model\n@tf.function\ndef model(x):\n    # lift variable initialization to the lowest context so they are\n    # not re-initialized on every call (eager calls or signature tracing)\n    with tf.init_scope():\n        W = tf.Variable(tf.ones([10, 1]))\n        b = tf.Variable(tf.ones([1]))\n\n    # define your \"complex\" model here\n    h = tf.add(tf.matmul(x, W), b)\n    y = tf.tanh(h, name=\"y\")\n\n    return y\n

In TensorFlow terms, the model function is polymorphic - it accepts different types of the input tensor x (tf.float32, tf.float64, ...). For each type, TensorFlow will create a concrete function with an associated tf.Graph object. This mechanism is referred to as signature tracing. For deeper insights into tf.function, the concepts of signature tracing, polymorphic and concrete functions, see the guide on Better performance with tf.function.

To save the model as a protobuf graph file, you explicitely need to create a concrete function. However, this is fairly easy once you know the exact type and shape of all input arguments.

# create a concrete function\ncmodel = model.get_concrete_function(\n    tf.TensorSpec(shape=[2, 10], dtype=tf.float32),\n)\n\n# convert to binary (.pb extension) protobuf\n# with variables converted to constants\ncmsml.tensorflow.save_graph(\"graph.pb\", cmodel, variables_to_constants=True)\n

The input will be named \"x\" while the output is named \"y\". To cross check the names, you can save the graph in text format by using the extension \".pb.txt\".

Different method: Frozen signatures

Instead of creating a polymorphic tf.function and extracting a concrete one in a second step, you can directly define an input signature upon definition.

@tf.function(input_signature=(tf.TensorSpec(shape=[2, 10], dtype=tf.float32),))\ndef model(x):\n    ...\n

This disables signature tracing since the input signature is frozen. However, you can directly pass it to cmsml.tensorflow.save_graph.

"},{"location":"inference/tensorflow2.html#inference-in-cmssw","title":"Inference in CMSSW","text":"

The inference can be implemented to run in a single thread. In general, this does not mean that the module cannot be executed with multiple threads (cmsRun --numThreads <N> <CFG_FILE>), but rather that its performance in terms of evaluation time and especially memory consumption is likely to be suboptimal. Therefore, for modules to be integrated into CMSSW, the multi-threaded implementation is strongly recommended.

"},{"location":"inference/tensorflow2.html#cmssw-module-setup","title":"CMSSW module setup","text":"

If you aim to use the TensorFlow interface in a CMSSW plugin, make sure to include

<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n

in your plugins/BuildFile.xml file. If you are using the interface inside the src/ or interface/ directory of your module, make sure to create a global BuildFile.xml file next to theses directories, containing (at least):

<use name=\"PhysicsTools/TensorFlow\" />\n\n<export>\n<lib name=\"1\" />\n</export>\n
"},{"location":"inference/tensorflow2.html#single-threaded-inference","title":"Single-threaded inference","text":"

Despite tf.Session being removed in the Python interface as of TensorFlow 2, the concepts of

  • Graph's, containing the constant computational structure and trained variables of your model,
  • Session's, handling execution and data exchange, and
  • the separation between them

live on in the C++ interface. Thus, the overall inference approach is 1) include the interface, 2) initialize Graph and session, 3) per event create input tensors and run the inference, and 4) cleanup.

"},{"location":"inference/tensorflow2.html#1-includes","title":"1. Includes","text":"
#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n// further framework includes\n...\n
"},{"location":"inference/tensorflow2.html#2-initialize-objects","title":"2. Initialize objects","text":"
// configure logging to show warnings (see table below)\ntensorflow::setLogging(\"2\");\n\n// load the graph definition\ntensorflow::GraphDef* graphDef = tensorflow::loadGraphDef(\"/path/to/constantgraph.pb\");\n\n// create a session\ntensorflow::Session* session = tensorflow::createSession(graphDef);\n
"},{"location":"inference/tensorflow2.html#3-inference","title":"3. Inference","text":"
// create an input tensor\n// (example: single batch of 10 values)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, { 1, 10 });\n\n\n// fill the tensor with your input data\n// (example: just fill consecutive values)\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// run the evaluation\nstd::vector<tensorflow::Tensor> outputs;\ntensorflow::run(session, { { \"input\", input } }, { \"output\" }, &outputs);\n\n// process the output tensor\n// (example: print the 5th value of the 0th (the only) example)\nstd::cout << outputs[0].matrix<float>()(0, 5) << std::endl;\n// -> float\n
"},{"location":"inference/tensorflow2.html#4-cleanup","title":"4. Cleanup","text":"
tensorflow::closeSession(session);\ndelete graphDef;\n
"},{"location":"inference/tensorflow2.html#full-example","title":"Full example","text":"Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 graph.pb\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.py
/*\n * Example plugin to demonstrate the direct single-threaded inference with TensorFlow 2.\n */\n\n#include <memory>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n\nclass MyPlugin : public edm::one::EDAnalyzer<> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet&);\n~MyPlugin(){};\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::string graphPath_;\nstd::string inputTensorName_;\nstd::string outputTensorName_;\n\ntensorflow::GraphDef* graphDef_;\ntensorflow::Session* session_;\n};\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"graphPath\");\ndesc.add<std::string>(\"inputTensorName\");\ndesc.add<std::string>(\"outputTensorName\");\ndescriptions.addWithDefaultLabel(desc);\n}\n\nMyPlugin::MyPlugin(const edm::ParameterSet& config)\n: graphPath_(config.getParameter<std::string>(\"graphPath\")),\ninputTensorName_(config.getParameter<std::string>(\"inputTensorName\")),\noutputTensorName_(config.getParameter<std::string>(\"outputTensorName\")),\ngraphDef_(nullptr),\nsession_(nullptr) {\n// set tensorflow log level to warning\ntensorflow::setLogging(\"2\");\n}\n\nvoid MyPlugin::beginJob() {\n// load the graph\ngraphDef_ = tensorflow::loadGraphDef(graphPath_);\n\n// create a new session and add the graphDef\nsession_ = tensorflow::createSession(graphDef_);\n}\n\nvoid MyPlugin::endJob() {\n// close the session\ntensorflow::closeSession(session_);\n\n// delete the graph\ndelete graphDef_;\ngraphDef_ = nullptr;\n}\n\nvoid MyPlugin::analyze(const edm::Event& event, const edm::EventSetup& setup) {\n// define a tensor and fill it with range(10)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, {1, 10});\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output and run\nstd::vector<tensorflow::Tensor> outputs;\ntensorflow::run(session_, {{inputTensorName_, input}}, {outputTensorName_}, &outputs);\n\n// print the output\nstd::cout << \" -> \" << outputs[0].matrix<float>()(0, 0) << std::endl << std::endl;\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# get the data/ directory\nthisdir = os.path.dirname(os.path.abspath(__file__))\ndatadir = os.path.join(os.path.dirname(thisdir), \"data\")\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIISummer20UL17MiniAODv2/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/00000/005708B7-331C-904E-88B9-189011E6C9DD.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(\n    input=cms.untracked.int32(10),\n)\nprocess.source = cms.Source(\n    \"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles),\n)\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\nprocess.myPlugin.graphPath = cms.string(os.path.join(datadir, \"graph.pb\"))\nprocess.myPlugin.inputTensorName = cms.string(\"input\")\nprocess.myPlugin.outputTensorName = cms.string(\"output\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n
"},{"location":"inference/tensorflow2.html#multi-threaded-inference","title":"Multi-threaded inference","text":"

Compared to the single-threaded implementation above, the multi-threaded version has one major difference: both the Graph and the Session are no longer members of a particular module instance, but rather shared between all instances in all threads. See the documentation on the C++ interface of stream modules for details.

Recommendation updated

The previous recommendation stated that the Session is not constant and thus, should not be placed in the global cache, but rather created once per stream module instance. However, it was discovered that, although not explicitely declared as constant in the tensorflow::run() / Session::run() interface, the session is actually not changed during evaluation and can be treated as being effectively constant.

As a result, it is safe to move it to the global cache, next to the Graph object. The TensorFlow interface in CMSSW was adjusted in order to accept const objects in cmssw#40161.

Thus, the overall inference approach is 1) include the interface, 2) let your plugin inherit from edm::stream::EDAnalyzerasdasd and declare the GlobalCache, 3) store in cconst Session*, pointing to the cached session, and 4) per event create input tensors and run the inference.

"},{"location":"inference/tensorflow2.html#1-includes_1","title":"1. Includes","text":"
#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n// further framework includes\n...\n

Note that stream/EDAnalyzer.h is included rather than one/EDAnalyzer.h.

"},{"location":"inference/tensorflow2.html#2-define-and-use-the-global-cache","title":"2. Define and use the global cache","text":"

The cache definition is done by declaring a simple struct. However, for the purpose of just storing a graph and a session object, a so-called tensorflow::SessionCache struct is already provided centrally. It was added in cmssw#40284 and its usage is shown in the following. In case the tensorflow::SessionCache is not (yet) available in your version of CMSSW, expand the \"Custom cache struct\" section below.

Use it in the edm::GlobalCache template argument and adjust the plugin accordingly.

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<tensorflow::SessionCache>> {\npublic:\nexplicit GraphLoadingMT(const edm::ParameterSet&, const tensorflow::SessionCache*);\n~GraphLoadingMT();\n\n// an additional static method for initializing the global cache\nstatic std::unique_ptr<tensorflow::SessionCache> initializeGlobalCache(const edm::ParameterSet&);\nstatic void globalEndJob(const CacheData*);\n...\n

Implement initializeGlobalCache to control the behavior of how the cache object is created. The destructor of tensorflow::SessionCache already handles the closing of the session itself and the deletion of all objects.

std::unique_ptr<tensorflow::SessionCache> MyPlugin::initializeGlobalCache(const edm::ParameterSet& config) {\nstd::string graphPath = edm::FileInPath(params.getParameter<std::string>(\"graphPath\")).fullPath();\nreturn std::make_unique<tensorflow::SessionCache>(graphPath);\n}\n
Custom cache struct
struct MyCache {\nMyCache() : {\n}\n\nstd::atomic<tensorflow::GraphDef*> graph;\nstd::atomic<tensorflow::Session*> session;\n};\n

Use it in the edm::GlobalCache template argument and adjust the plugin accordingly.

class MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<CacheData>> {\npublic:\nexplicit GraphLoadingMT(const edm::ParameterSet&, const CacheData*);\n~GraphLoadingMT();\n\n// two additional static methods for handling the global cache\nstatic std::unique_ptr<CacheData> initializeGlobalCache(const edm::ParameterSet&);\nstatic void globalEndJob(const CacheData*);\n...\n

Implement initializeGlobalCache and globalEndJob to control the behavior of how the cache object is created and destroyed.

See the full example below for more details.

"},{"location":"inference/tensorflow2.html#3-initialize-objects","title":"3. Initialize objects","text":"

In your module constructor, you can get a pointer to the constant session to perform model evaluation during the event loop.

// declaration in header\nconst tensorflow::Session* _session;\n\n// get a pointer to the const session stored in the cache in the constructor init\nMyPlugin::MyPlugin(const edm::ParameterSet& config,  const tensorflow::SessionCache* cache)\n: session_(cache->getSession()) {\n...\n}\n
"},{"location":"inference/tensorflow2.html#4-inference","title":"4. Inference","text":"
// create an input tensor\n// (example: single batch of 10 values)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, { 1, 10 });\n\n\n// fill the tensor with your input data\n// (example: just fill consecutive values)\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output\nstd::vector<tensorflow::Tensor> outputs;\n\n// evaluate\n// note: in case this line causes the compiler to complain about the const'ness of the session_ in\n//       this call, your CMSSW version might not yet support passing a const session, so in this\n//       case, pass \"const_cast<tensorflow::Session*>(session_)\"\ntensorflow::run(session_, { { inputTensorName, input } }, { outputTensorName }, &outputs);\n\n// process the output tensor\n// (example: print the 5th value of the 0th (the only) example)\nstd::cout << outputs[0].matrix<float>()(0, 5) << std::endl;\n// -> float\n

Note

If the TensorFlow interface in your CMSSW release does not yet accept const sessions, line 19 in the example above will cause an error during compilation. In this case, replace session_ in that line to

const_cast<tensorflow::Session*>(session_)\n
"},{"location":"inference/tensorflow2.html#full-example_1","title":"Full example","text":"Click to expand

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 MyPlugin.cpp\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 test/\n\u2502   \u2514\u2500\u2500 my_plugin_cfg.py\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 graph.pb\n
plugins/MyPlugin.cppplugins/BuildFile.xmltest/my_plugin_cfg.py
/*\n * Example plugin to demonstrate the direct multi-threaded inference with TensorFlow 2.\n */\n\n#include <memory>\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n#include \"FWCore/Framework/interface/stream/EDAnalyzer.h\"\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"PhysicsTools/TensorFlow/interface/TensorFlow.h\"\n\n// put a tensorflow::SessionCache into the global cache structure\n// the session cache wraps both a tf graph and a tf session instance and also handles their deletion\nclass MyPlugin : public edm::stream::EDAnalyzer<edm::GlobalCache<tensorflow::SessionCache>> {\npublic:\nexplicit MyPlugin(const edm::ParameterSet&, const tensorflow::SessionCache*);\n~MyPlugin(){};\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions&);\n\n// an additional static method for initializing the global cache\nstatic std::unique_ptr<tensorflow::SessionCache> initializeGlobalCache(const edm::ParameterSet&);\n\nprivate:\nvoid beginJob();\nvoid analyze(const edm::Event&, const edm::EventSetup&);\nvoid endJob();\n\nstd::string inputTensorName_;\nstd::string outputTensorName_;\n\n// a pointer to the session created by the global session cache\nconst tensorflow::Session* session_;\n};\n\nstd::unique_ptr<tensorflow::SessionCache> MyPlugin::initializeGlobalCache(const edm::ParameterSet& params) {\n// this method is supposed to create, initialize and return a SessionCache instance\nstd::string graphPath = edm::FileInPath(params.getParameter<std::string>(\"graphPath\")).fullPath();\n// Setup the TF backend by configuration\nif (params.getParameter<std::string>(\"tf_backend\") == \"cuda\"){\ntensorflow::Options options { tensorflow::Backend::cuda};\n}else {\ntensorflow::Options options { tensorflow::Backend::cpu};\n}\nreturn std::make_unique<tensorflow::SessionCache>(graphPath, options);\n}\n\nvoid MyPlugin::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n// defining this function will lead to a *_cfi file being generated when compiling\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"graphPath\");\ndesc.add<std::string>(\"inputTensorName\");\ndesc.add<std::string>(\"outputTensorName\");\ndescriptions.addWithDefaultLabel(desc);\n}\n\nMyPlugin::MyPlugin(const edm::ParameterSet& config,  const tensorflow::SessionCache* cache)\n: inputTensorName_(config.getParameter<std::string>(\"inputTensorName\")),\noutputTensorName_(config.getParameter<std::string>(\"outputTensorName\")),\nsession_(cache->getSession()) {}\n\nvoid MyPlugin::beginJob() {}\n\nvoid MyPlugin::endJob() {\n// close the session\ntensorflow::closeSession(session_);\n}\n\nvoid MyPlugin::analyze(const edm::Event& event, const edm::EventSetup& setup) {\n// define a tensor and fill it with range(10)\ntensorflow::Tensor input(tensorflow::DT_FLOAT, {1, 10});\nfor (size_t i = 0; i < 10; i++) {\ninput.matrix<float>()(0, i) = float(i);\n}\n\n// define the output\nstd::vector<tensorflow::Tensor> outputs;\n\n// evaluate\n// note: in case this line causes the compile to complain about the const'ness of the session_ in\n//       this call, your CMSSW version might not yet support passing a const session, so in this\n//       case, pass \"const_cast<tensorflow::Session*>(session_)\"\ntensorflow::run(session_, {{inputTensorName_, input}}, {outputTensorName_}, &outputs);\n\n// print the output\nstd::cout << \" -> \" << outputs[0].matrix<float>()(0, 0) << std::endl << std::endl;\n}\n\nDEFINE_FWK_MODULE(MyPlugin);\n
<use name=\"FWCore/Framework\" />\n<use name=\"FWCore/PluginManager\" />\n<use name=\"FWCore/ParameterSet\" />\n<use name=\"PhysicsTools/TensorFlow\" />\n\n<flags EDM_PLUGIN=\"1\" />\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n\n# get the data/ directory\nthisdir = os.path.dirname(os.path.abspath(__file__))\ndatadir = os.path.join(os.path.dirname(thisdir), \"data\")\n\n# setup minimal options\noptions = VarParsing(\"python\")\noptions.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIISummer20UL17MiniAODv2/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v1/00000/005708B7-331C-904E-88B9-189011E6C9DD.root\")  # noqa\noptions.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(\n    input=cms.untracked.int32(10),\n)\nprocess.source = cms.Source(\n    \"PoolSource\",\n    fileNames=cms.untracked.vstring(options.inputFiles),\n)\n\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\nprocess.load(\"MySubsystem.MyModule.myPlugin_cfi\")\nprocess.myPlugin.graphPath = cms.string(os.path.join(datadir, \"graph.pb\"))\nprocess.myPlugin.inputTensorName = cms.string(\"input\")\nprocess.myPlugin.outputTensorName = cms.string(\"output\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.myPlugin)\n
"},{"location":"inference/tensorflow2.html#gpu-backend","title":"GPU backend","text":"

By default the TensorFlow sessions get created for CPU running. Since CMSSW_13_1_X the GPU backend for TensorFlow is available in the cmssw release.

Minimal changes are needed in the inference code to move the model on the GPU. A tensorflow::Options struct is available to setup the backend.

tensorflow::Options options { tensorflow::Backend::cuda};\n\n# Initialize the cache\ntensorflow::SessionCache cache(pbFile, options);\n# or a single session\nconst tensorflow::Session* session = tensorflow::createSession(graphDef, options);\n

CMSSW modules should add an options in the PSets of the producers and analyzers to configure on the fly the TensorFlow backend for the sessions created by the plugins.

"},{"location":"inference/tensorflow2.html#optimization","title":"Optimization","text":"

Depending on the use case, the following approaches can optimize the inference performance. It could be worth checking them out in your algorithm.

Further optimization approaches can be found in the integration checklist.

"},{"location":"inference/tensorflow2.html#reusing-tensors","title":"Reusing tensors","text":"

In some cases, instead of creating new input tensors for each inference call, you might want to store input tensors as members of your plugin. This is of course possible if you know its exact shape a-prioro and comes with the cost of keeping the tensor in memory for the lifetime of your module instance.

You can use

tensor.flat<float>().setZero();\n

to reset the values of your tensor prior to each call.

"},{"location":"inference/tensorflow2.html#tensor-data-access-via-pointers","title":"Tensor data access via pointers","text":"

As shown in the examples above, tensor data can be accessed through methods such as flat<type>() or matrix<type>() which return objects that represent the underlying data in the requested structure (tensorflow::Tensor C++ API). To read and manipulate particular elements, you can directly call this object with the coordinates of an element.

// matrix returns a 2D representation\n// set element (b,i) to f\ntensor.matrix<float>()(b, i) = float(f);\n

However, doing this for a large input tensor might entail some overhead. Since the data is actually contiguous in memory (C-style \"row-major\" memory ordering), a faster (though less explicit) way of interacting with tensor data is using a pointer.

// get the pointer to the first tensor element\nfloat* d = tensor.flat<float>().data();\n

Now, the tensor data can be filled using simple and fast pointer arithmetic.

// fill tensor data using pointer arithmethic\n// memory ordering is row-major, so the most outer loop corresponds dimension 0\nfor (size_t b = 0; b < batchSize; b++) {\nfor (size_t i = 0; i < nFeatures; i++, d++) {  // note the d++\n*d = float(i);\n}\n}\n
"},{"location":"inference/tensorflow2.html#inter-and-intra-operation-parallelism","title":"Inter- and intra-operation parallelism","text":"

Debugging and local processing only

Parallelism between (inter) and within (intra) operations can greatly improve the inference performance. However, this allows TensorFlow to manage and schedule threads on its own, possibly interfering with the thread model inherent to CMSSW. For inference code that is to be officially integrated, you should avoid inter- and intra-op parallelism and rather adhere to the examples shown above.

You can configure the amount of inter- and infra-op threads via the second argument of the tensorflow::createSession method.

SimpleVerbose
tensorflow::Session* session = tensorflow::createSession(graphDef, nThreads);\n
tensorflow::SessionOptions sessionOptions;\nsessionOptions.config.set_intra_op_parallelism_threads(nThreads);\nsessionOptions.config.set_inter_op_parallelism_threads(nThreads);\n\ntensorflow::Session* session = tensorflow::createSession(graphDef, sessionOptions);\n

Then, when calling tensorflow::run, pass the internal name of the TensorFlow threadpool, i.e. \"tensorflow\", as the last argument.

std::vector<tensorflow::Tensor> outputs;\ntensorflow::run(\nsession,\n{ { inputTensorName, input } },\n{ outputTensorName },\n&outputs,\n\"tensorflow\"\n);\n
"},{"location":"inference/tensorflow2.html#miscellaneous","title":"Miscellaneous","text":""},{"location":"inference/tensorflow2.html#logging","title":"Logging","text":"

By default, TensorFlow logging is quite verbose. This can be changed by either setting the TF_CPP_MIN_LOG_LEVEL environment varibale before calling cmsRun, or within your code through tensorflow::setLogging(level).

Verbosity level TF_CPP_MIN_LOG_LEVEL debug \"0\" info \"1\" (default) warning \"2\" error \"3\" none \"4\"

Forwarding logs to the MessageLogger service is not possible yet.

"},{"location":"inference/tensorflow2.html#links-and-further-reading","title":"Links and further reading","text":"
  • cmsml package
  • CMSSW
    • TensorFlow interface documentation
    • TensorFlow interface header
    • CMSSW process options
    • C++ interface of stream modules
  • TensorFlow
    • TensorFlow 2 tutorial
    • tf.function
    • C++ API
    • tensorflow::Tensor
    • tensorflow::Operation
    • tensorflow::ClientSession
  • Keras
    • API

Authors: Marcel Rieger

"},{"location":"inference/tfaas.html","title":"TFaaS","text":""},{"location":"inference/tfaas.html#tensorflow-as-a-service","title":"TensorFlow as a Service","text":"

TensorFlow as a Service (TFaas) was developed as a general purpose service which can be deployed on any infrastruction from personal laptop, VM, to cloud infrastructure, inculding kubernetes/docker based ones. The main repository contains all details about the service, including install, end-to-end example, and demo.

For CERN users we already deploy TFaaS on the following URL: https://cms-tfaas.cern.ch

It can be used by CMS members using any HTTP based client. For example, here is a basic access from curl client:

curl -k https://cms-tfaas.cern.ch/models\n[\n  {\n    \"name\": \"luca\",\n    \"model\": \"prova.pb\",\n    \"labels\": \"labels.csv\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890554036 +0000 UTC m=+600537.976386186\"\n  },\n  {\n    \"name\": \"test_luca_1024\",\n    \"model\": \"saved_model.pb\",\n    \"labels\": \"labels.txt\",\n    \"options\": null,\n    \"inputNode\": \"dense_input_1:0\",\n    \"outputNode\": \"dense_3/Sigmoid:0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890776518 +0000 UTC m=+600537.976608672\"\n  },\n  {\n    \"name\": \"vk\",\n    \"model\": \"model.pb\",\n    \"labels\": \"labels.txt\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-10-22 14:04:52.890903234 +0000 UTC m=+600537.976735378\"\n  }\n]\n

The following APIs are available: - /upload to push your favorite TF model to TFaaS server either for Form or as tar-ball bundle, see examples below - /delete to delete your TF model from TFaaS server - /models to view existing TF models on TFaaS server - /predict/json to serve TF model predictions in JSON data-format - /predict/proto to serve TF model predictions in ProtoBuffer data-format - /predict/image to serve TF model predictions forimages in JPG/PNG formats

"},{"location":"inference/tfaas.html#look-up-your-favorite-model","title":"\u2780 look-up your favorite model","text":"

You may easily look-up your ML model from TFaaS server, e.g.

curl https://cms-tfaas.cern.ch/models\n# possible output may looks like this\n[\n  {\n    \"name\": \"luca\",\n    \"model\": \"prova.pb\",\n    \"labels\": \"labels.csv\",\n    \"options\": null,\n    \"inputNode\": \"dense_1_input\",\n    \"outputNode\": \"output_node0\",\n    \"description\": \"\",\n    \"timestamp\": \"2021-11-08 20:07:18.397487027 +0000 UTC m=+2091094.457327022\"\n  }\n  ...\n]\n
The provided /models API will list the name of the model, its file name, labels file, possible options, input and output nodes, description and proper timestamp when it was added to TFaaS repository

"},{"location":"inference/tfaas.html#upload-your-tf-model-to-tfaas-server","title":"\u2781 upload your TF model to TFaaS server","text":"

If your model is not in TFaaS server you may easily add it as following:

# example of image based model upload\ncurl -X POST https://cms-tfaas.cern.ch/upload\n-F 'name=ImageModel' -F 'params=@/path/params.json'\n-F 'model=@/path/tf_model.pb' -F 'labels=@/path/labels.txt'\n\n# example of TF pb file upload\ncurl -s -X POST https://cms-tfaas.cern.ch/upload \\\n    -F 'name=vk' -F 'params=@/path/params.json' \\\n    -F 'model=@/path/model.pb' -F 'labels=@/path/labels.txt'\n\n# example of bundle upload produce with Keras TF\n# here is our saved model area\nls model\nassets         saved_model.pb variables\n# we can create tarball and upload it to TFaaS via bundle end-point\ntar cfz model.tar.gz model\ncurl -X POST -H \"Content-Encoding: gzip\" \\\n             -H \"content-type: application/octet-stream\" \\\n             --data-binary @/path/models.tar.gz https://cms-tfaas.cern.ch/upload\n

"},{"location":"inference/tfaas.html#get-your-predictions","title":"\u2782 get your predictions","text":"

Finally, you may obtain predictions from your favorite model by using proper API, e.g.

# obtain predictions from your ImageModel\ncurl https://cms-tfaas.cern.ch/image -F 'image=@/path/file.png' -F 'model=ImageModel'\n\n# obtain predictions from your TF based model\ncat input.json\n{\"keys\": [...], \"values\": [...], \"model\":\"model\"}\n\n# call to get predictions from /json end-point using input.json\ncurl -s -X POST -H \"Content-type: application/json\" \\\n    -d@/path/input.json https://cms-tfaas.cern.ch/json\n

Fore more information please visit curl client page.

"},{"location":"inference/tfaas.html#tfaas-interface","title":"TFaaS interface","text":"

Clients communicate with TFaaS via HTTP protocol. See examples for Curl, Python and C++ clients.

"},{"location":"inference/tfaas.html#tfaas-benchmarks","title":"TFaaS benchmarks","text":"

Benchmark results on CentOS, 24 cores, 32GB of RAM serving DL NN with 42x128x128x128x64x64x1x1 architecture (JSON and ProtoBuffer formats show similar performance): - 400 req/sec for 100 concurrent clients, 1000 requests in total - 480 req/sec for 200 concurrent clients, 5000 requests in total

For more information please visit bencmarks page.

"},{"location":"inference/xgboost.html","title":"Direct inference with XGBoost","text":""},{"location":"inference/xgboost.html#general","title":"General","text":"

XGBoost is avaliable (at least) since CMSSW_9_2_4 cmssw#19377.

In CMSSW environment, XGBoost can be used via its Python API.

For UL era, there are different verisons available for different SCRAM_ARCH:

  1. For slc7_amd64_gcc700 and above, ver.0.80 is available.

  2. For slc7_amd64_gcc900 and above, ver.1.3.3 is available.

  3. Please note that different major versions have different behavior( See Caveat Session).

"},{"location":"inference/xgboost.html#existing-examples","title":"Existing Examples","text":"

There are some existing good examples of using XGBoost under CMSSW, as listed below:

  1. Offical sample for testing the integration of XGBoost library with CMSSW.

  2. Useful codes created by Dr. Huilin Qu for inference with existing trained model.

  3. C/C++ Interface for inference with existing trained model.

We will provide examples for both C/C++ interface and python interface of XGBoost under CMSSW environment.

"},{"location":"inference/xgboost.html#example-classification-of-points-from-joint-gaussian-distribution","title":"Example: Classification of points from joint-Gaussian distribution.","text":"

In this specific example, you will use XGBoost to classify data points generated from two 8-dimension joint-Gaussian distribution.

Feature Index 0 1 2 3 4 5 6 7 \u03bc1 1 2 3 4 5 6 7 8 \u03bc2 0 1.9 3.2 4.5 4.8 6.1 8.1 11 \u03c3\u00bd = \u03c3 1 1 1 1 1 1 1 1 |\u03bc1 - \u03bc2| / \u03c3 1 0.1 0.2 0.5 0.2 0.1 1.1 3

All generated data points for train(1:10000,2:10000) and test(1:1000,2:1000) are stored as Train_data.csv/Test_data.csv.

"},{"location":"inference/xgboost.html#preparing-model","title":"Preparing Model","text":"

The training process of a XGBoost model can be done outside of CMSSW. We provide a python script for illustration.

# importing necessary models\nimport numpy as np\nimport pandas as pd \nfrom xgboost import XGBClassifier # Or XGBRegressor for Logistic Regression\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n# specify parameters via map\nparam = {'n_estimators':50}\nxgb = XGBClassifier(param)\n\n# using Pandas.DataFrame data-format, other available format are XGBoost's DMatrix and numpy.ndarray\n\ntrain_data = pd.read_csv(\"path/to/the/data\") # The training dataset is code/XGBoost/Train_data.csv\n\ntrain_Variable = train_data['0', '1', '2', '3', '4', '5', '6', '7']\ntrain_Score = train_data['Type'] # Score should be integer, 0, 1, (2 and larger for multiclass)\n\ntest_data = pd.read_csv(\"path/to/the/data\") # The testing dataset is code/XGBoost/Test_data.csv\n\ntest_Variable = test_data['0', '1', '2', '3', '4', '5', '6', '7']\ntest_Score = test_data['Type']\n\n# Now the data are well prepared and named as train_Variable, train_Score and test_Variable, test_Score.\n\nxgb.fit(train_Variable, train_Score) # Training\n\nxgb.predict(test_Variable) # Outputs are integers\n\nxgb.predict_proba(test_Variable) # Output scores , output structre: [prob for 0, prob for 1,...]\n\nxgb.save_model(\"\\Path\\To\\Where\\You\\Want\\ModelName.model\") # Saving model\n
The saved model ModelName.model is thus available for python and C/C++ api to load. Please use the XGBoost major version consistently (see Caveat).

While training with data from different datasets, proper treatment of weights are necessary for better model performance. Please refer to Official Recommendation for more details.

"},{"location":"inference/xgboost.html#cc-usage-with-cmssw","title":"C/C++ Usage with CMSSW","text":"

To use a saved XGBoost model with C/C++ code, it is convenient to use the XGBoost's offical C api. Here we provide a simple example as following.

"},{"location":"inference/xgboost.html#module-setup","title":"Module setup","text":"

There is no official CMSSW interface for XGBoost while its library are placed in cvmfs of CMSSW. Thus we have to use the raw c_api as well as setting up the library manually.

  1. To run XGBoost's c_api within CMSSW framework, in addition to the following standard setup.
    export SCRAM_ARCH=\"slc7_amd64_gcc700\" # To use higher version, please switch to slc7_amd64_900\nexport CMSSW_VERSION=\"CMSSW_X_Y_Z\"\n\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\n\ncmsrel \"$CMSSW_VERSION\"\ncd \"$CMSSW_VERSION/src\"\n\ncmsenv\nscram b\n
    The addtional effort is to add corresponding xml file(s) to $CMSSW_BASE/toolbox$CMSSW_BASE/config/toolbox/$SCRAM_ARCH/tools/selected/ for setting up XGBoost.
  1. For lower version (<1), add two xml files as below.

    xgboost.xml

     <tool name=\"xgboost\" version=\"0.80\">\n<lib name=\"xgboost\"/>\n<client>\n<environment name=\"LIBDIR\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/lib\"/>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<use name=\"rabit\"/>\n</tool>\n
    rabit.xml
     <tool name=\"rabit\" version=\"0.80\">\n<client>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/rabit/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>  </tool>\n
    Please note that the path in cvmfs is not fixed, one can list all available versions in the py2-xgboost directory and choose one to use.

  2. For higher version (>=1), and one xml file

    xgboost.xml

    <tool name=\"xgboost\" version=\"0.80\">\n<lib name=\"xgboost\"/>\n<client>\n<environment name=\"LIBDIR\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/lib64\"/>\n<environment name=\"INCLUDE\" default=\"/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/include/\"/>\n</client>\n<runtime name=\"ROOT_INCLUDE_PATH\" value=\"$INCLUDE\" type=\"path\"/>\n<runtime name=\"PATH\" value=\"$INCLUDE\" type=\"path\"/>  </tool>\n
    Also one has the freedom to choose the available xgboost version inside xgboost directory.

  1. After adding xml file(s), the following commands should be executed for setting up.

    1. For lower version (<1), use
      scram setup rabit\nscram setup xgboost\n
    2. For higher version (>=1), use
      scram setup xgboost\n
  2. For using XGBoost as a plugin of CMSSW, it is necessary to add

    <use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
    in your plugins/BuildFile.xml. If you are using the interface inside the src/ or interface/ directory of your module, make sure to create a global BuildFile.xml file next to theses directories, containing (at least):
    <use name=\"xgboost\"/>\n<export>\n<lib   name=\"1\"/>\n</export>\n

  3. The libxgboost.so would be too large to load for cmsRun job, please using the following commands for pre-loading:

    export LD_PRELOAD=$CMSSW_BASE/external/$SCRAM_ARCH/lib/libxgboost.so\n

"},{"location":"inference/xgboost.html#basic-usage-of-c-api","title":"Basic Usage of C API","text":"

In order to use c_api of XGBoost to load model and operate inference, one should construct necessaries objects:

  1. Files to include

    #include <xgboost/c_api.h> 

  2. BoosterHandle: worker of XGBoost

    // Declare Object\nBoosterHandle booster_;\n// Allocate memory in C style\nXGBoosterCreate(NULL,0,&booster_);\n// Load Model\nXGBoosterLoadModel(booster_,model_path.c_str()); // second argument should be a const char *.\n

  3. DMatrixHandle: handle to dmatrix, the data format of XGBoost

    float TestData[2000][8] // Suppose 2000 data points, each data point has 8 dimension\n// Assign data to the \"TestData\" 2d array ... \n// Declare object\nDMatrixHandle data_;\n// Allocate memory and use external float array to initialize\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_); // The first argument takes in float * namely 1d float array only, 2nd & 3rd: shape of input, 4th: value to replace missing ones\n

  4. XGBoosterPredict: function for inference

    bst_ulong outlen; // bst_ulong is a typedef of unsigned long\nconst float *f; // array to store predictions\nXGBoosterPredict(booster_,data_,0,0,&out_len,&f);// lower version API\n// XGBoosterPredict(booster_,data_,0,0,0,&out_len,&f);// higher version API\n/*\nlower version (ver.<1) API\nXGB_DLL int XGBoosterPredict(   \nBoosterHandle   handle,\nDMatrixHandle   dmat,\nint     option_mask, // 0 for normal output, namely reporting scores\nint     training, // 0 for prediction\nbst_ulong *     out_len,\nconst float **  out_result \n)\n\nhigher version (ver.>=1) API\nXGB_DLL int XGBoosterPredict(   \nBoosterHandle   handle,\nDMatrixHandle   dmat,\nint     option_mask, // 0 for normal output, namely reporting scores\nint ntree_limit, // how many trees for prediction, set to 0 means no limit\nint     training, // 0 for prediction\nbst_ulong *     out_len,\nconst float **  out_result \n)\n*/\n

"},{"location":"inference/xgboost.html#full-example","title":"Full Example","text":"Click to expand full example

The example assumes the following directory structure:

MySubsystem/MyModule/\n\u2502\n\u251c\u2500\u2500 plugins/\n\u2502   \u251c\u2500\u2500 XGBoostExample.cc\n\u2502   \u2514\u2500\u2500 BuildFile.xml\n\u2502\n\u251c\u2500\u2500 python/\n\u2502   \u2514\u2500\u2500 xgboost_cfg.py\n\u2502\n\u251c\u2500\u2500 toolbox/ (storing necessary xml(s) to be copied to toolbox/ of $CMSSW_BASE)\n\u2502   \u2514\u2500\u2500 xgboost.xml\n\u2502   \u2514\u2500\u2500 rabit.xml (lower version only)\n\u2502\n\u2514\u2500\u2500 data/\n    \u2514\u2500\u2500 Test_data.csv\n    \u2514\u2500\u2500 lowVer.model / highVer.model \n
Please also note that in order to operate inference in an event-by-event way, please put XGBoosterPredict in analyze rather than beginJob.

plugins/XGBoostExample.cc for lower version XGBoostplugins/BuildFile.xml for lower version XGBoostpython/xgboost_cfg.py for lower version XGBoostplugins/XGBoostExample.cc for higher version XGBoostplugins/BuildFile.xml for higher version XGBoostpython/xgboost_cfg.py for higher version XGBoost
// -*- C++ -*-\n//\n// Package:    XGB_Example/XGBoostExample\n// Class:      XGBoostExample\n//\n/**\\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc\n\n Description: [one line class summary]\n\n Implementation:\n     [Notes on implementation]\n*/\n//\n// Original Author:  Qian Sitian\n//         Created:  Sat, 19 Jun 2021 08:38:51 GMT\n//\n//\n\n\n// system include files\n#include <memory>\n\n// user include files\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"FWCore/Utilities/interface/InputTag.h\"\n#include \"DataFormats/TrackReco/interface/Track.h\"\n#include \"DataFormats/TrackReco/interface/TrackFwd.h\"\n\n#include <xgboost/c_api.h>\n#include <vector>\n#include <tuple>\n#include <string>\n#include <iostream>\n#include <fstream>\n#include <sstream>\n\nusing namespace std;\n\nvector<vector<double>> readinCSV(const char* name){\nauto fin = ifstream(name);\nvector<vector<double>> floatVec;\nstring strFloat;\nfloat fNum;\nint counter = 0;\ngetline(fin,strFloat);\nwhile(getline(fin,strFloat))\n{\nstd::stringstream  linestream(strFloat);\nfloatVec.push_back(std::vector<double>());\nwhile(linestream>>fNum)\n{\nfloatVec[counter].push_back(fNum);\nif (linestream.peek() == ',')\nlinestream.ignore();\n}\n++counter;\n}\nreturn floatVec;\n}\n\n//\n// class declaration\n//\n\n// If the analyzer does not use TFileService, please remove\n// the template argument to the base class so the class inherits\n// from  edm::one::EDAnalyzer<>\n// This will improve performance in multithreaded jobs.\n\n\n\nclass XGBoostExample : public edm::one::EDAnalyzer<>  {\npublic:\nexplicit XGBoostExample(const edm::ParameterSet&);\n~XGBoostExample();\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions& descriptions);\n\n\nprivate:\nvirtual void beginJob() ;\nvirtual void analyze(const edm::Event&, const edm::EventSetup&) ;\nvirtual void endJob() ;\n\n// ----------member data ---------------------------\n\nstd::string test_data_path;\nstd::string model_path;\n\n\n\n\n};\n\n//\n// constants, enums and typedefs\n//\n\n//\n// static data member definitions\n//\n\n//\n// constructors and destructor\n//\nXGBoostExample::XGBoostExample(const edm::ParameterSet& config):\ntest_data_path(config.getParameter<std::string>(\"test_data_path\")),\nmodel_path(config.getParameter<std::string>(\"model_path\"))\n{\n\n}\n\n\nXGBoostExample::~XGBoostExample()\n{\n\n// do anything here that needs to be done at desctruction time\n// (e.g. close files, deallocate resources etc.)\n\n}\n\n\n//\n// member functions\n//\n\nvoid\nXGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)\n{\n}\n\n\nvoid\nXGBoostExample::beginJob()\n{\nBoosterHandle booster_;\nXGBoosterCreate(NULL,0,&booster_);\ncout<<\"Hello World No.2\"<<endl;\nXGBoosterLoadModel(booster_,model_path.c_str());\nunsigned long numFeature = 0;\ncout<<\"Hello World No.3\"<<endl;\nvector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());\ncout<<\"Hello World No.4\"<<endl;\nfloat TestData[2000][8];\ncout<<\"Hello World No.5\"<<endl;\nfor(unsigned i=0; (i < 2000); i++)\n{ for(unsigned j=0; (j < 8); j++)\n{\nTestData[i][j] = TestDataVector[i][j];\n//  cout<<TestData[i][j]<<\"\\t\";\n} //cout<<endl;\n}\ncout<<\"Hello World No.6\"<<endl;\nDMatrixHandle data_;\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);\ncout<<\"Hello World No.7\"<<endl;\nbst_ulong out_len=0;\nconst float *f;\ncout<<out_len<<endl;\nauto ret=XGBoosterPredict(booster_, data_, 0,0,&out_len,&f);\ncout<<ret<<endl;\nfor (unsigned int i=0;i<2;i++)\nstd::cout <<  i << \"\\t\"<< f[i] << std::endl;\ncout<<\"Hello World No.8\"<<endl;\n}\n\nvoid\nXGBoostExample::endJob()\n{\n}\n\nvoid\nXGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n//The following says we do not know what parameters are allowed so do no validation\n// Please change this to state exactly what you do use, even if it is no parameters\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"test_data_path\");\ndesc.add<std::string>(\"model_path\");\ndescriptions.addWithDefaultLabel(desc);\n\n//Specify that only 'tracks' is allowed\n//To use, remove the default given above and uncomment below\n//ParameterSetDescription desc;\n//desc.addUntracked<edm::InputTag>(\"tracks\",\"ctfWithMaterialTracks\");\n//descriptions.addDefault(desc);\n}\n\n//define this as a plug-in\nDEFINE_FWK_MODULE(XGBoostExample);\n
<use name=\"FWCore/Framework\"/>\n<use name=\"FWCore/PluginManager\"/>\n<use name=\"FWCore/ParameterSet\"/>\n<use name=\"DataFormats/TrackReco\"/>\n<use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n# setup minimal options\n#options = VarParsing(\"python\")\n#options.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root\")  # noqa\n#options.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(1))\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))\nprocess.source = cms.Source(\"EmptySource\")\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\nprocess.XGBoostExample = cms.EDAnalyzer(\"XGBoostExample\")\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\n#process.load(\"XGB_Example.XGBoostExample.XGBoostExample_cfi\")\nprocess.XGBoostExample.model_path = cms.string(\"/Your/Path/data/lowVer.model\")\nprocess.XGBoostExample.test_data_path = cms.string(\"/Your/Path/data/Test_data.csv\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.XGBoostExample)\n
// -*- C++ -*-\n//\n// Package:    XGB_Example/XGBoostExample\n// Class:      XGBoostExample\n//\n/**\\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc\n\n Description: [one line class summary]\n\n Implementation:\n     [Notes on implementation]\n*/\n//\n// Original Author:  Qian Sitian\n//         Created:  Sat, 19 Jun 2021 08:38:51 GMT\n//\n//\n\n\n// system include files\n#include <memory>\n\n// user include files\n#include \"FWCore/Framework/interface/Frameworkfwd.h\"\n#include \"FWCore/Framework/interface/one/EDAnalyzer.h\"\n\n#include \"FWCore/Framework/interface/Event.h\"\n#include \"FWCore/Framework/interface/MakerMacros.h\"\n\n#include \"FWCore/ParameterSet/interface/ParameterSet.h\"\n#include \"FWCore/Utilities/interface/InputTag.h\"\n#include \"DataFormats/TrackReco/interface/Track.h\"\n#include \"DataFormats/TrackReco/interface/TrackFwd.h\"\n\n#include <xgboost/c_api.h>\n#include <vector>\n#include <tuple>\n#include <string>\n#include <iostream>\n#include <fstream>\n#include <sstream>\n\nusing namespace std;\n\nvector<vector<double>> readinCSV(const char* name){\nauto fin = ifstream(name);\nvector<vector<double>> floatVec;\nstring strFloat;\nfloat fNum;\nint counter = 0;\ngetline(fin,strFloat);\nwhile(getline(fin,strFloat))\n{\nstd::stringstream  linestream(strFloat);\nfloatVec.push_back(std::vector<double>());\nwhile(linestream>>fNum)\n{\nfloatVec[counter].push_back(fNum);\nif (linestream.peek() == ',')\nlinestream.ignore();\n}\n++counter;\n}\nreturn floatVec;\n}\n\n//\n// class declaration\n//\n\n// If the analyzer does not use TFileService, please remove\n// the template argument to the base class so the class inherits\n// from  edm::one::EDAnalyzer<>\n// This will improve performance in multithreaded jobs.\n\n\n\nclass XGBoostExample : public edm::one::EDAnalyzer<>  {\npublic:\nexplicit XGBoostExample(const edm::ParameterSet&);\n~XGBoostExample();\n\nstatic void fillDescriptions(edm::ConfigurationDescriptions& descriptions);\n\n\nprivate:\nvirtual void beginJob() ;\nvirtual void analyze(const edm::Event&, const edm::EventSetup&) ;\nvirtual void endJob() ;\n\n// ----------member data ---------------------------\n\nstd::string test_data_path;\nstd::string model_path;\n\n\n\n\n};\n\n//\n// constants, enums and typedefs\n//\n\n//\n// static data member definitions\n//\n\n//\n// constructors and destructor\n//\nXGBoostExample::XGBoostExample(const edm::ParameterSet& config):\ntest_data_path(config.getParameter<std::string>(\"test_data_path\")),\nmodel_path(config.getParameter<std::string>(\"model_path\"))\n{\n\n}\n\n\nXGBoostExample::~XGBoostExample()\n{\n\n// do anything here that needs to be done at desctruction time\n// (e.g. close files, deallocate resources etc.)\n\n}\n\n\n//\n// member functions\n//\n\nvoid\nXGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)\n{\n}\n\n\nvoid\nXGBoostExample::beginJob()\n{\nBoosterHandle booster_;\nXGBoosterCreate(NULL,0,&booster_);\nXGBoosterLoadModel(booster_,model_path.c_str());\nunsigned long numFeature = 0;\nvector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());\nfloat TestData[2000][8];\nfor(unsigned i=0; (i < 2000); i++)\n{ for(unsigned j=0; (j < 8); j++)\n{\nTestData[i][j] = TestDataVector[i][j];\n//  cout<<TestData[i][j]<<\"\\t\";\n} //cout<<endl;\n}\nDMatrixHandle data_;\nXGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);\nbst_ulong out_len=0;\nconst float *f;\nauto ret=XGBoosterPredict(booster_, data_,0, 0,0,&out_len,&f);\nfor (unsigned int i=0;i<out_len;i++)\nstd::cout <<  i << \"\\t\"<< f[i] << std::endl;\n}\n\nvoid\nXGBoostExample::endJob()\n{\n}\n\nvoid\nXGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {\n//The following says we do not know what parameters are allowed so do no validation\n// Please change this to state exactly what you do use, even if it is no parameters\nedm::ParameterSetDescription desc;\ndesc.add<std::string>(\"test_data_path\");\ndesc.add<std::string>(\"model_path\");\ndescriptions.addWithDefaultLabel(desc);\n\n//Specify that only 'tracks' is allowed\n//To use, remove the default given above and uncomment below\n//ParameterSetDescription desc;\n//desc.addUntracked<edm::InputTag>(\"tracks\",\"ctfWithMaterialTracks\");\n//descriptions.addDefault(desc);\n}\n\n//define this as a plug-in\nDEFINE_FWK_MODULE(XGBoostExample);\n
<use name=\"FWCore/Framework\"/>\n<use name=\"FWCore/PluginManager\"/>\n<use name=\"FWCore/ParameterSet\"/>\n<use name=\"DataFormats/TrackReco\"/>\n<use name=\"xgboost\"/>\n<flags EDM_PLUGIN=\"1\"/>\n
# coding: utf-8\n\nimport os\n\nimport FWCore.ParameterSet.Config as cms\nfrom FWCore.ParameterSet.VarParsing import VarParsing\n\n# setup minimal options\n#options = VarParsing(\"python\")\n#options.setDefault(\"inputFiles\", \"root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root\")  # noqa\n#options.parseArguments()\n\n# define the process to run\nprocess = cms.Process(\"TEST\")\n\n# minimal configuration\nprocess.load(\"FWCore.MessageService.MessageLogger_cfi\")\nprocess.MessageLogger.cerr.FwkReport.reportEvery = 1\nprocess.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(10))\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))\nprocess.source = cms.Source(\"EmptySource\")\n#process.source = cms.Source(\"PoolSource\",\n#    fileNames=cms.untracked.vstring(options.inputFiles))\n# process options\nprocess.options = cms.untracked.PSet(\n    allowUnscheduled=cms.untracked.bool(True),\n    wantSummary=cms.untracked.bool(True),\n)\n\nprocess.XGBoostExample = cms.EDAnalyzer(\"XGBoostExample\")\n\n# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)\n#process.load(\"XGB_Example.XGBoostExample.XGBoostExample_cfi\")\nprocess.XGBoostExample.model_path = cms.string(\"/Your/Path/data/highVer.model\")  \nprocess.XGBoostExample.test_data_path = cms.string(\"/Your/Path/data/Test_data.csv\")\n\n# define what to run in the path\nprocess.p = cms.Path(process.XGBoostExample)\n
"},{"location":"inference/xgboost.html#python-usage","title":"Python Usage","text":"

To use XGBoost's python interface, using the snippet below under CMSSW environment

# importing necessary models\nimport numpy as np\nimport pandas as pd \nfrom xgboost import XGBClassifier\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n\nxgb = XGBClassifier()\nxgb.load_model('ModelName.model')\n\n# After loading model, usage is the same as discussed in the model preparation section.\n

"},{"location":"inference/xgboost.html#caveat","title":"Caveat","text":"

It is worth mentioning that both behavior and APIs of different XGBoost version can have difference.

  1. When using c_api for C/C++ inference, for ver.<1, the API is XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, int training, bst_ulong * out_len,const float ** out_result), while for ver.>=1 the API changes to XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, unsigned int ntree_limit, int training, bst_ulong * out_len,const float ** out_result).

  2. Model from ver.>=1 cannot be used for ver.<1.

Other important issue for C/C++ user is that DMatrix only takes in single precision floats (float), not double precision floats (double).

"},{"location":"inference/xgboost.html#appendix-tips-for-xgboost-users","title":"Appendix: Tips for XGBoost users","text":""},{"location":"inference/xgboost.html#importance-plot","title":"Importance Plot","text":"

XGBoost uses F-score to describe feature importance quantatitively. XGBoost's python API provides a nice tool,plot_importance, to plot the feature importance conveniently after finishing train.

# Once the training is done, the plot_importance function can thus be used to plot the feature importance.\nfrom xgboost import plot_importance # Import the function\n\nplot_importance(xgb) # suppose the xgboost object is named \"xgb\"\nplt.savefig(\"importance_plot.pdf\") # plot_importance is based on matplotlib, so the plot can be saved use plt.savefig()\n
The importance plot is consistent with our expectation, as in our toy-model, the data points differ by most on the feature \"7\". (see toy model setup).

"},{"location":"inference/xgboost.html#roc-curve-and-auc","title":"ROC Curve and AUC","text":"

The receiver operating characteristic (ROC) and auccrency (AUC) are key quantities to describe the model performance. For XGBoost, ROC curve and auc score can be easily obtained with the help of sci-kit learn (sklearn) functionals, which is also in CMSSW software.

from sklearn.metrics import roc_auc_score,roc_curve,auc\n# ROC and AUC should be obtained on test set\n# Suppose the ground truth is 'y_test', and the output score is named as 'y_score'\n\nfpr, tpr, _ = roc_curve(y_test, y_score)\nroc_auc = auc(fpr, tpr)\n\nplt.figure()\nlw = 2\nplt.plot(fpr, tpr, color='darkorange',\n         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\nplt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\nplt.xlim([0.0, 1.0])\nplt.ylim([0.0, 1.05])\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title('Receiver operating characteristic example')\nplt.legend(loc=\"lower right\")\n# plt.show() # display the figure when not using jupyter display\nplt.savefig(\"roc.png\") # resulting plot is shown below\n

"},{"location":"inference/xgboost.html#reference-of-xgboost","title":"Reference of XGBoost","text":"
  1. XGBoost Wiki: https://en.wikipedia.org/wiki/XGBoost
  2. XGBoost Github Repo.: https://github.com/dmlc/xgboost
  3. XGBoost offical api tutorial
  4. Latest, Python: https://xgboost.readthedocs.io/en/latest/python/index.html
  5. Latest, C/C++: https://xgboost.readthedocs.io/en/latest/tutorials/c_api_tutorial.html
  6. Older (0.80), Python: https://xgboost.readthedocs.io/en/release_0.80/python/index.html
  7. No Tutorial for older version C/C++ api, source code: https://github.com/dmlc/xgboost/blob/release_0.80/src/c_api/c_api.cc
"},{"location":"innovation/hackathons.html","title":"CMS Machine Learning Hackathons","text":"

Welcome to the CMS ML Hackathons! Here we encourage the exploration of cutting edge ML methods to particle physics problems through multi-day focused work. Form hackathon teams and work together with the ML Innovation group to get support with organization and announcements, hardware/software infrastructure, follow-up meetings and ML-related technical advise.

If you are interested in proposing a hackathon, please send an e-mail to the CMS ML Innovation conveners with a potential topic and we will get in touch!

Below follows a list of previous successful hackathons.

"},{"location":"innovation/hackathons.html#hgcal-ticl-reconstruction","title":"HGCAL TICL reconstruction","text":"

20 Jun 2022 - 24 Jun 2022 https://indico.cern.ch/e/ticlhack

Abstract: The HGCAL reconstruction relies on \u201cThe Iterative CLustering\u201d (TICL) framework. It follows an iterative approach, first clusters energy deposits in the same layer (layer clusters) and then connect these layer clusters to reconstruct the particle shower by forming 3-D objects, the \u201ctracksters\u201d. There are multiple areas that could benefit from advanced ML techniques to further improve the reconstruction performance.

In this project we plan to tackle the following topics using ML:

  • trackster identification (ie, identification of the type of particle initiating the shower) and energy regression linking of tracksters stemming from the same particle to reconstruct the full shower and/or use a high-purity trackster as a seed and collect 2D (ie. layer clusters) and/or 3D (ie, tracksters) energy deposits in the vicinity of the seed trackster to fully reconstruct the particle shower
  • tuning of the existing pattern recognition algorithms
  • reconstruction under HL-LHC pile-up scenarios (eg., PU=150-200)
  • trackster characterization, ie. predict if a trackster is a sound object in itself or determine if it is more likely to be a composite one.
"},{"location":"innovation/hackathons.html#material","title":"Material:","text":"

A CodiMD document has been created with an overview of the topics and to keep track of the activities during the hackathon:

https://codimd.web.cern.ch/s/hMd74Yi7J

"},{"location":"innovation/hackathons.html#jet-tagging","title":"Jet tagging","text":"

8 Nov 2021 - 11 Nov 2021 https://indico.cern.ch/e/jethack

Abstract: The identification of the initial particle (quark, gluon, W/Z boson, etc..) responsible for the formation of the jet, also known as jet tagging, provides a powerful handle in both standard model (SM) measurements and searches for physics beyond the SM (BSM). In this project we propose the development of jet tagging algorithms both for small-radius (i.e. AK4) and large-radius (i.e., AK8) jets using as inputs the PF candidates.

Two main projects are covered:

  • Jet tagging for scouting
  • Jet tagging for Level-1
"},{"location":"innovation/hackathons.html#jet-tagging-for-scouting","title":"Jet tagging for scouting","text":"

Using as inputs the PF candidates and local pixel tracks reconstructed in the scouting streams, the main goals of this project are the following:

Develop a jet-tagging baseline for scouting and compare the performance with the offline reconstruction Understand the importance of the different input variables and the impact of -various configurations (e.g., on pixel track reconstruction) in the performance Compare different jet tagging approaches with mind performance as well as inference time. Proof of concept: ggF H->bb, ggF HH->4b, VBF HH->4b

"},{"location":"innovation/hackathons.html#jet-tagging-for-level-1","title":"Jet tagging for Level-1","text":"

Using as input the newly developed particle flow candidates of Seeded Cone jets in the Level1 Correlator trigger, the following tasks will be worked on:

  • Developing a quark, gluon, b, pileup jet classifier for Seeded Cone R=0.4 jets using a combination of tt,VBF(H) and Drell-Yan Level1 samples
  • Develop tools to demonstrate the gain of such a jet tagging algorithm on a signal sample (like q vs g on VBF jets)
  • Study tagging performance as a function of the number of jet constituents
  • Study tagging performance for a \"real\" input vector (zero-paddes, perhaps unsorted)
  • Optimise jet constituent list of SeededCone Jets (N constituents, zero-removal, sorting etc)
  • Develop q/g/W/Z/t/H classifier for Seeded Cone R=0.8 jets
"},{"location":"innovation/hackathons.html#gnn-4-tracking","title":"GNN-4-tracking","text":"

27 Sept 2021 - 1 Oct 2021

https://indico.cern.ch/e/gnn4tracks

Abstract: The aim of this hackathon is to integrate graph neural nets (GNNs) for particle tracking into CMSSW.

The hackathon will make use of a GNN model reported by the paper Charged particle tracking via edge-classifying interaction networks by Gage DeZoort, Savannah Thais, et.al. They used a GNN to predict connections between detector pixel hits, and achieved accurate track building. They did this with the TrackML dataset, which uses a generic detector designed to be similar to CMS or ATLAS. Work is ongoing to apply this GNN approach to CMS data.

Tasks: The hackathon aims to create a workflow that allows graph building and GNN inference within the framework of CMSSW. This would enable accurate testing of future GNN models and comparison to existing CMSSW track building methods. The hackathon will be divided into the following subtasks:

  • Task 1: Create a package for extracting graph features and building graphs in CMSSW.
  • Task 2. GNN inference on Sonic servers
  • Task 3: Track fitting after GNN track building
  • Task 4. Performance evaluation for the new track collection
"},{"location":"innovation/hackathons.html#material_1","title":"Material:","text":"

Code is provided at this GitHub organisation. Project are listed here.

"},{"location":"innovation/hackathons.html#anomaly-detection","title":"Anomaly detection","text":"

In this four day Machine Learning Hackathon, we will develop new anomaly detection algorithms for New Physics detection, intended for deployment in the two main stages of the CMS data aquisition system: The Level-1 trigger and the High Level Trigger.

There are two main projects:

"},{"location":"innovation/hackathons.html#event-based-anomaly-detection-algorithms-for-the-level-1-trigger","title":"Event-based anomaly detection algorithms for the Level-1 Trigger","text":""},{"location":"innovation/hackathons.html#jet-based-anomaly-detection-algorithms-for-the-high-level-trigger-specifically-targeting-run-3-scouting","title":"Jet-based anomaly detection algorithms for the High Level Trigger, specifically targeting Run 3 scouting","text":""},{"location":"innovation/hackathons.html#material_2","title":"Material:","text":"

A list of projects can be found in this document. Instructions for fetching the data and example code for the two projects can be found at Level-1 Anomaly Detection.

"},{"location":"innovation/journal_club.html","title":"CMS Machine Learning Journal Club","text":"

Welcome to the CMS Machine Learning Journal Club (JC)! Here we read an discuss new cutting edge ML papers, with an emphasis on how these can be used within the collaboration. Below you can find a summary of each JC as well as some code examples demonstrating how to use the tools or methods introduced.

To vote for or to propose new papers for discussion, go to https://cms-ml-journalclub.web.cern.ch/.

Below follows a complete list of all the previous CMS ML JHournal clubs, together with relevant documentation and code examples.

"},{"location":"innovation/journal_club.html#dealing-with-nuisance-parameters-using-machine-learning-in-high-energy-physics-a-review","title":"Dealing with Nuisance Parameters using Machine Learning in High Energy Physics: a Review","text":"

Tommaso Dorigo, Pablo de Castro

Abstract: In this work we discuss the impact of nuisance parameters on the effectiveness of machine learning in high-energy physics problems, and provide a review of techniques that allow to include their effect and reduce their impact in the search for optimal selection criteria and variable transformations. The introduction of nuisance parameters complicates the supervised learning task and its correspondence with the data analysis goal, due to their contribution degrading the model performances in real data, and the necessary addition of uncertainties in the resulting statistical inference. The approaches discussed include nuisance-parameterized models, modified or adversary losses, semi-supervised learning approaches, and inference-aware techniques.

  • Indico
  • Paper
"},{"location":"innovation/journal_club.html#mapping-machine-learned-physics-into-a-human-readable-space","title":"Mapping Machine-Learned Physics into a Human-Readable Space","text":"

Taylor Faucett, Jesse Thaler, Daniel Whiteson

Abstract: We present a technique for translating a black-box machine-learned classifier operating on a high-dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature. - Indico - Paper

"},{"location":"innovation/journal_club.html#model-interpretability-2-papers","title":"Model Interpretability (2 papers):","text":"
  • Indico
"},{"location":"innovation/journal_club.html#identifying-the-relevant-dependencies-of-the-neural-network-response-on-characteristics-of-the-input-space","title":"Identifying the relevant dependencies of the neural network response on characteristics of the input space","text":"

Stefan Wunsch, Raphael Friese, Roger Wolf, G\u00fcnter Quast

Abstract: The relation between the input and output spaces of neural networks (NNs) is investigated to identify those characteristics of the input space that have a large influence on the output for a given task. For this purpose, the NN function is decomposed into a Taylor expansion in each element of the input space. The Taylor coefficients contain information about the sensitivity of the NN response to the inputs. A metric is introduced that allows for the identification of the characteristics that mostly determine the performance of the NN in solving a given task. Finally, the capability of this metric to analyze the performance of the NN is evaluated based on a task common to data analyses in high-energy particle physics experiments.

  • Paper
"},{"location":"innovation/journal_club.html#innvestigate-neural-networks","title":"iNNvestigate neural networks!","text":"

Maximilian Alber, Sebastian Lapuschkin, Philipp Seegerer, Miriam H\u00e4gele, Kristof T. Sch\u00fctt, Gr\u00e9goire Montavon, Wojciech Samek, Klaus-Robert M\u00fcller, Sven D\u00e4hne, Pieter-Jan Kindermans

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and pre- dictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this short- coming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library iNNvestigate addresses this by providing a common interface and out-of-the- box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of iNNvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

  • Paper
  • Code
"},{"location":"innovation/journal_club.html#simulation-based-inference-in-particle-physics-and-beyond-and-beyond","title":"Simulation-based inference in particle physics and beyond (and beyond)","text":"

Johann Brehmer, Kyle Cranmer

Abstract: Our predictions for particle physics processes are realized in a chain of complex simulators. They allow us to generate high-fidelity simulated data, but they are not well-suited for inference on the theory parameters with observed data. We explain why the likelihood function of high-dimensional LHC data cannot be explicitly evaluated, why this matters for data analysis, and reframe what the field has traditionally done to circumvent this problem. We then review new simulation-based inference methods that let us directly analyze high-dimensional data by combining machine learning techniques and information from the simulator. Initial studies indicate that these techniques have the potential to substantially improve the precision of LHC measurements. Finally, we discuss probabilistic programming, an emerging paradigm that lets us extend inference to the latent process of the simulator.

  • Indico
  • Paper
  • Code
"},{"location":"innovation/journal_club.html#efficiency-parameterization-with-neural-networks","title":"Efficiency Parameterization with Neural Networks","text":"

C. Badiali, F.A. Di Bello, G. Frattari, E. Gross, V. Ippolito, M. Kado, J. Shlomi

Abstract: Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained. - Indico - Paper - Code

"},{"location":"innovation/journal_club.html#a-general-framework-for-uncertainty-estimation-in-deep-learning","title":"A General Framework for Uncertainty Estimation in Deep Learning","text":"

Antonio Loquercio, Mattia Seg\u00f9, Davide Scaramuzza

Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy.

  • Indico
  • Paper
  • Code
"},{"location":"optimization/data_augmentation.html","title":"Data augmentation","text":""},{"location":"optimization/data_augmentation.html#introduction","title":"Introduction","text":"

This introduction is based on papers by Shorten & Khoshgoftaar, 2019 and Rebuffi et al., 2021 among others

With the increasing complexity and sizes of neural networks one needs huge amounts of data in order to train a state-of-the-art model. However, generating this data is often very resource and time intensive. Thus, one might either augment the existing data with more descriptive variables or combat the data scarcity problem by artificially increasing the size of the dataset by adding new instances without the resource-heavy generation process. Both processes are known in machine learning (ML) applications as data augmentation (DA) methods.

The first type of these methods is more widely known as feature generation or feature engineering and is done on instance level. Feature engineering focuses on crafting informative input features for the algorithm, often inspired or derived from first principles specific to the algorithm's application domain.

The second type of method is done on the dataset level. These types of techniques can generally be divided into two main categories: real data augmentation (RDA) and synthetic data augmentation (SDA). As the name suggests, RDA makes minor changes to the already existing data in order to generate new samples, whereas SDA generates new data from scratch. Examples of RDA include rotating (especially useful if we expect the event to be rotationally symmetric) and zooming, among a plethora of other methods detailed in this overview article. Examples of SDA include traditional sampling methods and more complex generative models like Generative Adversaial Netoworks (GANs) and Variational Autoencoders (VAE). Going further, the generative methods used for synthetic data augmentation could also be used in fast simulation, which is a notable bottleneck in the overall physics analysis workflow.

Dataset augmentation may lead to more successful algorithm outcomes. For example, introducing noise into data to form additional data points improves the learning ability of several models which otherwise performed relatively poorly, as shown by Freer & Yang, 2020. This finding implies that this form of DA creates variations that the model may see in the real world. If done right, preprocessing the data with DA will result in superior training outcomes. This improvement in performance is due to the fact that DA methods act as a regularizer, reducing overfitting during training. In addition to simulating real-world variations, DA methods can also even out categorical data with imbalanced classes.

Fig. 1: Generic pipeline of a heuristic DA (figure taken from Li, 2020)

Before diving more in depth into the various DA methods and applications in HEP, here is a list of the most notable benefits of using DA methods in your ML workflow:

  • Improvement of model prediction precision
  • More training data for the model
  • Preventing data scarcity for state-of-the-art models
  • Reduction of over overfitting and creation of data variability
  • Increased model generalization properties
  • Help in resolving class imbalance problems in datasets
  • Reduced cost of data collection and labeling
  • Enabling rare event prediction

And some words of caution:

  • There is no 'one size fits all' in DA. Each dataset and usecase should be considered separately.
  • Don't trust the augmented data blindly
  • Make sure that the augmented data is representative of the problem at hand, otherwise it will negatively affect the model performance.
  • There must be no unnecessary duplication of existing data, only by adding unique information we gain more insights.
  • Ensure the validity of the augmented data before using it in ML models.
  • If a real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important. So, double check your DA strategy.
"},{"location":"optimization/data_augmentation.html#feature-engineering","title":"Feature Engineering","text":"

This part is based mostly on Erdmann et al., 2018

Feature engineering (FE) is one of the key components of a machine learning workflow. This process transforms and augments training data with additional features in order to make the training more effective.

With multi-variate analyeses (MVAs), such boosted decision trees (BDTs) and neural networks, one could start with raw, \"low-level\" features, like four-momenta, and the algorithm can learn higher level patterns, correlations, metrics, etc. However, using \"high-level\" variables, in many cases, leads to outcomes superior to the use of low-level variables. As such, features used in MVAs are handcrafted from physics first principles.

Still, it is shown that a deep neural network (DNN) can perform better if it is trained with both specifically constructed variables and low-level variables. This observation suggests that the network extracts additional information from the training data.

"},{"location":"optimization/data_augmentation.html#hep-application-lorentz-boosted-network","title":"HEP Application - Lorentz Boosted Network","text":"

For the purposeses of FE in HEP, a novel ML architecture called a Lorentz Boost Network (LBN) (see Fig. 2) was proposed and implemented by Erdmann et al., 2018. It is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. LBN is the first stage of a two-stage neural network (NN) model, that enables a fully autonomous and comprehensive characterization of collision events by exploiting exclusively the four-momenta of the final-state particles.

Within LBN, particles are combined to create rest frames representions, which enables the formation of further composite particles. These combinations are realized via linear combinations of N input four-vectors to a number of M particles and rest frames. Subsequently these composite particles are then transformed into said rest frames by Lorentz transformations in an efficient and fully vectorized implementation.

The properties of the composite, transformed particles are compiled in the form of characteristic variables like masses, angles, etc. that serve as input for a subsequent network - the second stage, which has to be configured for a specific analysis task, like classification.

The authors observed leading performance with the LBN and demonstrated that LBN forms physically meaningful particle combinations and generates suitable characteristic variables.

The usual ML workflow, employing LBN, is as follows:

Step-1: LBN(M, F)\n\n    1.0: Input hyperparameters: number of combinations M; number of features F\n    1.0: Choose: number of incoming particles, N, according to the research\n         question\n\n    1.1: Combination of input four-vectors to particles and rest frames\n\n    1.2: Lorentz transformations\n\n    1.3 Extraction of suitable high-level objects\n\n\nStep-2: NN\n\n    2.X: Train some form of a NN using an objective function that depends on\n         the analysis / research question.\n
Fig. 2: The Lorentz Boost Network architecture (figure taken from Erdmann et al., 2018)

The LBN package is also pip-installable:

pip install lbn\n
"},{"location":"optimization/data_augmentation.html#rda-techniques","title":"RDA Techniques","text":"

This section and the following subsection are based on the papers by Freer & Yang, 2020, Dolan & Ore, 2021, Barnard et al., 2016, and Bradshaw et al., 2019

RDA methods augment the existing dataset by performance some transformation on the existing data points. These transformations could include rotation, flipping, color shift (for an image), Fourier transforming (for signal processing) or some other transformation that preserves the validity of the data point and its corresponding label. As mentioned in Freer & Yang, 2020, these types of transformations augment the dataset to capture potential variations that the population of data may exhibit, allowing the network to capture a more generalized view of the sampled data.

"},{"location":"optimization/data_augmentation.html#hep-application-zooming","title":"HEP Application - Zooming","text":"

In Barnard et al., 2016, the authors investigate the effect of parton shower modelling in DNN jet taggers using images of hadronically decaying W bosons. They introduce a method known as zooming to study the scale invariance of these networks. This is the RDA strategy used by Dolan & Ore, 2021. Zooming is similar to a normalization procedure such that it standardizes features in signal data, but it aims to not create similar features in background.

After some standard data processing steps, including jet trimming and clustering via the \\(k_t\\) algorithm, and some further processing to remove spatial symmetries, the resulting jet image depicts the leading subjet and subleading subjet directly below. Barnard et al., 2016 notes that the separation between the leading and subleading subjets varies linearly as \\(2m/p_T\\) where \\(m\\) and \\(p_T\\) are the mass and transverse momentum of the jet. Standardizing this separation, or removing the linear dependence, would allow the DNN tagger to generalize to a wide range of jet \\(p_T\\). To this end, the authors construct a factor, \\(R/\\DeltaR_{act}\\), where \\(R\\) is some fixed value and \\(\\DeltaR_{act}\\) is the separation between the leading and subleading subjets. To discriminate between signal and background images with this factor, the authors enlarge the jet images by a scaling factor of \\(\\text{max}(R/s,1)\\) where \\(s = 2m_W/p_T\\) and \\(R\\) is the original jet clustering size. This process of jet image enlargement by a linear mass and \\(p_T\\) dependent factor to account for the distane between the leading and subleading jet is known as zooming. This process can be thought of as an RDA technique to augment the data in a domain-specific way.

Advantage of using the zooming technique is that it makes the construction of scale invariant taggers easier. Scale invariant searches which are able to interpolate between the boosted and resolved parts of phase space have the advantage of being applicable over a broad range of masses and kinematics, allowing a single search or analysis to be effective where previously more than one may have been necessary.

As predicted the zoomed network outperforms the unzoomed one, particularly at low signal efficiency, where the background rejection rises by around 20%. Zooming has the greatest effect at high pT.

"},{"location":"optimization/data_augmentation.html#traditional-sda-techniques","title":"Traditional SDA Techniques","text":"

Text in part based on He et al., 2010

Generally speaking, imbalanced learning occurs whenever some type of data distribution dominates the instance space compared to other data distributions. Methods for handling imbalanced learning problems can be divided into the following five major categories:

  • Sampling strategies
  • Synthetic data generation (SMOTE & ADASYN & DataBoost-IM) - aims to overcome the imbalance by artificially generating data samples.
  • Cost-sensitive learning - uses cost-matrix for different types of errors or instance to facilitate learning from imbalanced data sets. This means that cost-sensitive learning does not modify the imbalanced data distribution directly, but targets this problem by using different cost-matrices that describe the cost for misclassifying any particular data sample.
  • Active learning - conventionally used to solve problems related to unlabeled data, though recently it has been used in learning imbalanced data sets. Instead of searching the entire training space, this method effectively selects informative instances from a random set of training populations, therefore significantly reducing the computational cost when dealing with large imbalanced data sets.
  • Kernel-based methods - by integrating the regularized orthogonal weighed least squares (ROWLS) estimator, a kernel classifier construction algorithm is based on orthogonal forward selection (OFS) to optimize the model generalization for learning from two-class imbalanced data sets.
"},{"location":"optimization/data_augmentation.html#sampling","title":"Sampling","text":"

When the percentage of the minority class is less than 5%, it can be considered a rare event. When a dataset is imbalanced or when a rare event occurs, it will be difficult to get a meaningful and good predictive model due to lack of information about the rare event Au et al., 2010. In these cases, re-sampling techniques can be helpful. The re-sampling techniques are implemented in four different categories: undersampling the majority class, oversampling the minority class, combining over- and undersampling, and ensembling sampling. Oversampling and undersampling are found to work well in improving the classification for the imbalanced dataset. Yap et al., 2013

Stratified sampling (STS) This technique is used in cases where the data can be partitioned into strata (subpopulations), where each strata should be collectively exhaustive and mutually exclusive. The process of dividing the data into homogeneus subgroups before sampling is referred to as stratification. The two common strategies of STS are proportionate allocation (PA) and optimum (disproportionate) allocation (OA). The former uses a fraction in each of the stata that is proportional to that of the total population. The latter uses the standard deviation of the distribution of the variable as well, so that the larger samples are taken from the strata that has the greatest variability to generate the least possible sampling variance. The advantages of using STS include smaller error in estimation (if measurements within strata have lower standard deviation) and similarity in uncertainties across all strata in case there is high variability in a given strata.

NOTE: STS is only useful if the population can be exhaustively partitioned into subgroups. Also in case of unknown class priors (the ratio of strata to the whole population) might have deleterious effects on the classification performance.

Over- and undersampling Oversampling randomly duplicates minority class samples, while undersampling discards majority class samples in order to modify the class distribution. While oversampling might lead to overfitting, since it makes exact copies of the minority samples, undersampling may discard potentially useful majority samples.

Oversampling and undersampling are essentially opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like Synthetic Minority Over-sampling TEchnique (SMOTE).

It has been shown that the combination of SMOTE and undersampling performs better than only undersampling the majority class. However, over- and undersampling remain popular as it each is much easier to implement alone than in some complex hybrid approach.

Synthetic Minority Over-sampling Technique (SMOTE) Text mostly based on Chawla et al., 2002 and in part on He et al., 2010

In case of Synthetic Minority Over-sampling Technique (SMOTE), the minority class is oversampled by creating synthetic examples along the line segments joining any or all of the \\(k\\)-nearest neighbours in the minority class. The synthetic examples cause the classifier to create larger and less specific decision regions, rather than smaller and more specific regions. More general regions are now learned for the minority class samples rather than those being subsumed by the majority class samples around them. In this way SMOTE shifts the classifier learning bias toward the minority class and thus has the effect of allowing the model to generalize better.

There also exist extensions of this work like SMOTE-Boost in which the syntetic procedure was integrated with adaptive boosting techniques to change the method of updating weights to better compensate for skewed distributions.

So in general SMOTE proceeds as follows

SMOTE(N, X, k)\nInput: N - Number of synthetic samples to be generated\n       X - Underrepresented data\n       k - Hyperparameter of number of nearest neighbours to be chosen\n\nCreate an empty list SYNTHETIC_SAMPLES\nWhile N_SYNTHETIC_SAMPLES < N\n    1. Randomly choose an entry xRand from X\n    2. Find k nearest neighbours from X\n    3. Randomly choose an entry xNeighbour from the k nearest neighbours\n    4. Take difference dx between the xRand and xNeighbour\n    5. Multiply dx by a random number between 0 and 1\n    6. Append the result to SYNTHETIC_SAMPLES\nExtend X by SYNTHETIC_SAMPLES\n

Adaptive synthetic sampling approach (ADASYN) Text mostly based on He et al., 2010

Adaptive synthetic sampling approach (ADASYN) is a sampling approach for learning from imbalanced datasets. The main idea is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. Thus, ADASYN improves learning with respect to the data distributions by reducing the bias introduced by the class imbalance and by adaptively shifting the classification boundary toward the difficult examples.

The objectives of ADASYN are reducing bias and learning adaptively. The key idea of this algorithm is to use a density distribution as a criterion to decide the number of synthetic samples that need to be generated for each minority data example. Physically, this density distribution is a distribution of weights for different minority class examples according to their level of difficulty in learning. The resulting dataset after using ADASYN will not only provide a balanced representation of the data distribution (according to the desired balance level defined in the configuration), but it also forces the learning algorithm to focus on those difficult to learn examples. It has been shown He et al., 2010, that this algorithm improves accuracy for both minority and majority classes and does not sacrifice one class in preference for another.

ADASYN is not limited to only two-class learning, but can also be generalized to multiple-class imbalanced learning problems as well as incremental learning applications.

For more details and comparisons of ADASYN to other algorithms, please see He et al., 2010.

"},{"location":"optimization/data_augmentation.html#existing-implementations","title":"Existing implementations","text":"

Imbalanced-learn is an open-source Python library which provides a suite of algorithms for treating the class imbalance problem.

For augmentig image data, one can use of of the following:

  • Albumentations
  • ImgAug
  • Autoaugment
  • Augmentor
  • DeepAugmnent

But it is also possible to use tools directly implemented by tensorflow, keras etc. For example:

flipped_image = tf.image.flip_left_right(image)\n
"},{"location":"optimization/data_augmentation.html#deep-learning-based-sda-techniques","title":"Deep Learning-based SDA Techniques","text":"

In data science, data augmentation techniques are used to increase the amount of data by either synthetically creating data from already existing samples via a GAN or modifying the data at hand with small noise or rotation. (Rebuffi et al., 2021)

More recently, data augmentation studies have begun to focus on the field of deep learning (DL), more specifically on the ability of generative models, like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), to create artificial data. This synthetic data is then introduced during the classification model training process to improve performance and results.

Generative Adversarial Networks (GANs) The following text is written based on the works by Musella & Pandolfi, 2018 and Hashemi et al., 2019 and Kansal et al., 2022 and Rehm et al., 2021 and Choi & Lim, 2021 and Kansal et al., 2020

GANs have been proposed as a fast and accurate way of modeling high energy jet formation (Paganini et al., 2017a) and modeling showers throughcalorimeters of high-energy physics experiments (Paganini et al., 2017 ; Paganini et al., 2012; Erdman et al., 2020; Musella & Pandolfi, 2018) GANs have also been trained to accurately approximate bottlenecks in computationally expensive simulations of particle physics experiments. Applications in the context of present and proposed CERN experiments have demonstrated the potential of these methods for accelerating simulation and/or improving simulation fidelity (ATLAS Collaboration, 2018; SHiP Collaboration, 2019).

The generative model approximates the combined response of aparticle detecor simulation and reconstruction algorithms to hadronic jets given the latent space of uniformly distributed noise, auxiliary features and jet image at particle level (jets clustered from the list of stable particles produced by PYTHIA).

In the paper by Musella & Pandolfi, 2018, the authors apply generative models parametrized by neural networks (GANs in particular) to the simulation of particles-detector response to hadronic jets. They show that this parametrization achieves high-fidelity while increasing the processing speed by several orders of magnitude.

Their model is trained to be capable of predicting the combined effect of particle-detector simulation models and reconstruction algorithms to hadronic jets.

Generative adversarial networks (GANs) are pairs of neural networks, a generative and a discriminative one, that are trained concurrently as players of a minimax game (Musella & Pandolfi, 2018). The task of the generative network is to produce, starting from a latent space with a fixed distribution, samples that the discriminative model tries to distinguish from samples drawn from a target dataset. This kind of setup allows the distribution of the target dataset to be learned, provided that both of the networks have high enough capacity.

The input to these networks are hadronic jets, represented as \"gray-scale\" images of fixed size centered around the jet axis, with the pixel intensity corresponding to the energy fraction in a given cell. The architectures of the networks are based on the image-to-image translation. There few differences between this approach and image-to-image translation. Firstly, non-empty pixels are explicitly modelled in the generated images since these are much sparser than the natural ones. Secondly, feature matching and a dedicated adversarial classifier enforce good modelling of the total pixel intensity (energy). Lastly, the generator is conditioned on some auxiliary inputs.

By predicting directly the objects used at analysis level and thus reproducing the output of both detector simulation and reconstruction algorithms, computation time is reduced. This kind of philosophy is very similar to parametrized detector simulations, which are used in HEP for phenomenological studies. The attained accuracies are comparable to the full simulation and reconstruction chain.

"},{"location":"optimization/data_augmentation.html#variational-autoencoders-vaes","title":"Variational autoencoders (VAEs)","text":"

The following section is partly based on Otten et al., 2021

In contrast to the traditional autoencoder (AE) that outputs a single value for each encoding dimension, variational autoencoders (VAEs) provide a probabilistic interpretation for describing an observation in latent space.

In case of VAEs, the encoder model is sometimes referred to as the recognition model and the decoder model as generative model.

By constructing the encoder model to output a distribution of the values from which we randomly sample to feed into our decoder model, we are enforcing a continuous, smooth latent space representation. Thus we expect our decoder model to be able to accurately reconstruct the input for any sampling of the latent distributions, which then means that values residing close to each other in latent space should have very similar reconstructions.

"},{"location":"optimization/data_augmentation.html#ml-powered-data-generation-for-fast-simulation","title":"ML-powered Data Generation for Fast Simulation","text":"

The following text is based on this Chen et al., 2020

We rely on accurate simulation of physics processes, however currently it is very common for LHC physics to be affected by large systematic uncertanties due to the limited amount of simulated data, especially for precise measurements of SM processes for which large datasets are already available. So far the most widely used simulator is GEANT4 that provides state-of-the-art accuracy. But running this is demanding, both in terms of time and resources. Consequently, delivering synthetic data at the pace at which LHC delivers real data is one of the most challenging tasks for computing infrastructures of the LHC experiments. The typical time it takes to simulate one single event is in the ballpark of 100 seconds.

Recently, generative algorithms based on deep learning have been proposed as a possible solution to speed up GEANT4. However, one needs to work beyond the collision-as-image paradigm so that the DL-based simulation accounts for the irregular geometry of a typical detector while delivering a dataset in a format compatible with downstream reconstruction software.

One method to solve this bottleneck was proposed by Chen et al., 2020. They adopt a generative DL model to convert an analysis specific representation of collision events at generator level to the corresponding representation at reconstruction level. Thus, this novel, fast-simulation workflow starts from a large amount of generator-level events to deliver large analysis-specific samples.

They trained a neural network to model detector resolution effects as a transfer function acting on an analysis-specific set of relevant features, computed at generator level. However, their model does not sample events from a latent space (like a GAN or a plain VAE). Instead, it works as a fast simulator of a given generator-level event, preserving the correspondence between the reconstructed and the generated event, which allows us to compare event-by-event residual distributions. Furthermore, this model is much simpler than a generative model.

Step one in this workflow is generating events in their full format, which is the most resource heavy task, where, as noted before, generating one event takes roughly 100 seconds. However, with this new proposed method O(1000) events are generated per second. This would save on storage: for the full format O(1) MB/event is needed, where for the DL model only 8 MB was used to store 100000 events. To train the model, they used NVIDIA RTX2080 and it trained for 30 minutes, which in terms of overall production time is negligible. For generating N=1M events and n=10%N, one would save 90% of the CPU resources and 79% of the disk storage. Thus augmenting the centrally produced data is a viable method and could help the HEP community to face the computing challenges of the High-Luminosity LHC.

Another more extreme approach investigated the use of GANs and VAEs for generating physics quantities which are relevant to a specific analysis. In this case, one learns the N-dimensional density function of the event, in a space defined by the quantities of interest for a given analysis. So sampling from this function, one can generate new data. Trade-off between statistical precision (decreases with the increasing amount of generated events) and the systematic uncertainty that could be induced by a non accurate description of the n-dim pdf.

Qualitatively, no accuracy deterioration was observed due to scaling the dataset size for DL. This fact proves the robustness of the proposed methodology and its effectiveness for data augmentation.

"},{"location":"optimization/data_augmentation.html#open-challenges-in-data-augmentation","title":"Open challenges in Data Augmentation","text":"

Excerpts are taken from Li, 2020

The limitations of conventional data augmentation approaches reveal huge opportunities for research advances. Below we summarize a few challenges that motivate some of the works in the area of data augmentation.

  • From manual to automated search algorithms: As opposed to performing suboptimal manual search, how can we design learnable algorithms to find augmentation strategies that can outperform human-designed heuristics?
  • From practical to theoretical understanding: Despite the rapid progress of creating various augmentation approaches pragmatically, understanding their benefits remains a mystery because of a lack of analytic tools. How can we theoretically understand various data augmentations used in practice?
  • From coarse-grained to fine-grained model quality assurance: While most existing data augmentation approaches focus on improving the overall performance of a model, it is often imperative to have a finer-grained perspective on critical subpopulations of data. When a model exhibits inconsistent predictions on important subgroups of data, how can we exploit data augmentations to mitigate the performance gap in a prescribed way?
"},{"location":"optimization/data_augmentation.html#references","title":"References","text":"
  • Shorten & Khoshgoftaar, 2019, \"A survey on Image Data Augmentationfor Deep Learning\"
  • Freer & Yang, 2020, \"Data augmentation for self-paced motor imagery classification with C-LSTM\"
  • Li, 2020, \"Automating Data Augmentation: Practice, Theory and New Direction\"
  • Rebuffi et al., 2021, \"Data Augmentation Can Improve Robustness\"
  • Erdmann et al., 2018, \"Lorentz Boost Networks: Autonomous Physics-Inspired Feature Engineering\"
  • Dolan & Ore, 2021, \"Meta-learning and data augmentation for mass-generalised jet taggers\"
  • Bradshaw et al., 2019, \"Mass agnostic jet taggers\"
  • Chang et al., 2018, \"What is the Machine Learning?\"
  • Oliveira et al. 2017, \"Jet-Images \u2013 Deep Learning Edition\"
  • Barnard et al., 2016, \"Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks\"
  • Chen et al., 2020, \"Data augmentation at the LHC through analysis-specific fast simulation with deep learning\"
  • Musella & Pandolfi, 2018, \"Fast and accurate simulation of particle detectors using generative adversarial networks\"
  • Hashemi et al., 2019, \"LHC analysis-specific datasets with Generative Adversarial Networks\"
  • Kansal et al., 2022, \"Particle Cloud Generation with Message Passing Generative Adversarial Networks\"
  • Rehm et al., 2021, \"Reduced Precision Strategies for Deep Learning: A High Energy Physics Generative Adversarial Network Use Case\"
  • Choi & Lim, 2021, \"A Data-driven Event Generator for Hadron Colliders using Wasserstein Generative Adversarial Network\"
  • Kansal et al., 2020, \"Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics\"
  • Otten et al., 2021, \"Event Generation and Statistical Sampling for Physics with Deep Generative Models and a Density Information Buffer\"
  • Yap et al., 2013, \"An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets\"
  • Au et al., 2010, \"Mining Rare Events Data by Sampling and Boosting: A Case Study\"
  • Chawla et al., 2002, \"SMOTE: Synthetic Minority Over-sampling Technique\"
  • He et al., 2010, \"ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning\"
  • Erdman et al., 2020, \"Precise simulation of electromagnetic calorimeter showers using a Wasserstein Generative Adversarial Network\"
  • Paganini et al., 2012, \"CaloGAN: Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks\"
  • Paganini et al., 2017, \"Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multi-Layer Calorimeters\"
  • Paganini et al., 2017, \"Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics Synthesis\"
  • ATLAS Collaboration, 2018, \"Deep generative models for fast shower simulation in ATLAS\"
  • SHiP Collaboration, 2019, \"Fast simulation of muons produced at the SHiP experiment using Generative Adversarial Networks\"

Content may be edited and published elsewhere by the author.

Page author: Laurits Tani, 2022

"},{"location":"optimization/importance.html","title":"Feature Importance","text":"

Feature importance is the impact a specific input field has on a prediction model's output. In general, these impacts can range from no impact (i.e. a feature with no variance) to perfect correlation with the ouput. There are several reasons to consider feature importance:

  • Important features can be used to create simplified models, e.g. to mitigate overfitting.
  • Using only important features can reduce the latency and memory requirements of the model.
  • The relative importance of a set of features can yield insight into the nature of an otherwise opaque model (improved interpretability).
  • If a model is sensitive to noise, rejecting irrelevant inputs may improve its performance.

In the following subsections, we detail several strategies for evaluating feature importance. We begin with a general discussion of feature importance at a high level before offering a code-based tutorial on some common techniques. We conclude with additional notes and comments in the last section.

"},{"location":"optimization/importance.html#general-discussion","title":"General Discussion","text":"

Most feature importance methods fall into one of three broad categories: filter methods, embedding methods, and wrapper methods. Here we give a brief overview of each category with relevant examples:

"},{"location":"optimization/importance.html#filter-methods","title":"Filter Methods","text":"

Filter methods do not rely on a specific model, instead considering features in the context of a given dataset. In this way, they may be considered to be pre-processing steps. In many cases, the goal of feature filtering is to reduce high dimensional data. However, these methods are also applicable to data exploration, wherein an analyst simply seeks to learn about a dataset without actually removing any features. This knowledge may help interpret the performance of a downstream predictive model. Relevant examples include,

  • Domain Knowledge: Perhaps the most obvious strategy is to select features relevant to the domain of interest.

  • Variance Thresholding: One basic filtering strategy is to simply remove features with low variance. In the extreme case, features with zero variance do not vary from example to example, and will therefore have no impact on the model's final prediction. Likewise, features with variance below a given threshold may not affect a model's downstream performance.

  • Fisher Scoring: Fisher scoring can be used to rank features; the analyst would then select the highest scoring features as inputs to a subsequent model.

  • Correlations: Correlated features introduce a certain degree of redundancy to a dataset, so reducing the number of strongly correlated variables may not impact a model's downstream performance.

"},{"location":"optimization/importance.html#embedded-methods","title":"Embedded Methods","text":"

Embedded methods are specific to a prediction model and independent of the dataset. Examples:

  • L1 Regularization (LASSO): L1 regularization directly penalizes large model weights. In the context of linear regression, for example, this amounts to enforcing sparsity in the output prediction; weights corresponding to less relevant features will be driven to 0, nullifying the feature's effect on the output.
"},{"location":"optimization/importance.html#wrapper-methods","title":"Wrapper Methods","text":"

Wrapper methods iterate on prediction models in the context of a given dataset. In general they may be computationally expensive when compared to filter methods. Examples:

  • Permutation Importance: Direct interpretation isn't always feasible, so other methods have been developed to inspect a feature's importance. One common and broadly-applicable method is to randomly shuffle a given feature's input values and test the degredation of model performance. This process allows us to measure permutation importance as follows. First, fit a model (\\(f\\)) to training data, yielding \\(f(X_\\mathrm{train})\\), where \\(X_\\mathrm{train}\\in\\mathbb{R}^{n\\times d}\\) for \\(n\\) input examples with \\(d\\) features. Next, measure the model's performance on testing data for some loss \\(\\mathcal{L}\\), i.e. \\(s=\\mathcal{L}\\big(f(X_\\mathrm{test}), y_\\mathrm{test}\\big)\\). For each feature \\(j\\in[1\\ ..\\ d]\\), randomly shuffle the corresponding column in \\(X_\\mathrm{test}\\) to form \\(X_\\mathrm{test}^{(j)}\\). Repeat this process \\(K\\) times, so that for \\(k\\in [1\\ ..\\ K]\\) each random shuffling of feature column \\(j\\) gives a corrupted input dataset \\(X_\\mathrm{test}^{(j,k)}\\). Finally, define the permutation importance of feature \\(j\\) as the difference between the un-corrupted validation score and average validation score over the corrupted \\(X_\\mathrm{test}^{(j,k)}\\) datasets:
\\[\\texttt{PI}_j = s - \\frac{1}{K}\\sum_{k=1}^{K} \\mathcal{L}[f(X_\\mathrm{test}^{(j,k)}), y_\\mathrm{test}]\\]
  • Recursive Feature Elimination (RFE): Given a prediction model and test/train dataset splits with \\(D\\) initial features, RFE returns the set of \\(d < D\\) features that maximize model performance. First, the model is trained on the full set of features. The importance of each feature is ranked depending on the model type (e.g. for regression, the slopes are a sufficient ranking measure; permutation importance may also be used). The least important feature is rejected and the model is retrained. This process is repeated until the most significant \\(d\\) features remain.
"},{"location":"optimization/importance.html#introduction-by-example","title":"Introduction by Example","text":""},{"location":"optimization/importance.html#direct-interpretation","title":"Direct Interpretation","text":"

Linear regression is particularly interpretable because the prediction coefficients themselves can be interpreted as a measure of feature importance. Here we will compare this direct interpretation to several model inspection techniques. In the following examples we use the Diabetes Dataset available as a Scikit-learn toy dataset. This dataset maps 10 biological markers to a 1-dimensional quantitative measure of diabetes progression:

from sklearn.datasets import load_diabetes\nfrom sklearn.model_selection import train_test_split\n\ndiabetes = load_diabetes()\nX_train, X_val, y_train, y_val = train_test_split(diabetes.data, diabetes.target, random_state=0)\nprint(X_train.shape)\n>>> (331,10)\nprint(y_train.shape)\n>>> (331,)\nprint(X_val.shape)\n>>> (111, 10)\nprint(y_val.shape)\n>>> (111,)\nprint(diabetes.feature_names)\n['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']\n
To begin, let's use Ridge Regression (L2-regularized linear regression) to model diabetes progression as a function of the input markers. The absolute value of a regression coefficient (slope) corresponding to a feature can be interpreted the impact of a feature on the final fit:

from sklearn.linear_model import Ridge\nfrom sklearn.feature_selection import RFE\n\nmodel = Ridge(alpha=1e-2).fit(X_train, y_train)\nprint(f'Initial model score: {model.score(X_val, y_val):.3f}')\n\nfor i in np.argsort(-abs(model.coef_)):\n    print(diabetes.feature_names[i], abs(model.coef_[i]))\n\n>>> Initial model score: 0.357\n>>> bmi: 592.253\n>>> s5: 580.078\n>>> bp: 297.258\n>>> s1: 252.425\n>>> sex: 203.436\n>>> s3: 145.196\n>>> s4: 97.033\n>>> age: 39.103\n>>> s6: 32.945\n>>> s2: 20.906\n
These results indicate that the bmi and s5 fields have the largest impact on the output of this regression model, while age, s6, and s2 have the smallest. Further interpretation is subject to the nature of the input data (see Common Pitfalls in the Interpretation of Coefficients of Linear Models). Note that scikit-learn has tools available to faciliate feature selections.

"},{"location":"optimization/importance.html#permutation-importance","title":"Permutation Importance","text":"

In the context of our ridge regression example, we can calculate the permutation importance of each feature as follows (based on scikit-learn docs):

from sklearn.inspection import permutation_importance\n\nmodel = Ridge(alpha=1e-2).fit(X_train, y_train)\nprint(f'Initial model score: {model.score(X_val, y_val):.3f}')\n\nr = permutation_importance(model, X_val, y_val, n_repeats=30, random_state=0)\nfor i in r.importances_mean.argsort()[::-1]:\n    print(f\"{diabetes.feature_names[i]:<8}\"\n          f\"{r.importances_mean[i]:.3f}\"\n          f\" +/- {r.importances_std[i]:.3f}\")\n\n>>> Initial model score: 0.357\n>>> s5      0.204 +/- 0.050\n>>> bmi     0.176 +/- 0.048\n>>> bp      0.088 +/- 0.033\n>>> sex     0.056 +/- 0.023\n>>> s1      0.042 +/- 0.031\n>>> s4      0.003 +/- 0.008\n>>> s6      0.003 +/- 0.003\n>>> s3      0.002 +/- 0.013\n>>> s2      0.002 +/- 0.003\n>>> age     -0.002 +/- 0.004\n
These results are roughly consistent with the direct interpretation of the linear regression parameters; s5 and bmi are the most permutation-important features. This is because both have significant permutation importance scores (0.204, 0.176) when compared to the initial model score (0.357), meaning their random permutations significantly degraded the model perforamnce. On the other hand, s2 and age have approximately no permutation importance, meaning that the model's performance was robust to random permutations of these features.

"},{"location":"optimization/importance.html#l1-enforced-sparsity","title":"L1-Enforced Sparsity","text":"

In some applications it may be useful to reject features with low importance. Models biased towards sparsity are one way to achieve this goal, as they are designed to ignore a subset of features with the least impact on the model's output. In the context of linear regression, sparsity can be enforced by imposing L1 regularization on the regression coefficients (LASSO regression):

\\[\\mathcal{L}_\\mathrm{LASSO} = \\frac{1}{2n}||y - Xw||^2_2 + \\alpha||w||_1\\]

Depending on the strength of the regularization \\((\\alpha)\\), this loss function is biased to zero-out features of low importance. In our diabetes regression example,

model = Lasso(alpha=1e-1).fit(X_train, y_train)\nprint(f'Model score: {model.score(X_val, y_val):.3f}')\n\nfor i in np.argsort(-abs(model.coef_)):\n    print(f'{diabetes.feature_names[i]}: {abs(model.coef_[i]):.3f}')\n\n>>> Model score: 0.355\n>>> bmi: 592.203\n>>> s5: 507.363\n>>> bp: 240.124\n>>> s3: 219.104\n>>> sex: 129.784\n>>> s2: 47.628\n>>> s1: 41.641\n>>> age: 0.000\n>>> s4: 0.000\n>>> s6: 0.000\n
For this value of \\(\\alpha\\), we see that the model has rejected the age, s4, and s6 features as unimportant (consistent with the permutation importance measures above) while achieving a similar model score as the previous ridge regression strategy.

"},{"location":"optimization/importance.html#recursive-feature-elimination","title":"Recursive Feature Elimination","text":"

Another common strategy is recursive feature elimination (RFE). Though RFE can be used for regression applications as well, we turn our attention to a classification task for the sake of variety. The following discussions are based on the Breast Cancer Wisconsin Diagnostic Dataset, which maps 30 numeric features corresponding to digitized breast mass images to a binary classification of benign or malignant.

from sklearn.datasets import load_breast_cancer\nfrom sklearn.svm import SVC\nfrom sklearn.model_selection import StratifiedKFold\n\ndata = load_breast_cancer()\nX_train, X_val, y_train, y_val = train_test_split(data.data, data.target, random_state=0)\nprint(X_train.shape)\n>>> (426, 30)\nprint(y_train.shape)\n>>> (426,)\nprint(X_val.shape)\n>>> (143, 30)\nprint(y_val.shape)\n>>> (143,)\nprint(breast_cancer.feature_names)\n>>> ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']\n

Given a classifier and a classification task, recursive feature elimination (RFE, see original paper) is the process of identifying the subset of input features leading to the most performative model. Here we employ a support vector machine classifier (SVM) with a linear kernel to perform binary classification on the input data. We ask for the top \\(j\\in[1\\ .. \\ d]\\) most important features in a for loop, computing the classification accuracy when only these features are leveraged.

from sklearn.feature_selection import RFE\n\nfeatures = np.array(breast_cancer.feature_names)\nsvc = SVC(kernel='linear')\nfor n_features in np.arange(1, 30, 1):\n    rfe = RFE(estimator=svc, step=1, n_features_to_select=n_features)\n    rfe.fit(X_train, y_train)\n    print(f'n_features={n_features}, accuracy={rfe.score(X_val, y_val):.3f}')\n    print(f' - selected: {features[rfe.support_]}')\n\n>>> n_features=1, accuracy=0.881\n>>>  - selected: ['worst concave points']\n>>> n_features=2, accuracy=0.874\n>>>  - selected: ['worst concavity' 'worst concave points']\n>>> n_features=3, accuracy=0.867\n>>>  - selected: ['mean concave points' 'worst concavity' 'worst concave points']\n ...\n>>> n_features=16, accuracy=0.930\n>>> n_features=17, accuracy=0.965\n>>> n_features=18, accuracy=0.951\n...\n>>> n_features=27, accuracy=0.958\n>>> n_features=28, accuracy=0.958\n>>> n_features=29, accuracy=0.958\n
Here we've shown a subset of the output. In the first output lines, we see that the 'worst concave points' feature alone leads to 88.1% accuracy. Including the next two most important features actually degrades the classification accuracy. We then skip to the top 17 features, which in this case we observe to yield the best performance for the linear SVM classifier. The addition of more features does not lead to additional perforamnce boosts. In this way, RFE can be treated as a model wrapper introducing an additional hyperparameter, n_features_to_select, which can be used to optimize model performance. A more principled optimization using k-fold cross validation with RFE is available in the scikit-learn docs.

"},{"location":"optimization/importance.html#feature-correlations","title":"Feature Correlations","text":"

In the above, we have focused specifically on interpreting the importance of single features. However, it may be that several features are correlated, sharing the responsibility for the overall prediction of the model. In this case, some measures of feature importance may inappropriately downweight correlated features in a so-called correlation bias (see Classification with Correlated Features: Unrelability of Feature Ranking and Solutions). For example, the permutation invariance of \\(d\\) correlated features is shown to decrease (as a function of correlation strength) faster for higher \\(d\\) (see Correlation and Variable importance in Random Forests).

We can see these effects in action using the breast cancer dataset, following the corresponding scikit-learn example

from sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import load_breast_cancer\n\ndata = load_breast_cancer()\nX, y = data.data, data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\nclf = RandomForestClassifier(n_estimators=100, random_state=42)\nclf.fit(X_train, y_train)\nprint(\"Accuracy on test data: {:.2f}\".format(clf.score(X_test, y_test)))\n\n>>> Accuracy on test data: 0.97\n
Here we've implemented a random forest classifier and achieved a high accuracy (97%) on the benign vs. malignent predictions. The permutation importances for the 10 most important training features are:

r = permutation_importance(clf, X_train, y_train, n_repeats=10, random_state=42)\nfor i in r.importances_mean.argsort()[::-1][:10]:\n    print(f\"{breast_cancer.feature_names[i]:<8}\"\n          f\"  {r.importances_mean[i]:.5f}\"\n          f\" +/- {r.importances_std[i]:.5f}\")\n\n>>> worst concave points  0.00681 +/- 0.00305\n>>> mean concave points  0.00329 +/- 0.00188\n>>> worst texture  0.00258 +/- 0.00070\n>>> radius error  0.00235 +/- 0.00000\n>>> mean texture  0.00188 +/- 0.00094\n>>> mean compactness  0.00188 +/- 0.00094\n>>> area error  0.00188 +/- 0.00094\n>>> worst concavity  0.00164 +/- 0.00108\n>>> mean radius  0.00141 +/- 0.00115\n>>> compactness error  0.00141 +/- 0.00115\n

In this case, even the most permutation important features have mean importance scores \\(<0.007\\), which doesn't indicate much importance. This is surprising, because we saw via RFE that a linear SVM can achieve \\(\\approx 88\\%\\) classification accuracy with this feature alone. This indicates that worst concave points, in addition to other meaningful features, may belong to subclusters of correlated features. In the corresponding scikit-learn example, the authors show that subsets of correlated features can be extracted by calculating a dendogram and selecting representative features from each correlated subset. They achieve \\(97\\%\\) accuracy (the same as with the full dataset) by selecting only five such representative variables.

"},{"location":"optimization/importance.html#feature-importance-in-decision-trees","title":"Feature Importance in Decision Trees","text":"

Here we focus on decision trees, which are particularly interpretable classifiers that often appear as ensembles (or boosted decision tree (BDT) algorithms) in HEP. Consider a classification dataset \\(X=\\{x_n\\}_{n=1}^{N}\\), \\(x_n\\in\\mathbb{R}^{D}\\), with truth labels \\(Y=\\{y_n\\}_{n=1}^N\\), \\(y_n\\in\\{1,...,C\\}\\) corresponding \\(C\\) classes. These truth labels naturally partition \\(X\\) into subsets \\(X_c\\) with class probabilities \\(p(c)=|X_c|/|X|\\). Decision trees begin with a root node \\(t_0\\) containing all of \\(X\\). The tree is grown from the root by recursively splitting the input set \\(X\\) in a principled way; internal nodes (or branch nodes) correspond to a decision of the form

\\[\\begin{aligned} &(x_n)_d\\leq\\delta \\implies\\ \\text{sample}\\ n\\ \\text{goes to left child node}\\\\ &(x_n)_d>\\delta \\implies\\ \\text{sample}\\ n\\ \\text{goes to right child node} \\end{aligned}\\]

We emphasize that the decision boundary is drawn by considering a single feature field \\(d\\) and partitioning the \\(n^\\mathrm{th}\\) sample by the value at that feature field. Decision boundaries at each internal parent node \\(t_P\\) are formed by choosing a \"split criterion,\" which describes how to partition the set of elements at this node into left and right child nodes \\(t_L\\), \\(t_R\\) with \\(X_{t_L}\\subset X_{t_P}\\) and \\(X_{t_R}\\subset X_{t_P}\\), \\(X_{t_L}\\cup X_{t_R}=X_{t_P}\\). This partitioning is optimal if \\(X_{t_L}\\) and \\(X_{t_R}\\) are pure, each containing only members of the same class. Impurity measures are used to evaluate the degree to which the set of data points at a given tree node \\(t\\) are not pure. One common impurity measure is Gini Impurity,

\\[\\begin{aligned} I(t) = \\sum_{c=1}^C p(c|t)(1-p(c|t)) \\end{aligned}\\]

Here, \\(p(c|t)\\) is the probability of drawing a member of class \\(c\\) from the set of elements at node \\(t\\). For example, the Gini impurity at the root node (corresponding to the whole dataset) is

\\[\\begin{aligned} I(t_0) = \\sum_{c=1}^C \\frac{|X_c|}{|X|}(1-\\frac{|X_c|}{|X|}) \\end{aligned}\\]

In a balanced binary dataset, this would give \\(I(t_0)=1/2\\). If the set at node \\(t\\) is pure, i.e. class labels corresponding to \\(X_t\\) are identical, then \\(I(t)=0\\). We can use \\(I(t)\\) to produce an optimal splitting from parent \\(t_p\\) to children \\(t_L\\) and \\(t_R\\) by defining an impurity gain,

\\[\\begin{aligned} \\Delta I = I(t_P) - I(t_L) - I(t_R) \\end{aligned}\\]

This quantity describes the relative impurity between a parent node and its children. If \\(X_{t_P}\\) contains only two classes, an optimal splitting would separate them into \\(X_{p_L}\\) and \\(X_{p_R}\\), producing pure children nodes with \\(I(t_L)=I(t_R)=0\\) and, correspondingly, \\(\\Delta I(t_p) = I(t_P)\\). Accordingly, good splitting decisions should maximize impurity gain. Note that the impurity gain is often weighted, for example Scikit-Learn defines:

\\[\\begin{aligned} \\Delta I(t_p) = \\frac{|X_{t_p}|}{|X|}\\bigg(I(t_p) - \\frac{|X_{t_L}|}{|X_{t_p}|} I(t_L) - \\frac{|X_{t_R}|}{|X_{t_p}|} I(t_R) \\bigg) \\end{aligned}\\]

In general, a pure node cannot be split further and must therefore be a leaf. Likewise, a node for which there is no splitting yielding \\(\\Delta I > 0\\) must be labeled a leaf. These splitting decisions are made recursively at each node in a tree until some stopping condition is met. Stopping conditions may include maximum tree depths or leaf node counts, or threshhold on the maximum impurity gain.

Impurity gain gives us insight into the importance of a decision. In particular, larger \\(\\Delta I\\) indicates a more important decision. If some feature \\((x_n)_d\\) is the basis for several decision splits in a decision tree, the sum of impurity gains at these splits gives insight into the importance of this feature. Accordingly, one measure of the feature importance of \\(d\\) is the average (with respect to the total number of internal nodes) impurity gain imparted by decision split on \\(d\\). This method generalizes to the case of BDTs, in which case one would average this quantity across all weak learner trees in the ensemble.

Note that though decision trees are based on the feature \\(d\\) producing the best (maximum impurity gain) split at a given branch node, surrogate splits are often used to retain additional splits corresponding to features other than \\(d\\). Denote the feature maximizing the impurity gain \\(d_1\\) and producing a split boundary \\(\\delta_1\\). Surrogte splitting involves tracking secondary splits with boundaries \\(\\delta_2, \\delta_3,...\\) corresponding to \\(d_2,d_3,...\\) that have the highest correlation with the maximum impurity gain split. The upshot is that in the event that input data is missing a value at field \\(d_1\\), there are backup decision boundaries to use, mitigating the need to define multiple trees for similar data. Using this generalized notion of a decision tree, wherein each branch node contains a primary decision boundary maximizing impurity gain and several additional surrogate split boundaries, we can average the impurity gain produced at feature field \\(d\\) over all its occurances as a decision split or a surrogate split. This definition of feature importance generalizes the previous to include additional correlations.

"},{"location":"optimization/importance.html#example","title":"Example","text":"

Let us now turn to an example:

import numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.datasets import load_wine\nfrom sklearn.inspection import DecisionBoundaryDisplay\nfrom sklearn.metrics import log_loss\nfrom sklearn.model_selection import train_test_split\n\nwine_data = load_wine() \nprint(wine_data.data.shape)\nprint(wine_data.feature_names)\nprint(np.unique(wine_data.target))\n>>> (178, 13)\n>>> ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']\n>>> [0 1 2]\n

This sklearn wine dataset has 178 entries with 13 features and truth labels corresponding to membership in one of \\(C=3\\) classes. We can train a decision tree classifier as follows:

X, y = wine_data.data, wine_data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)\nclassifier = DecisionTreeClassifier(criterion='gini', splitter='best', random_state=27)\nclassifier.fit(X_train, y_train)\nX_test_pred = classifier.predict(X_test)\nprint('Test Set Performance')\nprint('Number misclassified:', sum(X_test_pred!=y_test))\nprint(f'Accuracy: {classifier.score(X_test, y_test):.3f}')\n>>> Test Set Performance\n>>> Number misclassified: 0\n>>> Accuracy: 1.000\n

In this case, the classifier has generalized perfectly, fitting the test set with \\(100\\%\\) accuracy. Let's take a look into how it makes predictions:

tree = classifier.tree_\nn_nodes = tree.node_count\nnode_features = tree.feature\nthresholds = tree.threshold\nchildren_L = tree.children_left\nchildren_R = tree.children_right\nfeature_names = np.array(wine_data.feature_names)\n\nprint(f'The tree has {n_nodes} nodes')\nfor n in range(n_nodes):\n    if children_L[n]==children_R[n]: continue # leaf node\n    print(f'Decision split at node {n}:',\n          f'{feature_names[node_features[n]]}({node_features[n]}) <=',\n          f'{thresholds[n]:.2f}')\n\n>>> The tree has 13 nodes\n>>> Decision split at node 0: color_intensity(9) <= 3.46\n>>> Decision split at node 2: od280/od315_of_diluted_wines(11) <= 2.48\n>>> Decision split at node 3: flavanoids(6) <= 1.40\n>>> Decision split at node 5: color_intensity(9) <= 7.18\n>>> Decision split at node 8: proline(12) <= 724.50\n>>> Decision split at node 9: malic_acid(1) <= 3.33\n

Here we see that several features are used to generate decision boundaries. For example, the dataset is split at the root node by a cut on the \\(\\texttt{color_intensity}\\) feature. The importance of each feature can be taken to be the average impurity gain it generates across all nodes, so we expect that one (or several) of the five unique features used at the decision splits will be the most important features by this definition. Indeed, we see,

feature_names = np.array(wine_data.feature_names)\nimportances = classifier.feature_importances_\nfor i in range(len(importances)):\n    print(f'{feature_names[i]}: {importances[i]:.3f}')\nprint('\\nMost important features', \n      feature_names[np.argsort(importances)[-3:]])\n\n>>> alcohol: 0.000\n>>> malic_acid: 0.021\n>>> ash: 0.000\n>>> alcalinity_of_ash: 0.000\n>>> magnesium: 0.000\n>>> total_phenols: 0.000\n>>> flavanoids: 0.028\n>>> nonflavanoid_phenols: 0.000\n>>> proanthocyanins: 0.000\n>>> color_intensity: 0.363\n>>> hue: 0.000\n>>> od280/od315_of_diluted_wines: 0.424\n>>> proline: 0.165\n\n>>> Most important features ['proline' 'color_intensity' 'od280/od315_of_diluted_wines']\n

This is an embedded method for generating feature importance - it's cooked right into the decision tree model. Let's verify these results using a wrapper method, permutation importance:

from sklearn.inspection import permutation_importance\n\nprint(f'Initial classifier score: {classifier.score(X_test, y_test):.3f}')\n\nr = permutation_importance(classifier, X_test, y_test, n_repeats=30, random_state=0)\nfor i in r.importances_mean.argsort()[::-1]:\n    print(f\"{feature_names[i]:<8}\"\n          f\" {r.importances_mean[i]:.3f}\"\n          f\" +/- {r.importances_std[i]:.3f}\")\n\n>>> Initial classifier score: 1.000\n\n>>> color_intensity 0.266 +/- 0.040\n>>> od280/od315_of_diluted_wines 0.237 +/- 0.049\n>>> proline  0.210 +/- 0.041\n>>> flavanoids 0.127 +/- 0.025\n>>> malic_acid 0.004 +/- 0.008\n>>> hue      0.000 +/- 0.000\n>>> proanthocyanins 0.000 +/- 0.000\n>>> nonflavanoid_phenols 0.000 +/- 0.000\n>>> total_phenols 0.000 +/- 0.000\n>>> magnesium 0.000 +/- 0.000\n>>> alcalinity_of_ash 0.000 +/- 0.000\n>>> ash      0.000 +/- 0.000\n>>> alcohol  0.000 +/- 0.000\n

The tree's performance is hurt the most if the \\(\\texttt{color_intensity}\\), \\(\\texttt{od280/od315_of_diluted_wines}\\), or \\(\\texttt{proline}\\) features are permuted, consistent with the impurity gain measure of feature importance.

"},{"location":"optimization/model_optimization.html","title":"Model optimization","text":"

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum and may be edited and published elsewhere by the author.

"},{"location":"optimization/model_optimization.html#what-we-talk-about-when-we-talk-about-model-optimization","title":"What we talk about when we talk about model optimization","text":"

Given some data \\(x\\) and a family of functionals parameterized by (a vector of) parameters \\(\\theta\\) (e.g. for DNN training weights), the problem of learning consists in finding \\(argmin_\\theta Loss(f_\\theta(x) - y_{true})\\). The treatment below focusses on gradient descent, but the formalization is completely general, i.e. it can be applied also to methods that are not explicitly formulated in terms of gradient descent (e.g. BDTs). The mathematical formalism for the problem of learning is briefly explained in a contribution on statistical learning to the ML forum: for the purposes of this documentation we will proceed through two illustrations.

The first illustration, elaborated from an image by the huawei forums shows the general idea behind learning through gradient descent in a multidimensional parameter space, where the minimum of a loss function is found by following the function's gradient until the minimum.

The cartoon illustrates the general idea behind gradient descent to find the minimum of a function in a multidimensional parameter space (figure elaborated from an image by the huawei forums).

The model to be optimized via a loss function typically is a parametric function, where the set of parameters (e.g. the network weights in neural networks) corresponds to a certain fixed structure of the network. For example, a network with two inputs, two inner layers of two neurons, and one output neuron will have six parameters whose values will be changed until the loss function reaches its minimum.

When we talk about model optimization we refer to the fact that often we are interested in finding which model structure is the best to describe our data. The main concern is to design a model that has a sufficient complexity to store all the information contained in the training data. We can therefore think of parameterizing the network structure itself, e.g. in terms of the number of inner layers and number of neurons per layer: these hyperparameters define a space where we want to again minimize a loss function. Formally, the parametric function \\(f_\\theta\\) is also a function of these hyperparameters \\(\\lambda\\): \\(f_{(\\theta, \\lambda)}\\), and the \\(\\lambda\\) can be optimized

The second illustration, also elaborated from an image by the huawei forums, broadly illustrates this concept: for each point in the hyperparameters space (that is, for each configuration of the model), the individual model is optimized as usual. The global minimum over the hyperparameters space is then sought.

The cartoon illustrates the general idea behind gradient descent to optimize the model complexity (in terms of the choice of hyperparameters) multidimensional parameter and hyperparameter space (figure elaborated from an image by the huawei forums)."},{"location":"optimization/model_optimization.html#caveat-which-data-should-you-use-to-optimize-your-model","title":"Caveat: which data should you use to optimize your model","text":"

In typical machine learning studies, you should divide your dataset into three parts. One is used for training the model (training sample), one is used for testing the performance of the model (test sample), and the third one is the one where you actually use your trained model, e.g. for inference (application sample). Sometimes you may get away with using test data as application data: Helge Voss (Chap 5 of Behnke et al.) states that this is acceptable under three conditions that must be simultaneously valid:

  • no hyperparameter optimization is performed;
  • no overtraining is found;
  • the number of training data is high enough to make statistical fluctuations negligible.

If you are doing any kind of hyperparamters optimization, thou shalt NOT use the test sample as application sample. You should have at least three distinct sets, and ideally you should use four (training, testing, hyperparameter optimization, application).

"},{"location":"optimization/model_optimization.html#grid-search","title":"Grid Search","text":"

The most simple hyperparameters optimization algorithm is the grid search, where you train all the models in the hyperparameters space to build the full landscape of the global loss function, as illustrated in Goodfellow, Bengio, Courville: \"Deep Learning\".

The cartoon illustrates the general idea behind grid search (image taken from Goodfellow, Bengio, Courville: \"Deep Learning\").

To perform a meaningful grid search, you have to provide a set of values within the acceptable range of each hyperparameters, then for each point in the cross-product space you have to train the corresponding model.

The main issue with grid search is that when there are nonimportant hyperparameters (i.e. hyperparameters whose value doesn't influence much the model performance) the algorithm spends an exponentially large time (in the number of nonimportant hyperparameters) in the noninteresting configurations: having \\(m\\) parameters and testing \\(n\\) values for each of them leads to \\(\\mathcal{O}(n^m)\\) tested configurations. While the issue may be mitigated by parallelization, when the number of hyperparameters (the dimension of hyperparameters space) surpasses a handful, even parallelization can't help.

Another issue is that the search is binned: depending on the granularity in the scan, the global minimum may be invisible.

Despite these issues, grid search is sometimes still a feasible choice, and gives its best when done iteratively. For example, if you start from the interval \\(\\{-1, 0, 1\\}\\):

  • if the best parameter is found to be at the boundary (1), then extend range (\\(\\{1, 2, 3\\}\\)) and do the search in the new range;
  • if the best parameter is e.g. at 0, then maybe zoom in and do a search in the range \\(\\{-0.1, 0, 0.1\\}\\).
"},{"location":"optimization/model_optimization.html#random-search","title":"Random search","text":"

An improvement of the grid search is the random search, which proceeds like this:

  • you provide a marginal p.d.f. for each hyperparameter;
  • you sample from the joint p.d.f. a certain number of training configurations;
  • you train for each of these configurations to build the loss function landscape.

This procedure has significant advantages over a simple grid search: random search is not binned, because you are sampling from a continuous p.d.f., so the pool of explorable hyperparameter values is larger; random search is exponentially more efficient, because it tests a unique value for each influential hyperparameter on nearly every trial.

Random search also work best when done iteratively. The differences between grid and random search are again illustrated in Goodfellow, Bengio, Courville: \"Deep Learning\".

The cartoon illustrates the general idea behind random search, as opposed to grid search (image taken from Goodfellow, Bengio, Courville: \"Deep Learning\")."},{"location":"optimization/model_optimization.html#model-based-optimization-by-gradient-descent","title":"Model-based optimization by gradient descent","text":"

Now that we have looked at the most basic model optimization techniques, we are ready to look into using gradient descent to solve a model optimization problem. We will proceed by recasting the problem as one of model selection, where the hyperparameters are the input (decision) variables, and the model selection criterion is a differentiable validation set error. The validation set error attempts to describe the complexity of the network by a single hyperparameter (details in [a contribution on statistical learning to the ML forum]) The problem may be solved with standard gradient descent, as illustrated above, if we assume that the training criterion \\(C\\) is continuous and differentiable with respect to both the parameters \\(\\theta\\) (e.g. weights) and hyperparameters \\(\\lambda\\) Unfortunately, the gradient is seldom available (either because it has a prohibitive computational cost, or because it is non-differentiable as is the case when there are discrete variables).

A diagram illustrating the way gradient-based model optimization works has been prepared by Bengio, doi:10.1162/089976600300015187.

The diagram illustrates the way model optimization can be recast as a model selection problem, where a model selection criterion involves a differentiable validation set error (image taken from Bengio, doi:10.1162/089976600300015187)."},{"location":"optimization/model_optimization.html#model-based-optimization-by-surrogates","title":"Model-based optimization by surrogates","text":"

Sequential Model-based Global Optimization (SMBO) consists in replacing the loss function with a surrogate model of it, when the loss function (i.e. the validation set error) is not available. The surrogate is typically built as a Bayesian regression model, when one estimates the expected value of the validation set error for each hyperparameter together with the uncertainty in this expectation. The pseudocode for the SMBO algorithm is illustrated by Bergstra et al.

The diagram illustrates the pseudocode for the Sequential Model-based Global Optimization (image taken from Bergstra et al).

This procedure results in a tradeoff between: exploration, i.e. proposing hyperparameters with high uncertainty, which may result in substantial improvement or not; and exploitation (propose hyperparameters that will likely perform as well as the current proposal---usually this mean close to the current ones). The disadvantage is that the whole procedure must run until completion before giving as an output any usable information. By comparison, manual or random searches tend to give hints on the location of the minimum faster.

"},{"location":"optimization/model_optimization.html#bayesian-optimization","title":"Bayesian Optimization","text":"

We are now ready to tackle in full what is referred to as Bayesian optimization.

Bayesian optimization assumes that the unknown function \\(f(\\theta, \\lambda)\\) was sampled from a Gaussian process (GP), and that after the observations it maintains the corresponding posterior. In this context, observations are the various validation set errors for different values of the hyperparameters \\(\\lambda\\). In order to pick the next value to probe, one maximizes some estimate of the expected improvement (see below). To understand the meaning of \"sampled from a Gaussian process\", we need to define what a Gaussian process is.

"},{"location":"optimization/model_optimization.html#gaussian-processes","title":"Gaussian processes","text":"

Gaussian processes (GPs) generalize the concept of Gaussian distribution over discrete random variables to the concept of Gaussian distribution over continuous functions. Given some data and an estimate of the Gaussian noise, by fitting a function one can estimate also the noise at the interpolated points. This estimate is made by similarity with contiguous points, adjusted by the distance between points. A GP is therefore fully described by its mean and its covariance function. An illustration of Gaussian processes is given in Kevin Jamieson's CSE599 lecture notes.

The diagram illustrates the evolution of a Gaussian process, when adding interpolating points (image taken from Kevin Jamieson's CSE599 lecture notes).

GPs are great for Bayesian optimization because they out-of-the-box provide the expected value (i.e. the mean of the process) and its uncertainty (covariance function).

"},{"location":"optimization/model_optimization.html#the-basic-idea-behind-bayesian-optimization","title":"The basic idea behind Bayesian optimization","text":"

Gradient descent methods are intrinsically local: the decision on the next step is taken based on the local gradient and Hessian approximations- Bayesian optimization (BO) with GP priors uses a model that uses all the information from the previous steps by encoding it in the model giving the expectation and its uncertainty. The consequence is that GP-based BO can find the minimum of difficult nonconvex functions in relatively few evaluations, at the cost of performing more computations to find the next point to try in the hyperparameters space.

The BO prior is a prior over the space of the functions. GPs are especially suited to play the role of BO prior, because marginals and conditionals can be computed in closed form (thanks to the properties of the Gaussian distribution).

There are several methods to choose the acquisition function (the function that selects the next step for the algorithm), but there is no omnipurpose recipe: the best approach is problem-dependent. The acquisition function involves an accessory optimization to maximize a certain quantity; typical choices are:

  • maximize the probability of improvement over the current best value: can be calculated analytically for a GP;
  • maximize the expected improvement over the current best value: can also be calculated analytically for a GP;
  • maximize the GP Upper confidence bound: minimize \"regret\" over the course of the optimization.
"},{"location":"optimization/model_optimization.html#historical-note","title":"Historical note","text":"

Gaussian process regression is also called kriging in geostatistics, after Daniel G. Krige (1951) who pioneered the concept later formalized by Matheron (1962)

"},{"location":"optimization/model_optimization.html#bayesian-optimization-in-practice","title":"Bayesian optimization in practice","text":"

The figure below, taken by a tutorial on BO by Martin Krasser, clarifies rather well the procedure. The task is to approximate the target function (labelled noise free objective in the figure), given some noisy samples of it (the black crosses). At the first iteration, one starts from a flat surrogate function, with a given uncertainty, and fits it to the noisy samples. To choose the next sampling location, a certain acquisition function is computed, and the value that maximizes it is chosen as the next sampling location At each iteration, more noisy samples are added, until the distance between consecutive sampling locations is minimized (or, equivalently, a measure of the value of the best selected sample is maximized).

Practical illustration of Bayesian Optimization (images taken from a tutorial on BO by Martin Krasser])."},{"location":"optimization/model_optimization.html#limitations-and-some-workaround-of-bayesian-optimization","title":"Limitations (and some workaround) of Bayesian Optimization","text":"

There are three main limitations to the BO approach. A good overview of these limitations and of possible solutions can be found in arXiv:1206.2944.

First of all, it is unclear what is an appropriate choice for the covariance function and its associated hyperparameters. In particular, the standard squared exponential kernel is often too smooth. As a workaround, alternative kernels may be used: a common choice is the Mat\u00e9rn 5/2 kernel, which is similar to the squared exponential one but allows for non-smoothness.

Another issue is that, for certain problems, the function evaluation may take very long to compute. To overcome this, often one can replace the function evaluation with the Monte Carlo integration of the expected improvement over the GP hyperparameters, which is faster.

The third main issue is that for complex problems one would ideally like to take advantage of parallel computation. The procedure is iterative, however, and it is not easy to come up with a scheme to make it parallelizable. The referenced paper proposed sampling over the expected acquisition, conditioned on all the pending evaluations: this is computationally cheap and is intrinsically parallelizable.

"},{"location":"optimization/model_optimization.html#alternatives-to-gaussian-processes-tree-based-models","title":"Alternatives to Gaussian processes: Tree-based models","text":"

Gaussian Processes model directly \\(P(hyperpar | data)\\) but are not the only suitable surrogate models for Bayesian optimization

The so-called Tree-structured Parzen Estimator (TPE), described in Bergstra et al, models separately \\(P(data | hyperpar)\\) and \\(P(hyperpar)\\), to then obtain the posterior by explicit application of the Bayes theorem TPEs exploit the fact that the choice of hyperparameters is intrinsically graph-structured, in the sense that e.g. you first choose the number of layers, then choose neurons per layer, etc. TPEs run over this generative process by replacing the hyperparameters priors with nonparametric densities. These generative nonparametric densities are built by classifying them into those that result in worse/better loss than the current proposal.

TPEs have been used in CMS already around 2017 in a VHbb analysis (see repository by Sean-Jiun Wang) and in a charged Higgs to tb search (HIG-18-004, doi:10.1007/JHEP01(2020)096).

"},{"location":"optimization/model_optimization.html#implementations-of-bayesian-optimization","title":"Implementations of Bayesian Optimization","text":"
  • Implementations in R are readily available as the R-studio tuning package;
  • Scikit-learn provides a handy implementation of Gaussian processes;
  • **scipy* provides a handy implementation of the optimization routines;
  • hyperopt provides a handy implementation of distributed hyperparameter optimization routines;
    • GPs not coded by default, hence must rely on scikit-learn;
    • Parzen tree estimators are implemented by default (together with random search);
  • Several handy tutorials online focussed on hyperparameters optimization
    • Tutorial by Martin Krasser;
    • Tutorial by Jason Brownlee;
  • Early example of hyperopt in CMS
    • VHbb analysis: repository by Sean-Jiun Wang), for optimization of a BDT;
    • Charged Higgs HIG-18-004, doi:10.1007/JHEP01(2020)096) for optimization of a DNN (no public link for the code, contact me if needed)
  • Several expansions and improvements (particularly targeted at HPC clusters) are available, see e.g. this talk by Eric Wulff.
"},{"location":"optimization/model_optimization.html#caveats-dont-get-too-obsessed-with-model-optimization","title":"Caveats: don't get too obsessed with model optimization","text":"

In general, optimizing model structure is a good thing. F. Chollet e.g. says \"If you want to get to the very limit of what can be achieved on a given task, you can't be content with arbitrary choices made by a fallible human\". On the other side, for many problems hyperparameter optimization does result in small improvements, and there is a tradeoff between improvement and time spent on the task: sometimes the time spent on optimization may not be worth, e.g. when the gradient of the loss in hyperparameters space is very flat (i.e. different hyperparameter sets give more or less the same results), particularly if you already know that small improvements will be eaten up by e.g. systematic uncertainties. On the other side, before you perform the optimization you don't know if the landscape is flat or if you can expect substantial improvements. Sometimes broad grid or random searches may give you a hint on whether the landscape of hyperparameters space is flat or not.

Sometimes you may get good (and faster) improvements by model ensembling rather than by model optimization. To do model ensembling, you first train a handful models (either different methods---BDT, SVM, NN, etc---or different hyperparameters sets): \\(pred\\_a = model\\_a.predict(x)\\), ..., \\(pred\\_d = model\\_d.predict(x)\\). You then pool the predictions: \\(pooled\\_pred = (pred\\_a + pred\\_b + pred\\_c + pred\\_d)/4.\\). THis works if all models are kind of good: if one is significantly worse than the others, then \\(pooled\\_pred\\) may not be as good as the best model of the pool.

You can also find ways of ensembling in a smarter way, e.g. by doing weighted rather than simple averages: \\(pooled\\_pred = 0.5\\cdot pred\\_a + 0.25\\cdot pred\\_b + 0.1\\cdot pred\\_c + 0.15\\cdot pred\\_d)/4.\\). Here the idea is to give more weight to better classifiers. However, you transfer the problem to having to choose the weights. These can be found empirically empirically by using random search or other algorithms like Nelder-Mead (result = scipy.optimize.minimize(objective, pt, method='nelder-mead'), where you build simplexes (polytope with N+1 vertices in N dimensions, generalization of triangle) and stretch them towards higher values of the objective. Nelder-Mead can converge to nonstationary points, but there are extensions of the algorithm that may help.

This page summarizes the concepts shown in a contribution on Bayesian Optimization to the ML Forum. Content may be edited and published elsewhere by the author. Page author: Pietro Vischia, 2022

"},{"location":"resources/cloud_resources/index.html","title":"Cloud Resources","text":"

Work in progress.

"},{"location":"resources/dataset_resources/index.html","title":"CMS-ML Dataset Tab","text":""},{"location":"resources/dataset_resources/index.html#introduction","title":"Introduction","text":"

Welcome to CMS-ML Dataset tab! Our tab is designed to provide accurate, up-to-date, and relevant data across various purposes. We strive to make this tab resourceful for your analysis and decision-making needs. We are working on benchmarking more dataset and presenting them in a user-friendly format. This tab will be continuously updated to reflect the latest developments. Explore, analyze, and derive insights with ease!

"},{"location":"resources/dataset_resources/index.html#1-jetnet","title":"1. JetNet","text":""},{"location":"resources/dataset_resources/index.html#links","title":"Links","text":"

Github Repository

Zenodo

"},{"location":"resources/dataset_resources/index.html#description","title":"Description","text":"

JetNet is a project aimed at enhancing accessibility and reproducibility in jet-based machine learning. It offers easy-to-access and standardized interfaces for several datasets, including JetNet, TopTagging, and QuarkGluon. Additionally, JetNet provides standard implementations of various generative evaluation metrics such as Fr\u00e9chet Physics Distance (FPD), Kernel Physics Distance (KPD), Wasserstein-1 (W1), Fr\u00e9chet ParticleNet Distance (FPND), coverage, and Minimum Matching Distance (MMD). Beyond these, it includes a differentiable implementation of the energy mover's distance and other general jet utilities, making it a comprehensive resource for researchers and practitioners in the field.

"},{"location":"resources/dataset_resources/index.html#nature-of-objects","title":"Nature of Objects","text":"
  • Objects: Gluon (g), Top Quark (t), Light Quark (q), W boson (w), and Z boson (z) jets of ~1 TeV transverse momentum (\\(p_T\\))
  • Number of Objects: N = 177252, 177945, 170679, 177172, 176952 for g, t, q, w, z jets respectively.
"},{"location":"resources/dataset_resources/index.html#format-of-dataset","title":"Format of Dataset","text":"
  • File Type: HDF5
  • Structure: Each file has particle_features; and jet_features; arrays, containing the list of particles' features per jet and the corresponding jet's features, respectively. Particle_features is of shape [N, 30, 4], where N is the total number of jets, 30 is the max number of particles per jet, and 4 is the number of particle features, in order: []\\eta, \\varphi, \\p_T, mask]. See Zenodo for definitions of these. jet_features is of shape [N, 4], where 4 is the number of jet features, in order: [\\(p_T\\), \\(\\eta\\), mass, # of particles].
"},{"location":"resources/dataset_resources/index.html#related-projects","title":"Related Projects","text":"
  • Top tagging benchmark
  • Particle Cloud Generation with Message Passing Generative Adversarial Networks
"},{"location":"resources/dataset_resources/index.html#2-top-tagging-benchmark-dataset","title":"2. Top Tagging Benchmark Dataset","text":""},{"location":"resources/dataset_resources/index.html#links_1","title":"Links","text":"

Zenodo

"},{"location":"resources/dataset_resources/index.html#description_1","title":"Description","text":"

A set of MC simulated training/testing events for the evaluation of top quark tagging architectures. - 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8 - No MPI/pile-up included - Clustering of particle-flow entries (produced by Delphes E-flow) into anti-kT 0.8 jets in the pT range [550,650] GeV - All top jets are matched to a parton-level top within \u2206R = 0.8, and to all top decay partons within 0.8 - Jets are required to have |eta| < 2 - The leading 200 jet constituent four-momenta are stored, with zero-padding for jets with fewer than 200 - Constituents are sorted by pT, with the highest pT one first - The truth top four-momentum is stored as truth_px etc. - A flag (1 for top, 0 for QCD) is kept for each jet. It is called is_signal_new - The variable \"ttv\" (= test/train/validation) is kept for each jet. It indicates to which dataset the jet belongs. It is redundant as the different sets are already distributed as different files.

"},{"location":"resources/dataset_resources/index.html#nature-of-objects_1","title":"Nature of Objects","text":"
  • Objects: 14 TeV, hadronic tops for signal, qcd diets background, Delphes ATLAS detector card with Pythia8
  • Number of Objects: In total 1.2M training events, 400k validation events and 400k test events.
"},{"location":"resources/dataset_resources/index.html#format-of-dataset_1","title":"Format of Dataset","text":"
  • File Type: HDF5
  • Structure: Use \u201ctrain\u201d for training, \u201cval\u201d for validation during the training and \u201ctest\u201d for final testing and reporting results. For details, see the Zenodo link
"},{"location":"resources/dataset_resources/index.html#related-projects_1","title":"Related Projects","text":"
  • Butter, Anja; Kasieczka, Gregor; Plehn, Tilman and Russell, Michael (2017). Based on data from 10.21468/SciPostPhys.5.3.028 (1707.08966)
  • Kasieczka, Gregor et al (2019). Dataset used for arXiv:1902.09914 (The Machine Learning Landscape of Top Taggers)
"},{"location":"resources/dataset_resources/index.html#more-dataset-coming-in","title":"More dataset coming in!","text":"

Have any questions? Want your dataset shown on this page? Contact the ML Knowledge Subgroup!

"},{"location":"resources/fpga_resources/index.html","title":"FPGA Resource","text":"

Work in progress.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_gpu.html","title":"lxplus-gpu.cern.ch","text":""},{"location":"resources/gpu_resources/cms_resources/lxplus_gpu.html#how-to-use-it","title":"How to use it?","text":"

lxplus-gpu are special lxplus nodes with GPU support. You can access these nodes by executing

ssh <your_user_name>@lxplus-gpu.cern.ch\n

The configuration of the software environment for lxplus-gpu is described in the Software Environments page.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html","title":"HTCondor With GPU resources","text":"

In general, HTCondor supports GPU jobs if there are some worker nodes which are configured with GPU devices. CMS Connect and lxplus both have access to worker nodes equipped with GPUs.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#how-to-require-gpus-in-htcondor","title":"How to require GPUs in HTCondor","text":"

People can require their jobs to have GPU support by adding the following requirements to the condor submission file.

request_gpus = n # n equal to the number of GPUs required\n
"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#further-documentation","title":"Further documentation","text":"

There are good materials providing detailed documentation on how to run HTCondor jobs with GPU support at both machines.

The configuration of the software environment for lxplus-gpu and HTcondor is described in the Software Environments page. Moreover the page Using container explains step by step how to build a docker image to be run on HTCondor jobs.

"},{"location":"resources/gpu_resources/cms_resources/lxplus_htcondor.html#more-available-resources","title":"More available resources","text":"
  1. A complete documentation can be found from the GPUs section in CERN Batch Docs. Where a Tensorflow example is supplied. This documentation also contains instructions on advanced HTCondor configuration, for instance constraining GPU device or CUDA version.
  2. A good example on submitting GPU HTCondor job @ Lxplus is the weaver-benchmark project. It provides a concrete example on how to setup environment for weaver framework and operate trainning and testing process within a single job. Detailed description can be found at section ParticleNet of this documentation.

    In principle, this example can be run elsewhere as HTCondor jobs. However, paths to the datasets should be modified to meet the requirements.

  3. CMS Connect also provides a documentation on GPU job submission. In this documentation there is also a Tensorflow example.

    When submitting GPU jobs @ CMS Connect, especially for Machine Learning purpose, EOS space @ CERN are not accessible as a directory, therefore one should consider using xrootd utilities as documented in this page

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html","title":"ml.cern.ch","text":"

ml.cern.ch is a Kubeflow based ML solution provided by CERN.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#kubeflow","title":"Kubeflow","text":"

Kubeflow is a Kubernetes based ML toolkits aiming at making deployments of ML workflows simple, portable and scalable. In Kubeflow, pipeline is an important concept. Machine Learning workflows are discribed as a Kubeflow pipeline for execution.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#how-to-access","title":"How to access","text":"

ml.cern.ch only accepts connections from within the CERN network. Therefore, if you are outside of CERN, you will need to use a network tunnel (eg. via ssh -D dynamic port forwarding as a SOCKS5 proxy)... The main website are shown below.

"},{"location":"resources/gpu_resources/cms_resources/ml_cern_ch.html#examples","title":"Examples","text":"

After logging into the main website, you can click on the Examples entry to browser a gitlab repository containing a lot of examples. For instance, below are two examples from that repository with a well-documented readme file.

  1. mnist-kfp is an example on how to use jupyter notebooks to create a Kubeflow pipeline (kfp) and how to access CERN EOS files.
  2. katib gives an example on how to use the katib to operate hyperparameter tuning for jet tagging with ParticleNet.
"},{"location":"resources/gpu_resources/cms_resources/swan.html","title":"SWAN","text":""},{"location":"resources/gpu_resources/cms_resources/swan.html#preparation","title":"Preparation","text":"
  1. Registration:

    To require GPU resources for SWAN: According to this thread, one can create a ticket through this link to ask for GPU support at SWAN, it is now in beta version and limited to a small scale. 2. Setup SWAN with GPU resources:

    1. Once the registration is done, one can login SWAN with Kerberes8 support and then create his SWAN environment.

      \ud83d\udca1 Note: When configuring the SWAN environment you will be given your choice of software stack. Be careful to use a software release with GPU support as well as an appropriate CUDA version. If you need to install additional software, it must be compatible with your chosen CUDA version.

Another important option is the environment script, which will be discussed later in this document.

"},{"location":"resources/gpu_resources/cms_resources/swan.html#working-with-swan","title":"Working with SWAN","text":"
  1. After creation, one will browse the SWAN main directory My Project where all existing projects are displayed. A new project can be created by clicking the upper right \"+\" button. After creation one will be redirected to the newly created project, at which point the \"+\" button on the upper right panel can be used for creating new notebook.

  2. It is possible to use the terminal for installing new packages or monitoring computational resources.

    1. For package installation, one can install packages with package management tools, e.g. pip for python. To use the installed packages, you will need to wrap the environment configuration in a scrip, which will be executed by SWAN. Detailed documentation can be found by clicking the upper right \"?\" button.

    2. In addition to using top and htop to monitor ordinary resources, you can use nvidia-smi to monitor GPU usage.

"},{"location":"resources/gpu_resources/cms_resources/swan.html#examples","title":"Examples","text":"

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.

"},{"location":"resources/gpu_resources/cms_resources/notebooks/pytorch_mnist.html","title":"Pytorch mnist","text":"
from __future__ import print_function\nimport argparse\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torchvision import datasets, transforms\nfrom torch.optim.lr_scheduler import StepLR\n
class Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n        self.dropout1 = nn.Dropout(0.25)\n        self.dropout2 = nn.Dropout(0.5)\n        self.fc1 = nn.Linear(9216, 128)\n        self.fc2 = nn.Linear(128, 10)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = F.relu(x)\n        x = self.conv2(x)\n        x = F.relu(x)\n        x = F.max_pool2d(x, 2)\n        x = self.dropout1(x)\n        x = torch.flatten(x, 1)\n        x = self.fc1(x)\n        x = F.relu(x)\n        x = self.dropout2(x)\n        x = self.fc2(x)\n        output = F.log_softmax(x, dim=1)\n        return output\n
def train(args, model, device, train_loader, optimizer, epoch):\n    model.train()\n    for batch_idx, (data, target) in enumerate(train_loader):\n        data, target = data.to(device), target.to(device)\n\n        optimizer.zero_grad()\n        output = model(data)\n        loss = F.nll_loss(output, target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n            if args[\"dry_run\"]:\n                break\n
def test(model, device, test_loader):\n    model.eval()\n    test_loss = 0\n    correct = 0\n    with torch.no_grad():\n        for data, target in test_loader:\n            data, target = data.to(device), target.to(device)\n            output = model(data)\n            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss\n            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability\n            correct += pred.eq(target.view_as(pred)).sum().item()\n\n    test_loss /= len(test_loader.dataset)\n\n    print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n        test_loss, correct, len(test_loader.dataset),\n        100. * correct / len(test_loader.dataset)))\n
torch.cuda.is_available() # Check if cuda is available\n
train_kwargs = {\"batch_size\":64}\ntest_kwargs = {\"batch_size\":1000}\n
cuda_kwargs = {'num_workers': 1,\n               'pin_memory': True,\n               'shuffle': True}\ntrain_kwargs.update(cuda_kwargs)\ntest_kwargs.update(cuda_kwargs)\n
transform=transforms.Compose([\n    transforms.ToTensor(),\n    transforms.Normalize((0.1307,), (0.3081,))\n    ])\n
dataset1 = datasets.MNIST('./data', train=True, download=True,\n                   transform=transform)\ndataset2 = datasets.MNIST('./data', train=False,\n                   transform=transform)\ntrain_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)\ntest_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)\n
device = torch.device(\"cuda\")\nmodel = Net().to(device)\noptimizer = optim.Adadelta(model.parameters(), lr=1.0)\nscheduler = StepLR(optimizer, step_size=1, gamma=0.7)\n
args = {\"dry_run\":False, \"log_interval\":100}\nfor epoch in range(1, 14 + 1):\n    train(args, model, device, train_loader, optimizer, epoch)\n    test(model, device, test_loader)\n    scheduler.step()\n
"},{"location":"resources/gpu_resources/cms_resources/notebooks/toptagging_mlp.html","title":"Toptagging mlp","text":"

import torch\nimport torch.nn as nn\nfrom torch.utils.data.dataset import Dataset\nimport pandas as pd\nimport numpy as np\nimport uproot3\nimport torch.optim as optim\nfrom torch.optim.lr_scheduler import StepLR\nimport torch.nn.functional as F\nimport awkward0\n
class MultiLayerPerceptron(nn.Module):\nr\"\"\"Parameters\n    ----------\n    input_dims : int\n        Input feature dimensions.\n    num_classes : int\n        Number of output classes.\n    layer_params : list\n        List of the feature size for each layer.\n    \"\"\"\n\n    def __init__(self, input_dims, num_classes,\n                 layer_params=(256,64,16),\n                 **kwargs):\n\n        super(MultiLayerPerceptron, self).__init__(**kwargs)\n        channels = [input_dims] + list(layer_params) + [num_classes]\n        layers = []\n        for i in range(len(channels) - 1):\n            layers.append(nn.Sequential(nn.Linear(channels[i], channels[i + 1]),\n                                        nn.ReLU()))\n        self.mlp = nn.Sequential(*layers)\n\n    def forward(self, x):\n        # x: the feature vector initally read from the data structure, in dimension (N, C, P)\n        x = x.flatten(start_dim=1) # (N, L), where L = C * P\n        return self.mlp(x)\n\n    def predict(self,x):\n        pred = F.softmax(self.forward(x))\n        ans = []\n        for t in pred:\n            if t[0] > t[1]:\n                ans.append(1)\n            else:\n                ans.append(0)\n        return torch.tensor(ans)\n

def train(args, model, device, train_loader, optimizer, epoch):\n    model.train()\n    for batch_idx, (data, target) in enumerate(train_loader):\n        data, target = data.to(device), target.to(device)\n        optimizer.zero_grad()\n        output = model(data)\n        loss = F.nll_loss(output, target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n            if args[\"dry_run\"]:\n                break\n
input_branches = [\n                  'Part_Etarel',\n                  'Part_Phirel',\n                  'Part_E_log',\n                  'Part_P_log'\n                 ]\n\noutput_branches = ['is_signal_new']\n
train_dataset = uproot3.open(\"TopTaggingMLP/train.root\")[\"Events\"].arrays(input_branches+output_branches,namedecode='utf-8')\ntrain_dataset = {name:train_dataset[name].astype(\"float32\") for name in input_branches+output_branches}\ntest_dataset = uproot3.open(\"/eos/user/c/coli/public/weaver-benchmark/top_tagging/samples/prep/top_test_0.root\")[\"Events\"].arrays(input_branches+output_branches,namedecode='utf-8')\ntest_dataset = {name:test_dataset[name].astype(\"float32\") for name in input_branches+output_branches}\n
for ds in [train_dataset,test_dataset]:\n    for name in ds.keys():\n        if isinstance(ds[name],awkward0.JaggedArray):\n            ds[name] = ds[name].pad(30,clip=True).fillna(0).regular().astype(\"float32\")\n
class PF_Features(Dataset):\n    def __init__(self,mode = \"train\"):\n        if mode == \"train\":\n            self.x = {key:train_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':train_dataset['is_signal_new']}\n        elif mode == \"test\":\n            self.x = {key:test_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':test_dataset['is_signal_new']}\n        elif model == \"val\":\n            self.x = {key:test_dataset[key] for key in input_branches}\n            self.y = {'is_signal_new':test_dataset['is_signal_new']}\n\n    def __len__(self):\n        return len(self.y['is_signal_new'])\n\n    def __getitem__(self,idx):\n        X = [self.x[key][idx].copy() for key in input_branches]\n        X = np.vstack(X)\n        y = self.y['is_signal_new'][idx].copy()\n        return X,y\n
torch.cuda.is_available() # Check if cuda is available\n
True\n
device = torch.device(\"cuda\")\n
train_kwargs = {\"batch_size\":1000}\ntest_kwargs = {\"batch_size\":10}\ncuda_kwargs = {'num_workers': 1,\n               'pin_memory': True,\n               'shuffle': True}\ntrain_kwargs.update(cuda_kwargs)\ntest_kwargs.update(cuda_kwargs)\n
model = MultiLayerPerceptron(input_dims = 4 * 30, num_classes=2).to(device)\n
optimizer = optim.Adam(model.parameters(), lr=0.01)\n
train_loader = torch.utils.data.DataLoader(PF_Features(mode=\"train\"),**train_kwargs)\ntest_loader = torch.utils.data.DataLoader(PF_Features(mode=\"test\"),**test_kwargs)\n
loss_func = torch.nn.CrossEntropyLoss()\n
args = {\"dry_run\":False, \"log_interval\":500}\nfor epoch in range(1,10+1):\n    for batch_idx, (data, target) in enumerate(train_loader):\n        inputs = data.to(device)#.flatten(start_dim=1)\n        target = target.long().to(device)\n        optimizer.zero_grad()\n        output = model.forward(inputs)\n        loss = loss_func(output,target)\n        loss.backward()\n        optimizer.step()\n        if batch_idx % args[\"log_interval\"] == 0:\n            print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n                epoch, batch_idx * len(data), len(train_loader.dataset),\n                100. * batch_idx / len(train_loader), loss.item()))\n
"},{"location":"software_envs/containers.html","title":"Using containers","text":"

Containers are a great solution to isolate a software environment, especially in batch systems like lxplus. At the moment two container solutations are supported Apptainer ( previously called Singularity), and Docker.

"},{"location":"software_envs/containers.html#using-singularity","title":"Using Singularity","text":"

The unpacked.cern.ch service mounts on CVMFS contains many singularity images, some of which are suitable for machine learning applications. A description of each of the images is beyond the scope of this document. However, if you find an image which is useful for your application, you can use if by running a Singularity container with the appropriate options. For example:

singularity run --nv --bind <bind_mount_path> /cvmfs/unpacked.cern.ch/<path_to_image>\n

"},{"location":"software_envs/containers.html#examples","title":"Examples","text":"

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.

"},{"location":"software_envs/containers.html#using-docker","title":"Using Docker","text":"

Docker is not supported at the moment in the interactive node of lxplus (like lxplus-gpu). However Docker is supported on HTCondor for job submission.

This option can be very handy for users, as HTCondor can pull images from any public registry, like DockerHub or GitLab registry. The user can follow this workflow: 1. Define a custom image on top of a commonly available pytorch or tensorflow image 2. Add the desidered packages and configuration 3. Push the docker image on a registry 4. Use the image in a HTCondor job

The rest of the page is a step by step tutorial for this workflow.

"},{"location":"software_envs/containers.html#define-the-image","title":"Define the image","text":"
  1. Define a file Dockerfile

    FROM pytorch/pytorch:latest\n\nADD localfolder_with_code /opt/mycode\n\n\nRUN  cd /opt/mycode && pip install -e . # or pip install requirements\n\n# Install the required Python packages\nRUN pip install \\\n    numpy \\\n    sympy \\\n    scikit-learn \\\n    numba \\\n    opt_einsum \\\n    h5py \\\n    cytoolz \\\n    tensorboardx \\\n    seaborn \\\n    rich \\\n    pytorch-lightning==1.7\n\nor \nADD requirements.txt \npip install -r requirements.txt\n
  2. Build the image

    docker build -t username/pytorch-condor-gpu:tag .\n

    and push it (after having setup the credentials with docker login hub.docker.com)

    docker push username/pytorch-condor-gpu:tag\n
  3. Setup the condor job with a submission file submitfile as:

    universe                = docker\ndocker_image            = user/pytorch-condor-gpu:tag\nexecutable              = job.sh\nwhen_to_transfer_output = ON_EXIT\noutput                  = $(ClusterId).$(ProcId).out\nerror                   = $(ClusterId).$(ProcId).err\nlog                     = $(ClusterId).$(ProcId).log\nrequest_gpus            = 1\nrequest_cpus            = 2\n+Requirements           = OpSysAndVer =?= \"CentOS7\"\n+JobFlavour = espresso\nqueue 1\n
  4. For testing purpose one can start a job interactively and debug

    condor_submit -interactive submitfile\n
"},{"location":"software_envs/lcg_environments.html","title":"LCG environments","text":""},{"location":"software_envs/lcg_environments.html#software-environment","title":"Software Environment","text":"

The software environment for ML application trainings can be setup in different ways. In this page we focus on the CERN lxplus environment.

"},{"location":"software_envs/lcg_environments.html#lcg-release-software","title":"LCG release software","text":"

Checking out an ideal software bundle with Cuda support at http://lcginfo.cern.ch/, one can set up an LCG environment by executing

source /cvmfs/sft.cern.ch/lcg/views/<name of bundle>/**x86_64-centos*-gcc11-opt**/setup.sh\n

On lxplus-gpu nodes, usually equipped with AlmaLinux 9.1 (also called Centos9), one should use the proper lcg release. At the time of writing (May 2023) the recommended environment to use GPUs is:

source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\n
"},{"location":"software_envs/lcg_environments.html#customized-environments","title":"Customized environments","text":"

One can create custom Python environment using virtualenv or venv tools, in order to avoid messing up with the global python environment.

The user has the choice of building a virtual environment from scratch or by basing on top of a LCG release.

"},{"location":"software_envs/lcg_environments.html#virtual-environment-from-scratch","title":"Virtual environment from scratch","text":"

The first approach is cleaner but requires downloading the full set of libraries needed for pytorch or TensorFlow (very heavy). Moreover the compatibility with the computing environment (usually lxplus-gpu) is not guaranteed.

  1. Create the environment in a folder of choice, usually called myenv

    python3 -m venv --system-site-packages myenv\nsource myenv/bin/activate   # activate the environment\n# Add following line to .bashrc if you want to activate this environment by default (not recommended)\n# source \"/afs/cern.ch/user/<first letter of your username>/<username>/<path-to-myenv-folder>/myenv/bin/activate\"\n
  2. To install packages properly, one should carefully check the CUDA version with nvidia-smi (as shown in figure before), and then find a proper version, pytorch is used as an example.

    # Execute the command shown in your terminal\npip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html\npip install jupyterlab matplotlib scikit-hep # install other packages if they are needed\n
"},{"location":"software_envs/lcg_environments.html#virtual-environment-on-top-of-lcg","title":"Virtual environment on top of LCG","text":"

Creating a virtual environment only to add packages on top of a specific LCG release can be a very effective and inexpesive way to manage the Python environment in lxplus.

N.B A caveat is that the users needs to remember to activate the lcg environment before activating his virtual environment.

  1. Activate the lcg environment of choice

    source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\n
  2. Create the enviroment as above

    python3 -m venv --system-site-packages myenv\nsource myenv/bin/activate   # activate the environment\n
  3. Now the user can work in the environment as before but Pytorch and tensorflow libraries will be available. If a single package needs to be update one can do

pip install --upgrade tensorflow=newer.version\n

This will install the package in the local environment.

At the next login, the user will need to perform these steps to get back the environment:

source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh\nsource myenv/bin/activate\n
"},{"location":"software_envs/lcg_environments.html#conda-environments","title":"Conda environments","text":"

Using conda package manager: conda pacakge manager is more convenient to install and use. To begin with, obtaining an Anaconda or Miniconda installer for Linux x86_64 platform. Then execute it on Lxplus.

1. Please note that if you update your shell configuration (e.g. `.bashrc` file) by `conda init`, you may encounter failure due to inconsistent environment configuration.\n2. Installing packages via `conda` also needs special consideration on selecting proper CUDA version as discussed in `pip` part.\n
"},{"location":"training/Decorrelation.html","title":"Decorrelation","text":"

When preparing to train a machine learning algorithm, it is important to think about the correlations of the output and their impact on how the trained model is used. Generally, the goal of any training is to maximize correlations with variables of interests. For example, a classifier is trained specifically to be highly correlated with the classification categories. However, there is often another set of variables that high correlation with the ML algorithm's output is not desirable and could make the ML algorithm useless regardless of its overall performance.

There are numerous methods that achieve the goal of minimizing correlations of ML algorithms. Choosing the correct decorrelation method depends on the situation, e.g., which ML algorithm is being used and the type of the undesirable variables. Below, we detail various methods for common scenarios focusing on BDT (boosted decision tree) and neural network algorithms.

"},{"location":"training/Decorrelation.html#impartial-training-data","title":"Impartial Training Data","text":"

Generally, the best method for making a neural network's or BDT's output independent of some known variable is to remove any bias in the training dataset, which is commonly done by adding or removing information.

"},{"location":"training/Decorrelation.html#adding-information","title":"Adding Information","text":"
  • Training on a mix of signals with different masses can help prevent the BDT from learning the mass.
"},{"location":"training/Decorrelation.html#removing-information","title":"Removing Information","text":"
  • If you have any input variables that are highly correlated with the mass, you may want to omit them. There may be a loss of raw discrimination power with this approach, but the underlying interpretation will be more sound.
"},{"location":"training/Decorrelation.html#reweighting","title":"Reweighting","text":"
  • One method to achieve correlation by weighting data is reweighting the network's input samples to match a reference distribution. Examples input variables include mass, or an input to invariant mass, like the \\(p_T\\). This method is distinct from flattening the data since it is weighted to match a target distribution rather than a flat distribution. Flattening can also require very large weights that could potentially affect training. This is one way to avoid having the network sculpt, or learn, a certain kinematic quantity, like the background mass. An example of this technique is given in EXO-19-020.
    • This is what is done for the ImageTop tagger and ParticleNet group of taggers. BDT scores from EXO-10-020 where the jet \\(p_T\\) distribution is reweighted to match a reference distribution for each sample.
"},{"location":"training/Decorrelation.html#adversarial-approach","title":"Adversarial Approach","text":"

Adversarial approaches to decorrelation revolve around including a penalty, or regularization, term in the loss function in training. The loss function can be modified to enforce uniformity in the variable of interest (i.e. mass). Check out these links (1, 2, 3) for some examples of this. One way to technically implement this type of approach is using the \"flatness loss function\" (i.e. BinFlatnessLossFunction in the hep-ml package). This type of decorrelation what is done for the DeepAK8-MD taggers.

Another type of regularization one can do to acheive decorrelation is penalizing the loss function on a certain type of correlation, for example distance. In the seminal distance correlation in HEP-ML paper ((DisCo Fever: Robust Networks Through Distance Correlation)), distance is in this case is defined as distance correlation (DisCo), a measure derived from distance covariance, first introduced here. This distance correlation function calculates the non-linear correlation between the NN output and some variables that you care about, e.g. jet mass, that you can force the network to minimize which decorrelates the two variables by including it as a penalty term in the loss function. An extension of this can be found in the Double DisCo method, given below, which highlights the distance correlation term in the loss function at the bottom. The Double DisCo network leverages the ABCD method for background estimation, which is why it requires two separate discriminants. Below is the Double DisCo NN architecture used in MLG-23-003. Notice the two separate discriminant paths consisting of a Dense layer, a Dropout layer, and another Dense layer before outputting a single discriminant per path.

Source: CMS AN-22-101 for MLG-23-003.

Many thanks to Kevin Pedro for his input on this section and the previous one.

"},{"location":"training/Decorrelation.html#parametric-cut","title":"Parametric Cut","text":"

When designing jet taggers, variables of interest for discriminators include N-subjettiness derived quantities. Often, these quantities will be correlated with, for example, the \\(p_T\\) of the jet. One example of this type of correlation is called \"mass scuplting\" and happens when the distribution of the discriminating variable in background begins to exhibit a shape similar to that of the signal with successive cuts. This correlation can have confounding effects in the tagger and one way to remove these effects is to parametrically cut on the discriminant.

One such prescription to remove these correlations is described here and focuses on removing the \\(p_T\\) dependence in the soft-drop mass variable \\(\\rho\\). The authors note that there is a \\(p_T\\) dependence in the N-subjettiness ratio \\(\\tau_2/\\tau_1\\) as a function of the QCD jet scaling (soft-drop) variable, defined as \\(\\rho = log(m^2)(p_T^2)\\), which leads to mass sculpting. In order to alleviate this issue, the authors introduce a modified version of the soft-drop variable, \\(\\rho' = \\rho + log(p_T/\\mu)\\) where \\(\\mu\\) is chosen to be 1 GeV. It can also be noted that there is a linear depedence between \\(\\tau_2/\\tau_1\\) and \\(\\rho'\\). Here, the authors remedy this by modelling the linear depedence with \\(\\tau_{21}' + \\tau_2/\\tau_1 - M \\times \\rho'\\) where \\(M\\) is fit from the data. Applying both these transformations flattens out the relationship between the ratio and the soft-drop variable and removes the mass sculpting effects. It is imperative that the transformation between variables are smooth, as discontinuous functions may lead to artificial features in the data.

"},{"location":"training/Decorrelation.html#methods-for-mass-parameterized-bdts","title":"Methods for mass parameterized BDTs","text":"

Finally, when using a BDT that is parameterized by a mass quantity of interest, the output can be decorrelated from that mass by three different methods: randomization, oversampling, and variable profiling. Randomization entails randomly pairing a mass quanitity to a background training event so the BDT does not learn any meaningful associations between the mass and the output. For oversampling, this is a bootstrapping method where every input background event is paired with a potential mass point so the effective statistics for all the mass points are the same. Finally, variable profiling has the user profile each BDT input as a function of the mass quantity of interest. Examples of each of these methods is given below in the context of a di-higgs search.

A di-higgs multilepton search (HIG-21-002) made use of a BDT for signal discrimination, parameterized by the di-higgs invariant mass. In order to avoid correlations in the BDT output and invariant mass of the di-higgs system, they looked at decorrelation via randomization, oversampling, and variable profiling. All of the methods utilized a (more or less) 50/50 dataset train/test split where one iteration of the BDT was trained on \"even\" numbered events and the datacards were produced with the \"odd\" numbered events. This procedure was repeated for the opposite configuration. Then, to deteremine if the BDT was correctly interpolating the signal masses, one mass point was omitted from training and the results of this BDT were compared to a BDT trained on only this single, omitted mass point. For each train/test configuration (even/odd or odd/even), the BDT's performance gain, as well as loss, were evaluated with ROC curves with two ommitted mass points (done separately).

In the randomization method, a generator-level di-higgs invariant mass was randomly assigned to each background event the BDT was trained on. For the oversampling method, every signal mass point was assigned to a duplicate of each background event. Obviously the oversampling method leads to slower execution but the same effective statistics for all backgrounds and each signal mass. Conversely, the randomization approach is quicker, but leads to reduced effective statistics. Lastly, to improve performance over lower signal masses, each BDT input variable was profiled as a function of \\(m_{HH}\\). This profile was fit with a polynomial function, and then each point in the input distribution is divided by the fit function value. This corrected ratio is used as the new input to the BDT. The authors also found that splitting the BDT training into high and low mass regions helped.

In the end, oversampling, especially when combined with input variable corrections, provided a sizable performance gain (5.6%) over the randomization method. This gain is determined from ROC curves made for each training iteration (even/odd or odd/event) and each method. The performance loss is also a 5% improvement over the randomization method.

For more information on these methods, see the HIG-21-002. Below are some example BDT output scores for the \\(2\\ell ss\\) and \\(3 \\ell\\) channels for this analysis.

Source: HIG-21-002

So far we have seen decorrelation achieved by using inputs that are decorrelated for the classifier and regularizing the output to penalize learning correlations. Another approach can be to learn decorrelation by maximizing performance metrics that more closely align with the sensitivity of the analysis, like in this paper and their corresponding Python-based package, ThickBrick. In this case, the authors study the dependence of the event selection threshold on the signal purity in a given bin of the distribution of an observable. They demonstrate that the threshold increases with signal purity, \"implying that the threshold is stronger in the x-'bins' of higher purity.\" This parametric event selection threshold \"naturally leads to decorrelation of the event selection criteria from the event variable x.\" The failure to incorporate the dependencies on observable distributions is framed as a misalignment between the ML-based selector and the sensitivity of the physics analysis. A demo of their package, ThickBrick, was given at PyHEP2020.

"},{"location":"training/MLaaS4HEP.html","title":"MLaaS4HEP","text":""},{"location":"training/MLaaS4HEP.html#machine-learning-as-a-service-for-hep","title":"Machine Learning as a Service for HEP","text":"

MLaaS for HEP is a set of Python-based modules to support reading HEP data and stream them to the ML tool of the user's choice. It consists of three independent layers: - Data Streaming layer to handle remote data, see reader.py - Data Training layer to train ML model for given HEP data, see workflow.py - Data Inference layer, see tfaas_client.py

The MLaaS4HEP resopitory can be found here.

The general architecture of MLaaS4HEP looks like this:

Even though this architecture was originally developed for dealing with HEP ROOT files, we extend it to other data formats. As of right now, following data formats are supported: JSON, CSV, Parquet, and ROOT. All of the formats support reading files from the local file system or HDFS, while the ROOT format supports reading files via the XRootD protocol.

The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients. The TFaaS documentation can be found here.

"},{"location":"training/MLaaS4HEP.html#dependencies","title":"Dependencies","text":"

Here is a list of the dependencies: - pyarrow for reading data from HDFS file system - uproot for reading ROOT files - numpy, pandas for data representation - modin for fast panda support - numba for speeing up individual functions

"},{"location":"training/MLaaS4HEP.html#installation","title":"Installation","text":"

The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images

# run MLaaS4HEP docker container\ndocker run veknet/mlaas4hep\n# run TFaaS docker container\ndocker run veknet/tfaas\n

"},{"location":"training/MLaaS4HEP.html#reading-root-files","title":"Reading ROOT files","text":"

MLaaS4HEP python repository provides the reader.py module that defines a DataReader class able to read either local or remote ROOT files (via xrootd) in chunks. It is based on the uproot framework.

Basic usage

# setup the proper environment, e.g.\n# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework\n# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries\n\n# get help and option description\nreader --help\n\n# here is a concrete example of reading local ROOT file:\nreader --fin=/opt/cms/data/Tau_Run2017F-31Mar2018-v1_NANOAOD.root --info --verbose=1 --nevts=2000\n\n# here is an example of reading remote ROOT file:\nreader --fin=root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root --verbose=1 --nevts=2000 --info\n\n# both of aforementioned commands produce the following output\nReading root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root\n# 1000 entries, 883 branches, 4.113945007324219 MB, 0.6002757549285889 sec, 6.853425235896175 MB/sec, 1.6659010326328503 kHz\n# 1000 entries, 883 branches, 4.067909240722656 MB, 1.3497390747070312 sec, 3.0138486148558896 MB/sec, 0.740883937302516 kHz\n###total time elapsed for reading + specs computing: 2.2570559978485107 sec; number of chunks 2\n###total time elapsed for reading: 1.9500117301940918 sec; number of chunks 2\n\n--- first pass: 1131872 events, (648-flat, 232-jagged) branches, 2463 attrs\nVMEM used: 29.896704 (MB) SWAP used: 0.0 (MB)\n<__main__.RootDataReader object at 0x7fb0cdfe4a00> init is complete in 2.265552043914795 sec\nNumber of events  : 1131872\n# flat branches   : 648\nCaloMET_phi values in [-3.140625, 3.13671875] range, dim=N/A\nCaloMET_pt values in [0.783203125, 257.75] range, dim=N/A\nCaloMET_sumEt values in [820.0, 3790.0] range, dim=N/A\n

More examples about using uproot may be found here and here.

"},{"location":"training/MLaaS4HEP.html#how-to-train-ml-models-on-hep-root-data","title":"How to train ML models on HEP ROOT data","text":"

The MLaaS4HEP framework allows to train ML models in different ways: - using full dataset (i.e. the entire amount of events stored in input ROOT files) - using chunks, as subsets of a dataset, which dimension can be chosen directly by the user and can vary between 1 and the total number of events - using local or remote ROOT files.

The training phase is managed by the workflow.py module which performs the following actions: - read all input ROOT files in chunks to compute a specs file (where the main information about the ROOT files are stored: the dimension of branches, the minimum and the maximum for each branch, and the number of events for each ROOT file) - perform the training cycle (each time using a new chunk of events) - create a new chunk of events taken proportionally from the input ROOT files - extract and convert each event in a list of NumPy arrays - normalize the events - fix the Jagged Arrays dimension - create the masking vector - use the chunk to train the ML model provided by the user

A schematic representation of the steps performed in the MLaaS4HEP pipeline, in particular those inside the Data Streaming and Data Training layers, is:

If the dataset is large and exceed the amount of RAM on the training node, then the user should consider the chunk approach. This allows to train the ML model each time using a different chunk, until the entire dataset is completely read. In this case the user should pay close attention to the ML model convergence, and validate it after each chunk. For more information look at this, this and this. Using different training approach has pros and cons. For instance, training on entire dataset can guarantee the ML model convergence, but the dataset should fits into RAM of the training node. While chunk approach allows to split the dataset to fit in the hardware resources, but it requires proper model evaluation after each chunk training. In terms of training speed, this choice should be faster than training on the entire dataset, since after having used a chunk for training, that chunk is no longer read and used subsequently (this effect is prominent when remote ROOT files are used). Finally, user should be aware of potential divergence of ML model when training last chunk of the dataset and check for bias towards last chunk. For instance, user may implement a K-fold cross validation approach to train on N-1 chunks (i.e. folds in this case) and use one chunk for validation.

A detailed description of how to use the workflow.py module for training a ML model reading ROOT files from the opendata portal, can be found here. Please see how the user has to provide several information when run the workflow.py module, e.g. the definition of the ML model, and then is task of MLaaS4HEP framework to perform all the training procedure using the ML model provided by the user.

For a complete description of MLaaS4HEP see this paper.

"},{"location":"training/autoencoders.html","title":"Autoencoders","text":""},{"location":"training/autoencoders.html#introduction","title":"Introduction","text":"

Autoencoders are a powerful tool that has gained popularity in HEP and beyond recently. These types of algorithms are neural networks that learn to decompress data with minimal reconstruction error (Goodfellow, et. al.).

The idea of using neural networks for dimensionality reduction or feature learning dates back to the early 1990s. Autoencoders, or \"autoassociative neural networks,\" were originally proposed as a nonlinear generalization of principle component analysis (PCA) (Kramer). More recently, connections between autoencoders and latent variable models have brought these types of algorithms into the generative modeling space.

The two main parts of an autoencoder algorithm are the encoder function \\(f(x)\\) and the decoder function \\(g(x)\\). The learning process of an autoencoder is a minimization of a loss function, \\(L(x,g(f(x)))\\), that compares the original data to the output of the decoder, similar to that of a neural network. As such, these algorithms can be trained using the same techniques, like minibatch gradient descent with backpropagation. Below is a representation of an autoencoder from Mathworks.

"},{"location":"training/autoencoders.html#constrained-autoencoders-undercomplete-and-regularized","title":"Constrained Autoencoders (Undercomplete and Regularized)","text":"

Information in this section can be found in Goodfellow, et. al.

An autoencoder that is able to perfectly reconstruct the original data one-to-one, such that \\(g(f(x)) = x\\), is not very useful for extracting salient information from the data. There are several methods imposed on simple autoencoders to encourage them to extract useful aspects of the data.

One way of avoiding perfect data reconstruction is by constraining the dimension of the encoding function \\(f(x)\\) to be less than the data \\(x\\). These types of autoencoders are called undercomplete autoencoders, which force the imperfect copying of the data such that the encoding and decoding networks can prioritize the most useful aspects of the data.

However, if undercomplete encoders are given too much capacity, they will struggle to learn anything of importance from the data. Similarly, this problem occurs in autoencoders with encoder dimensionality greater than or equal to the data (the overcomplete case). In order to train any architecture of AE successfully, constraints based on the complexity of the target distribution must be imposed, apart from small dimensionality. These regularized autoencoders can have constraints on sparsity, robustness to noise, and robustness to changes in data (the derivative).

"},{"location":"training/autoencoders.html#sparse-autoencoders","title":"Sparse Autoencoders","text":"

Sparse autoencoders place a penalty to enforce sparsity in the encoding layer \\(\\mathbf{h} = f(\\mathbf{x})\\) such that \\(L(\\mathbf{x}, g(f(\\mathbf{x}))) + \\Omega(\\mathbf{h})\\). This penalty prevents the autoencoder from learning the identity transformation, extracting useful features of the data to be used in later tasks, such as classification. While the penalty term can be thought of as a regularizing term for a feedforward network, we can expand this view to think of the entire sparse autoencoder framework as approximating the maximum likelihood estimation of a generative model with latent variables \\(h\\). When approximating the maximum likelihood, the joint distribution \\(p_{\\text{model}}(\\mathbf{x}, \\mathbf{h})\\) can be approximated as

\\[ \\text{log} [ p_{\\text{model}}(\\mathbf{x})] = \\text{log} [p_{\\text{model}}(\\mathbf{h})] + [\\text{log} p_{\\text{model}}(\\mathbf{x} | \\mathbf{h})] \\]

where \\(p_{\\text{model}}(\\mathbf{h})\\) is the prior distribution over the latent variables, instead of the model's parameters. Here, we approximate the sum over all possible prior distribution values to be a point estimate at one highly likely value of \\(\\mathbf{h}\\). This prior term is what introduces the sparsity requirement, for example with the Laplace prior, $$ p_{\\text{model}}(h_i) = \\frac{\\lambda}{2}e^{-\\lambda|h_i|}. $$

The log-prior is then

$$ \\text{log} [p_{\\text{model}}(\\mathbf{h})] = \\sum_i (\\lambda|h_i| - \\text{log}\\frac{\\lambda}{2}) = \\Omega(\\mathbf{h}) + \\text{const}. $$ This example demonstrates how the model's distribution over latent variables (prior) gives rise to a sparsity penalty.

"},{"location":"training/autoencoders.html#penalized-autoencoders","title":"Penalized Autoencoders","text":"

Similar to sparse autoencoders, a traditional penalty term can be introduced to the cost function to regularize the autoencoder, such that the function to minimize becomes $$ L(\\mathbf{x},g(f(\\mathbf{x}))) + \\Omega(\\mathbf{h},\\mathbf{x}). $$ where $$ \\Omega(\\mathbf{h},\\mathbf{x}) = \\lambda\\sum_i ||\\nabla_{\\mathbf{x}}h_i||^2. $$ Because of the dependence on the gradient of the latent variables with respect to the input variables, if \\(\\mathbf{x}\\) changes slightly, the model is penalized for learning those slight variations. This type of regularization leads to a contractive autoencoder (CAE).

"},{"location":"training/autoencoders.html#denoising-autoencoders","title":"Denoising Autoencoders","text":"

Another way to encourage autoencoders to learn useful features of the data is training the algorithm to minimize a cost function that compares the original data (\\(\\mathbf{x}\\)) to encoded and decoded data that has been injected with noise (\\(f(g(\\mathbf{\\tilde{x}}))\\), $$ L(\\mathbf{x},g(f(\\mathbf{\\tilde{x}}))) $$ Denoising autoencoders then must learn to undo the effect of the noise in the encoded/decoded data. The autoencoder is able to learn the structure of the probability density function of the data (\\(p_{\\text{data}}\\)) as a function of the input variables (\\(x\\)) through this process (Alain, Bengio, Bengio, et. al.). With this type of cost function, even overcomplete, high-capacity autoencoders can avoid learning the identity transformation.

"},{"location":"training/autoencoders.html#variational-autoencoders","title":"Variational Autoencoders","text":"

Variational autoencoders (VAEs), introduced by Kigma and Welling, are similar to normal AEs. They are comprised of neural nets, which maps the input to latent space (encoder) and back (decoder), where the latent space is a low-dimensional, variational distribution. VAEs are bidirectional, generating data or estimating distributions, and were initially designed for unsupervised learning but can also be very useful in semi-supervised and fully supervised scenarios (Goodfellow, et. al.).

VAEs are trained by maximizing the variational lower bound associated with data point \\(\\mathbf{x}\\), which is a function of the approximate posterior (inference network, or encoder), \\(q(\\mathbf{z})\\). Latent variable \\(\\mathbf{z}\\) is drawn from this encoder distribution, with \\(p_\\text{model}(\\mathbf{x} | \\mathbf{z})\\) viewed as the decoder network. The variational lower bound (also called the evidence lower bound or ELBO) is a trade-off between the join log-likelihood of the visible and latent variables, and the KL divergence between the model prior and the approximate posterior, shown below (Goodfellow, et. al.).

$$ \\mathcal{L}(q) = E_{\\mathbf{z} \\sim q(\\mathbf{z} | \\mathbf{x})} \\text{log}p_\\text{model}(\\mathbf{x} | \\mathbf{z}) - D_\\text{KL}(q || p) $$.

Methods for optimizing the VAE by learning the variational lower bound include EM meta-algorithms like probabilistic PCA (Goodfellow, et. al.).

"},{"location":"training/autoencoders.html#applications-in-hep","title":"Applications in HEP","text":"

One of the more popular applications of AEs in HEP include anomaly detection. Because autoencoders are trained to learn latent features of a dataset, any new data that does not match those features could be classified as an anomaly and picked out by the AE. Examples of AEs for anomaly detection in HEP are listed below:

  • Anomaly detection in high-energy physics using a quantum autoencoder
  • Particle Graph Autoencoders and Differentiable, Learned Energy Mover's Distance
  • Bump Hunting in Latent Space

Another application of (V)AEs in HEP is data generation, as once the likelihood of the latent variables is approximated it can be used to generate new data. Examples of this application in HEP for simulation of various physics processes are listed below:

  • Deep generative models for fast shower simulation in ATLAS
  • Sparse Data Generation for Particle-Based Simulation of Hadronic Jets in the LHC
  • Variational Autoencoders for Jet Simulation

Finally, the latent space learned by (V)AEs give a parsimonious and information-rich phase space from which one can make inferences. Examples of using (V)AEs to learn approximate and/or compressed representations of data are given below:

  • An Exploration of Learnt Representations of W Jets
  • Machine-Learning Compression for Particle Physics Discoveries
  • Decoding Photons: Physics in the Latent Space of a BIB-AE Generative Network

More examples of (V)AEs in HEP can be found at the HEP ML Living Review.

"},{"location":"training/autoencoders.html#references","title":"References","text":"
  • Goodfellow, et. al., 2016, Deep Learning
  • Alain, Bengio, 2013, \"What Regularized Auto-Encoders Learn from the Data Generating Distribution\"
  • Bengio, et. al., 2013, \"Generalized Denoising Auto-Encoders as Generative Models\"
  • Kramer, 1991, \"Nonlinear principle component analysis using autoassociative neural networks\"
  • Kingma, Welling, 2013, \"Auto-Encoding Variational Bayes\"
"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index ba7c481..84950aa 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -210,6 +210,11 @@ 2023-12-05 daily + + https://cms-ml.github.io/documentation/training/Decorrelation.html + 2023-12-05 + daily + https://cms-ml.github.io/documentation/training/MLaaS4HEP.html 2023-12-05 diff --git a/software_envs/containers.html b/software_envs/containers.html index a73f05f..52dea51 100644 --- a/software_envs/containers.html +++ b/software_envs/containers.html @@ -1,4 +1,4 @@ - Using containers - CMS Machine Learning Documentation

Using containers

Containers are a great solution to isolate a software environment, especially in batch systems like lxplus. At the moment two container solutations are supported Apptainer ( previously called Singularity), and Docker.

Using Singularity

The unpacked.cern.ch service mounts on CVMFS contains many singularity images, some of which are suitable for machine learning applications. A description of each of the images is beyond the scope of this document. However, if you find an image which is useful for your application, you can use if by running a Singularity container with the appropriate options. For example:

singularity run --nv --bind <bind_mount_path> /cvmfs/unpacked.cern.ch/<path_to_image>
+ Using containers - CMS Machine Learning Documentation       

Using containers

Containers are a great solution to isolate a software environment, especially in batch systems like lxplus. At the moment two container solutations are supported Apptainer ( previously called Singularity), and Docker.

Using Singularity

The unpacked.cern.ch service mounts on CVMFS contains many singularity images, some of which are suitable for machine learning applications. A description of each of the images is beyond the scope of this document. However, if you find an image which is useful for your application, you can use if by running a Singularity container with the appropriate options. For example:

singularity run --nv --bind <bind_mount_path> /cvmfs/unpacked.cern.ch/<path_to_image>
 

Examples

After installing package, you can then use GPU based machine learning algorithms. Two examples are supplied as an example.

  1. The first example aims at using a CNN to perform handwritten digits classification with MNIST dataset. The whole notebook can be found at pytorch_mnist. This example is modified from an official pytorch example.

  2. The second example is modified from the simple MLP example from weaver-benchmark. The whole notebook can be found at toptagging_mlp.

Using Docker

Docker is not supported at the moment in the interactive node of lxplus (like lxplus-gpu). However Docker is supported on HTCondor for job submission.

This option can be very handy for users, as HTCondor can pull images from any public registry, like DockerHub or GitLab registry. The user can follow this workflow: 1. Define a custom image on top of a commonly available pytorch or tensorflow image 2. Add the desidered packages and configuration 3. Push the docker image on a registry 4. Use the image in a HTCondor job

The rest of the page is a step by step tutorial for this workflow.

Define the image

  1. Define a file Dockerfile

    FROM pytorch/pytorch:latest
     
     ADD localfolder_with_code /opt/mycode
    diff --git a/software_envs/lcg_environments.html b/software_envs/lcg_environments.html
    index e584a8f..6e65d2e 100644
    --- a/software_envs/lcg_environments.html
    +++ b/software_envs/lcg_environments.html
    @@ -1,4 +1,4 @@
    - LCG environments - CMS Machine Learning Documentation       

    LCG environments

    Software Environment

    The software environment for ML application trainings can be setup in different ways. In this page we focus on the CERN lxplus environment.

    LCG release software

    Checking out an ideal software bundle with Cuda support at http://lcginfo.cern.ch/, one can set up an LCG environment by executing

    source /cvmfs/sft.cern.ch/lcg/views/<name of bundle>/**x86_64-centos*-gcc11-opt**/setup.sh
    + LCG environments - CMS Machine Learning Documentation       

    LCG environments

    Software Environment

    The software environment for ML application trainings can be setup in different ways. In this page we focus on the CERN lxplus environment.

    LCG release software

    Checking out an ideal software bundle with Cuda support at http://lcginfo.cern.ch/, one can set up an LCG environment by executing

    source /cvmfs/sft.cern.ch/lcg/views/<name of bundle>/**x86_64-centos*-gcc11-opt**/setup.sh
     

    On lxplus-gpu nodes, usually equipped with AlmaLinux 9.1 (also called Centos9), one should use the proper lcg release. At the time of writing (May 2023) the recommended environment to use GPUs is:

    source /cvmfs/sft.cern.ch/lcg/views/LCG_103cuda/x86_64-centos9-gcc11-opt/setup.sh
     

    Customized environments

    One can create custom Python environment using virtualenv or venv tools, in order to avoid messing up with the global python environment.

    The user has the choice of building a virtual environment from scratch or by basing on top of a LCG release.

    Virtual environment from scratch

    The first approach is cleaner but requires downloading the full set of libraries needed for pytorch or TensorFlow (very heavy). Moreover the compatibility with the computing environment (usually lxplus-gpu) is not guaranteed.

    1. Create the environment in a folder of choice, usually called myenv

      python3 -m venv --system-site-packages myenv
       source myenv/bin/activate   # activate the environment
      diff --git a/training/Decorrelation.html b/training/Decorrelation.html
      new file mode 100644
      index 0000000..8ce3b00
      --- /dev/null
      +++ b/training/Decorrelation.html
      @@ -0,0 +1 @@
      + Decorrelation - CMS Machine Learning Documentation       

      Decorrelation

      When preparing to train a machine learning algorithm, it is important to think about the correlations of the output and their impact on how the trained model is used. Generally, the goal of any training is to maximize correlations with variables of interests. For example, a classifier is trained specifically to be highly correlated with the classification categories. However, there is often another set of variables that high correlation with the ML algorithm's output is not desirable and could make the ML algorithm useless regardless of its overall performance.

      There are numerous methods that achieve the goal of minimizing correlations of ML algorithms. Choosing the correct decorrelation method depends on the situation, e.g., which ML algorithm is being used and the type of the undesirable variables. Below, we detail various methods for common scenarios focusing on BDT (boosted decision tree) and neural network algorithms.

      Impartial Training Data

      Generally, the best method for making a neural network's or BDT's output independent of some known variable is to remove any bias in the training dataset, which is commonly done by adding or removing information.

      Adding Information

      • Training on a mix of signals with different masses can help prevent the BDT from learning the mass.

      Removing Information

      • If you have any input variables that are highly correlated with the mass, you may want to omit them. There may be a loss of raw discrimination power with this approach, but the underlying interpretation will be more sound.

      Reweighting

      • One method to achieve correlation by weighting data is reweighting the network's input samples to match a reference distribution. Examples input variables include mass, or an input to invariant mass, like the \(p_T\). This method is distinct from flattening the data since it is weighted to match a target distribution rather than a flat distribution. Flattening can also require very large weights that could potentially affect training. This is one way to avoid having the network sculpt, or learn, a certain kinematic quantity, like the background mass. An example of this technique is given in EXO-19-020.
        • This is what is done for the ImageTop tagger and ParticleNet group of taggers. reweighted_BDT_scores BDT scores from EXO-10-020 where the jet \(p_T\) distribution is reweighted to match a reference distribution for each sample.

      Adversarial Approach

      Adversarial approaches to decorrelation revolve around including a penalty, or regularization, term in the loss function in training. The loss function can be modified to enforce uniformity in the variable of interest (i.e. mass). Check out these links (1, 2, 3) for some examples of this. One way to technically implement this type of approach is using the "flatness loss function" (i.e. BinFlatnessLossFunction in the hep-ml package). This type of decorrelation what is done for the DeepAK8-MD taggers.

      Another type of regularization one can do to acheive decorrelation is penalizing the loss function on a certain type of correlation, for example distance. In the seminal distance correlation in HEP-ML paper ((DisCo Fever: Robust Networks Through Distance Correlation)), distance is in this case is defined as distance correlation (DisCo), a measure derived from distance covariance, first introduced here. This distance correlation function calculates the non-linear correlation between the NN output and some variables that you care about, e.g. jet mass, that you can force the network to minimize which decorrelates the two variables by including it as a penalty term in the loss function. An extension of this can be found in the Double DisCo method, given below, which highlights the distance correlation term in the loss function at the bottom. The Double DisCo network leverages the ABCD method for background estimation, which is why it requires two separate discriminants. Below is the Double DisCo NN architecture used in MLG-23-003. Notice the two separate discriminant paths consisting of a Dense layer, a Dropout layer, and another Dense layer before outputting a single discriminant per path.

      disco_method Source: CMS AN-22-101 for MLG-23-003.

      Many thanks to Kevin Pedro for his input on this section and the previous one.

      Parametric Cut

      When designing jet taggers, variables of interest for discriminators include N-subjettiness derived quantities. Often, these quantities will be correlated with, for example, the \(p_T\) of the jet. One example of this type of correlation is called "mass scuplting" and happens when the distribution of the discriminating variable in background begins to exhibit a shape similar to that of the signal with successive cuts. This correlation can have confounding effects in the tagger and one way to remove these effects is to parametrically cut on the discriminant.

      One such prescription to remove these correlations is described here and focuses on removing the \(p_T\) dependence in the soft-drop mass variable \(\rho\). The authors note that there is a \(p_T\) dependence in the N-subjettiness ratio \(\tau_2/\tau_1\) as a function of the QCD jet scaling (soft-drop) variable, defined as \(\rho = log(m^2)(p_T^2)\), which leads to mass sculpting. In order to alleviate this issue, the authors introduce a modified version of the soft-drop variable, \(\rho' = \rho + log(p_T/\mu)\) where \(\mu\) is chosen to be 1 GeV. It can also be noted that there is a linear depedence between \(\tau_2/\tau_1\) and \(\rho'\). Here, the authors remedy this by modelling the linear depedence with \(\tau_{21}' + \tau_2/\tau_1 - M \times \rho'\) where \(M\) is fit from the data. Applying both these transformations flattens out the relationship between the ratio and the soft-drop variable and removes the mass sculpting effects. It is imperative that the transformation between variables are smooth, as discontinuous functions may lead to artificial features in the data.

      Methods for mass parameterized BDTs

      Finally, when using a BDT that is parameterized by a mass quantity of interest, the output can be decorrelated from that mass by three different methods: randomization, oversampling, and variable profiling. Randomization entails randomly pairing a mass quanitity to a background training event so the BDT does not learn any meaningful associations between the mass and the output. For oversampling, this is a bootstrapping method where every input background event is paired with a potential mass point so the effective statistics for all the mass points are the same. Finally, variable profiling has the user profile each BDT input as a function of the mass quantity of interest. Examples of each of these methods is given below in the context of a di-higgs search.

      A di-higgs multilepton search (HIG-21-002) made use of a BDT for signal discrimination, parameterized by the di-higgs invariant mass. In order to avoid correlations in the BDT output and invariant mass of the di-higgs system, they looked at decorrelation via randomization, oversampling, and variable profiling. All of the methods utilized a (more or less) 50/50 dataset train/test split where one iteration of the BDT was trained on "even" numbered events and the datacards were produced with the "odd" numbered events. This procedure was repeated for the opposite configuration. Then, to deteremine if the BDT was correctly interpolating the signal masses, one mass point was omitted from training and the results of this BDT were compared to a BDT trained on only this single, omitted mass point. For each train/test configuration (even/odd or odd/even), the BDT's performance gain, as well as loss, were evaluated with ROC curves with two ommitted mass points (done separately).

      In the randomization method, a generator-level di-higgs invariant mass was randomly assigned to each background event the BDT was trained on. For the oversampling method, every signal mass point was assigned to a duplicate of each background event. Obviously the oversampling method leads to slower execution but the same effective statistics for all backgrounds and each signal mass. Conversely, the randomization approach is quicker, but leads to reduced effective statistics. Lastly, to improve performance over lower signal masses, each BDT input variable was profiled as a function of \(m_{HH}\). This profile was fit with a polynomial function, and then each point in the input distribution is divided by the fit function value. This corrected ratio is used as the new input to the BDT. The authors also found that splitting the BDT training into high and low mass regions helped.

      In the end, oversampling, especially when combined with input variable corrections, provided a sizable performance gain (5.6%) over the randomization method. This gain is determined from ROC curves made for each training iteration (even/odd or odd/event) and each method. The performance loss is also a 5% improvement over the randomization method.

      For more information on these methods, see the HIG-21-002. Below are some example BDT output scores for the \(2\ell ss\) and \(3 \ell\) channels for this analysis.

      mass_param_BDT Source: HIG-21-002

      So far we have seen decorrelation achieved by using inputs that are decorrelated for the classifier and regularizing the output to penalize learning correlations. Another approach can be to learn decorrelation by maximizing performance metrics that more closely align with the sensitivity of the analysis, like in this paper and their corresponding Python-based package, ThickBrick. In this case, the authors study the dependence of the event selection threshold on the signal purity in a given bin of the distribution of an observable. They demonstrate that the threshold increases with signal purity, "implying that the threshold is stronger in the x-'bins' of higher purity." This parametric event selection threshold "naturally leads to decorrelation of the event selection criteria from the event variable x." The failure to incorporate the dependencies on observable distributions is framed as a misalignment between the ML-based selector and the sensitivity of the physics analysis. A demo of their package, ThickBrick, was given at PyHEP2020.


      Last update: December 5, 2023
      \ No newline at end of file diff --git a/training/MLaaS4HEP.html b/training/MLaaS4HEP.html index 97a8b8f..bf75a9e 100644 --- a/training/MLaaS4HEP.html +++ b/training/MLaaS4HEP.html @@ -1,4 +1,4 @@ - MLaaS4HEP - CMS Machine Learning Documentation

      MLaaS4HEP

      Machine Learning as a Service for HEP

      MLaaS for HEP is a set of Python-based modules to support reading HEP data and stream them to the ML tool of the user's choice. It consists of three independent layers: - Data Streaming layer to handle remote data, see reader.py - Data Training layer to train ML model for given HEP data, see workflow.py - Data Inference layer, see tfaas_client.py

      The MLaaS4HEP resopitory can be found here.

      The general architecture of MLaaS4HEP looks like this: MLaaS4HEP-architecture

      Even though this architecture was originally developed for dealing with HEP ROOT files, we extend it to other data formats. As of right now, following data formats are supported: JSON, CSV, Parquet, and ROOT. All of the formats support reading files from the local file system or HDFS, while the ROOT format supports reading files via the XRootD protocol.

      The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients. The TFaaS documentation can be found here.

      Dependencies

      Here is a list of the dependencies: - pyarrow for reading data from HDFS file system - uproot for reading ROOT files - numpy, pandas for data representation - modin for fast panda support - numba for speeing up individual functions

      Installation

      The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images

      # run MLaaS4HEP docker container
      + MLaaS4HEP - CMS Machine Learning Documentation       

      MLaaS4HEP

      Machine Learning as a Service for HEP

      MLaaS for HEP is a set of Python-based modules to support reading HEP data and stream them to the ML tool of the user's choice. It consists of three independent layers: - Data Streaming layer to handle remote data, see reader.py - Data Training layer to train ML model for given HEP data, see workflow.py - Data Inference layer, see tfaas_client.py

      The MLaaS4HEP resopitory can be found here.

      The general architecture of MLaaS4HEP looks like this: MLaaS4HEP-architecture

      Even though this architecture was originally developed for dealing with HEP ROOT files, we extend it to other data formats. As of right now, following data formats are supported: JSON, CSV, Parquet, and ROOT. All of the formats support reading files from the local file system or HDFS, while the ROOT format supports reading files via the XRootD protocol.

      The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients. The TFaaS documentation can be found here.

      Dependencies

      Here is a list of the dependencies: - pyarrow for reading data from HDFS file system - uproot for reading ROOT files - numpy, pandas for data representation - modin for fast panda support - numba for speeing up individual functions

      Installation

      The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images

      # run MLaaS4HEP docker container
       docker run veknet/mlaas4hep
       # run TFaaS docker container
       docker run veknet/tfaas
      diff --git a/training/autoencoders.html b/training/autoencoders.html
      index d1f9290..63c4652 100644
      --- a/training/autoencoders.html
      +++ b/training/autoencoders.html
      @@ -1,4 +1,4 @@
      - Autoencoders - CMS Machine Learning Documentation       

      Autoencoders

      Introduction

      Autoencoders are a powerful tool that has gained popularity in HEP and beyond recently. These types of algorithms are neural networks that learn to decompress data with minimal reconstruction error (Goodfellow, et. al.).

      The idea of using neural networks for dimensionality reduction or feature learning dates back to the early 1990s. Autoencoders, or "autoassociative neural networks," were originally proposed as a nonlinear generalization of principle component analysis (PCA) (Kramer). More recently, connections between autoencoders and latent variable models have brought these types of algorithms into the generative modeling space.

      The two main parts of an autoencoder algorithm are the encoder function \(f(x)\) and the decoder function \(g(x)\). The learning process of an autoencoder is a minimization of a loss function, \(L(x,g(f(x)))\), that compares the original data to the output of the decoder, similar to that of a neural network. As such, these algorithms can be trained using the same techniques, like minibatch gradient descent with backpropagation. Below is a representation of an autoencoder from Mathworks.

      autoencoder_model

      Constrained Autoencoders (Undercomplete and Regularized)

      Information in this section can be found in Goodfellow, et. al.

      An autoencoder that is able to perfectly reconstruct the original data one-to-one, such that \(g(f(x)) = x\), is not very useful for extracting salient information from the data. There are several methods imposed on simple autoencoders to encourage them to extract useful aspects of the data.

      One way of avoiding perfect data reconstruction is by constraining the dimension of the encoding function \(f(x)\) to be less than the data \(x\). These types of autoencoders are called undercomplete autoencoders, which force the imperfect copying of the data such that the encoding and decoding networks can prioritize the most useful aspects of the data.

      However, if undercomplete encoders are given too much capacity, they will struggle to learn anything of importance from the data. Similarly, this problem occurs in autoencoders with encoder dimensionality greater than or equal to the data (the overcomplete case). In order to train any architecture of AE successfully, constraints based on the complexity of the target distribution must be imposed, apart from small dimensionality. These regularized autoencoders can have constraints on sparsity, robustness to noise, and robustness to changes in data (the derivative).

      Sparse Autoencoders

      Sparse autoencoders place a penalty to enforce sparsity in the encoding layer \(\mathbf{h} = f(\mathbf{x})\) such that \(L(\mathbf{x}, g(f(\mathbf{x}))) + \Omega(\mathbf{h})\). This penalty prevents the autoencoder from learning the identity transformation, extracting useful features of the data to be used in later tasks, such as classification. While the penalty term can be thought of as a regularizing term for a feedforward network, we can expand this view to think of the entire sparse autoencoder framework as approximating the maximum likelihood estimation of a generative model with latent variables \(h\). When approximating the maximum likelihood, the joint distribution \(p_{\text{model}}(\mathbf{x}, \mathbf{h})\) can be approximated as

      \[ \text{log} [ p_{\text{model}}(\mathbf{x})] = \text{log} [p_{\text{model}}(\mathbf{h})] + [\text{log} p_{\text{model}}(\mathbf{x} | \mathbf{h})] \]

      where \(p_{\text{model}}(\mathbf{h})\) is the prior distribution over the latent variables, instead of the model's parameters. Here, we approximate the sum over all possible prior distribution values to be a point estimate at one highly likely value of \(\mathbf{h}\). This prior term is what introduces the sparsity requirement, for example with the Laplace prior, $$ p_{\text{model}}(h_i) = \frac{\lambda}{2}e^{-\lambda|h_i|}. $$

      The log-prior is then

      $$ \text{log} [p_{\text{model}}(\mathbf{h})] = \sum_i (\lambda|h_i| - \text{log}\frac{\lambda}{2}) = \Omega(\mathbf{h}) + \text{const}. $$ This example demonstrates how the model's distribution over latent variables (prior) gives rise to a sparsity penalty.

      Penalized Autoencoders

      Similar to sparse autoencoders, a traditional penalty term can be introduced to the cost function to regularize the autoencoder, such that the function to minimize becomes $$ L(\mathbf{x},g(f(\mathbf{x}))) + \Omega(\mathbf{h},\mathbf{x}). $$ where $$ \Omega(\mathbf{h},\mathbf{x}) = \lambda\sum_i ||\nabla_{\mathbf{x}}h_i||^2. $$ Because of the dependence on the gradient of the latent variables with respect to the input variables, if \(\mathbf{x}\) changes slightly, the model is penalized for learning those slight variations. This type of regularization leads to a contractive autoencoder (CAE).

      Denoising Autoencoders

      Another way to encourage autoencoders to learn useful features of the data is training the algorithm to minimize a cost function that compares the original data (\(\mathbf{x}\)) to encoded and decoded data that has been injected with noise (\(f(g(\mathbf{\tilde{x}}))\), $$ L(\mathbf{x},g(f(\mathbf{\tilde{x}}))) $$ Denoising autoencoders then must learn to undo the effect of the noise in the encoded/decoded data. The autoencoder is able to learn the structure of the probability density function of the data (\(p_{\text{data}}\)) as a function of the input variables (\(x\)) through this process (Alain, Bengio, Bengio, et. al.). With this type of cost function, even overcomplete, high-capacity autoencoders can avoid learning the identity transformation.

      Variational Autoencoders

      Variational autoencoders (VAEs), introduced by Kigma and Welling, are similar to normal AEs. They are comprised of neural nets, which maps the input to latent space (encoder) and back (decoder), where the latent space is a low-dimensional, variational distribution. VAEs are bidirectional, generating data or estimating distributions, and were initially designed for unsupervised learning but can also be very useful in semi-supervised and fully supervised scenarios (Goodfellow, et. al.).

      VAEs are trained by maximizing the variational lower bound associated with data point \(\mathbf{x}\), which is a function of the approximate posterior (inference network, or encoder), \(q(\mathbf{z})\). Latent variable \(\mathbf{z}\) is drawn from this encoder distribution, with \(p_\text{model}(\mathbf{x} | \mathbf{z})\) viewed as the decoder network. The variational lower bound (also called the evidence lower bound or ELBO) is a trade-off between the join log-likelihood of the visible and latent variables, and the KL divergence between the model prior and the approximate posterior, shown below (Goodfellow, et. al.).

      $$ \mathcal{L}(q) = E_{\mathbf{z} \sim q(\mathbf{z} | \mathbf{x})} \text{log}p_\text{model}(\mathbf{x} | \mathbf{z}) - D_\text{KL}(q || p) $$.

      Methods for optimizing the VAE by learning the variational lower bound include EM meta-algorithms like probabilistic PCA (Goodfellow, et. al.).

      cms-ml/documentation

      Autoencoders

      Introduction

      Autoencoders are a powerful tool that has gained popularity in HEP and beyond recently. These types of algorithms are neural networks that learn to decompress data with minimal reconstruction error (Goodfellow, et. al.).

      The idea of using neural networks for dimensionality reduction or feature learning dates back to the early 1990s. Autoencoders, or "autoassociative neural networks," were originally proposed as a nonlinear generalization of principle component analysis (PCA) (Kramer). More recently, connections between autoencoders and latent variable models have brought these types of algorithms into the generative modeling space.

      The two main parts of an autoencoder algorithm are the encoder function \(f(x)\) and the decoder function \(g(x)\). The learning process of an autoencoder is a minimization of a loss function, \(L(x,g(f(x)))\), that compares the original data to the output of the decoder, similar to that of a neural network. As such, these algorithms can be trained using the same techniques, like minibatch gradient descent with backpropagation. Below is a representation of an autoencoder from Mathworks.

      autoencoder_model

      Constrained Autoencoders (Undercomplete and Regularized)

      Information in this section can be found in Goodfellow, et. al.

      An autoencoder that is able to perfectly reconstruct the original data one-to-one, such that \(g(f(x)) = x\), is not very useful for extracting salient information from the data. There are several methods imposed on simple autoencoders to encourage them to extract useful aspects of the data.

      One way of avoiding perfect data reconstruction is by constraining the dimension of the encoding function \(f(x)\) to be less than the data \(x\). These types of autoencoders are called undercomplete autoencoders, which force the imperfect copying of the data such that the encoding and decoding networks can prioritize the most useful aspects of the data.

      However, if undercomplete encoders are given too much capacity, they will struggle to learn anything of importance from the data. Similarly, this problem occurs in autoencoders with encoder dimensionality greater than or equal to the data (the overcomplete case). In order to train any architecture of AE successfully, constraints based on the complexity of the target distribution must be imposed, apart from small dimensionality. These regularized autoencoders can have constraints on sparsity, robustness to noise, and robustness to changes in data (the derivative).

      Sparse Autoencoders

      Sparse autoencoders place a penalty to enforce sparsity in the encoding layer \(\mathbf{h} = f(\mathbf{x})\) such that \(L(\mathbf{x}, g(f(\mathbf{x}))) + \Omega(\mathbf{h})\). This penalty prevents the autoencoder from learning the identity transformation, extracting useful features of the data to be used in later tasks, such as classification. While the penalty term can be thought of as a regularizing term for a feedforward network, we can expand this view to think of the entire sparse autoencoder framework as approximating the maximum likelihood estimation of a generative model with latent variables \(h\). When approximating the maximum likelihood, the joint distribution \(p_{\text{model}}(\mathbf{x}, \mathbf{h})\) can be approximated as

      \[ \text{log} [ p_{\text{model}}(\mathbf{x})] = \text{log} [p_{\text{model}}(\mathbf{h})] + [\text{log} p_{\text{model}}(\mathbf{x} | \mathbf{h})] \]

      where \(p_{\text{model}}(\mathbf{h})\) is the prior distribution over the latent variables, instead of the model's parameters. Here, we approximate the sum over all possible prior distribution values to be a point estimate at one highly likely value of \(\mathbf{h}\). This prior term is what introduces the sparsity requirement, for example with the Laplace prior, $$ p_{\text{model}}(h_i) = \frac{\lambda}{2}e^{-\lambda|h_i|}. $$

      The log-prior is then

      $$ \text{log} [p_{\text{model}}(\mathbf{h})] = \sum_i (\lambda|h_i| - \text{log}\frac{\lambda}{2}) = \Omega(\mathbf{h}) + \text{const}. $$ This example demonstrates how the model's distribution over latent variables (prior) gives rise to a sparsity penalty.

      Penalized Autoencoders

      Similar to sparse autoencoders, a traditional penalty term can be introduced to the cost function to regularize the autoencoder, such that the function to minimize becomes $$ L(\mathbf{x},g(f(\mathbf{x}))) + \Omega(\mathbf{h},\mathbf{x}). $$ where $$ \Omega(\mathbf{h},\mathbf{x}) = \lambda\sum_i ||\nabla_{\mathbf{x}}h_i||^2. $$ Because of the dependence on the gradient of the latent variables with respect to the input variables, if \(\mathbf{x}\) changes slightly, the model is penalized for learning those slight variations. This type of regularization leads to a contractive autoencoder (CAE).

      Denoising Autoencoders

      Another way to encourage autoencoders to learn useful features of the data is training the algorithm to minimize a cost function that compares the original data (\(\mathbf{x}\)) to encoded and decoded data that has been injected with noise (\(f(g(\mathbf{\tilde{x}}))\), $$ L(\mathbf{x},g(f(\mathbf{\tilde{x}}))) $$ Denoising autoencoders then must learn to undo the effect of the noise in the encoded/decoded data. The autoencoder is able to learn the structure of the probability density function of the data (\(p_{\text{data}}\)) as a function of the input variables (\(x\)) through this process (Alain, Bengio, Bengio, et. al.). With this type of cost function, even overcomplete, high-capacity autoencoders can avoid learning the identity transformation.

      Variational Autoencoders

      Variational autoencoders (VAEs), introduced by Kigma and Welling, are similar to normal AEs. They are comprised of neural nets, which maps the input to latent space (encoder) and back (decoder), where the latent space is a low-dimensional, variational distribution. VAEs are bidirectional, generating data or estimating distributions, and were initially designed for unsupervised learning but can also be very useful in semi-supervised and fully supervised scenarios (Goodfellow, et. al.).

      VAEs are trained by maximizing the variational lower bound associated with data point \(\mathbf{x}\), which is a function of the approximate posterior (inference network, or encoder), \(q(\mathbf{z})\). Latent variable \(\mathbf{z}\) is drawn from this encoder distribution, with \(p_\text{model}(\mathbf{x} | \mathbf{z})\) viewed as the decoder network. The variational lower bound (also called the evidence lower bound or ELBO) is a trade-off between the join log-likelihood of the visible and latent variables, and the KL divergence between the model prior and the approximate posterior, shown below (Goodfellow, et. al.).

      $$ \mathcal{L}(q) = E_{\mathbf{z} \sim q(\mathbf{z} | \mathbf{x})} \text{log}p_\text{model}(\mathbf{x} | \mathbf{z}) - D_\text{KL}(q || p) $$.

      Methods for optimizing the VAE by learning the variational lower bound include EM meta-algorithms like probabilistic PCA (Goodfellow, et. al.).