Skip to content

Tribuo v4.2.0

Compare
Choose a tag to compare
@Craigacp Craigacp released this 20 Dec 22:09
· 98 commits to main since this release
1c594dc

Tribuo 4.2 adds new models, ONNX export for several types of models, a reproducibility framework for recreating Tribuo models, easy deployment of Tribuo models on Oracle Cloud, along with several smaller improvements and bug fixes. We've added more tutorials covering the new features along with multi-label classification, and further expanded the javadoc to cover all public methods.

In Tribuo 4.1.0 and earlier there is a severe bug in multi-dimensional regression models (i.e., regression tasks with multiple output dimensions). Models other than LinearSGDModel and SparseLinearModel (apart from when using the ElasticNetCDTrainer) have a bug in how the output dimension indices are constructed, and may produce incorrect outputs for all dimensions (as the output will be for a different dimension than the one named in the Regressor object). This has been fixed, and loading in models trained in earlier versions of Tribuo will patch the model to rearrange the dimensions appropriately. Unfortunately this fix cannot be applied to tree based models, and so all multi-output regression tree based models should be retrained using Tribuo 4.2 as they are irretrievably corrupt. Additionally when using standardization in multi-output regression LibSVM models dimensions past the first dimension have the model improperly stored and will also need to be retrained with Tribuo 4.2. See #177 for more details.

Note the KMeans implementation had several internal changes to support running with a java.lang.SecurityManager which will break any subclasses of KMeansTrainer. In most cases changing the signature of any overridden mStep method to match the new signature, and allowing the fjp argument to be null in single threaded execution will fix the subclass.

New models

In this release we've added Factorization Machines, Classifier Chains and HDBSCAN*. Factorization machines are a powerful non-linear predictor which uses a factorized approximation to learn a per output feature-feature interaction term in addition to a linear model. We've added Factorization Machines for multi-class classification, multi-label classification and regression. Classifier chains are an ensemble approach to multi-label classification which given a specific ordering of the labels learns a chain of classifiers where each classifier gets the features along with the predicted labels from earlier in the chain. We also added ensembles of randomly ordered classifier chains which work well in situations when the ground truth label ordering is unknown (i.e., most of the time). HDBSCAN is a hierarchical density based clustering algorithm which chooses the number of clusters based on properties of the data rather than as a hyperparameter. The Tribuo implementation can cluster a dataset, and then at prediction time it provides the cluster the given datapoint would be in without modifying the cluster structure.

  • Classifier Chains (#149), which also adds the jaccard score as a multi-label evaluation metric, and a multi-label voting combiner for use in multi-label ensembles.
  • Factorization machines (#179).
  • HDBSCAN (#196).

ONNX Export

The ONNX format is a cross-platform and cross-library model exchange format. Tribuo can already serve ONNX models via its ONNX Runtime interface, and now has the ability to export models in ONNX format for serving on edge devices, in cloud services, or in other languages like Python or C#.

In this release Tribuo supports exporting linear models (multi-class classification, multi-label classification and regression), sparse linear
regression models, factorization machines (multi-class classification, multi-label classification and regression), LibLinear models (multi-class classification and regression), LibSVM models (multi-class classification and regression), along with ensembles of those models, including arbitrary levels of ensemble nesting. We plan to expand this coverage to more models over time, however for TensorFlow we recommend users export those models as a Saved Model and use the Python tf2onnx converter.

Tribuo models exported in ONNX format preserve their provenance information in a metadata field which is accessible when the ONNX model is loaded back into Tribuo. The provenance is stored as a protobuf so could be read from other libraries or platforms if necessary.

The ONNX export support is in a separate module with no dependencies, and could be used elsewhere on the JVM to support generating ONNX graphs. We welcome contributions to build out the ONNX support in that module.

  • ONNX export for LinearSGDModels (#154), which also adds a multi-label output transformer for scoring multi-label ONNX models.
  • ONNX export for SparseLinearModel (#163).
  • Add provenance to ONNX exported models (#182).
  • Refactor ONNX tensor creation (#187).
  • ONNX ensemble export support (#186).
  • ONNX export for LibSVM and LibLinear (#191).
  • Refactor ONNX support to improve type safety (#199).
  • Extract ONNX support into separate module (#TBD).

Reproducibility Framework

Tribuo has strong model metadata support via its provenance system which records how models, datasets and evaluations are created. In this release we enhance this support by adding a push-button reproduction framework which accepts either a model provenance or a model object and rebuilds the complete training pipeline, ensuring consistent usage of RNGs and other mutable state.

This allows Tribuo to easily rebuild models to see if updated datasets could change performance, or even if the model is actually reproducible (which may be required for regulatory reasons). Over time we hope to expand this support into a full experimental framework, allowing models to be rebuilt with hyperparameter or data changes as part of the data science process or for debugging models in production.

This framework was written by Joseph Wonsil and Prof. Margo Seltzer at the University of British Columbia as part of a collaboration between Prof. Seltzer and Oracle Labs. We're excited to continue working with Joe, Margo and the rest of the lab at UBC, as this is excellent work.

Note the reproducibility framework module requires Java 16 or greater, and is thus not included in the tribuo-all meta-module.

  • Reproducibility framework (#185, with minor changes in #189 and #190).

OCI Data Science Integration

Oracle Cloud Data Science is a platform for building and deploying models in Oracle Cloud. The model deployment functionality wraps a Python runtime and deploys them with an auto-scaler at a REST endpoint. In this release we've added support for deploying Tribuo models which are ONNX exportable directly to OCI DS, allowing scale-out deployments of models from the JVM. We also added a OCIModel wrapper which scores Tribuo Example objects using a deployed model's REST endpoint, allowing easy use of cloud resources for ML on the JVM.

  • Oracle Cloud Data Science integration (#200).

Small improvements

  • Date field processor and locale support in metadata extractors (#148)
  • Multi-output response processor allowing loading different formats of multi-label and multi-dimensional regression datasets (#150)
  • ARM dev profile for compiling Tribuo on ARM platforms (#152)
  • Refactor CSVLoader so it uses CSVDataSource and parses CSV files using RowProcessor, allowing an easy transition to more complex columnar extraction (#153)
  • Configurable anomaly demo data source (#160)
  • Configurable clustering demo data source (#161)
  • Configurable classification demo data source (#162)
  • Multi-Label tutorial and configurable multi-label demo data source (#166) (also adds a multi-label tutorial) plus fix in #168 after #167
  • Add javadoc for all public methods and fields (#175) (also fixes a bug in Util.vectorNorm)
  • Add hooks for model equality checks to trees and LibSVM models (#183) (also fixes a bug in liblinear get top features)
  • XGBoost 1.5.0 (#192)
  • TensorFlow Java 0.4.0 (#195) (note this changes Tribuo's TF API slightly as TF-Java 0.4.0 has a different method of initializing the session)
  • KMeans now uses dense vectors when appropriate, speeding up training (#201)
  • Documentation updates, ONNX and reproducibility tutorials (#205)

Bug fixes

  • NPE fix for LIME explanations using models which don't support per class weights (#157)
  • Fixing a bug in multi-label evaluation which swapped FP for FN (#167)
  • Persist CSVDataSource headers in the provenance (#171)
  • Fixing LibSVM and LibLinear so they have reproducible behaviour (#172)
  • Provenance fix for TransformTrainer and an extra factory for XGBoostExternalModel so you can make them from an in memory booster (#176)
  • Fix multidimensional regression (#177) (fixes regression ids, fixes libsvm so it emits correct standardized models, adds support for per dimension feature weights in XGBoostRegressionModel)
  • Fix provenance generation for FieldResponseProcessor and BinaryResponseProcessor (#178)
  • Normalize LibSVMDataSource paths consistently in the provenance (#181)
  • KMeans and KNN now run correctly when using OpenSearch's SecurityManager (#197)

Contributors

What's Changed

  • Bumping to 4.2.0-SNAPSHOT for new development by @Craigacp in #143
  • Adding release notes for the earlier v4 releases by @Craigacp in #146
  • Adds classifier chains as a generic multi-label classifier by @Craigacp in #149
  • Adds a field processor which operates on dates by @Craigacp in #148
  • Added support for multioutputs to ResponseProcesser, with tests. by @JackSullivan in #150
  • Adding an ARM maven profile which skips the native library tests by @Craigacp in #152
  • Fixing an NPE in LIMEExplanation.getActiveFeatures() by @Craigacp in #157
  • CSVLoader refactor by @Craigacp in #153
  • Adds a ConfigurableDataSource data generator for AnomalyDetection by @Craigacp in #160
  • Adds ONNX export support to Tribuo's LinearSGDModels by @Craigacp in #154
  • Adds a ConfigurableDataSource data generator for Clustering by @Craigacp in #161
  • Adds ConfigurableDataSource data generators for Classification by @Craigacp in #162
  • Adds a tutorial on multi-label problems and a configurable data source generator for multi-label demos. by @Craigacp in #166
  • Bumping LibSVM version by @Craigacp in #170
  • Adds ONNX export support to the sparse linear models by @Craigacp in #163
  • Fixing a bug where MultiLabelConfusionMatrix swapped FP for FN by @Craigacp in #167
  • Updating the multi-label tutorial after the evaluation bug fix by @Craigacp in #168
  • CSVDataSource should persist the headers in the provenance by @Craigacp in #171
  • LibLinear and LibSVM have unmanaged global RNGs by @Craigacp in #172
  • Add Javadoc for all remaining undocumented public methods and fields. by @Craigacp in #175
  • Two small fixes for provenance. by @Craigacp in #176
  • Fixes multidimensional regression by @Craigacp in #177
  • Normalizes the URL created from paths in LibSVMDataSource by @Craigacp in #181
  • Factorization machines by @Craigacp in #179
  • ResponseProcessor State-setting and Tests by @JackSullivan in #178
  • Adding some accessors to allow deeper model equality checks by @Craigacp in #183
  • Adds Tribuo provenance as a metadata field to exported ONNX models by @Craigacp in #182
  • Bumping CI to Java 17 by @Craigacp in #188
  • Refactor onnx math by @JackSullivan in #187
  • Addition of a Reproducibility Framework by @jwons in #185
  • Updates for the reproducibility changes in the rest of Tribuo by @Craigacp in #190
  • ONNX ensemble support by @Craigacp in #186
  • Bump XGBoost to 1.5.0 by @Craigacp in #192
  • Reproducibility generics cleanup by @Craigacp in #189
  • ONNX export support for LibLinear and LibSVM by @Craigacp in #191
  • Update bug_report.md by @Craigacp in #194
  • These are the changes for an implementation of HDBSCAN* by @geoffreydstewart in #196
  • Tensorflow-Java 0.4.0 update by @Craigacp in #195
  • Single threaded K-Means training no longer uses a ForkJoinPool by @Craigacp in #197
  • Adds setInvocationCount to HdbscanTrainer by @Craigacp in #198
  • Refactor Java ONNX Interface by @JackSullivan in #199
  • Add the HDBSCAN* clustering tutorial, and add a small fix for predictions by @geoffreydstewart in #202
  • Cleaned up some doc formatting and some typos by @jhalexand in #204
  • Moving ONNX export utils out into a separate module by @Craigacp in #203
  • Oracle Cloud Data Science interop by @Craigacp in #200
  • KMeans DenseVector support by @Craigacp in #201
  • Documentation updates for 4.2 by @Craigacp in #205
  • Tribuo v4.2 release by @Craigacp in #206

New Contributors

Full Changelog: v4.1.0...v4.2.0