v0.7.beta
Pre-release
Pre-release
What's Changed
- Fix ray nightly import by @jppgks in #2196
- Restructured split config and added datetime splitting by @tgaddair in #2132
- enh: Implements
InferenceModule
as a pipelined module with separate preprocessor, predictor, and postprocessor modules by @brightsparc in #2105 - Explicitly pass data credentials when reading binary files from a RayBackend by @jeffreyftang in #2198
- MlflowCallback: do not end run on_trainer_train_teardown by @jppgks in #2201
- Fail hyperopt with full import error when Ray not installed by @tgaddair in #2203
- Make convert_predictions() backend-aware by @hungcs in #2200
- feat: MVP for explanations using Integrated Gradients from captum by @jppgks in #2205
- [Torchscript] Adds GPU-enabled input types for Vector and Timeseries by @geoffreyangus in #2197
- feat: Added model type GBM (LightGBM tree learner), as an alternative to ECD by @jppgks in #2027
- [Torchscript] Parallelized Text/Sequence Preprocessing by @geoffreyangus in #2206
- feat: Adding feature type shared parameter capability for hyperopt by @arnavgarg1 in #2133
- Bump up version to 0.6.dev. by @justinxzhao in #2209
- Define
FloatOrAuto
andIntegerOrAuto
schema fields, and use them. by @justinxzhao in #2219 - Define a dataclass for parameter metadata. by @justinxzhao in #2218
- Add explicit handling for zero-length image byte buffers to avoid cryptic errors by @jeffreyftang in #2210
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2231
- Create dataset util to form repeatable train/vali/test split by @amholler in #2159
- Bug fix: Use safe rename which works across filesystems when writing checkpoints by @dantreiman in #2225
- Add parameter metadata to the trainer schema. by @justinxzhao in #2224
- Add an explicit call to merge_wtih_defaults() when loading a config from a model directory. by @justinxzhao in #2226
- Fixes flaky test test_datetime_split[dask] by @dantreiman in #2232
- Fixes prediction saving for models with Set output by @geoffreyangus in #2211
- Make ExpectedImpact JSON serializable by @hungcs in #2233
- standardised quotation marks, added missing word by @Marvjowa in #2236
- Add boolean postprocessing to dataset type inference for automl by @magdyksaleh in #2193
- Update get_repeatable_train_val_test_split to handle non-stratified split w/ no existing split by @amholler in #2237
- Update R2 score to handle single sample computation by @arnavgarg1 in #2235
- Input/Output Feature Schema Refactor by @connor-mccorm in #2147
- Fix nan in entmax loss and flaky sparsemax/entmax loss tests by @dantreiman in #2238
- Fix preprocessing dataset split API backwards compatibility upgrade bug. by @justinxzhao in #2239
- Removing duplicates in constants from recent PRs by @arnavgarg1 in #2240
- Add attention scores of the vit encoder as an additional return value by @Dennis-Rall in #2192
- Unnest Audio Feature Preprocessing Config by @connor-mccorm in #2242
- Fixed handling of invalud number values to treat as missing values by @tgaddair in #2247
- Support saving numpy predictions to remote FS by @hungcs in #2245
- Use global constant for description.json by @hungcs in #2246
- Removed import warnings when LightGBM and Ray not requested by @tgaddair in #2249
- Adds ability to read images from numpy files and numpy arrays by @geoffreyangus in #2212
- Hyperopt steps per epoch not being computed correctly by @arnavgarg1 in #2175
- Fixed splitting when providing pre-split inputs by @tgaddair in #2248
- Added Backwards Compatibility for Audio Feature Preprocessing by @connor-mccorm in #2254
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2256
- Fix: Don't skip saving the model if the save path already exists. by @justinxzhao in #2264
- Load best weights outside of finally block, since load may throw an exception by @dantreiman in #2268
- Reduce number of distributed tests. by @justinxzhao in #2270
- [WIP] Adds
inference_utils.py
by @geoffreyangus in #2213 - Run github checks for pushes and merges to *-stable. by @justinxzhao in #2266
- Add ludwig logo and version to CLI help text. by @justinxzhao in #2258
- Add hyperopt_statistics.json constant by @hungcs in #2276
- fix: Make
BaseTrainerConfig
an abstract class by @ksbrar in #2273 - [Torchscript] Adds
--device
argument toexport_torchscript
CLI command by @geoffreyangus in #2275 - Use pytest tmpdir fixture wherever temporary directories are used in tests. by @justinxzhao in #2274
- adding configs used in benchmarking by @abidwael in #2263
- Fixes #2279 by @noahlh in #2284
- adding hardware usage and software packages tracker by @abidwael in #2195
- benchmarking utils by @abidwael in #2260
- dataclasses for summarizing benchmarking results by @abidwael in #2261
- Benchmarking core by @abidwael in #2262
- Fixed default eval_batch_size when setting batch_size=auto by @tgaddair in #2286
- Remove obsolete postprocess_inference_graph function. by @justinxzhao in #2267
- [Torchscript] Adds BERT tokenizer + partial HF tokenizer support by @geoffreyangus in #2272
- Support passing ground_truth as df for visualizations by @hungcs in #2281
- catching urllib3 exception by @abidwael in #2294
- Run pytest workflow on release branches. by @justinxzhao in #2291
- Save checkpoint if train_steps is smaller than batcher's steps_per_epoch by @dantreiman in #2298
- Fix typo in amazon review datasets: s/review_tile/review_title by @dantreiman in #2300
- Refactor non-distributed automl utils into a separate directory. by @justinxzhao in #2296
- Don't skip normalization in TabNet during inference on a single row. by @dantreiman in #2299
- Fix error in postproc_predictions calculation in model.evaluate() by @arnavgarg1 in #2304
- Test for parameter updates in Ludwig components by @jimthompson5802 in #2194
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2311
- Use warnings to suppress repeated logs for failed image reads by @arnavgarg1 in #2312
- Use ray dataset and drop type casting in binary_feature prediction post processing for speedup by @magdyksaleh in #2293
- Add size_bytes to DatasetInfo and DataSource by @jeffreyftang in #2306
- Fixes TensorDtype TypeError in Ray nightly by @geoffreyangus in #2320
- Add configuration section for global feature parameters by @arnavgarg1 in #2208
- Ensures unit tests are deleting artifacts during teardown by @geoffreyangus in #2310
- Fixes unit test that had empty Dask partitions after splitting by @geoffreyangus in #2313
- Serve json numpy encoding by @jeffkinnison in #2316
- fix: Mlflow config being injected in hyperopt config by @hungcs in #2321
- Update tests that use preprocessing to match new defaults config structure by @arnavgarg1 in #2323
- Bump test timeout to 60 minutes by @tgaddair in #2325
- Set a default value for size_bytes in DatasetInfo by @jeffreyftang in #2331
- Pin nightly versions to fix CI by @geoffreyangus in #2327
- Log number of failed image reads by @arnavgarg1 in #2317
- Add test with encoder dependencies for global defaults by @arnavgarg1 in #2342
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2334
- Add wine quality notebook to demonstrate using config defaults by @arnavgarg1 in #2333
- fix: GBM tests failing after new release from upstream dependency by @jppgks in #2347
- fix: restore overwrite of eval_batch_size on GBM schema by @jppgks in #2345
- Removes empty partitions after dropping rows and splitting datasets by @geoffreyangus in #2328
- fix: Properly serialize
ParameterMetadata
to JSON by @ksbrar in #2348 - Test for parameter updates in Ludwig Components - Part 2 by @jimthompson5802 in #2252
- refactor: Replace bespoke marshmallow fields that accept multiple types with a new 'combinatorial'
OneOfField
that accepts other fields as arguments. by @ksbrar in #2285 - Use Ray Datasets to read binary files in parallel by @tgaddair in #2241
- typos: Update README.md by @andife in #2358
- Respect the resource requests in RayPredictor by @magdyksaleh in #2359
- Resource tracker threading by @abidwael in #2352
- Allow writing init_config results to remote filesystems by @tgaddair in #2364
- Fixed export_mlflow command to not assume an existing registered_model_name by @tgaddair in #2369
- fix: Fixes to serialization, and update to allow set repo location. by @brightsparc in #2367
- Add amazon employee access challenge kaggle dataset by @justinxzhao in #2349
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2362
- Wrap read of cached training set metadata in try/except for robustness by @jeffreyftang in #2373
- Reduce dropout prob in test_conv1d_stack by @dantreiman in #2380
- fever: change broken download links by @jppgks in #2381
- Add default split config by @hungcs in #2379
- Fix CI: Skip failing ray GBM tests by @justinxzhao in #2391
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2389
- Triton ensemble export by @abidwael in #2251
- Fix: Random dataset splitting with 0.0 probability for optional validation or test sets. by @justinxzhao in #2382
- Print final training report as tabulated text. by @justinxzhao in #2383
- Add Ray 2.0 to CI by @tgaddair in #2337
- add GBM configs to benchmarking by @jppgks in #2395
- Optional artifact logging for MLFlow by @ShreyaR in #2255
- Simplify ludwig.benchmarking.benchmark API and add ludwig benchmark CLI by @abidwael in #2394
- rename kaggle_api_key to kaggle_key by @jppgks in #2384
- use new URL for yosemite dataset by @jppgks in #2385
- Encoder refactor V2 by @dantreiman in #2370
- re-enable GBM tests after new lightgbm-ray release by @jppgks in #2393
- Added option to log artifact location while creating mlflow experiment by @ShreyaR in #2397
- Treat dataset columns as object dtype during first pass of handle_missing_values by @jeffreyftang in #2398
- fix: ParameterMetadata JSON serialization bug by @ksbrar in #2399
- Adds registry to organize backward compatibility updates around versions and config sections by @dantreiman in #2335
- Include split column in explanation df by @connor-mccorm in #2405
- Fix AimCallback to model_name as Run.name by @alberttorosyan in #2413
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2410
- Hotfix: features eligible for shared params hyperopt by @arnavgarg1 in #2417
- Nest FC Params in Decoder by @connor-mccorm in #2400
- Hyperopt Backwards Compatibility by @connor-mccorm in #2419
- Investigating test_resnet_block_layer intermittent test failure by @dantreiman in #2414
- fix: Remove duplicate option from
cell_type
field schema by @ksbrar in #2428 - Test for parameter updates in Ludwig Combiners - Part 3 by @jimthompson5802 in #2332
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2430
- Hotfix: Proc column missing in output feature schema by @arnavgarg1 in #2435
- Nest hyperopt parameters into decoder object by @arnavgarg1 in #2436
- Fix: Make the twitter bots modeling example runnable by @justinxzhao in #2433
- Add MLG-ULB creditcard fraud dataset by @jppgks in #2425
- Bugfix: non-number inputs to GBM by @jppgks in #2418
- GBM: log intermediate progress by @jppgks in #2421
- Fix: Upgrade ludwig config before schema validation by @connor-mccorm in #2441
- Log warning for calibration if validation set is trivially small by @dantreiman in #2440
- Fixes calibration and adds example scripts by @dantreiman in #2431
- Add medical no-show appointments dataset by @jppgks in #2387
- Added conditional check for UNK token insertion into category feature vocab by @arnavgarg1 in #2429
- Ensure synthetic dataset unit tests to clean up extra files. by @justinxzhao in #2442
- Added feature specific parameter test for hyperopt by @arnavgarg1 in #2329
- Fixed version transformation to accept user configs without ludwig_version by @tgaddair in #2424
- Fix mulitple partition predict by @magdyksaleh in #2422
- Cache jsonschema validator to reduce memory pressure by @tgaddair in #2444
- [tests] Added more explicit lifecycle management to Ray clusters during tests by @tgaddair in #2447
- Fix: explicit keyword args for seaborn plot fn by @jppgks in #2454
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2453
- Extended hyperopt to support nested configuration block parameters by @tgaddair in #2445
- Consolidate missing value strategy to only include bfill and ffill by @arnavgarg1 in #2457
- fix: Switched Learning Rate to NonNegativeFloat Field by @connor-mccorm in #2446
- Support GitHub Codespaces by @jppgks in #2463
- Enh: quality-of-life improvements for
export_torchscript
by @geoffreyangus in #2459 - Disables
batch_size: auto
for CPU-only training by @geoffreyangus in #2455 - buxfix: triton model version as a string by @abidwael in #2461
- Updating images to Ray 2.0.0 and CUDA 11.3 by @abidwael in #2390
- Loss, Split, and Defaults Schema Additions by @connor-mccorm in #2439
- More precise resource usage tracking by @abidwael in #2363
- Summarizing performance metrics and resource usage results by @abidwael in #2372
- Better gbm defaults based on benchmarking results by @jppgks in #2466
- Infer single distinct value columns as category instead of binary by @arnavgarg1 in #2467
- fix: Add explicit schema in to_parquet() during saving predictions by @hungcs in #2420
- Publish docker images from release branches by @tgaddair in #2470
- Add backwards-compatibility logic for model progress tracker by @jeffreyftang in #2468
- Backwards compatibility for class_weights by @connor-mccorm in #2469
- Test for parameter updates in Ludwig Decoders - Part 4 by @jimthompson5802 in #2354
- Fixed backwards compatibility for training_set_metadata and bfill by @tgaddair in #2472
- Fixed backwards compatibility for models with level metadata in saved configs by @tgaddair in #2475
- Fix profiler: account for missing values when running in docker by @jppgks in #2477
- Add L-BFGS optimizer by @jppgks in #2478
- fix: Automatically assign title to OneOfOptionsField by @ksbrar in #2480
- fix: handle 'numerical' entries in preprocessing config during backwards compatibility upgrade by @jeffreyftang in #2484
- fix: mark update_class_weights_in_features transformation for version 0.6 by @jeffreyftang in #2481
- Fixed usage of checkpoints for AutoML in Ray 2.0 by @tgaddair in #2485
- [fix flaky test] Relax loss constraint for unit tests for lbfgs optimizer. by @justinxzhao in #2486
- Fixed stratified splitting with Dask by @tgaddair in #1883
- Replace custom Union marshmallow fields with Oneof fields, and default allow_none=True everywhere. by @justinxzhao in #2482
- Resource isolation for dataset preprocessing on ray backends by @magdyksaleh in #2404
- Pin transformers < 4.22 until issues resolved by @tgaddair in #2495
- Fix flaky ray nightly image test by @arnavgarg1 in #2493
- Added workflow to auto cherry-pick into release branches by @tgaddair in #2500
- Enable hyperopt to be launched from a ray client by @ShreyaR in #2501
- GBM: support hyperopt by @jppgks in #2490
- Fixes saved_weights_in_checkpoint docstring, mark as internal only by @dantreiman in #2506
- Fix test length of predictions by @tgaddair in #2507
- Fixed support for distributed datasets in create_auto_config by @tgaddair in #2508
- Config-first Datasets API (ludwig.datasets refactor) by @dantreiman in #2479
- Add in-memory dataset size calculation to dataset statistics by @arnavgarg1 in #2509
- Surfacing dataset statistics in hyperopt by @arnavgarg1 in #2515
- Adds multimodal benchmark datasets from AutoGluon paper by @dantreiman in #2512
- Adds goodbooks dataset by @dantreiman in #2514
- GBM: correctly compute early stopping by @jppgks in #2517
- Fixes mnist dataset image files not exporting by @dantreiman in #2520
- Fix get_best_model in hyperopt for Ray 1.12 by @arnavgarg1 in #2527
- Populate Parameter Metadata by @connor-mccorm in #2503
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2532
- Update README to be consistent with ludwig.ai home page. by @justinxzhao in #2530
- Add missing declarative ML image in README by @arnavgarg1 in #2533
- fix: Add missing titles/descriptions to various schemas by @ksbrar in #2516
- Cleanup: move to per-module loggers instead of the global logging object. by @justinxzhao in #2531
- Updated schedule logic for placement groups for ray backend by @magdyksaleh in #2523
- Nit: Parameter update tests grammar. by @justinxzhao in #2537
- Hyperopt: Log warning with num_extra_trials if all grid search parameters and num_samples > 1 by @arnavgarg1 in #2535
- Adds model configs to ludwig.datasets by @dantreiman in #2540
- ZScore Normalization Failure When Using Constant Value Number Feature by @arnavgarg1 in #2543
- Adds class names to calibration plot title, reformats Brier scores as grouped bar chart by @dantreiman in #2545
- Pin ray nightly version to avoid new test failures by @arnavgarg1 in #2548
- Added tests for init_config and render_config CLI commands by @tgaddair in #2551
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2554
- Ensure bfill/ffill leave no residual NaNs in the dataset during preprocessing by @arnavgarg1 in #2553
- Comprehensive configs: Explicitly list and save all parameter values for input and output features in configs. by @justinxzhao in #2460
- Fixing SettingWithCopyWarning when using
get_repeatable_train_val_test_split
by @abidwael in #2562 - Replace numerical with number in dataset zoo configs. by @justinxzhao in #2558
- Benchmarking toolkit wrap up by @abidwael in #2462
- Migrate to Raincloud plots for hyperopt report by @arnavgarg1 in #2561
- Remove global torchtext version-specific tokenizer availability warnings. by @justinxzhao in #2547
- Only create hyperopt pair plots when there is more than 1 parameter by @arnavgarg1 in #2560
- fix: Limit frequency array to top_n_classes in F1 viz by @hungcs in #2565
- int: unpin Dask version by @geoffreyangus in #2550
- Fixed typehint and removed unused utility function by @magdyksaleh in #2570
- AutoML: stratify imbalanced datasets by @jppgks in #2525
- Use Ray Air Checkpoint to sync files between trial workers by @tgaddair in #2577
- GBM bugfix: matching predictions LightGBM, hummingbird by @jppgks in #2574
- specify seed in RayDataset shuffling by @abidwael in #2566
- update logging message when
early_stop: -1
by @abidwael in #2585 - update docker with torch wheel by @abidwael in #2584
- Refactors test_ray.py to minimize duplicate training jobs by @geoffreyangus in #2573
- Explanation API and feature importance for GBM by @jppgks in #2564
- Remove duplicate option by @connor-mccorm in #2593
- Quick fix: Don't show calibration validation set warnings unless calibration is actually enabled by @dantreiman in #2595
- Fixed issue when uploading output directory artifacts to remote filesystems by @tgaddair in #2598
- Add API Annotations to Ludwig by @arnavgarg1 in #2596
- Tweaks to the README (forward-ported from release-0.6) by @justinxzhao in #2603
- Extend test coverage for non-conventional booleans by @jppgks in #2601
- Fix assertions in training_determinism tests by @arnavgarg1 in #2606
- Ensure no ghost ray instances are running in tests by @arnavgarg1 in #2607
- Allow explicitly plumbing through nics by @tgaddair in #2605
- bug: fix relative import in optimizers.py by @ksbrar in #2600
- GBM: increase boosting_rounds_per_checkpoint to reduce evaluation overhead by @jppgks in #2612
- regression tests: add GBM model trained on v0.6.1 by @jppgks in #2611
- Relax test constraint to reduce flakiness in test_ray by @arnavgarg1 in #2610
- Add splitter that deterministically splits on an ID column by @tgaddair in #2615
- fix(explain): missing columns for fixed split by @jppgks in #2616
- Fixed hyperopt trial syncing to remote filesystems for Ray 2.0 by @tgaddair in #2617
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2622
- feat: adds
max_batch_size
to auto batch size functionality by @geoffreyangus in #2579 - Set commonly used parameters by @connor-mccorm in #2619
- Factor out defaults mixin change by @connor-mccorm in #2628
- Add type to custom combiner by @connor-mccorm in #2627
- Remove hyperopt from config when running train through cli by @arnavgarg1 in #2631
- Ensure resource availability for ray datasets workloads when running on cpu clusters by @arnavgarg1 in #2524
- Speed up horovod hyperopt tests and solve OOMs by @arnavgarg1 in #2599
- [explain] add API annotations by @jppgks in #2635
- Added storage backend API to allow injecting dynamic credentials by @tgaddair in #2630
- Update version to 0.7.dev by @justinxzhao in #2625
- Unpin Ray nightly in CI by @tgaddair in #2614
- Skip Horovod 0.26 installation, add packaging to requirements.txt by @arnavgarg1 in #2642
- [Annotations] Callbacks by @arnavgarg1 in #2641
- Fix automl by @connor-mccorm in #2639
- accepting dictionary as input to
benchmarking.benchmark
by @abidwael in #2626 - Fixed automl APIs to work with remote filesystems by @tgaddair in #2650
- Adds minimum split size, ensures random split is never smaller than minimum for local backend by @dantreiman in #2623
- Categorical passthrough encoder training failure fix by @abidwael in #2649
- Changes learning_curves to use "step" or "epoch" as x-axis label. by @dantreiman in #2578
- Remove Trainer
type
Param by @connor-mccorm in #2647 - Model performace in GitHub actions by @abidwael in #2568
- Fixed race condition in schema validation by @tgaddair in #2653
- Fixed --gpu_memory_limit in CLI to interpret as fraction of GPU memory by @tgaddair in #2658
- Stopgap solution for test_training_determinism by @connor-mccorm in #2665
- Added min and max to sample ratio by @connor-mccorm in #2655
- Set internal only flags by @connor-mccorm in #2659
- Add support for running pytest github action locally with act by @dantreiman in #2661
- Enforcing a 1 to 1 matching in names between Ludwig datasets and AutoGluon paper by @abidwael in #2666
- Added default arg to get_schema by @connor-mccorm in #2667
- remove duplicate
news_popularity
dataset by @abidwael in #2668 - Switch defaults to use mixins and improve test by @connor-mccorm in #2669
- Documents running local tests with act by @dantreiman in #2672
- Config Object by @connor-mccorm in #2426
- Unpin protobuf by @justinxzhao in #2673
- Check vocab size of category features, error out if only one category. Also adds error.py for custom error types. by @dantreiman in #2670
- Ordered Schema by @connor-mccorm in #2671
- Fix Regression Test Configs by @connor-mccorm in #2678
- Testing always() inside expansion in condition by @dantreiman in #2681
- Add protos to the Ludwig project: DatasetProfile messages and Whylogs messages. by @justinxzhao in #2674
- Allow Ray Tune callbacks to be passed into hyperopt and log model config by @jeffkinnison in #2640
- Check for nans before testing equality in test_training_determinism by @dantreiman in #2687
- Set saved_weights_in_checkpoint on encoder, not input feature by @dantreiman in #2690
- Use fully rendered config dictionary when accessing model.config by @tgaddair in #2685
- bug: Set
additionalProperties
toTrue
for preprocessing schemas. by @ksbrar in #2620 - Bump support for torch 1.11.0 by @justinxzhao in #2691
- Fix validator for reduce_learning_rate_on_plateau by @carlogrisetti in #2692
- Use TensorArray to speed up writing predictions with Ray by @tgaddair in #2684
- Dataset size checks in preprocess_for_training by @dantreiman in #2688
- Remove Duplicate Schema Fields by @connor-mccorm in #2679
- Speed up tune_batch_size by using synthetic batches by @tgaddair in #2680
- Add
bucketing_field
Param to Trainer by @connor-mccorm in #2694 - Fix InputDataError to be serializeable by @tgaddair in #2695
- Adds PublicAPI annotation to api.py by @dantreiman in #2698
- Cleanup: move to per-module loggers instead of the global logging object. (2) by @justinxzhao in #2699
- Adds Ray implementation of IntegratedGradientsExplainer that distributes across cluster resources by @tgaddair in #2697
- Fixed bug with non-category outputs in RayIntegratedGradientsExplainer by @tgaddair in #2702
- Fix example values for max_batch_size in trainer parameter metadata by @connor-mccorm in #2705
- Fix incorrect internal_only flags on audio feature metadata by @connor-mccorm in #2704
- add customer churn datasets by @abidwael in #2703
- Add Kaggle test splits by @abidwael in #2675
- Fix ComparatorCombiner by @jppgks in #2689
- Actually print the torchinfo summary in print_model_summary() by @justinxzhao in #2696
- Add H&M fashion recommendation dataset by @jppgks in #2708
- Fix GBM ray nightly test by @jppgks in #2676
- Adds DeveloperAPI and PublicAPI annotations to AutoML by @dantreiman in #2701
- Remove obsolete v0 whylogs callback. by @justinxzhao in #2713
- fill_value / computed_fill_value fix by @connor-mccorm in #2714
- Add path to RayDataset by @tgaddair in #2716
- Fixed Horovod to be an optional import when doing Hyperopt by @tgaddair in #2717
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2722
- Adds annotation to download_one method in benchmarks by @dantreiman in #2712
- fix: Prevent shared parameter_metadata instances between
defaults
and_features
. by @ksbrar in #2715 - Added ngram tokenizer by @tgaddair in #2723
- Revert "Add H&M fashion recommendation dataset (#2708)" by @jppgks in #2724
- Optimize search space for hyperopt tests to decrease test durations by @arnavgarg1 in #2730
- Add custom to_dask() to infer Dask metadata from Datasets schema. by @arnavgarg1 in #2728
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2735
- Bump Ludwig to Ray 2.0 by @arnavgarg1 in #2729
- Parameter Metadata Updates by @connor-mccorm in #2736
- Removes some vestigial code and replaces Tensorflow with PyTorch in comments by @dantreiman in #2731
- @DeveloperAPI annotations for backend module by @dantreiman in #2707
- int: Refactor
test_ray.py
to limit number of full train jobs by @geoffreyangus in #2637 - BaseTrainer: add empty barrier() by @jppgks in #2734
- Use whylogs to generate dataset profiles for pandas and dask dataframes. by @justinxzhao in #2710
- Add IntegerOptions marshmallow field by @ksbrar in #2739
- Downgrade to Ray 2.0 in CI to get green Ludwig CIs again. by @justinxzhao in #2742
- Adds @DeveloperAPI annotations to combiner classes by @dantreiman in #2744
- Use clearer error messages in ludwig serving, and enable serving to work with configs that have stratified splitting on target columns. by @justinxzhao in #2740
- Update Ray GPU Docker image to CUDA 11.6 by @tgaddair in #2747
- Fix #1735 by @herrmann in #2746
- Enable dataset window autosizing by @jeffkinnison in #2721
- Downgrade to PyTorch 1.12.1 in Docker to due to NCCL + CUDA compatibility by @tgaddair in #2750
- Replicate ludwig type inference, using the whylogs dataset profile. by @justinxzhao in #2743
- fix:
Encountered unknown symbol 'foo'
warning in Category feature preprocessing by @geoffreyangus in #2662 - Expand ~ in dataset download paths by @dantreiman in #2754
- Updates twitter bots example to new datasets API by @dantreiman in #2753
- fix: refactor
IntegerOptions
field by @ksbrar in #2755 - Added ray datasets repartitioning in cases of multiple train workers by @ShreyaR in #2756
- fix: Fix metadata object-to-JSON serialization for oneOf fields and add full schema serialization test. by @ksbrar in #2758
- refactor: Add
ProtectedString
field (alias ofStringOptions
that only allows one string) by @ksbrar in #2757 - [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2761
- Updates ludwig docker readme by @dantreiman in #2760
- Annotates ludwig.datasets API by @dantreiman in #2751
- Annotate MLFlow callback, and utility functions by @dantreiman in #2749
- Drishi sarcasmdataset 1 by @drishi in #2725
- Add local_rank to BaseTrainer by @tgaddair in #2766
- Public datasets by @connor-mccorm in #2752
- Fix typo by @connor-mccorm in #2767
- Correctly infer bool and object types in autoML by @arnavgarg1 in #2765
- feat: Hyperopt schema v0, part 1: Move output feature metrics from feature classes to feature configs. by @ksbrar in #2759
- Fix by @connor-mccorm in #2769
- Add ray version to runners by @arnavgarg1 in #2771
- Annotate Ludwig encoders and decoders by @arnavgarg1 in #2773
- Move preprocess callbacks inside model.preprocess by @jeffreyftang in #2772
- Fix benchmark tests, update latest metrics, and use the local backend for GBM benchmark tests by @abidwael in #2748
- Ensure correct output reduction for text encoders like MT5 and add warning messages when not supported by @arnavgarg1 in #2774
- CVE-2007-4559 Patch by @TrellixVulnTeam in #2770
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2776
- Fix double counting of training loss by @arnavgarg1 in #2775
- feat: Hyperopt schema v0, part 2: Make
BaseMarshmallowConfig
abstract by @ksbrar in #2779 - feat: Hyperopt schema v0, part 3: Enable optional min/max support for
FloatTupleMarshmallowField
fields by @ksbrar in #2780 - feat: Hyperopt schema v0, part 4: Add and use new hyperopt registry, search algorithm instantiation by @ksbrar in #2781
- Added exponential retry for mlflow, remote dataset loading by @ShreyaR in #2738
- Add synthetic test data integration test utils, and use them for loss value decrease tests. by @justinxzhao in #2789
- feat: Hyperopt schema v0, part 5: Add basic search algorithm, scheduler, executor, and hyperopt schemas. by @ksbrar in #2784
- Add benchmark as a pytest marker to avoid warnings. by @justinxzhao in #2786
- feat: Hyperopt schema v0, part 6: Enable new hyperopt schema by @ksbrar in #2785
- Add sentencepiece as a requirement, which is necessary for some hf models like mt5. by @justinxzhao in #2782
- [Annotations] Ludwig data modules by @arnavgarg1 in #2793
- [Annotations] Add DeveloperAPI annotations to Ludwig utils - Part 1 by @arnavgarg1 in #2794
- [Annotations] Annotations for Ludwig's utils - Part 2 by @arnavgarg1 in #2797
- [Annotations] Add annotations for schema module (part 1) - Model Config, Split, Trainer, Optimizers, Utils by @arnavgarg1 in #2798
- [Annotations] Annotate Schema Part 2: decoders, encoders, defaults, combiners, loss and preprocessing by @arnavgarg1 in #2799
- Add new data utility functions for buffers and files, and rename registry by @arnavgarg1 in #2796
- [Annotations] Ludwig Schema - Part 3: Features, Hyperopt and Metadata by @arnavgarg1 in #2800
- [Annotations] Add annotations for Ludwig's data utils (file readers) by @arnavgarg1 in #2795
- Proceed with model training even if saving preprocessed data fails. by @justinxzhao in #2783
- Improve warnings about backwards compatibility and dataset splitting. by @justinxzhao in #2788
- Generate structural change warnings and log_once functionality by @arnavgarg1 in #2801
- Broadcast progress tracker dict to all workers by @arnavgarg1 in #2804
- Start fresh training run if files for resuming training are missing by @arnavgarg1 in #2787
- LIghtGBMRayTrainer repartition datasets with fewer blocks than Ray actors by @jeffkinnison in #2806
- Add InterQuartileTransformer normalization strategy for Number Features by @arnavgarg1 in #2805
- Add negative sampling to ludwig.data by @jppgks in #2711
- Rectify output features in dataset config by @abidwael in #2768
- int: Add JSON markup to support unique input feature names. by @ksbrar in #2792
- int: Replace
StringOptions
usage withProtectedString
in split schemas by @ksbrar in #2808 - int: Replace
StringOptions
withProtectedString
for combiner schematype
fields by @ksbrar in #2809 - refactor: Replace
StringOptions
withProtectedString
for encoder/decoder schematype
fields by @ksbrar in #2810 - Upload Datasets to Remote Location by @connor-mccorm in #2764
- [Annotations] Annotate AutoML utils by @arnavgarg1 in #2812
- [Annotations] Ludwig Visualizations by @arnavgarg1 in #2813
- [Annotations] Logging Level Registry by @arnavgarg1 in #2814
- refactor: Replace
StringOptions
withProtectedString
for loss/hyperopt schematype
fields by @ksbrar in #2816 - Define custom Ludwig types and replace Dict[str, Any] type hints with them. by @justinxzhao in #2556
- Config Object Bug Fix by @connor-mccorm in #2817
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2803
- AutoML libraries that use DatasetProfile instead of DatasetInfo by @justinxzhao in #2802
- Remove Sentencepiece by @connor-mccorm in #2821
- fix: account for
max_batch_size
config param in batch size tuning on cpu by @geoffreyangus in #2693 - [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2823
- refactor: Add filtering based on
model_type
for feature, combiner, and model type schemas by @ksbrar in #2815 - [TorchScript] Add user-defined HF Bert tokenizers by @geoffreyangus in #2733
- [Annotations] Move feature registries into accessor functions by @arnavgarg1 in #2818
- [Annotations] Encoder and Decoder Registries by @arnavgarg1 in #2819
- Speed Up Ray Image Tests by @geoffreyangus in #2828
- fix: Restrict allowed top-level config keys by @ksbrar in #2826
- Moves image decoding out of Ray Datasets to Dask Dataframe by @geoffreyangus in #2737
- Improve type hints and remove dead code for DatasetLoader module by @arnavgarg1 in #2833
- Update stratified split with a more specific exception for underpopulated classes by @jeffkinnison in #2831
- Add Ludwig contributors to README by @arnavgarg1 in #2835
- Fix key error in AutoML model select by @ShreyaR in #2824
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2836
- Drop incomplete batches for Ray and Pandas to prevent Batchnorm computation errors by @arnavgarg1 in #2778
- Catch and surface Runtime exceptions during preprocessing by @arnavgarg1 in #2839
- fix: Mark
width
andheight
asinternal_only
for image encoders by @ksbrar in #2842 - Select best batch size to maximize training throughput by @tgaddair in #2843
- Make batch_size=auto more consistent by using median of 5 steps by @tgaddair in #2846
- Make trainable=False default for all pretrained models by @tgaddair in #2844
- fix: Add back missing split fields by @ksbrar in #2848
- Pin scikit-learn<1.2.0 by @tgaddair in #2850
- text_encoder: RoBERTa max_sequence_length by @rudolfolah in #2852
- Fix TorchText version in tokenizers ahead of torch 1.13.0 upgrade by @geoffreyangus in #2838
- Fix
trainable=False
to freeze all params for HF encoders by @tgaddair in #2855 - Add support for automatic mixed precision (AMP) training by @tgaddair in #2857
- Evaluate training set in the training loop by @tgaddair in #2856
- Extend parameter guidance documentation for regularization, and add explicit maxes to Non-Negative floats by @justinxzhao in #2849
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2860
- Fixes for the roberta encoder: explicitly set max sequence length, and fix output shape computation by @justinxzhao in #2861
- Enables Set output feature on Ray by @geoffreyangus in #2791
- Add go module for dataset profile protos. by @justinxzhao in #2834
- fix: Upgrade
expected_impact
fortrainable
toMEDIUM
on all encoders. by @ksbrar in #2865 - support stratified split with low cardinality features by @abidwael in #2863
- fix: load spacy model for lemmatization in
EnglishLemmatizeFilterTokenizer
to work by @abidwael in #2868 - Token-level explanations by @jppgks in #2864
- Replace
learning rate: auto
with feature type and encoder-based heuristics by @abidwael in #2854 - Set RayBackend Config to use single worker for tests by @arnavgarg1 in #2853
- Remove
_to_tensors_fn
from Ray Datasets by @geoffreyangus in #2866 - Remove ludwig-dev Dockerfile by @arnavgarg1 in #2873
- Support Ray GPU image with Torch 1.13 and CUDA 11.6 by @arnavgarg1 in #2869
- Use native LightGBM for intermittent eval during training by @jeffkinnison in #2829
- Set default validation metrics based on the output feature type. by @justinxzhao in #2820
- Auto resize images for ViTEncoder when use_pretrained is True or False by @arnavgarg1 in #2862
- TLE Backwards Compatibility Fixes by @jppgks in #2875
- Do not drop batch size dimension for single inputs by @jppgks in #2878
- Save GBM after training if not previously saved by @jeffkinnison in #2880
- Fix TLE - Pt. 2 by @connor-mccorm in #2881
- Tle fix by @connor-mccorm in #2883
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2885
- Convert schema metadata to YAML by @tgaddair in #2884
- Automatically infer vector_size for vector features when not provided by @tgaddair in #2888
- Support MLFlowCallback logging to an existing run by @tgaddair in #2892
- Fix dataset synthesizer by @connor-mccorm in #2894
- Add a clear error message about invalid column names in GBM datasets by @jeffkinnison in #2879
- Explicitly track all metrics related to the best evaluation in the progress tracker. by @justinxzhao in #2827
- Added DistributedStrategy interface with support for DDP by @tgaddair in #2890
- Adopt PyTorch official LRScheduler API by @tgaddair in #2877
- Annotate Confusion Matrix with updated cmap by @arnavgarg1 in #2899
- Dynamically resize confusion matrix and f1 plots by @arnavgarg1 in #2900
- Update backward compatibility tests for LR progress tracker changes made in #2877. by @justinxzhao in #2904
- fix: Fix vague initializer JSON schema titles. by @ksbrar in #2909
- Support Distributed Training And Ray Tune with Ray 2.1 by @arnavgarg1 in #2709
- Expand vision models to support pre-trained models by @jimthompson5802 in #2408
- Add ECD Descriptions by @connor-mccorm in #2897
- Simplify titanic example to read config in-line, and skip saving processed input. by @justinxzhao in #2912
- Adds quick fix for pretrained models not loading by modifying state_dict keys on load. by @dantreiman in #2911
- fix: Schema split conditions should pass in [TYPE] and not string by @hungcs in #2917
- Refactor metrics and metric tables and support adding more in-training metrics. by @justinxzhao in #2901
- Updated AutoML configs for latest schema and added validation tests by @tgaddair in #2921
- Adds backwards compatibility for legacy image encoders by @tgaddair in #2916
- Pin Torch to
>=1.13.0
by @connor-mccorm in #2914 - Hyperopt invalid GBM config by @jppgks in #2926
- Store mlflow tracking URI to ensure consistency across processes by @tgaddair in #2927
- Update automl heuristics for fine-tuning and multi-modal tasks by @tgaddair in #2922
- Bump torch version for benchmark tests by @connor-mccorm in #2929
- Fix signing key by @tgaddair in #2928
- Adds safe_move_directory to fs_utils by @arnavgarg1 in #2931
- Added separate AutoML APIs for feature inference and config generation by @tgaddair in #2932
- Dynamic resizing for Confusion Matrix, Brier, F1 Plot, etc. by @arnavgarg1 in #2936
- Raise RuntimeError only for category output features with vocab size 1 by @arnavgarg1 in #2923
- Bump min python to 3.8 by @tgaddair in #2930
- Evaluate training set in the training loop (GBM) by @jppgks in #2907
- [automl] Exclude text fields with low avg words by @tgaddair in #2941
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #2944
- Fix pre-commit by removing manually specified blacken-docs dep. by @justinxzhao in #2949
- Rotate Brier Plot X-axis labels to 45 degree angle by @arnavgarg1 in #2948
- Retry HuggingFace pretrained model download on failure by @jeffkinnison in #2951
- Disable AUROC for CATEGORY features. by @justinxzhao in #2950
- Deactivate GBM random forest boosting type by @jeffkinnison in #2954
- Make batch_size=auto the default by @tgaddair in #2845
- Twitter bots test small improvements by @dantreiman in #2955
- Disable bagging when using GOSS GBM boosting type by @jeffkinnison in #2956
- Add missing standardize_image key to metadata by @jppgks in #2946
- Integrated Gradients: reset sample_ratio to 1.0 if set by @jppgks in #2945
- Increase CI pytest time out to 75 minutes by @jimthompson5802 in #2958
- Add
sacremoses
as a dependency for transformer_xl encoder by @arnavgarg1 in #2961 - Move all config validation to its own standalone module,
config_validation
. by @justinxzhao in #2959 - Fixes longformer encoder by passing in pretrained_kwargs correctly by @arnavgarg1 in #2963
- Expected Impact Calibration by @connor-mccorm in #2960
- Update Camembert by @geoffreyangus in #2966
- fix: Fix
epochs
suggested range by @ksbrar in #2965 - fix: enable binary dense encoder by @abidwael in #2957
- GBM DART boosting type incopatible with early stopping by @jeffkinnison in #2964
- Improving metadata config descriptions by @w4nderlust in #2933
- Fix ludwig-gpu image by @tgaddair in #2974
- Skip test_ray_outputs by @arnavgarg1 in #2935
- Enable custom HF BERT models with default tokenizer config by @geoffreyangus in #2973
- Update CamemBERT in schema by @geoffreyangus in #2975
- Set reduce_output to sum for XLM encoder by @arnavgarg1 in #2972
- Skipped mercedes_benz_greener.ecd.yaml benchmark test by @tgaddair in #2980
- Add sentencepiece as a requirement for MT5 text encoder by @arnavgarg1 in #2967
- Disable CTRL Encoder by @connor-mccorm in #2976
- MT5 reduce_output can't be cls_pooled - set to sum by default by @arnavgarg1 in #2981
- Populate hyperopt defaults using schema by @arnavgarg1 in #2968
- Revert "Add sentencepiece as a requirement for MT5 text encoder (#2967)" and disable MT5 Encoder by @arnavgarg1 in #2982
- Change default reduce_output strategy to sum for CamemBERT by @arnavgarg1 in #2984
- Set
max_failures
for Tuner to 0 by @geoffreyangus in #2987 - Fix TLE OOM for BERT-like models by @jppgks in #2990
- Reorder Advanced Parameters by @connor-mccorm in #2979
- [Hyperopt] Modify _get_best_model_path to grab it from the Checkpoint object with ExperimentAnalysis by @arnavgarg1 in #2985
- GBM: disable goss boosting type by @jppgks in #2986
- Adds HuggingFace pretrained encoder unit tests by @geoffreyangus in #2962
- [Hyperopt] Set default num_samples based on parameter space by @arnavgarg1 in #2997
- LR Scheduler Adjustments by @connor-mccorm in #2996
- fix: Force populate combiner registry inside of
get_schema
function. by @ksbrar in #2970 - fix: Fix validation and serialization for
Boolean
andOneOfOptionsField
fields by @ksbrar in #2992 - Ray 2.2 compatibility by @arnavgarg1 in #2910
- Compute fixed text embeddings (e.g., BERT) during preprocessing by @tgaddair in #2867
- Use iloc to fetch first audio value. by @justinxzhao in #3006
- Fix Internal Only Param by @connor-mccorm in #3008
- Ludwig Dataclass by @connor-mccorm in #3005
- Cap batch_size=auto at 128 for CPU training by @tgaddair in #3007
- Added ghost batch norm option for concat combiner by @tgaddair in #3001
- Refactored norm layer and added additional norm at the start of the FCStack by @tgaddair in #3011
- Fix assignment that undoes tensor move to CPU by @jeffkinnison in #3012
- [Explain] Detach inputs before numpy processing by @jppgks in #3014
- Handle CUDA OOMs in explanations with retry and batch size halving by @tgaddair in #3015
- fix: Remove
ecd_ray_legacy
model type alias. by @ksbrar in #3013 - Explain fixes by @jppgks in #3016
- Remove
null
GBM trainer config options by @jeffkinnison in #2989 - Disable reuse_actors in hyperopt by @arnavgarg1 in #3017
- Skip Sarcos dataset during benchmark tests by @arnavgarg1 in #3020
- Explain: improve docstring about IntegratedGradient baseline for number features by @jppgks in #3018
- Upgrade isort to fix pre-commit. by @justinxzhao in #3027
- Limit batch size tuning to ≤20% of dataset size by @geoffreyangus in #3003
- [schema] Mark skip internal only by @jppgks in #3022
- Add specificity metric for binary features by @jppgks in #3025
- Added FSDP distributed strategy by @tgaddair in #3026
New Contributors
- @Marvjowa made their first contribution in #2236
- @Dennis-Rall made their first contribution in #2192
- @abidwael made their first contribution in #2263
- @noahlh made their first contribution in #2284
- @jeffkinnison made their first contribution in #2316
- @andife made their first contribution in #2358
- @alberttorosyan made their first contribution in #2413
- @herrmann made their first contribution in #2746
- @drishi made their first contribution in #2725
- @TrellixVulnTeam made their first contribution in #2770
- @rudolfolah made their first contribution in #2852
Full Changelog: v0.5.3...v0.7.beta