Release v4.2.0 · tensorflow/datasets

API:

Add tfds build to the CLI. See documentation.
DownloadManager now returns Pathlib-like objects
Datasets returned by tfds.as_numpy are compatible with len(ds)
New tfds.features.Dataset to represent nested datasets
Add tfds.ReadConfig(add_tfds_id=True) to add a unique id to the example ex['tfds_id'] (e.g. b'train.tfrecord-00012-of-01024__123')
Add num_parallel_calls option to tfds.ReadConfig to overwrite to default AUTOTUNE option
tfds.ImageFolder now support tfds.decode.SkipDecoder
Add multichannel audio support to tfds.features.Audio
Better tfds.as_dataframe visualization (ffmpeg video if installed, bounding boxes,...)
Add try_gcs to tfds.builder(..., try_gcs=True)
Simpler BuilderConfig definition: class VERSION and RELEASE_NOTES are applied to all BuilderConfig. Config description is now optional.

Breaking compatibility changes:

Removed configs for all text datasets. Only plain text version is kept. For example: multi_nli/plain_text -> multi_nli.
To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to str, bytes and int). New errors likely indicates an issue in the dataset implementation.
tfds.core.benchmark now returns a pd.DataFrame (instead of a dict)
tfds.units is not visible anymore from the public API

Bug fixes:

Support 0-len sequence with images of dynamic shape (Fix #2616)
Progression bar correctly updated when copying files.
Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
Better debugging and error message (e.g. human readable size,...)
Allow max_examples_per_splits=0 in tfds build --max_examples_per_splits=0 to test _split_generators only (without _generate_examples).

And of course, many new datasets and datasets updates.

Thank you the community for their many valuable contributions and to supporting us in this project!!!

Provide feedback