Releases: tensorflow/datasets
v4.8.1
Changed
- Added file
valid_tags.txt
to not break builds. - TFDS no longer relies on TensorFlow DTypes. We chose NumPy DTypes to keep the
typing expressiveness, while dropping the heavy dependency on TensorFlow. We
migrated all our internal datasets. Please, migrate accordingly:tf.bool
:np.bool_
tf.string
:np.str_
tf.int64
,tf.int32
, etc:np.int64
,np.int32
, etctf.float64
,tf.float32
, etc:np.float64
,np.float32
, etc
v4.8.0
Added
- [API]
DatasetBuilder
's description and citations can be specified in
dedicatedREADME.md
andCITATIONS.bib
files, within the dataset package
(see https://www.tensorflow.org/datasets/add_dataset). - Tags can be associated to Datasets, in the
TAGS.txt
file. For
now, they are only used in the generated documentation. - [API][Experimental] New
ViewBuilder
to define datasets as transformations
of existing datasets. Also addstfds.transform
with functionality to apply
transformations. - Loggers are also called on
tfds.as_numpy(...)
, baseLogger
class has a
new corresponding method. tfds.core.DatasetBuilder
can have a default limit for the number of
simultaneous downloads.tfds.download.DownloadConfig
can override it.tfds.features.Audio
supports storing raw audio data for lazy decoding.- The number of shards can be overridden when preparing a dataset:
builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42))
.
Alternatively, you can configure the min and max shard size if you want TFDS
to compute the number of shards for you, but want to have control over the
shard sizes.
Changed
Deprecated
Removed
Fixed
Security
v4.7.0
Added
- [API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
- [API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
- [API] Added
tfds.beam.inc_counter
to reducebeam.metrics.Metrics.counter
boilerplate - [API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
- [Documentation] update, specifically:
- [TFDS CLI] Supports custom config through Json (e.g.
tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}'
) - New datasets:
- conll2003
- universal_dependency 2.10
- bucc
- i_naturalist2021
- mtnt Machine Translation of Noisy Text.
- placesfull
- tatoeba
- user_libri_audio
- user_libri_text
- xtreme_pos
- yahoo_ltrc
- Updated datasets:
- C4 was updated to version 3.1.
- common_voice was updated to a more recent snapshot.
- wikipedia was updated with the
20220620
snapshot.
- New dataset collections, such as xtreme and LongT5
Changed
- The base
Logger
class expects more information to be passed to theas_dataset
method. This should only be relevant to people who have implemented and registered customLogger
class(es). - You can set
DEFAULT_BUILDER_CONFIG_NAME
in aDatasetBuilder
to change the default config if it shouldn't be the first builder config defined inBUILDER_CONFIGS
.
Deprecated
Removed
Fixed
- Various datasets
- In Linux, when loading a dataset from a directory that is not your home (
~
) directory, a new~
directory is not created in the current directory (fixes #4117).
Security
v4.6.0
Added
- Support for community datasets on GCS.
- [API]
tfds.builder_from_directory
andtfds.builder_from_directories
, see
https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder. - [API] Dash ("-") support in split names.
- [API]
file_format
argument todownload_and_prepare
method, allowing user
to specify an alternative file format to store prepared data (e.g. "riegeli"). - [API]
file_format
toDatasetInfo
string representation. - [API] Expose the return value of Beam pipelines. This allows for users to
read the Beam metrics. - [API] Expose Feature
tf_example_spec
to public. - [API]
doc
kwarg onFeature
s, to describe a feature. - [Documentation] Features description is shown on TFDS Catalog.
- [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
- [Performance] Parallel load of metadata files.
- [Testing] TFDS tests are now run using GitHub actions - misc improvements such
as caching and sharding. - [Testing] Improvements to MockFs.
- New datasets.
Changed
- [API]
num_shards
is now optional in the shard name.
Removed
- TFDS pathlib API, migrated to a self-contained
etils.epath
(see
https://github.com/google/etils).
Fixed
- Various datasets.
- Dataset builders that are defined adhoc (e.g. in Colab).
- Better
DatasetNotFoundError
messages. - Don't set
deterministic
on a global level but locally in interleave, so it
only apply to interleave and not all transformations. - Google drive downloader.
As always, thank you to all contributors!
v4.5.2
v4.5.1
v4.5.0
This is the last version of TFDS supporting 3.6. Future version will use 3.7
-
Better split API:
- Splits can be selected using shards:
split='train[3shard]'
- Underscore supported in numbers for better readability:
split='train[:500_000]'
- Select the union of all splits with
split='all'
tfds.even_splits
is more precise and flexible:- Return splits exactly of the same size when passed
tfds.even_splits('train', n=3, drop_remainder=True)
- Works on subsplits
tfds.even_splits('train[:75%]', n=3)
or even nested - Can be composed with other splits:
tfds.even_splits('train', n=3)[0] + 'test'
- Return splits exactly of the same size when passed
- Splits can be selected using shards:
-
FeatureConnectors:
- Faster dataset generation (using tfrecords)
- Features now have
serialize_example
/deserialize_example
methods to encode/decode example to proto:example_bytes = features.serialize_example(example_data)
Audio
now supportsencoding='zlib'
for better compression- Features specs exposed in proto for better compatibility with other languages
-
Better testing:
- Mock dataset now supports nested datasets
- Customize the number of sub examples
-
Documentation update:
- Community datasets: https://www.tensorflow.org/datasets/community_catalog/overview
- New guide on TFDS and determinism
-
RLDS:
- Nested datasets features are supported
- New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes
-
Misc:
- Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS
- Support setting the file formats in
tfds build --file_format=tfrecord
- Typing annotations exposed in
tfds.typing
tfds.ReadConfig
has a newassert_cardinality=False
to disable cardinality- Add a tfds.display_progress_bar(True) for functional control
- Support for huge number of shards (>99999)
- DatasetInfo exposes
.release_notes
And of course, new datasets, bug fixes,...
Thank you to all our contributors for improving TFDS!
v4.4.0
API:
- Add
PartialDecoding
support, to decode only a subset of the features (for performances) - Catalog now expose links to KnowYourData visualisations
tfds.as_numpy
supports datasets withNone
- Dataset generated with
disable_shuffling=True
are now read in generation order. - Loading datasets from files now supports custom
tfds.features.FeatureConnector
tfds.testing.mock_data
now supports- non-scalar tensors with dtype
tf.string
builder_from_files
and path-based community datasets
- non-scalar tensors with dtype
- File format automatically restored (for datasets generated with
tfds.builder(..., file_format=)
). - Many new reinforcement learning datasets
- Various bug fixes and internal improvements like:
- Dynamically set number of worker thread during extraction
- Update progression bar during download even if downloads are cached
Dataset creation:
- Add
tfds.features.LabeledImage
for semantic segmentation (like image but with additionalinfo.features['image_label'].name
label metadata) - Add float32 support for
tfds.features.Image
(e.g. for depth map) - All FeatureConnector can now have a
None
dimension anywhere (previously restricted to the first position). tfds.features.Tensor()
can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)
))tfds.features.Tensor
can now be serialised as bytes, instead of float/int values (to allow better compression):Tensor(..., encoding='zlib')
- Add script to add TFDS metadata files to existing TF-record (see doc).
- New guide on common implementation gotchas
Thank you all for your support and contribution!
v4.3.0
API:
• Add dataset.info.splits['train'].num_shards
to expose the number of shards to the user
• Add tfds.features.Dataset
to have a field containing sub-datasets (e.g. used in RL datasets)
• Add dtype and tf.uint16
supports for tfds.features.Video
• Add DatasetInfo.license
field to add redistributing information
• Better tfds.benchmark(ds)
(compatible with any iterator, not just tf.data
, better colab representation)
Other
• Faster tfds.as_numpy() (avoid extra tf.Tensor <> np.array copy)
• Better tfds.as_dataframe
visualisation (Sequence, ragged tensor, semantic masks with use_colormap
)
• (experimental) community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
• (experimental) Add a hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
• (experimental) Riegelli format support
• (experimental) Add DatasetInfo.disable_shuffling
to force examples to be read in generation order.
• Add .copy
, .format
methods to GPath objects
• Many bug fixes
Testing:
• Supports custom BuilderConfig
in DatasetBuilderTest
• DatasetBuilderTest
now has a dummy_data
class property which can be used in setUpClass
• Add add_tfds_id
and cardinality support to tfds.testing.mock_data
And of course, many new datasets and datasets updates.
We would like to thank all the TFDS contributors!
v4.2.0
API:
- Add
tfds build
to the CLI. See documentation. - DownloadManager now returns Pathlib-like objects
- Datasets returned by
tfds.as_numpy
are compatible withlen(ds)
- New
tfds.features.Dataset
to represent nested datasets - Add
tfds.ReadConfig(add_tfds_id=True)
to add a unique id to the exampleex['tfds_id']
(e.g.b'train.tfrecord-00012-of-01024__123'
) - Add
num_parallel_calls
option totfds.ReadConfig
to overwrite to defaultAUTOTUNE
option tfds.ImageFolder
now supporttfds.decode.SkipDecoder
- Add multichannel audio support to
tfds.features.Audio
- Better
tfds.as_dataframe
visualization (ffmpeg video if installed, bounding boxes,...) - Add
try_gcs
totfds.builder(..., try_gcs=True)
- Simpler
BuilderConfig
definition: classVERSION
andRELEASE_NOTES
are applied to allBuilderConfig
. Config description is now optional.
Breaking compatibility changes:
- Removed configs for all text datasets. Only plain text version is kept. For example:
multi_nli/plain_text
->multi_nli
. - To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to
str
,bytes
andint
). New errors likely indicates an issue in the dataset implementation. tfds.core.benchmark
now returns apd.DataFrame
(instead of adict
)tfds.units
is not visible anymore from the public API
Bug fixes:
- Support 0-len sequence with images of dynamic shape (Fix #2616)
- Progression bar correctly updated when copying files.
- Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
- Better debugging and error message (e.g. human readable size,...)
- Allow
max_examples_per_splits=0
intfds build --max_examples_per_splits=0
to test_split_generators
only (without_generate_examples
).
And of course, many new datasets and datasets updates.
Thank you the community for their many valuable contributions and to supporting us in this project!!!