Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V4 changes, Indra adaptors #2733

Merged
merged 88 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from 86 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
e4a5b71
Rename DeeplakeQuery to Indra.
khustup2 Jan 5, 2024
ad30bb8
Fixed black.
khustup2 Jan 5, 2024
c860dd6
Merge branch 'main' into v4
khustup2 Jan 22, 2024
6ca7f5c
Added indra storage provider.
khustup2 Jan 22, 2024
fed36e6
Careful switch to indra provider.
khustup2 Jan 22, 2024
3766f29
Fixed lint.
khustup2 Jan 22, 2024
ffee1ae
One more fix.
khustup2 Jan 22, 2024
291c650
Fixed black.
khustup2 Jan 23, 2024
6516620
Switch local read only to Indra.
khustup2 Jan 23, 2024
d1bff43
Fixed black.
khustup2 Jan 23, 2024
2842836
Added v4 arg.
khustup2 Feb 2, 2024
d488ab7
Merge branch 'main' into v4
khustup2 Feb 2, 2024
abfed5b
Fixed darglint.
khustup2 Feb 3, 2024
5f4de19
Merge branch 'main' into v4
khustup2 Feb 4, 2024
de92f24
Fix.
khustup2 Feb 4, 2024
aa8b0de
Merge branch 'main' into v4
khustup2 Feb 7, 2024
3c192a1
Merge branch 'main' into v4
khustup2 Feb 13, 2024
6260cb7
Minor.
khustup2 Feb 13, 2024
36fef4d
Cleanup indra provider.
khustup2 Feb 14, 2024
da2af7e
Fixes.
khustup2 Feb 15, 2024
1d371b5
Bump libdeeplake version.
khustup2 Feb 15, 2024
e401079
v4 CI Tests
zaaram Feb 15, 2024
88838aa
Merge pull request #2771 from activeloopai/v4-DevOps
khustup2 Feb 15, 2024
6bde53a
Merge branch 'main' into v4
khustup2 Feb 18, 2024
095e4f0
Merge branch 'v4' of github.com:activeloopai/deeplake into v4
khustup2 Feb 18, 2024
7e78642
Bump libdeeplake version for test.
khustup2 Feb 18, 2024
2599b9b
Reimplement rename with deecopy+delete.
khustup2 Feb 19, 2024
c268f32
Bump libdeeplake version.
khustup2 Feb 20, 2024
4116be2
Merge branch 'main' into v4
khustup2 Feb 20, 2024
38d9995
Merge branch 'main' of github.com:activeloopai/deeplake into main
khustup2 Feb 24, 2024
03f0f2d
Switch to batch request for indra tensor bytes.
khustup2 Feb 25, 2024
6c6b0db
Bump libdeeplake version.
khustup2 Feb 25, 2024
278a1b3
Merge branch 'batch-bytes' into v4
khustup2 Feb 25, 2024
c3c5964
Fixed tests.
khustup2 Feb 26, 2024
9dd47ae
Merge branch 'batch-bytes' into v4
khustup2 Feb 26, 2024
8c80b59
Merge branch 'main' into v4
khustup2 Feb 29, 2024
5de5584
Replace v4 flag with indra flag.
khustup2 Feb 29, 2024
1bb14b4
Indra read only view.
khustup2 Mar 1, 2024
cb61ed9
Fixed error.
khustup2 Mar 1, 2024
6b0f2cf
Fixed black.
khustup2 Mar 1, 2024
f9800fd
Fix mypy.
khustup2 Mar 1, 2024
cc8d6f1
Fixes.
khustup2 Mar 3, 2024
34b7d4b
Step to get rid of deeplake_ds.
khustup2 Mar 3, 2024
93da017
Last.
khustup2 Mar 3, 2024
d6675b2
Remove deeplake_ds.
khustup2 Mar 4, 2024
80afefd
Fixed black.
khustup2 Mar 4, 2024
f262dbd
Fixed failures.
khustup2 Mar 4, 2024
c4e42ff
Further adaptations.
khustup2 Mar 4, 2024
390ee9e
Fixed linter.
khustup2 Mar 4, 2024
c70a940
Fix mypy.
khustup2 Mar 4, 2024
46aa379
More.
khustup2 Mar 5, 2024
a997e1c
Fixed linter.
khustup2 Mar 5, 2024
e723c14
Merge branch 'main' into v4
khustup2 Mar 7, 2024
598d7d0
Bump libdeeplake version.
khustup2 Mar 7, 2024
a09bc4f
Final fixes for indra adaptors.
khustup2 Mar 7, 2024
f4512af
token and cache_size errors fixed.
khustup2 Mar 7, 2024
fd5287c
Bump libdeeplake version.
khustup2 Mar 8, 2024
baef9d8
Merge branch 'main' into v4
khustup2 Mar 8, 2024
ec92f04
Handle index in non-linear views.
khustup2 Mar 11, 2024
e99c7dd
Merge branch 'main' into v4
khustup2 Mar 14, 2024
548b8a9
Merge branch 'main' into v4
khustup2 Mar 16, 2024
251d057
Prepare indra materialization usage.
khustup2 Mar 17, 2024
bb5e833
Materialize indra view.
khustup2 Mar 17, 2024
cae1c23
Added indra view load test with optimize=True
khustup2 Mar 17, 2024
74965dd
Fixed black.
khustup2 Mar 17, 2024
32e8563
Added indra flag to ingest api.
khustup2 Mar 21, 2024
539e090
Merge branch 'main' into v4
khustup2 Mar 21, 2024
fde3acc
Fix.
khustup2 Mar 21, 2024
deed131
Merge branch 'main' into v4
khustup2 Mar 22, 2024
b2a5716
Merge branch 'main' into v4
khustup2 Mar 23, 2024
ee31dfa
Merge branch 'v4' of github.com:activeloopai/deeplake into v4
khustup2 Mar 23, 2024
6059c6e
Added ingest dataframe with indra.
khustup2 Mar 24, 2024
7d13829
Adapt test to indra.
khustup2 Mar 24, 2024
8f79d8d
Bump libdeeplake version.
khustup2 Mar 25, 2024
42c8e6f
Bump libdeeplake version.
khustup2 Mar 26, 2024
8bb0c61
Merge branch 'main' into v4
khustup2 Apr 1, 2024
f2a4a1a
Bump libdeeplake version.
khustup2 Apr 1, 2024
9127617
Merge branch 'main' into v4
khustup2 Apr 2, 2024
2e0cc0e
set endpoint.
khustup2 Apr 2, 2024
5078432
Merge branch 'main' into v4
khustup2 Apr 2, 2024
8568cb7
Bump libdeeplake version.
khustup2 Apr 2, 2024
69a00e6
Fixed linter.
khustup2 Apr 2, 2024
7522062
Reset workflow.
khustup Apr 3, 2024
880dbbc
Restore shuffling.
khustup Apr 3, 2024
03b6f6f
Merge branch 'main' into v4
khustup2 Apr 3, 2024
f18cd6c
Merge branch 'v4' of github.com:activeloopai/deeplake into v4
khustup2 Apr 3, 2024
6578d81
Bump libdeeplake version.
khustup2 Apr 4, 2024
132f78e
Fixed sonar.
khustup2 Apr 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,13 @@

deeplake.client.config.USE_STAGING_ENVIRONMENT = True

try:
from indra import api # type: ignore

api.backend.set_endpoint("https://app-staging.activeloop.dev")
except ImportError:
pass

from deeplake.constants import *
from deeplake.tests.common import SESSION_ID

Expand Down
77 changes: 68 additions & 9 deletions deeplake/api/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from deeplake.auto.unstructured.yolo.yolo import YoloDataset
from deeplake.client.log import logger
from deeplake.core.dataset import Dataset, dataset_factory
from deeplake.core.dataset.indra_dataset_view import IndraDatasetView
from deeplake.core.tensor import Tensor
from deeplake.core.meta.dataset_meta import DatasetMeta
from deeplake.util.connect_dataset import connect_dataset_entry
Expand Down Expand Up @@ -43,6 +44,7 @@
DEFAULT_READONLY,
DATASET_META_FILENAME,
DATASET_LOCK_FILENAME,
USE_INDRA,
)
from deeplake.util.access_method import (
check_access_method,
Expand Down Expand Up @@ -101,6 +103,7 @@
lock_enabled: Optional[bool] = True,
lock_timeout: Optional[int] = 0,
index_params: Optional[Dict[str, Union[int, str]]] = None,
indra: bool = USE_INDRA,
):
"""Returns a :class:`~deeplake.core.dataset.Dataset` object referencing either a new or existing dataset.

Expand Down Expand Up @@ -173,6 +176,7 @@
lock_timeout (int): Number of seconds to wait before throwing a LockException. If None, wait indefinitely
lock_enabled (bool): If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally
index_params: Optional[Dict[str, Union[int, str]]] = None : The index parameters used while creating vector store is passed down to dataset.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false

..
# noqa: DAR101
Expand Down Expand Up @@ -225,6 +229,7 @@
token=token,
memory_cache_size=memory_cache_size,
local_cache_size=local_cache_size,
indra=indra,
)

feature_report_path(path, "dataset", {"Overwrite": overwrite}, token=token)
Expand Down Expand Up @@ -378,6 +383,7 @@
lock_timeout: Optional[int] = 0,
verbose: bool = True,
index_params: Optional[Dict[str, Union[int, str]]] = None,
indra: bool = USE_INDRA,
) -> Dataset:
"""Creates an empty dataset

Expand All @@ -402,6 +408,7 @@
lock_timeout (int): Number of seconds to wait before throwing a LockException. If None, wait indefinitely
lock_enabled (bool): If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally.
index_params: Optional[Dict[str, Union[int, str]]]: Index parameters used while creating vector store, passed down to dataset.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false

Returns:
Dataset: Dataset created using the arguments provided.
Expand Down Expand Up @@ -441,6 +448,7 @@
token=token,
memory_cache_size=memory_cache_size,
local_cache_size=local_cache_size,
indra=indra,
)

feature_report_path(
Expand Down Expand Up @@ -508,6 +516,7 @@
access_method: str = "stream",
unlink: bool = False,
reset: bool = False,
indra: bool = USE_INDRA,
check_integrity: Optional[bool] = None,
lock_timeout: Optional[int] = 0,
lock_enabled: Optional[bool] = True,
Expand Down Expand Up @@ -578,6 +587,7 @@
setting ``reset=True`` will reset HEAD changes and load the previous version.
check_integrity (bool, Optional): Performs an integrity check by default (None) if the dataset has 20 or fewer tensors.
Set to ``True`` to force integrity check, ``False`` to skip integrity check.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false

..
# noqa: DAR101
Expand Down Expand Up @@ -624,6 +634,7 @@
token=token,
memory_cache_size=memory_cache_size,
local_cache_size=local_cache_size,
indra=indra,
)
feature_report_path(
path,
Expand All @@ -644,6 +655,12 @@
f"A Deep Lake dataset does not exist at the given path ({path}). Check the path provided or in case you want to create a new dataset, use deeplake.empty()."
)

if indra and read_only:
from indra import api # type: ignore

Check warning on line 659 in deeplake/api/dataset.py

View check run for this annotation

Codecov / codecov/patch

deeplake/api/dataset.py#L659

Added line #L659 was not covered by tests

ids = api.load_from_storage(storage.core)
return IndraDatasetView(indra_ds=ids)

Check warning on line 662 in deeplake/api/dataset.py

View check run for this annotation

Codecov / codecov/patch

deeplake/api/dataset.py#L661-L662

Added lines #L661 - L662 were not covered by tests

dataset_kwargs: Dict[str, Union[None, str, bool, int, Dict]] = {
"path": path,
"read_only": read_only,
Expand Down Expand Up @@ -812,10 +829,10 @@

feature_report_path(old_path, "rename", {}, token=token)

ds = deeplake.load(old_path, verbose=False, token=token, creds=creds)
ds.rename(new_path)
deeplake.deepcopy(old_path, new_path, verbose=False, token=token, creds=creds)
deeplake.delete(old_path, token=token, creds=creds)

return ds # type: ignore
return deeplake.load(new_path, verbose=False, token=token, creds=creds)

@staticmethod
@spinner
Expand Down Expand Up @@ -1491,6 +1508,7 @@
num_workers: int = 0,
token: Optional[str] = None,
connect_kwargs: Optional[Dict] = None,
indra: bool = USE_INDRA,
**dataset_kwargs,
) -> Dataset:
"""Ingest images and annotations in COCO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.
Expand Down Expand Up @@ -1544,6 +1562,7 @@
num_workers (int): The number of workers to use for ingestion. Set to ``0`` by default.
token (Optional[str]): The token to use for accessing the dataset and/or connecting it to Deep Lake.
connect_kwargs (Optional[Dict]): If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to :meth:`Dataset.connect <deeplake.core.dataset.Dataset.connect>`.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false
**dataset_kwargs: Any arguments passed here will be forwarded to the dataset creator function. See :func:`deeplake.empty`.

Returns:
Expand Down Expand Up @@ -1582,7 +1601,12 @@
structure = unstructured.prepare_structure(inspect_limit)

ds = deeplake.empty(
dest, creds=dest_creds, verbose=False, token=token, **dataset_kwargs
dest,
creds=dest_creds,
verbose=False,
token=token,
indra=indra,
**dataset_kwargs,
)
if connect_kwargs is not None:
connect_kwargs["token"] = token or connect_kwargs.get("token")
Expand Down Expand Up @@ -1613,6 +1637,7 @@
num_workers: int = 0,
token: Optional[str] = None,
connect_kwargs: Optional[Dict] = None,
indra: bool = USE_INDRA,
**dataset_kwargs,
) -> Dataset:
"""Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.
Expand Down Expand Up @@ -1661,6 +1686,7 @@
num_workers (int): The number of workers to use for ingestion. Set to ``0`` by default.
token (Optional[str]): The token to use for accessing the dataset and/or connecting it to Deep Lake.
connect_kwargs (Optional[Dict]): If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to :meth:`Dataset.connect <deeplake.core.dataset.Dataset.connect>`.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false
**dataset_kwargs: Any arguments passed here will be forwarded to the dataset creator function. See :func:`deeplake.empty`.

Returns:
Expand Down Expand Up @@ -1708,7 +1734,12 @@
structure = unstructured.prepare_structure()

ds = deeplake.empty(
dest, creds=dest_creds, verbose=False, token=token, **dataset_kwargs
dest,
creds=dest_creds,
verbose=False,
token=token,
indra=indra,
**dataset_kwargs,
)
if connect_kwargs is not None:
connect_kwargs["token"] = token or connect_kwargs.get("token")
Expand Down Expand Up @@ -1738,6 +1769,7 @@
shuffle: bool = True,
token: Optional[str] = None,
connect_kwargs: Optional[Dict] = None,
indra: bool = USE_INDRA,
**dataset_kwargs,
) -> Dataset:
"""Ingest a dataset of images from a local folder to a Deep Lake Dataset. Images should be stored in subfolders by class name.
Expand All @@ -1758,6 +1790,7 @@
shuffle (bool): Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to ``True``.
token (Optional[str]): The token to use for accessing the dataset.
connect_kwargs (Optional[Dict]): If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to :meth:`Dataset.connect <deeplake.core.dataset.Dataset.connect>`.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false
**dataset_kwargs: Any arguments passed here will be forwarded to the dataset creator function see :func:`deeplake.empty`.

Returns:
Expand Down Expand Up @@ -1839,6 +1872,7 @@
dest_creds=dest_creds,
progressbar=progressbar,
token=token,
indra=indra,
**dataset_kwargs,
)
return ds
Expand All @@ -1861,7 +1895,12 @@
unstructured = ImageClassification(source=src)

ds = deeplake.empty(
dest, creds=dest_creds, token=token, verbose=False, **dataset_kwargs
dest,
creds=dest_creds,
token=token,
verbose=False,
indra=indra,
**dataset_kwargs,
)
if connect_kwargs is not None:
connect_kwargs["token"] = token or connect_kwargs.get("token")
Expand Down Expand Up @@ -1892,6 +1931,7 @@
progressbar: bool = True,
summary: bool = True,
shuffle: bool = True,
indra: bool = USE_INDRA,
**dataset_kwargs,
) -> Dataset:
"""Download and ingest a kaggle dataset and store it as a structured dataset to destination.
Expand All @@ -1911,6 +1951,7 @@
progressbar (bool): Enables or disables ingestion progress bar. Set to ``True`` by default.
summary (bool): Generates ingestion summary. Set to ``True`` by default.
shuffle (bool): Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to ``True``.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false
**dataset_kwargs: Any arguments passed here will be forwarded to the dataset creator function. See :func:`deeplake.dataset`.

Returns:
Expand Down Expand Up @@ -1956,6 +1997,7 @@
progressbar=progressbar,
summary=summary,
shuffle=shuffle,
indra=indra,
**dataset_kwargs,
)

Expand All @@ -1972,6 +2014,7 @@
progressbar: bool = True,
token: Optional[str] = None,
connect_kwargs: Optional[Dict] = None,
indra: bool = USE_INDRA,
**dataset_kwargs,
):
"""Convert pandas dataframe to a Deep Lake Dataset. The contents of the dataframe can be parsed literally, or can be treated as links to local or cloud files.
Expand Down Expand Up @@ -2021,6 +2064,7 @@
progressbar (bool): Enables or disables ingestion progress bar. Set to ``True`` by default.
token (Optional[str]): The token to use for accessing the dataset.
connect_kwargs (Optional[Dict]): A dictionary containing arguments to be passed to the dataset connect method. See :meth:`Dataset.connect`.
indra (bool): Flag indicating whether indra api should be used to create the dataset. Defaults to false
**dataset_kwargs: Any arguments passed here will be forwarded to the dataset creator function. See :func:`deeplake.empty`.

Returns:
Expand All @@ -2045,15 +2089,30 @@
structured = DataFrame(src, column_params, src_creds, creds_key)

dest = convert_pathlib_to_string_if_needed(dest)
ds = deeplake.empty(
dest, creds=dest_creds, token=token, verbose=False, **dataset_kwargs
)
if indra:
from indra import api

Check warning on line 2093 in deeplake/api/dataset.py

View check run for this annotation

Codecov / codecov/patch

deeplake/api/dataset.py#L2093

Added line #L2093 was not covered by tests

ds = api.dataset_writer(

Check warning on line 2095 in deeplake/api/dataset.py

View check run for this annotation

Codecov / codecov/patch

deeplake/api/dataset.py#L2095

Added line #L2095 was not covered by tests
dest, creds=dest_creds, token=token, **dataset_kwargs
)
else:
ds = deeplake.empty(
dest,
creds=dest_creds,
token=token,
verbose=False,
**dataset_kwargs,
)
if connect_kwargs is not None:
connect_kwargs["token"] = token or connect_kwargs.get("token")
ds.connect(**connect_kwargs)

structured.fill_dataset(ds, progressbar) # type: ignore

if indra:
ids = api.load_from_storage(ds.storage)
return IndraDatasetView(indra_ds=ids)

Check warning on line 2114 in deeplake/api/dataset.py

View check run for this annotation

Codecov / codecov/patch

deeplake/api/dataset.py#L2113-L2114

Added lines #L2113 - L2114 were not covered by tests

return ds # type: ignore

@staticmethod
Expand Down
6 changes: 3 additions & 3 deletions deeplake/auto/tests/test_ingestion.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ def test_csv(memory_ds: Dataset, dataframe_ingestion_data: dict):
assert ds[tensors_names[2]].htype == "text"
assert ds[tensors_names[2]].dtype == str
np.testing.assert_array_equal(
ds[tensors_names[2]].numpy().reshape(-1), df[df_keys[2]].values
np.array(ds[tensors_names[2]].numpy()).reshape(-1), df[df_keys[2]].values
)


Expand Down Expand Up @@ -273,7 +273,7 @@ def test_dataframe_basic(
assert ds[df_keys[2]].htype == "text"
assert ds[df_keys[2]].dtype == str
np.testing.assert_array_equal(
ds[df_keys[2]].numpy().reshape(-1), df[df_keys[2]].values
np.array(ds[df_keys[2]].numpy()).reshape(-1), df[df_keys[2]].values
)


Expand Down Expand Up @@ -342,7 +342,7 @@ def test_dataframe_array(memory_ds: Dataset):
)

np.testing.assert_array_equal(
ds[df_keys[2]][0:3].numpy().reshape(-1), df[df_keys[2]].values[0:3]
np.array(ds[df_keys[2]][0:3].numpy()).reshape(-1), df[df_keys[2]].values[0:3]
)
assert ds[df_keys[2]].dtype == df[df_keys[2]].dtype

Expand Down
2 changes: 2 additions & 0 deletions deeplake/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,3 +352,5 @@

# Size of dataset view to expose as indra dataset wrapper.
INDRA_DATASET_SAMPLES_THRESHOLD = 10000000

USE_INDRA = os.environ.get("DEEPLAKE_USE_INDRA", "false").strip().lower() == "true"
Loading
Loading