Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looks like it's impossible to load local ChromaDB to HF. #168

Open
kojinglick-ctec opened this issue Sep 6, 2024 · 4 comments · May be fixed by #171
Open

Looks like it's impossible to load local ChromaDB to HF. #168

kojinglick-ctec opened this issue Sep 6, 2024 · 4 comments · May be fixed by #171
Assignees
Labels
bug Something isn't working
Milestone

Comments

@kojinglick-ctec
Copy link

❯ cdp export "file://chroma_data/diftg" | cdp ds-put "hf://kojinglick-ctec/text-embedding-exploration"
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/[email protected]/Dev/latent-space/env/lib/python3.10/site-packages/chroma_dp/hugging │
│ face/__init__.py:342 in hf_export                                                                │
│                                                                                                  │
│   339 │   │   │   for key in doc.metadata.keys():                                                │
│   340 │   │   │   │   if f"metadata.{key}" not in features:                                      │
│   341 │   │   │   │   │   features[f"metadata.{key}"] = _infer_hf_type(doc.metadata[key])        │
│ ❱ 342 │   │   │   │   _batch[f"metadata.{key}"].append(doc.metadata[key])                        │
│   343 │   │                                                                                      │
│   344 │   │   if len(_batch["document"]) >= _batch_size:                                         │
│   345 │   │   │   if dataset is None:                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │         _batch = {                                                                           │ │
│ │                  │   'id': ['doc_0'],                                                        │ │
│ │                  │   'document': [                                                           │ │
│ │                  │   │   'I have some tips for anyone who wants to get into high yield       │ │
│ │                  container gardening'+13                                                     │ │
│ │                  │   ],                                                                      │ │
│ │                  │   'embedding': [                                                          │ │
│ │                  │   │   [                                                                   │ │
│ │                  │   │   │   -0.06528378278017044,                                           │ │
│ │                  │   │   │   0.022876497358083725,                                           │ │
│ │                  │   │   │   -0.06239813193678856,                                           │ │
│ │                  │   │   │   0.03368566185235977,                                            │ │
│ │                  │   │   │   -0.0018341012764722109,                                         │ │
│ │                  │   │   │   -0.04636296629905701,                                           │ │
│ │                  │   │   │   -0.008506285957992077,                                          │ │
│ │                  │   │   │   0.0280376635491848,                                             │ │
│ │                  │   │   │   -0.026711950078606606,                                          │ │
│ │                  │   │   │   0.029601674526929855,                                           │ │
│ │                  │   │   │   ... +1014                                                       │ │
│ │                  │   │   ]                                                                   │ │
│ │                  │   ]                                                                       │ │
│ │                  }                                                                           │ │
│ │    _batch_size = 100                                                                         │ │
│ │       _dataset = 'kojinglick-ctec/text-embedding-exploration'                                │ │
│ │   _doc_feature = 'text_chunk'                                                                │ │
│ │ _embed_feature = 'embedding'                                                                 │ │
│ │        _hf_uri = HFImportUri(                                                                │ │
│ │                  │   dataset='kojinglick-ctec/text-embedding-exploration',                   │ │
│ │                  │   dataset_name='/text-embedding-exploration',                             │ │
│ │                  │   limit=None,                                                             │ │
│ │                  │   offset=None,                                                            │ │
│ │                  │   split=None,                                                             │ │
│ │                  │   stream=False,                                                           │ │
│ │                  │   id_feature=None,                                                        │ │
│ │                  │   doc_feature=None,                                                       │ │
│ │                  │   embed_feature=None,                                                     │ │
│ │                  │   is_remote=True,                                                         │ │
│ │                  │   meta_features=None,                                                     │ │
│ │                  │   private=False,                                                          │ │
│ │                  │   batch_size=100                                                          │ │
│ │                  )                                                                           │ │
│ │    _id_feature = 'id'                                                                        │ │
│ │         _limit = -1                                                                          │ │
│ │ _meta_features = []                                                                          │ │
│ │        _offset = 0                                                                           │ │
│ │       _private = False                                                                       │ │
│ │         _split = 'train'                                                                     │ │
│ │     batch_size = 100                                                                         │ │
│ │        dataset = None                                                                        │ │
│ │            doc = EmbeddableTextResource(                                                     │ │
│ │                  │   id='doc_0',                                                             │ │
│ │                  │   metadata={'author': 'ECO', 'title': 'CONTAINER GARDENING'},             │ │
│ │                  │   embedding=[                                                             │ │
│ │                  │   │   -0.06528378278017044,                                               │ │
│ │                  │   │   0.022876497358083725,                                               │ │
│ │                  │   │   -0.06239813193678856,                                               │ │
│ │                  │   │   0.03368566185235977,                                                │ │
│ │                  │   │   -0.0018341012764722109,                                             │ │
│ │                  │   │   -0.04636296629905701,                                               │ │
│ │                  │   │   -0.008506285957992077,                                              │ │
│ │                  │   │   0.0280376635491848,                                                 │ │
│ │                  │   │   -0.026711950078606606,                                              │ │
│ │                  │   │   0.029601674526929855,                                               │ │
│ │                  │   │   ... +1014                                                           │ │
│ │                  │   ],                                                                      │ │
│ │                  │   text_chunk='I have some tips for anyone who wants to get into high      │ │
│ │                  yield container gardening'+13                                               │ │
│ │                  )                                                                           │ │
│ │    doc_feature = 'text_chunk'                                                                │ │
│ │  embed_feature = 'embedding'                                                                 │ │
│ │       features = {                                                                           │ │
│ │                  │   'id': Value(dtype='string', id=None),                                   │ │
│ │                  │   'embedding': Sequence(                                                  │ │
│ │                  │   │   feature=Value(dtype='float32', id=None),                            │ │
│ │                  │   │   length=-1,                                                          │ │
│ │                  │   │   id=None                                                             │ │
│ │                  │   ),                                                                      │ │
│ │                  │   'document': Value(dtype='string', id=None),                             │ │
│ │                  │   'metadata.author': Value(dtype='string', id=None)                       │ │
│ │                  }                                                                           │ │
│ │     id_feature = 'id'                                                                        │ │
│ │            inf = <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>                │ │
│ │            key = 'author'                                                                    │ │
│ │          limit = -1                                                                          │ │
│ │           line = '{"id":"doc_0","metadata":{"author":"ECO","title":"CONTAINER                │ │
│ │                  GARDENING"},"embeddi'+21838                                                 │ │
│ │  meta_features = []                                                                          │ │
│ │         offset = 0                                                                           │ │
│ │        private = False                                                                       │ │
│ │          split = 'train'                                                                     │ │
│ │            uri = 'hf://kojinglick-ctec/text-embedding-exploration'                           │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'metadata.author'

currently impossible to parse what the --meta-features flag does. i've tried every combination.

@tazarov
Copy link
Contributor

tazarov commented Sep 12, 2024

@kojinglick-ctec, thanks for reporting this. I'll look at and deliver a fix (as needed) this week.

@tazarov tazarov added this to the 0.0.12 milestone Sep 12, 2024
@tazarov tazarov self-assigned this Sep 12, 2024
@tazarov tazarov added the bug Something isn't working label Sep 12, 2024
@tazarov tazarov linked a pull request Sep 12, 2024 that will close this issue
@tazarov
Copy link
Contributor

tazarov commented Sep 12, 2024

@kojinglick-ctec, can you try to install the PR version for the fix and test with it?

pip install git+https://github.com/amikos-tech/chromadb-data-pipes.git@trayan-09-12-fix_hf_batch_metadata_keys

I'll try to test it myself later today if you haven't had the chance.

@kojinglick-ctec
Copy link
Author

I'll give it a shot here in the next day or so, thanks so much for getting to this so quickly.

I'd say go test it and i can add a review to the pr!

@kojinglick-ctec
Copy link
Author

Took a look at the newer version, still getting the same issue.

❯ cdp export file://chroma_data/diftg | cdp ds-put "hf://kojinglick-ctec/test" --private
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/[email protected]/Dev/latent-space/env/lib/python3.10/site-packages/chroma_dp/hugging │
│ face/__init__.py:343 in hf_export                                                                │
│                                                                                                  │
│   340 │   │   │   │   if f"metadata.{key}" not in features:                                      │
│   341 │   │   │   │   │   features[f"metadata.{key}"] = _infer_hf_type(doc.metadata[key])        │
│   342 │   │   │   │   │   _batch[f"metadata.{key}"] = []                                         │
│ ❱ 343 │   │   │   │   _batch[f"metadata.{key}"].append(doc.metadata[key])                        │
│   344 │   │                                                                                      │
│   345 │   │   if len(_batch["document"]) >= _batch_size:                                         │
│   346 │   │   │   if dataset is None:                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │         _batch = {                                                                           │ │
│ │                  │   'id': ['diftg_1088'],                                                   │ │
│ │                  │   'document': [                                                           │ │
│ │                  │   │   'Fury is a gift - it kills doubt in its tracks and enables you to   │ │
│ │                  act decisively'                                                             │ │
│ │                  │   ],                                                                      │ │
│ │                  │   'embedding': [                                                          │ │
│ │                  │   │   [                                                                   │ │
│ │                  │   │   │   -0.01718083955347538,                                           │ │
│ │                  │   │   │   -0.004701937083154917,                                          │ │
│ │                  │   │   │   -0.04010680317878723,                                           │ │
│ │                  │   │   │   -0.012514934875071049,                                          │ │
│ │                  │   │   │   -0.038050323724746704,                                          │ │
│ │                  │   │   │   -0.06639696657657623,                                           │ │
│ │                  │   │   │   0.05012612044811249,                                            │ │
│ │                  │   │   │   -0.0002598187711555511,                                         │ │
│ │                  │   │   │   0.02284279279410839,                                            │ │
│ │                  │   │   │   -0.005357010290026665,                                          │ │
│ │                  │   │   │   ... +1014                                                       │ │
│ │                  │   │   ]                                                                   │ │
│ │                  │   ]                                                                       │ │
│ │                  }                                                                           │ │
│ │    _batch_size = 100                                                                         │ │
│ │       _dataset = 'kojinglick-ctec/test'                                                      │ │
│ │   _doc_feature = 'text_chunk'                                                                │ │
│ │ _embed_feature = 'embedding'                                                                 │ │
│ │        _hf_uri = HFImportUri(                                                                │ │
│ │                  │   dataset='kojinglick-ctec/test',                                         │ │
│ │                  │   dataset_name='/test',                                                   │ │
│ │                  │   limit=None,                                                             │ │
│ │                  │   offset=None,                                                            │ │
│ │                  │   split=None,                                                             │ │
│ │                  │   stream=False,                                                           │ │
│ │                  │   id_feature=None,                                                        │ │
│ │                  │   doc_feature=None,                                                       │ │
│ │                  │   embed_feature=None,                                                     │ │
│ │                  │   is_remote=True,                                                         │ │
│ │                  │   meta_features=None,                                                     │ │
│ │                  │   private=False,                                                          │ │
│ │                  │   batch_size=100                                                          │ │
│ │                  )                                                                           │ │
│ │    _id_feature = 'id'                                                                        │ │
│ │         _limit = -1                                                                          │ │
│ │ _meta_features = []                                                                          │ │
│ │        _offset = 0                                                                           │ │
│ │       _private = True                                                                        │ │
│ │         _split = 'train'                                                                     │ │
│ │     batch_size = 100                                                                         │ │
│ │        dataset = Dataset({                                                                   │ │
│ │                  │   features: ['id', 'document', 'embedding', 'metadata.author',            │ │
│ │                  'metadata.title'],                                                          │ │
│ │                  │   num_rows: 100                                                           │ │
│ │                  })                                                                          │ │
│ │            doc = EmbeddableTextResource(                                                     │ │
│ │                  │   id='diftg_1088',                                                        │ │
│ │                  │   metadata={'author': '3S', 'title': 'FURY'},                             │ │
│ │                  │   embedding=[                                                             │ │
│ │                  │   │   -0.01718083955347538,                                               │ │
│ │                  │   │   -0.004701937083154917,                                              │ │
│ │                  │   │   -0.04010680317878723,                                               │ │
│ │                  │   │   -0.012514934875071049,                                              │ │
│ │                  │   │   -0.038050323724746704,                                              │ │
│ │                  │   │   -0.06639696657657623,                                               │ │
│ │                  │   │   0.05012612044811249,                                                │ │
│ │                  │   │   -0.0002598187711555511,                                             │ │
│ │                  │   │   0.02284279279410839,                                                │ │
│ │                  │   │   -0.005357010290026665,                                              │ │
│ │                  │   │   ... +1014                                                           │ │
│ │                  │   ],                                                                      │ │
│ │                  │   text_chunk='Fury is a gift - it kills doubt in its tracks and enables   │ │
│ │                  you to act decisively'                                                      │ │
│ │                  )                                                                           │ │
│ │    doc_feature = 'text_chunk'                                                                │ │
│ │  embed_feature = 'embedding'                                                                 │ │
│ │       features = {                                                                           │ │
│ │                  │   'id': Value(dtype='string', id=None),                                   │ │
│ │                  │   'embedding': Sequence(                                                  │ │
│ │                  │   │   feature=Value(dtype='float32', id=None),                            │ │
│ │                  │   │   length=-1,                                                          │ │
│ │                  │   │   id=None                                                             │ │
│ │                  │   ),                                                                      │ │
│ │                  │   'document': Value(dtype='string', id=None),                             │ │
│ │                  │   'metadata.author': Value(dtype='string', id=None),                      │ │
│ │                  │   'metadata.title': Value(dtype='string', id=None)                        │ │
│ │                  }                                                                           │ │
│ │     id_feature = 'id'                                                                        │ │
│ │            inf = <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>                │ │
│ │            key = 'author'                                                                    │ │
│ │          limit = -1                                                                          │ │
│ │           line = '{"id":"diftg_1088","metadata":{"author":"3S","title":"FURY"},"embedding":… │ │
│ │  meta_features = []                                                                          │ │
│ │         offset = 0                                                                           │ │
│ │        private = True                                                                        │ │
│ │          split = 'train'                                                                     │ │
│ │            uri = 'hf://kojinglick-ctec/test'                                                 │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'metadata.author'

That being said the behavior has changed when I try to set the --meta-features flag. Quick question about this flag: If I want both "metadata.author" and "metadata.title", how would i set the flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants