Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: allow 1D numpy array #1185

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tibor-reiss
Copy link
Contributor

Fixes #1002

This is the simplest fix which allows a 1d numpy array to be passed in (additionally, anything which can be converted in the function util.get_vector). This will raise a TypeError if the input is incorrect, which is converted later to WeaviateInvalidInputError.

Note that passing in something else then types.VECTOR, e.g. a numpy array, will raise a mypy error.

@dirkkul
Copy link
Collaborator

dirkkul commented Jul 19, 2024

Hey, thank you for your contribution! Would you be able to add a test here: integration/test_batch_v4.py? Numpy is available as a test dependency

@weaviate-git-bot
Copy link

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

@tibor-reiss
Copy link
Contributor Author

agree with the CLA

agree with the CLA

@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from b196d53 to 02abfbd Compare July 25, 2024 20:20
@tibor-reiss
Copy link
Contributor Author

tibor-reiss commented Jul 26, 2024

Hi @dirkkul, I think should be fine now. The two previous tests which failed (DB issue and timeout on grpc) were not connected to this PR.

@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from 802a74d to a2f116b Compare July 27, 2024 04:44
@tibor-reiss
Copy link
Contributor Author

@dirkkul friendly reminder: can this branch be merged?

@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from a2f116b to 5120b1a Compare September 11, 2024 19:46
@@ -60,6 +61,15 @@
DATE3 = datetime.datetime.strptime("2019-06-10", "%Y-%m-%d").replace(tzinfo=datetime.timezone.utc)


def get_numpy_vector(input_list: list) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have numpy in our dev requirements, so you can assume it is present for tests

Copy link
Contributor Author

@tibor-reiss tibor-reiss Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, simplified
(and rebased - hopefully that fixes the failing test)

@@ -57,7 +57,7 @@ def pack_vector(vector: Any) -> bytes:
collection=obj.collection,
vector_bytes=(
pack_vector(obj.vector)
if obj.vector is not None and isinstance(obj.vector, list)
if obj.vector is not None and not isinstance(obj.vector, dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling of vector input should be at the input level, eg in insert_many and batch.add_objects and not here

Copy link
Contributor Author

@tibor-reiss tibor-reiss Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate what you mean by this and some hints how to accomplish?

I was staring at the code and don't see how I could push the validation upstream, since obj.vector can be both None and a dict. Couple of lines below where vectors is constructed: if obj.vector is not None and isinstance(obj.vector, dict), which means either vector_bytes or vectors will be not-None, but not both at the same time.
For readability, we could move these two next to each other - just cosmetics.

Note: the only thing I would do is to move pack_vector outside as is _pack_named_vectors, and rename the latter to _pack_named_vector. But that is again just cosmetics :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct place would be 'weaviate/collections/data/data.py::insert_manyand then the cleanup should be herevector=obj.vector`. But keep in mind that vectors can be lists and dictionaries (for named vectors) and both should be supported

Copy link
Contributor Author

@tibor-reiss tibor-reiss Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dirkkul, thanks for coming back. However, I still don't see how this could be achieved without a major refactoring.

There are 3 possibilities in the current program flow, starting from data/_DataCollectionAsync.insert_many():

type(obj) type(obj.vector) _BatchObject.vector vector_bytes vectors
!DataObject n/a None None None
DataObject !dict obj.vector pack_vector(...) None
DataObject dict obj.vector None _pack_named_vectors(...)

So move one needs to move some of the logic from __grpc_objects() into insert_many(), e.g. creating a list of tuples containing (_BatchObject, vector_bytes, vectors). Is this what you are suggesting or am I on the wrong path please?

Update: the last commit contains a first try at refactoring. Let me know your thoughts please.

@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from e36d66c to 3649fc5 Compare September 13, 2024 18:08
@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from 3649fc5 to 9f21009 Compare September 25, 2024 18:43
@tibor-reiss tibor-reiss requested a review from a team as a code owner September 25, 2024 19:47
@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from c564504 to e98b0aa Compare September 25, 2024 19:49
@tibor-reiss tibor-reiss requested a review from dirkkul September 25, 2024 19:50
@dirkkul
Copy link
Collaborator

dirkkul commented Sep 26, 2024

Hey, I think you are mixing two things up:

  • how the users supplies vectors
  • how we hand them internally (vector bytes, packing etc)

We would want that user supplied vectors that are numpy arrays are converted to python lists. So if you do

                _BatchObject(
                    collection=self.name,
                    vector=_get_vector_v4(obj.vector),
                    uuid=str(obj.uuid if obj.uuid is not None else uuid_package.uuid4()),
                    properties=cast(dict, obj.properties),
                    tenant=self._tenant,
                    references=obj.references,
                    index=idx,
                )

it should take care of things.

@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch 2 times, most recently from addbb98 to 00a60a8 Compare September 26, 2024 18:38
@tibor-reiss tibor-reiss force-pushed the fix_1002_numpy_support branch from 00a60a8 to 27422ac Compare September 26, 2024 18:40
@tibor-reiss
Copy link
Contributor Author

Hi @dirkkul, thanks for your patience, fingers crossed I got it now...

@tibor-reiss
Copy link
Contributor Author

Friendly ping @dirkkul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

insert_many silently discards vectors if they're NumPy arrays
3 participants