Restructure Documents to support bulk embedding #87

tomusher · 2024-09-13T13:32:15Z

Forgive the large PR here - this started out as me implementing bulk embedding and then going down a spiral of refactoring things because the existing structure made bulk operations quite difficult.

This change refactors how Documents are generated and used in order to:

Decouple them from content types (resolves EmbeddableFieldsDocumentConverter currently depends on a content_type field on base model #73)
Support bulk embeddings (resolving Bulk embedding support #84)
Simplify the code base and make individual components more testable.
Hopefully make it easier to filter searches by querysets.

In more detail:

Where we used to have a Document dataclass that was converted back and forth to Embedding model instances, there is now only a Document Django model. Documents represent a chunk of something to be stored/queried on by a storage provider.
DocumentConverters are responsible for converting something to Documents, and Documents back to their relevant something.
A Document no longer has references to objects or content types - each document can have list of object_keys which are an arbitrary representation of the thing it relates to. A DocumentConverter knows how to both generate keys for the object it's working with, and retrieve a relevant object based on a key.
In the case of the ModelDocumentConverter, this can create Documents based on a ModelKey, and (relatively) efficiently get all Django models that correspond to a list of Documents.
DocumentConverters are now composed of separate operators that can be individually specified/overridden to make it easier to test and reason about each operation a converter is involved with.

src/wagtail_vector_index/storage/django.py

emilytoppm

Generally looks great! Added a couple questions/nitpicks

emilytoppm · 2024-09-19T14:49:13Z

src/wagtail_vector_index/storage/django.py

+    ) -> AsyncGenerator[models.Model, None]:
+        """A copy of `bulk_from_documents`, but async"""
+        # Force evaluate generators to allow value to be reused
+        documents = tuple(documents)


Are there circumstances where evaluating this could cause a query, and thus fail with a SynchronousOnlyOperation - eg if passed a queryset?

Yeah you're right - I've changed these methods to accept Sequences to prevent the need to force evaluate. You could still conceivably pass a Sequence that can't be evaluated in an async context but I'm not sure if there's a way to guard against that - any ideas welcome!

Looks good! Honestly I can't think of a great way to impose that restriction either - ie get rid of the chance of .only etc deferring fields we want - without forcing it to no longer be a sequence of documents, and instead plain dataclasses... which seems like overkill.

src/wagtail_vector_index/storage/django.py

emilytoppm · 2024-09-19T15:43:11Z

src/wagtail_vector_index/storage/models.py

@@ -5,7 +5,7 @@
 from django.db.models import Q


-class DocumentQuerySet(models.QuerySet):
+class DocumentManager(models.Manager):


Wouldn't the QuerySet derived manager be preferable here for allowing chaining custom operations + default queryset operations regardless of order? Not a big deal, but I'm not sure I get this change

Yes definitely preferable but I ran in to typing issues using the QuerySet derived manager.

I think this is resolved in django-stubs by typeddjango/django-stubs#738, but not in django-types which we're using in this project.

I've moved this back to be a QuerySet derived manager and added a workaround for the typing issue.

Added a workaround for typing issues with django-types

…aluate generators

…entManager

tomusher added 4 commits September 13, 2024 15:40

Restructure Documents

1326b79

Remove object_key field from Document

e8c768c

Update bulk_generate_documents to use object_keys

6928fdd

Use staticmethod consistently

8bf89c6

tomusher force-pushed the feature/document-refactor branch from 341d1eb to 8bf89c6 Compare September 13, 2024 15:45

tomusher added 3 commits September 13, 2024 15:47

Fix tests after rebase

6e2f102

Typing fixes

dc9faa8

Fix async typing error

5004e87

zerolab reviewed Sep 13, 2024

View reviewed changes

src/wagtail_vector_index/storage/django.py Outdated Show resolved Hide resolved

tomusher added 3 commits September 13, 2024 17:24

Fix for_keys when using Postgres

4b6c7ba

Use Python stdlib batched if it exists

59f5ab0

Ignore type warnings when import is unavailable

657b6e9

emilytoppm reviewed Sep 19, 2024

View reviewed changes

tomusher added 4 commits September 20, 2024 12:56

Revert Document manager to be derived from QuerySet

39275f3

Added a workaround for typing issues with django-types

bulk_from_documents now accepts Sequences to prevent need to force ev…

c73193d

…aluate generators

Add as_manager on DocumentQuerySet which casts returned type to Docum…

ef70ab8

…entManager

Fixed cases where generator was being passed to bulk_from_documents

c7be208

emilytoppm approved these changes Sep 23, 2024

View reviewed changes

tomusher changed the title ~~WIP: Restructure Documents to support bulk embedding~~ Restructure Documents to support bulk embedding Sep 23, 2024

tomusher merged commit c08d077 into main Sep 23, 2024
6 of 7 checks passed

zerolab deleted the feature/document-refactor branch September 23, 2024 15:00

tomusher mentioned this pull request Sep 26, 2024

Bulk embedding support #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Documents to support bulk embedding #87

Restructure Documents to support bulk embedding #87

tomusher commented Sep 13, 2024

emilytoppm left a comment

emilytoppm Sep 19, 2024

tomusher Sep 20, 2024

emilytoppm Sep 23, 2024

emilytoppm Sep 19, 2024

tomusher Sep 19, 2024

tomusher Sep 20, 2024

Restructure Documents to support bulk embedding #87

Restructure Documents to support bulk embedding #87

Conversation

tomusher commented Sep 13, 2024

emilytoppm left a comment

Choose a reason for hiding this comment

emilytoppm Sep 19, 2024

Choose a reason for hiding this comment

tomusher Sep 20, 2024

Choose a reason for hiding this comment

emilytoppm Sep 23, 2024

Choose a reason for hiding this comment

emilytoppm Sep 19, 2024

Choose a reason for hiding this comment

tomusher Sep 19, 2024

Choose a reason for hiding this comment

tomusher Sep 20, 2024

Choose a reason for hiding this comment