Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Documents to support bulk embedding #87

Merged
merged 14 commits into from
Sep 23, 2024

Conversation

tomusher
Copy link
Member

Forgive the large PR here - this started out as me implementing bulk embedding and then going down a spiral of refactoring things because the existing structure made bulk operations quite difficult.

This change refactors how Documents are generated and used in order to:

In more detail:

  • Where we used to have a Document dataclass that was converted back and forth to Embedding model instances, there is now only a Document Django model. Documents represent a chunk of something to be stored/queried on by a storage provider.
  • DocumentConverters are responsible for converting something to Documents, and Documents back to their relevant something.
  • A Document no longer has references to objects or content types - each document can have list of object_keys which are an arbitrary representation of the thing it relates to. A DocumentConverter knows how to both generate keys for the object it's working with, and retrieve a relevant object based on a key.
  • In the case of the ModelDocumentConverter, this can create Documents based on a ModelKey, and (relatively) efficiently get all Django models that correspond to a list of Documents.
  • DocumentConverters are now composed of separate operators that can be individually specified/overridden to make it easier to test and reason about each operation a converter is involved with.

Copy link
Member

@emilytoppm emilytoppm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great! Added a couple questions/nitpicks

) -> AsyncGenerator[models.Model, None]:
"""A copy of `bulk_from_documents`, but async"""
# Force evaluate generators to allow value to be reused
documents = tuple(documents)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there circumstances where evaluating this could cause a query, and thus fail with a SynchronousOnlyOperation - eg if passed a queryset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right - I've changed these methods to accept Sequences to prevent the need to force evaluate. You could still conceivably pass a Sequence that can't be evaluated in an async context but I'm not sure if there's a way to guard against that - any ideas welcome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Honestly I can't think of a great way to impose that restriction either - ie get rid of the chance of .only etc deferring fields we want - without forcing it to no longer be a sequence of documents, and instead plain dataclasses... which seems like overkill.

src/wagtail_vector_index/storage/django.py Outdated Show resolved Hide resolved
@@ -5,7 +5,7 @@
from django.db.models import Q


class DocumentQuerySet(models.QuerySet):
class DocumentManager(models.Manager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the QuerySet derived manager be preferable here for allowing chaining custom operations + default queryset operations regardless of order? Not a big deal, but I'm not sure I get this change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely preferable but I ran in to typing issues using the QuerySet derived manager.

I think this is resolved in django-stubs by typeddjango/django-stubs#738, but not in django-types which we're using in this project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this back to be a QuerySet derived manager and added a workaround for the typing issue.

@tomusher tomusher changed the title WIP: Restructure Documents to support bulk embedding Restructure Documents to support bulk embedding Sep 23, 2024
@tomusher tomusher merged commit c08d077 into main Sep 23, 2024
6 of 7 checks passed
@zerolab zerolab deleted the feature/document-refactor branch September 23, 2024 15:00
@tomusher tomusher mentioned this pull request Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EmbeddableFieldsDocumentConverter currently depends on a content_type field on base model
3 participants