feat: 4018 feature add out of the box embedding support via clientfee…

…dbackintegrations (#4454)  # Description I added support for `sentence-transformers` via the introduction of the `SentenceTransformersExtractor`. During the implementation, I fixed some minor issues. - replaced the `vector_settings_by_name` that had been defined twice for the `remote` dataset and once within the `local` dataset, by the method we defined in the shared `base`. - updated the `__repr__` of the dataset to include the `vector_settings`. - rewrote some code of the `TextDescriptivesExtractor` to align the structure and usage of both extractors. - resolved a bug in the `TextDescriptivesExtractor`, where we might update records from wrong indices with embedding during simultaneous IO. - removed unit tests for the `TextDescriptivesExtractor` and created integration tests instead. - removed some unused imoports Closes #4018 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [X] New feature (non-breaking change which adds functionality) - [X] Refactor (change restructuring the codebase without changing functionality) - [X] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] integration/**/**/integration/test_textdescriptives.py - [X] integration/**/**/integration/test_sentencetransformerx.py **Checklist** - [x] I added relevant documentation - [x] I followed the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [x] I have added relevant notes to the `CHANGELOG.md` file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martín Blázquez <[email protected]>
argilla-io · Jan 14, 2024 · 27e3f09 · 27e3f09
1 parent c95326a
commit 27e3f09
Show file tree

Hide file tree

Showing 27 changed files with 1,269 additions and 274 deletions.
diff --git a/.github/workflows/package.yml b/.github/workflows/package.yml
@@ -82,10 +82,6 @@ jobs:
             searchEngineDockerEnv: '{"discovery.type": "single-node", "xpack.security.enabled": "false"}'
             coverageReport: coverage-elasticsearch-8.8.2
             runsOn: extended-runner
-          - searchEngineDockerImage: opensearchproject/opensearch:2.4.1
-            searchEngineDockerEnv: '{"discovery.type": "single-node", "plugins.security.disabled": "true"}'
-            coverageReport: coverage-opensearch-2.4.1
-            runsOn: ubuntu-latest
     name: Run end2end tests
     uses: ./.github/workflows/end2end-examples.yml
     needs: check_repo_files

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -25,6 +25,8 @@ These are the section headers that we use:
 ### Changed
 
 - Module `argilla.cli.server` definitions have been moved to `argilla.server.cli` module. ([#4472](https://github.com/argilla-io/argilla/pull/4472))
+- [breaking] Changed `vector_settings_by_name` for generic `property_by_name` usage, which will return `None` instead of raising an error ([#4454](https://github.com/argilla-io/argilla/pull/4454))
+- Added pydantic v2 support using the python SDK ([#4459](https://github.com/argilla-io/argilla/pull/4459))
 - The constant definition `ES_INDEX_REGEX_PATTERN` in module `argilla._constants` is now private. ([#4472](https://github.com/argilla-io/argilla/pull/4474))
 - `nan` values in metadata properties will raise a 422 error when creating/updating records. ([#4300](https://github.com/argilla-io/argilla/issues/4300))
 - `None` values are now allowed in metadata properties. ([#4300](https://github.com/argilla-io/argilla/issues/4300))
@@ -36,6 +38,8 @@ These are the section headers that we use:
 ### Removed
 
 - The deprecated `python -m argilla database` command has been removed. ([#4472](https://github.com/argilla-io/argilla/pull/4472))
+- Added `vector_settings` to the `__repr__` method of the `FeedbackDataset` and `RemoteFeedbackDataset`. ([#4454](https://github.com/argilla-io/argilla/pull/4454))
+- Added integration for `sentence-transformers` using `SentenceTransformersExtractor` to configure `vector_settings` in `FeedbackDataset` and `FeedbackRecord`. ([#4454](https://github.com/argilla-io/argilla/pull/4454))
 
 ## [1.21.0](https://github.com/argilla-io/argilla/compare/v1.20.0...v1.21.0)
 
@@ -53,7 +57,7 @@ These are the section headers that we use:
 
 ### Changed
 
-- More productive and simpler shortcuts system ([#4215](https://github.com/argilla-io/argilla/pull/4215))
+- More productive and simpler shortcut system ([#4215](https://github.com/argilla-io/argilla/pull/4215))
 - Move `ArgillaSingleton`, `init` and `active_client` to a new module `singleton`. ([#4347](https://github.com/argilla-io/argilla/pull/4347))
 - Updated `argilla.load` functions to also work with `FeedbackDataset`s. ([#4347](https://github.com/argilla-io/argilla/pull/4347))
 - [breaking] Updated `argilla.delete` functions to also work with `FeedbackDataset`s. It now raises an error if the dataset does not exist. ([#4347](https://github.com/argilla-io/argilla/pull/4347))

diff --git a/...rials/add-sentence-transformers-embeddings-as-vectors/sentence-transformers.png b/...rials/add-sentence-transformers-embeddings-as-vectors/sentence-transformers.png
diff --git a/docs/_source/practical_guides/annotation_workflows/add_text_descriptives_as_metadata.ipynb b/docs/_source/practical_guides/annotation_workflows/add_text_descriptives_as_metadata.ipynb
@@ -304,7 +304,6 @@
     "\n",
     "* *model*: the language of the model.\n",
     "* *metrics*: the metrics to be extracted.\n",
-    "* *fields*: the field names to extract metrics from.\n",
     "* *visible_for_annotators*: whether the metadata is visible for annotators.\n",
     "* *show_progress*: whether to show the progress bar.\n",
     "\n",
@@ -336,7 +335,6 @@
     "tde = TextDescriptivesExtractor(\n",
     "    model = \"en\",\n",
     "    metrics = None,\n",
-    "    fields = [\"question\"],\n",
     "    visible_for_annotators = False,\n",
     "    show_progress = True,\n",
     ")"
@@ -349,7 +347,7 @@
    "outputs": [],
    "source": [
     "# Update the records\n",
-    "updated_records = tde.update_records(records)"
+    "updated_records = tde.update_records(records, fields=[\"question\"])"
    ]
   },
   {
@@ -435,7 +433,6 @@
     "tde = TextDescriptivesExtractor(\n",
     "    model = \"en\",\n",
     "    metrics = [\"descriptive_stats\", \"readability\"],\n",
-    "    fields = [\"context\"],\n",
     "    visible_for_annotators = True,\n",
     "    show_progress = True,\n",
     ")"
@@ -448,7 +445,7 @@
    "outputs": [],
    "source": [
     "# Update the dataset\n",
-    "tde.update_dataset(remote_dataset)"
+    "tde.update_dataset(remote_dataset, fields=[\"context\"])"
    ]
   },
   {

diff --git a/docs/_source/practical_guides/annotation_workflows/annotation_workflows.md b/docs/_source/practical_guides/annotation_workflows/annotation_workflows.md
@@ -38,5 +38,5 @@ active_learning
 weak_supervision
 semantic_search
 job_scheduling
-text_descriptives_as_metadata
+add_text_descriptives_as_metadata
 ```
diff --git a/docs/_source/practical_guides/annotation_workflows/semantic_search.md b/docs/_source/practical_guides/annotation_workflows/semantic_search.md
@@ -80,7 +80,6 @@ See https://github.com/opensearch-project/k-NN/issues/1286
 This may result in unexpected results when combining filtering with vector search with this engine.
 :::
 
-
 ## Add vectors to your data
 
 The first and most important thing to do before leveraging semantic search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned `vectors` field. But the question is: **how do you create these vectors?**
@@ -103,38 +102,7 @@ If you run into issues when logging records with large vectors using `rg.log`, w
 
 SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. There are dozens of [pre-trained models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) on the Hugging Face Hub.
 
-The code below will load a dataset from the Hub, encode the `text` field, and create the `vectors` field which will contain only one key (`mini-lm-sentence-transformers`).
-
-```{note}
-
-Vector keys are arbitrary names that will be used as a name for the vector and shown in the UI if there's more than 1 so users can decide which vector to use for finding similar records. Remember you can associate several vectors to one record by using different keys.
-```
-
-```{warning}
-Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you cannot register vectors with dimensions greater than `1024`.
-```
-
-To run the code below you need to install `sentence_transformers` and `datasets` with pip: `pip install sentence_transformers datasets`
-
-```python
-from sentence_transformers import SentenceTransformer
-
-from datasets import load_dataset
-
-# Define fast version of sentence transformers
-encoder = SentenceTransformer("BAAI/bge-small-en", device="cpu")
-
-# Load dataset
-dataset = load_dataset("PolyAI/banking77", split="test")
-
-# Encode text field using batched computation
-dataset = dataset.map(lambda batch: {"vectors": encoder.encode(batch["text"])}, batch_size=32, batched=True)
-
-# Turn vectors into a dictionary
-dataset = dataset.map(
-    lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
-)
-```
+Given its fundamental and open source versatile nature, we have decided to add a native integration with SentenceTransformers. This integration allows you to easily add embeddings to your records or datasets using the `SentenceTransformersExtractor` based on the [sentence-transformers](https://sbert.net/) library. This integration can be found [here](/practical_guides/create_update_dataset/vectors.md).
 
 ### OpenAI `Embeddings`
 

diff --git a/docs/_source/practical_guides/create_update_dataset/create_dataset.md b/docs/_source/practical_guides/create_update_dataset/create_dataset.md
@@ -146,8 +146,9 @@ vectors_settings = [
     )
 ]
 ```
+
 ```{note}
-You can also define vector settings after the dataset has been configured or add them to an existing dataset in Argilla. To do that use the `add_vector_settings` method as explained [here](/practical_guides/create_update_dataset/vectors.md).
+You can also define vector settings after the dataset has been configured or add them to an existing dataset in Argilla. To do that use the `add_vector_settings` method. In addition, you can now add text descriptives of your fields as metadata automatically with the `SentenceTransformersExtractor`. For more info, take a look [here](/practical_guides/create_update_dataset/vectors.md).
 ```
 
 ##### Define `guidelines`

diff --git a/docs/_source/practical_guides/create_update_dataset/metadata.md b/docs/_source/practical_guides/create_update_dataset/metadata.md
@@ -137,11 +137,10 @@ You can easily add text descriptives to your records or datasets using the `Text
 
 - `model` (optional): The language of the spacy model that will be used. Defaults to `en`. Check [here](https://spacy.io/usage/models) the available languages and models.
 - `metrics` (optional): A list of metrics to extract. The default extracted metrics are: `n_tokens`, `n_unique_tokens`, `n_sentences`, `perplexity`, `entropy`, and `flesch_reading_ease`. You can select your metrics according to the following groups `descriptive_stats`, `readability`, `dependency_distance`, `pos_proportions`, `coherence`, `quality`, and `information_theory`. For more information about each group, check this documentation [page](https://hlasse.github.io/TextDescriptives/descriptivestats.html).
-- `fields` (optional): A list of field names to extract metrics from. All fields will be used by default.
 - `visible_for_annotators` (optional): Whether the extracted metrics should be visible to annotators. Defaults to `True`.
 - `show_progress` (optional): Whether to show a progress bar when extracting metrics. Defaults to `True`.
 
-For a practical example, check our [tutorial on adding text descriptives as metadata](/tutorials_and_integrations/integrations/add_text_descriptives_as_metadata.html).
+For a practical example, check our [tutorial on adding text descriptives as metadata](/tutorials_and_integrations/integrations/add_text_descriptives_as_metadata.ipynb).
 
 ::::{tab-set}
 
@@ -157,12 +156,16 @@ dataset = ... # FeedbackDataset or RemoteFeedbackDataset
 tde = TextDescriptivesExtractor(
     model="en",
     metrics=None,
-    fields=None,
     visible_for_annotators=True,
     show_progress=True,
 )
 
-dataset = tde.update_dataset(dataset)
+dataset = tde.update_dataset(
+    dataset=dataset
+    fields=None # None means using all fields
+    update_records=True # Also, update the records in the dataset
+    overwrite=False # Whether to overwrite existing vectors
+)
 ```
 :::
 
@@ -178,17 +181,19 @@ records = [...] # FeedbackRecords or RemoteFeedbackRecords
 tde = TextDescriptivesExtractor(
     model="en",
     metrics=None,
-    fields=None,
     visible_for_annotators=True,
     show_progress=True,
 )
 
-records = tde.update_records(records)
+records = tde.update_records(
+    records=records,
+    fields=None # None means using all fields
+    overwrite=False # Whether to overwrite existing vectors
+)
 ```
 
 :::
 
-
 ::::
 
 

diff --git a/docs/_source/practical_guides/create_update_dataset/vectors.md b/docs/_source/practical_guides/create_update_dataset/vectors.md
@@ -124,6 +124,64 @@ dataset.update_records(modified_records)
 You can also follow the same strategy to modify existing vectors.
 ```
 
+### Add Sentence Transformers `vectors`
+
+You can easily add semantic embeddings to your records or datasets using the `SentenceTransformersExtractor` based on the [sentence-transformers](https://sbert.net/) library. This extractor is available in the Python SDK and can be used to configure settings for a dataset and extract embeddings from a list of records. The `SentenceTransformersExtractor` has the following arguments:
+
+- `model_name`: The name of the model to use for extracting embeddings. You can find a list of available models [here](https://www.sbert.net/docs/pretrained_models.html).
+- `show_progress` (optional): Whether to show a progress bar when extracting metrics. Defaults to `True`.
+
+For a practical example, check our [tutorial on adding sentence transformer embeddings as vectors](/tutorials_and_integrations/integrations/add_sentence_transformers_embeddings_as_vectors.ipynb).
+
+::::{tab-set}
+
+:::{tab-item} Dataset
+
+This can be used to update the dataset and configuration with `VectorSettings` for `Fields` in a `FeedbackDataset` or a `RemoteFeedbackDataset`.
+
+```python
+from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor
+
+dataset = ... # FeedbackDataset or RemoteFeedbackDataset
+
+tde = SentenceTransformersExtractor(
+    model="TaylorAI/bge-micro-v2",
+    show_progress=True,
+)
+
+dataset = tde.update_dataset(
+    dataset=dataset
+    fields=None # None means using all fields
+    update_records=True # Also, update the records in the dataset
+    overwrite=False # Whether to overwrite existing vectors
+)
+```
+:::
+
+:::{tab-item} Records
+
+This can be used to update the records with `vector` values for `Fields` in a list of `FeedbackRecords`.
+
+```python
+from argilla.client.feedback.integrations.textdescrisentencetransformersptives import SentenceTransformersExtractor
+
+records = [...] # FeedbackRecords or RemoteFeedbackRecords
+
+tde = SentenceTransformersExtractor(
+    model="TaylorAI/bge-micro-v2",
+    show_progress=True,
+)
+
+records = tde.update_records(
+    records=records,
+    fields=None # None means using all fields
+    overwrite=False # Whether to overwrite existing vectors
+)
+```
+
+:::
+
+::::
 
 ## Other datasets