Skip to content

Commit

Permalink
feat: 4018 feature add out of the box embedding support via clientfee…
Browse files Browse the repository at this point in the history
…dbackintegrations (#4454)

<!-- Thanks for your contribution! As part of our Community Growers
initiative 🌱, we're donating Justdiggit bunds in your name to reforest
sub-Saharan Africa. To claim your Community Growers certificate, please
contact David Berenstein in our Slack community or fill in this form
https://tally.so/r/n9XrxK once your PR has been merged. -->

# Description

I added support for `sentence-transformers` via the introduction of the
`SentenceTransformersExtractor`. During the implementation, I fixed some
minor issues.

- replaced the `vector_settings_by_name` that had been defined twice for
the `remote` dataset and once within the `local` dataset, by the method
we defined in the shared `base`.
- updated the `__repr__` of the dataset to include the
`vector_settings`.
- rewrote some code of the `TextDescriptivesExtractor` to align the
structure and usage of both extractors.
- resolved a bug in the `TextDescriptivesExtractor`, where we might
update records from wrong indices with embedding during simultaneous IO.
- removed unit tests for the `TextDescriptivesExtractor` and created
integration tests instead.
- removed some unused imoports
 
Closes #4018 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [X] New feature (non-breaking change which adds functionality)
- [X] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [X] integration/**/**/integration/test_textdescriptives.py
- [X] integration/**/**/integration/test_sentencetransformerx.py

**Checklist**

- [x] I added relevant documentation
- [x] I followed the style guidelines of this project
- [x] I did a self-review of my code
- [x] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [x] I have added relevant notes to the `CHANGELOG.md` file (See
https://keepachangelog.com/)

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
  • Loading branch information
davidberenstein1957 and gabrielmbmb authored Jan 14, 2024
1 parent c95326a commit 27e3f09
Show file tree
Hide file tree
Showing 27 changed files with 1,269 additions and 274 deletions.
4 changes: 0 additions & 4 deletions .github/workflows/package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,6 @@ jobs:
searchEngineDockerEnv: '{"discovery.type": "single-node", "xpack.security.enabled": "false"}'
coverageReport: coverage-elasticsearch-8.8.2
runsOn: extended-runner
- searchEngineDockerImage: opensearchproject/opensearch:2.4.1
searchEngineDockerEnv: '{"discovery.type": "single-node", "plugins.security.disabled": "true"}'
coverageReport: coverage-opensearch-2.4.1
runsOn: ubuntu-latest
name: Run end2end tests
uses: ./.github/workflows/end2end-examples.yml
needs: check_repo_files
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ These are the section headers that we use:
### Changed

- Module `argilla.cli.server` definitions have been moved to `argilla.server.cli` module. ([#4472](https://github.com/argilla-io/argilla/pull/4472))
- [breaking] Changed `vector_settings_by_name` for generic `property_by_name` usage, which will return `None` instead of raising an error ([#4454](https://github.com/argilla-io/argilla/pull/4454))
- Added pydantic v2 support using the python SDK ([#4459](https://github.com/argilla-io/argilla/pull/4459))
- The constant definition `ES_INDEX_REGEX_PATTERN` in module `argilla._constants` is now private. ([#4472](https://github.com/argilla-io/argilla/pull/4474))
- `nan` values in metadata properties will raise a 422 error when creating/updating records. ([#4300](https://github.com/argilla-io/argilla/issues/4300))
- `None` values are now allowed in metadata properties. ([#4300](https://github.com/argilla-io/argilla/issues/4300))
Expand All @@ -36,6 +38,8 @@ These are the section headers that we use:
### Removed

- The deprecated `python -m argilla database` command has been removed. ([#4472](https://github.com/argilla-io/argilla/pull/4472))
- Added `vector_settings` to the `__repr__` method of the `FeedbackDataset` and `RemoteFeedbackDataset`. ([#4454](https://github.com/argilla-io/argilla/pull/4454))
- Added integration for `sentence-transformers` using `SentenceTransformersExtractor` to configure `vector_settings` in `FeedbackDataset` and `FeedbackRecord`. ([#4454](https://github.com/argilla-io/argilla/pull/4454))

## [1.21.0](https://github.com/argilla-io/argilla/compare/v1.20.0...v1.21.0)

Expand All @@ -53,7 +57,7 @@ These are the section headers that we use:

### Changed

- More productive and simpler shortcuts system ([#4215](https://github.com/argilla-io/argilla/pull/4215))
- More productive and simpler shortcut system ([#4215](https://github.com/argilla-io/argilla/pull/4215))
- Move `ArgillaSingleton`, `init` and `active_client` to a new module `singleton`. ([#4347](https://github.com/argilla-io/argilla/pull/4347))
- Updated `argilla.load` functions to also work with `FeedbackDataset`s. ([#4347](https://github.com/argilla-io/argilla/pull/4347))
- [breaking] Updated `argilla.delete` functions to also work with `FeedbackDataset`s. It now raises an error if the dataset does not exist. ([#4347](https://github.com/argilla-io/argilla/pull/4347))
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,6 @@
"\n",
"* *model*: the language of the model.\n",
"* *metrics*: the metrics to be extracted.\n",
"* *fields*: the field names to extract metrics from.\n",
"* *visible_for_annotators*: whether the metadata is visible for annotators.\n",
"* *show_progress*: whether to show the progress bar.\n",
"\n",
Expand Down Expand Up @@ -336,7 +335,6 @@
"tde = TextDescriptivesExtractor(\n",
" model = \"en\",\n",
" metrics = None,\n",
" fields = [\"question\"],\n",
" visible_for_annotators = False,\n",
" show_progress = True,\n",
")"
Expand All @@ -349,7 +347,7 @@
"outputs": [],
"source": [
"# Update the records\n",
"updated_records = tde.update_records(records)"
"updated_records = tde.update_records(records, fields=[\"question\"])"
]
},
{
Expand Down Expand Up @@ -435,7 +433,6 @@
"tde = TextDescriptivesExtractor(\n",
" model = \"en\",\n",
" metrics = [\"descriptive_stats\", \"readability\"],\n",
" fields = [\"context\"],\n",
" visible_for_annotators = True,\n",
" show_progress = True,\n",
")"
Expand All @@ -448,7 +445,7 @@
"outputs": [],
"source": [
"# Update the dataset\n",
"tde.update_dataset(remote_dataset)"
"tde.update_dataset(remote_dataset, fields=[\"context\"])"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ active_learning
weak_supervision
semantic_search
job_scheduling
text_descriptives_as_metadata
add_text_descriptives_as_metadata
```
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,6 @@ See https://github.com/opensearch-project/k-NN/issues/1286
This may result in unexpected results when combining filtering with vector search with this engine.
:::


## Add vectors to your data

The first and most important thing to do before leveraging semantic search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned `vectors` field. But the question is: **how do you create these vectors?**
Expand All @@ -103,38 +102,7 @@ If you run into issues when logging records with large vectors using `rg.log`, w

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. There are dozens of [pre-trained models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) on the Hugging Face Hub.

The code below will load a dataset from the Hub, encode the `text` field, and create the `vectors` field which will contain only one key (`mini-lm-sentence-transformers`).

```{note}
Vector keys are arbitrary names that will be used as a name for the vector and shown in the UI if there's more than 1 so users can decide which vector to use for finding similar records. Remember you can associate several vectors to one record by using different keys.
```

```{warning}
Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you cannot register vectors with dimensions greater than `1024`.
```

To run the code below you need to install `sentence_transformers` and `datasets` with pip: `pip install sentence_transformers datasets`

```python
from sentence_transformers import SentenceTransformer

from datasets import load_dataset

# Define fast version of sentence transformers
encoder = SentenceTransformer("BAAI/bge-small-en", device="cpu")

# Load dataset
dataset = load_dataset("PolyAI/banking77", split="test")

# Encode text field using batched computation
dataset = dataset.map(lambda batch: {"vectors": encoder.encode(batch["text"])}, batch_size=32, batched=True)

# Turn vectors into a dictionary
dataset = dataset.map(
lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
)
```
Given its fundamental and open source versatile nature, we have decided to add a native integration with SentenceTransformers. This integration allows you to easily add embeddings to your records or datasets using the `SentenceTransformersExtractor` based on the [sentence-transformers](https://sbert.net/) library. This integration can be found [here](/practical_guides/create_update_dataset/vectors.md).

### OpenAI `Embeddings`

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,9 @@ vectors_settings = [
)
]
```

```{note}
You can also define vector settings after the dataset has been configured or add them to an existing dataset in Argilla. To do that use the `add_vector_settings` method as explained [here](/practical_guides/create_update_dataset/vectors.md).
You can also define vector settings after the dataset has been configured or add them to an existing dataset in Argilla. To do that use the `add_vector_settings` method. In addition, you can now add text descriptives of your fields as metadata automatically with the `SentenceTransformersExtractor`. For more info, take a look [here](/practical_guides/create_update_dataset/vectors.md).
```

##### Define `guidelines`
Expand Down
19 changes: 12 additions & 7 deletions docs/_source/practical_guides/create_update_dataset/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,11 +137,10 @@ You can easily add text descriptives to your records or datasets using the `Text

- `model` (optional): The language of the spacy model that will be used. Defaults to `en`. Check [here](https://spacy.io/usage/models) the available languages and models.
- `metrics` (optional): A list of metrics to extract. The default extracted metrics are: `n_tokens`, `n_unique_tokens`, `n_sentences`, `perplexity`, `entropy`, and `flesch_reading_ease`. You can select your metrics according to the following groups `descriptive_stats`, `readability`, `dependency_distance`, `pos_proportions`, `coherence`, `quality`, and `information_theory`. For more information about each group, check this documentation [page](https://hlasse.github.io/TextDescriptives/descriptivestats.html).
- `fields` (optional): A list of field names to extract metrics from. All fields will be used by default.
- `visible_for_annotators` (optional): Whether the extracted metrics should be visible to annotators. Defaults to `True`.
- `show_progress` (optional): Whether to show a progress bar when extracting metrics. Defaults to `True`.

For a practical example, check our [tutorial on adding text descriptives as metadata](/tutorials_and_integrations/integrations/add_text_descriptives_as_metadata.html).
For a practical example, check our [tutorial on adding text descriptives as metadata](/tutorials_and_integrations/integrations/add_text_descriptives_as_metadata.ipynb).

::::{tab-set}

Expand All @@ -157,12 +156,16 @@ dataset = ... # FeedbackDataset or RemoteFeedbackDataset
tde = TextDescriptivesExtractor(
model="en",
metrics=None,
fields=None,
visible_for_annotators=True,
show_progress=True,
)

dataset = tde.update_dataset(dataset)
dataset = tde.update_dataset(
dataset=dataset
fields=None # None means using all fields
update_records=True # Also, update the records in the dataset
overwrite=False # Whether to overwrite existing vectors
)
```
:::

Expand All @@ -178,17 +181,19 @@ records = [...] # FeedbackRecords or RemoteFeedbackRecords
tde = TextDescriptivesExtractor(
model="en",
metrics=None,
fields=None,
visible_for_annotators=True,
show_progress=True,
)

records = tde.update_records(records)
records = tde.update_records(
records=records,
fields=None # None means using all fields
overwrite=False # Whether to overwrite existing vectors
)
```

:::


::::


Expand Down
58 changes: 58 additions & 0 deletions docs/_source/practical_guides/create_update_dataset/vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,64 @@ dataset.update_records(modified_records)
You can also follow the same strategy to modify existing vectors.
```

### Add Sentence Transformers `vectors`

You can easily add semantic embeddings to your records or datasets using the `SentenceTransformersExtractor` based on the [sentence-transformers](https://sbert.net/) library. This extractor is available in the Python SDK and can be used to configure settings for a dataset and extract embeddings from a list of records. The `SentenceTransformersExtractor` has the following arguments:

- `model_name`: The name of the model to use for extracting embeddings. You can find a list of available models [here](https://www.sbert.net/docs/pretrained_models.html).
- `show_progress` (optional): Whether to show a progress bar when extracting metrics. Defaults to `True`.

For a practical example, check our [tutorial on adding sentence transformer embeddings as vectors](/tutorials_and_integrations/integrations/add_sentence_transformers_embeddings_as_vectors.ipynb).

::::{tab-set}

:::{tab-item} Dataset

This can be used to update the dataset and configuration with `VectorSettings` for `Fields` in a `FeedbackDataset` or a `RemoteFeedbackDataset`.

```python
from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor

dataset = ... # FeedbackDataset or RemoteFeedbackDataset

tde = SentenceTransformersExtractor(
model="TaylorAI/bge-micro-v2",
show_progress=True,
)

dataset = tde.update_dataset(
dataset=dataset
fields=None # None means using all fields
update_records=True # Also, update the records in the dataset
overwrite=False # Whether to overwrite existing vectors
)
```
:::

:::{tab-item} Records

This can be used to update the records with `vector` values for `Fields` in a list of `FeedbackRecords`.

```python
from argilla.client.feedback.integrations.textdescrisentencetransformersptives import SentenceTransformersExtractor

records = [...] # FeedbackRecords or RemoteFeedbackRecords

tde = SentenceTransformersExtractor(
model="TaylorAI/bge-micro-v2",
show_progress=True,
)

records = tde.update_records(
records=records,
fields=None # None means using all fields
overwrite=False # Whether to overwrite existing vectors
)
```

:::

::::

## Other datasets

Expand Down
Loading

0 comments on commit 27e3f09

Please sign in to comment.