Merge branch 'main' into develop

argilla-io · Nov 13, 2023 · ee7f073 · ee7f073
2 parents bfe4a0e + c9ed2b3
commit ee7f073
Show file tree

Hide file tree

Showing 64 changed files with 1,049 additions and 1,718 deletions.
diff --git a/.github/workflows/package.yml b/.github/workflows/package.yml
@@ -14,6 +14,8 @@ on:
       - "main"
       - "develop"
       - "releases/**"
+    tags:
+      - "*"
   pull_request:
     branches:
       - "main"
@@ -76,17 +78,17 @@ jobs:
     strategy:
       matrix:
         include:
-          - searchEngineDockerImage: docker.elastic.co/elasticsearch/elasticsearch:8.0.1
+          - searchEngineDockerImage: docker.elastic.co/elasticsearch/elasticsearch:8.8.0
             searchEngineDockerEnv: '{"discovery.type": "single-node", "xpack.security.enabled": "false"}'
-            coverageReport: coverage-elasticsearch-8.0.1
+            coverageReport: coverage-elasticsearch-8.8.0
             runsOn: extended-runner
-          - searchEngineDockerImage: docker.elastic.co/elasticsearch/elasticsearch:7.17.11
+          - searchEngineDockerImage: docker.elastic.co/elasticsearch/elasticsearch:8.6.0
             searchEngineDockerEnv: '{"discovery.type": "single-node", "xpack.security.enabled": "false"}'
-            coverageReport: coverage-elasticsearch-7.17.11
+            coverageReport: coverage-elasticsearch-8.6.0
             runsOn: extended-runner
-          - searchEngineDockerImage: opensearchproject/opensearch:1.3.11
+          - searchEngineDockerImage: opensearchproject/opensearch:2.8.0
             searchEngineDockerEnv: '{"discovery.type": "single-node", "plugins.security.disabled": "true"}'
-            coverageReport: coverage-opensearch-1.3.11
+            coverageReport: coverage-opensearch-2.8.0
             runsOn: extended-runner
     name: Run unit tests with extra engines
     uses: ./.github/workflows/run-python-tests.yml

diff --git a/.github/workflows/run-python-tests.yml b/.github/workflows/run-python-tests.yml
@@ -74,7 +74,7 @@ jobs:
           path: ~/.cache/pip
           key: ${{ runner.os }}-pip-${{ env.CACHE_NUMBER }}-${{ hashFiles('pyproject.toml') }}
       - name: Set huggingface hub credentials
-        if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/releases')
+        if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop' || startsWith(github.ref, 'refs/heads/releases')
         run: |
           echo "HF_HUB_ACCESS_TOKEN=${{ secrets.HF_HUB_ACCESS_TOKEN }}" >> "$GITHUB_ENV"
           echo "Enable HF access token"

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,65 +16,70 @@ These are the section headers that we use:
 
 ## [Unreleased]()
 
+### Added
+
+- Added `metadata_properties` to the `__repr__` method of the `FeedbackDataset` and `RemoteFeedbackDataset`.([#4192](https://github.com/argilla-io/argilla/pull/4192)).
+
 ### Fixed
 
 - Fixed error in `ArgillaTrainer`, with numerical labels use `RatingQuestion` instead of `RankingQuestion` ([#4171](https://github.com/argilla-io/argilla/pull/4171))
 - Fixed error in `ArgillaTrainer`, now we can train for `extractive_question_answering` using a validation sample ([#4204](https://github.com/argilla-io/argilla/pull/4204))
 
-## [1.19.0]()
+## [1.19.0](https://github.com/argilla-io/argilla/compare/v1.18.0...v1.19.0)
 
 ### Added
 
-- Added `metadata_properties` to the `__repr__` method of the `FeedbackDataset` and `RemoteFeedbackDataset`.([#4192](https://github.com/argilla-io/argilla/pull/4192)).
-- Added `show_progress` argument to `from_huggingface()` method to make the progress bar for parsing records process optional.([#4132](https://github.com/argilla-io/argilla/pull/4132)).
-- Added a progress bar for parsing records process to `from_huggingface()` method with `trange` in `tqdm`.([#4132](https://github.com/argilla-io/argilla/pull/4132)).
-- Added to sort by `inserted_at` or `updated_at` for datasets with no metadata. ([4147](https://github.com/argilla-io/argilla/pull/4147))
-- Added `max_records` argument to `pull()` method for `RemoteFeedbackDataset`.([#4074](https://github.com/argilla-io/argilla/pull/4074))
-- Added functionality to push your models to the Hugging Face hub with `ArgillaTrainer.push_to_huggingface` ([#3976](https://github.com/argilla-io/argilla/pull/3976)). Contributed by @Racso-3141.
 - Added `POST /api/v1/datasets/:dataset_id/records/search` endpoint to search for records without user context, including responses by all users. ([#4143](https://github.com/argilla-io/argilla/pull/4143))
-- Added `filter_by` argument to `ArgillaTrainer` to filter by `response_status` ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
-- Added `sort_by` argument to `ArgillaTrainer` to sort by `metadata` ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
-- Added `max_records` argument to `ArgillaTrainer` to limit record used for training ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
 - Added `POST /api/v1/datasets/:dataset_id/vectors-settings` endpoint for creating vector settings for a dataset. ([#3776](https://github.com/argilla-io/argilla/pull/3776))
 - Added `GET /api/v1/datasets/:dataset_id/vectors-settings` endpoint for listing the vectors settings for a dataset. ([#3776](https://github.com/argilla-io/argilla/pull/3776))
 - Added `DELETE /api/v1/vectors-settings/:vector_settings_id` endpoint for deleting a vector settings. ([#3776](https://github.com/argilla-io/argilla/pull/3776))
+- Added `PATCH /api/v1/vectors-settings/:vector_settings_id` endpoint for updating a vector settings. ([#4092](https://github.com/argilla-io/argilla/pull/4092))
 - Added `GET /api/v1/records/:record_id` endpoint to get a specific record. ([#4039](https://github.com/argilla-io/argilla/pull/4039))
 - Added support to include vectors for `GET /api/v1/datasets/:dataset_id/records` endpoint response using `include` query param. ([#4063](https://github.com/argilla-io/argilla/pull/4063))
 - Added support to include vectors for `GET /api/v1/me/datasets/:dataset_id/records` endpoint response using `include` query param. ([#4063](https://github.com/argilla-io/argilla/pull/4063))
 - Added support to include vectors for `POST /api/v1/me/datasets/:dataset_id/records/search` endpoint response using `include` query param. ([#4063](https://github.com/argilla-io/argilla/pull/4063))
-- Added `PATCH /api/v1/vectors-settings/:vector_settings_id` endpoint for updating a vector settings. ([#4092](https://github.com/argilla-io/argilla/pull/4092))
+- Added `show_progress` argument to `from_huggingface()` method to make the progress bar for parsing records process optional.([#4132](https://github.com/argilla-io/argilla/pull/4132)).
+- Added a progress bar for parsing records process to `from_huggingface()` method with `trange` in `tqdm`.([#4132](https://github.com/argilla-io/argilla/pull/4132)).
+- Added to sort by `inserted_at` or `updated_at` for datasets with no metadata. ([4147](https://github.com/argilla-io/argilla/pull/4147))
+- Added `max_records` argument to `pull()` method for `RemoteFeedbackDataset`.([#4074](https://github.com/argilla-io/argilla/pull/4074))
+- Added functionality to push your models to the Hugging Face hub with `ArgillaTrainer.push_to_huggingface` ([#3976](https://github.com/argilla-io/argilla/pull/3976)). Contributed by @Racso-3141.
+- Added `filter_by` argument to `ArgillaTrainer` to filter by `response_status` ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
+- Added `sort_by` argument to `ArgillaTrainer` to sort by `metadata` ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
+- Added `max_records` argument to `ArgillaTrainer` to limit record used for training ([#4120](https://github.com/argilla-io/argilla/pull/4120)).
 - Added `add_vector_settings` method to local and remote `FeedbackDataset`. ([#4055](https://github.com/argilla-io/argilla/pull/4055))
 - Added `update_vectors_settings` method to local and remote `FeedbackDataset`. ([#4122](https://github.com/argilla-io/argilla/pull/4122))
 - Added `delete_vectors_settings` method to local and remote `FeedbackDataset`. ([#4130](https://github.com/argilla-io/argilla/pull/4130))
 - Added `vector_settings_by_name` method to local and remote `FeedbackDataset`. ([#4055](https://github.com/argilla-io/argilla/pull/4055))
-- Added `ARGILLA_SEARCH_ENGINE` environment variable to configure the search engine to use. ([#4019](https://github.com/argilla-io/argilla/pull/4019))
 - Added `find_similar_records` method to local and remote `FeedbackDataset`. ([#4023](https://github.com/argilla-io/argilla/pull/4023))
+- Added `ARGILLA_SEARCH_ENGINE` environment variable to configure the search engine to use. ([#4019](https://github.com/argilla-io/argilla/pull/4019))
 
 ### Changed
 
+- [breaking] Remove support for Elasticsearch < 8.5 and OpenSearch < 2.4. ([#4173](https://github.com/argilla-io/argilla/pull/4173))
+- [breaking] Users working with OpenSearch engines must use version >=2.4 and set `ARGILLA_SEARCH_ENGINE=opensearch`. ([#4019](https://github.com/argilla-io/argilla/pull/4019) and [#4111](https://github.com/argilla-io/argilla/pull/4111))
 - [breaking] Changed `FeedbackDataset.*_by_name()` methods to return `None` when no match is found ([#4101](https://github.com/argilla-io/argilla/pull/3976)).
+- [breaking] `limit` query parameter for `GET /api/v1/datasets/:dataset_id/records` endpoint is now only accepting values greater or equal than `1` and less or equal than `1000`. ([#4143](https://github.com/argilla-io/argilla/pull/4143))
+- [breaking] `limit` query parameter for `GET /api/v1/me/datasets/:dataset_id/records` endpoint is now only accepting values greater or equal than `1` and less or equal than `1000`. ([#4143](https://github.com/argilla-io/argilla/pull/4143))
+- Update `GET /api/v1/datasets/:dataset_id/records` endpoint to fetch record using the search engine. ([#4142](https://github.com/argilla-io/argilla/pull/4142))
+- Update `GET /api/v1/me/datasets/:dataset_id/records` endpoint to fetch record using the search engine. ([#4142](https://github.com/argilla-io/argilla/pull/4142))
 - Update `POST /api/v1/datasets/:dataset_id/records` endpoint to allow to create records with `vectors` ([#4022](https://github.com/argilla-io/argilla/pull/4022))
+- Update `PATCH /api/v1/datasets/:dataset_id` endpoint to allow updating `allow_extra_metadata` attribute. ([#4112](https://github.com/argilla-io/argilla/pull/4112))
 - Update `PATCH /api/v1/datasets/:dataset_id/records` endpoint to allow to update records with `vectors`. ([#4062](https://github.com/argilla-io/argilla/pull/4062))
 - Update `PATCH /api/v1/records/:record_id` endpoint to allow to update record with `vectors`. ([#4062](https://github.com/argilla-io/argilla/pull/4062))
+- Update `POST /api/v1/me/datasets/:dataset_id/records/search` endpoint to allow to search records with vectors. ([#4019](https://github.com/argilla-io/argilla/pull/4019))
 - Update `BaseElasticAndOpenSearchEngine.index_records` method to also index record vectors. ([#4062](https://github.com/argilla-io/argilla/pull/4062))
 - Update `FeedbackDataset.__init__` to allow passing a list of vector settings. ([#4055](https://github.com/argilla-io/argilla/pull/4055))
 - Update `FeedbackDataset.push_to_argilla` to also push vector settings. ([#4055](https://github.com/argilla-io/argilla/pull/4055))
 - Update `FeedbackDatasetRecord` to support the creation of records with vectors. ([#4043](https://github.com/argilla-io/argilla/pull/4043))
-- Update `POST /api/v1/me/datasets/:dataset_id/records/search` endpoint to allow to search records with vectors. ([#4019](https://github.com/argilla-io/argilla/pull/4019))
-- [breaking] Users working with OpenSearch engines must use version >=2.4 and set `ARGILLA_SEARCH_ENGINE=opensearch`. ([#4019](https://github.com/argilla-io/argilla/pull/4019) and [#4111](https://github.com/argilla-io/argilla/pull/4111))
-- Update `PATCH /api/v1/datasets/:dataset_id` endpoint to allow updating `allow_extra_metadata` attribute. ([#4112](https://github.com/argilla-io/argilla/pull/4112))
 - Using cosine similarity to compute similarity between vectors. ([#4124](https://github.com/argilla-io/argilla/pull/4124))
-- Update `GET /api/v1/datasets/:dataset_id/records` endpoint to fetch record using the search engine. ([#4142](https://github.com/argilla-io/argilla/pull/4142))
-- Update `GET /api/v1/me/datasets/:dataset_id/records` endpoint to fetch record using the search engine. ([#4142](https://github.com/argilla-io/argilla/pull/4142))
-- [breaking] `limit` query parameter for `GET /api/v1/datasets/:dataset_id/records` endpoint is now only accepting values greater or equal than `1` and less or equal than `1000`. ([#4143](https://github.com/argilla-io/argilla/pull/4143))
-- [breaking] `limit` query parameter for `GET /api/v1/me/datasets/:dataset_id/records` endpoint is now only accepting values greater or equal than `1` and less or equal than `1000`. ([#4143](https://github.com/argilla-io/argilla/pull/4143))
-- Now client class `DatasetConfig` is setting `use_enum_values` config value to `True`. Closes [#4089](https://github.com/argilla-io/argilla/issues/4089) ([#4172](https://github.com/argilla-io/argilla/pull/4172))
 
 ### Fixed
 
 - Fixed svg images out of screen with too large images ([#4047](https://github.com/argilla-io/argilla/pull/4047))
 - Fixed creating records with responses from multiple users. Closes [#3746](https://github.com/argilla-io/argilla/issues/3746) and [#3808](https://github.com/argilla-io/argilla/issues/3808) ([#4142](https://github.com/argilla-io/argilla/pull/4142))
-
+- Fixed deleting or updating responses as an owner for annotators. (Commit [403a66d](https://github.com/argilla-io/argilla/commit/403a66d16d816fa8a62e3f76314ccc90e0073297))
+- Fixed passing user_id when getting records by id. (Commit [98c7927](https://github.com/argilla-io/argilla/commit/98c792757a21da05bac89b7f625e7e5792ad59f9))
+- Fixed non-basic tags serialized when pushing a dataset to the Hugging Face Hub. Closes [#4089](https://github.com/argilla-io/argilla/issues/4089) ([#4200](https://github.com/argilla-io/argilla/pull/4200))
 
 ## [1.18.0](https://github.com/argilla-io/argilla/compare/v1.17.0...v1.18.0)
 

diff --git a/docs/_source/_common/sdk_feedback_semantic_search.md b/docs/_source/_common/sdk_feedback_semantic_search.md
@@ -0,0 +1,39 @@
+In the Python SDK, you can also get a list of feedback records that are semantically close to a given embedding with the `find_similar_records` method. These are the arguments of this function:
+
+- `vector_name`: The `name` of the vector to use in the search.
+- `value`: A vector to use for the similarity search in the form of a `List[float]`. It is necessary to include a `value` **or** a `record`.
+- `record`: A `FeedbackRecord` to use as part of the search. It is necessary to include a `value` **or** a `record`.
+- `max_results` (optional): The maximum number of results for this search. The default is `50`.
+
+This returns a list of Tuples with the records and their similarity score (between 0 and 1).
+
+```python
+ds = rg.FeedbackDataset.from_argilla("my_dataset", workspace="my_workspace")
+
+# using text embeddings
+similar_records =  ds.find_similar_records(
+    vector_name="my_vector",
+    value=embedder_model.embeddings("My text is here")
+    # value=embedder_model.embeddings("My text is here").tolist() # for numpy arrays
+)
+
+# using another record
+similar_records =  ds.find_similar_records(
+    vector_name="my_vector",
+    record=ds.records[0],
+    max_results=5
+)
+
+# work with the resulting tuples
+for record, score in similar_records:
+    ...
+```
+
+You can also combine filters and semantic search like this:
+
+```python
+similar_records = (dataset
+    .filter_by(metadata=[rg.TermsMetadataFilter(values=["Positive"])])
+    .find_similar_records(vector_name="vector", value=model.encode("Another text").tolist())
+)
+```
diff --git a/docs/_source/_common/ui_feedback_semantic_search.md b/docs/_source/_common/ui_feedback_semantic_search.md
@@ -0,0 +1,7 @@
+In Feedback datasets, you can also retrieve records based on their similarity with another record. To do that, make sure you have added `vector_settings` to your [dataset configuration](/practical_guides/create_dataset.md#define-vectors) and that your [records include vectors](/practical_guides/create_dataset.md#configure-the-records).
+
+In the UI, go to the record you'd like to use for the semantic search and click on `Find similar` at the top right corner of the record card. If there is more than one vector, you will be asked to select which vector to use. You can also select whether you want the most or least similar records and the number of results you would like to see.
+
+At any time, you can expand or collapse the record that was used for the search as a reference. If you want to undo the search, just click on the cross next to the reference record.
+
+![Snapshot of semantic search in a Feedback Dataset from Argilla's UI](/_static/images/llms/feedback_semantic_search.png)
diff --git a/docs/_source/_static/images/llms/feedback_semantic_search.png b/docs/_source/_static/images/llms/feedback_semantic_search.png
diff --git a/docs/_source/_static/images/llms/snapshot-feedback-demo.png b/docs/_source/_static/images/llms/snapshot-feedback-demo.png
diff --git a/docs/_source/_static/images/llms/snapshot-feedback-submitted.png b/docs/_source/_static/images/llms/snapshot-feedback-submitted.png
diff --git a/docs/_source/community/developer_docs.md b/docs/_source/community/developer_docs.md
@@ -193,7 +193,14 @@ To install Elasticsearch or Opensearch, and to work with Argilla on your server
 To install ElasticSearch or OpenSearch, you can refer to the [Setup and Installation](/getting_started/installation/deployments/docker.md) guide.
 
 :::{note}
-Argilla supports ElasticSearch versions 8.8, 8.5, 8.0, and 7.17 and OpenSearch versions 1.3 and 2.3.
+Argilla supports ElasticSearch versions >=8.5, and OpenSearch versions >=2.4.
+:::
+
+:::{note}
+For vector search in OpenSearch, the filtering applied is using a `post_filter` step, since there is a bug that makes queries fail using filtering + knn from Argilla.
+See https://github.com/opensearch-project/k-NN/issues/1286
+
+This may result in unexpected results when combining filtering with vector search with this engine.
 :::
 
 ### Relational Database and Migration

diff --git a/docs/_source/getting_started/installation/configurations/server_configuration.md b/docs/_source/getting_started/installation/configurations/server_configuration.md
@@ -4,6 +4,18 @@ This section explains advanced operations and settings for running the Argilla S
 
 By default, the Argilla Server will look for your Elasticsearch (ES) endpoint at `http://localhost:9200`. You can customize this by setting the `ARGILLA_ELASTICSEARCH` environment variable. Have a look at the list of available [environment variables](#environment-variables) to further configure the Argilla server.
 
+From the Argilla version `1.19.0`, you must set up the search engine manually to work with Feedback datasets. You should set the
+environment variable `ARGILLA_SEARCH_ENGINE=opensearch` or `ARGILLA_SEARCH_ENGINE=elasticsearch` depending on the backend you're using
+The default value for this variable is set to `elasticsearch`. The minimal version for Elasticsearch is `8.5.0`, and for Opensearch is `2.4.0`.
+Please, review your backend and upgrade it if necessary.
+
+:::{warning}
+For vector search in OpenSearch, the filtering applied is using a `post_filter` step, since there is a bug that makes queries fail using filtering + knn from Argilla.
+See https://github.com/opensearch-project/k-NN/issues/1286
+
+This may result in unexpected results when combining filtering with vector search with this engine.
+:::
+
 ## Launching
 ### Using a proxy
 
@@ -52,6 +64,8 @@ You can set the following environment variables to further configure your server
 
 - `ARGILLA_ELASTICSEARCH`: URL of the connection endpoint of the Elasticsearch instance (Default: `http://localhost:9200`).
 
+- `ARGILA_SEARCH_ENGINE`: (Only for Feedback datasets) Search engine to use. Valid values are "elasticsearch" and "opensearch" (Default: "elasticsearch").
+
 - `ARGILLA_ELASTICSEARCH_SSL_VERIFY`: If "False", disables SSL certificate verification when connecting to the Elasticsearch backend.
 
 - `ARGILLA_ELASTICSEARCH_CA_PATH`: Path to CA cert for ES host. For example: `/full/path/to/root-ca.pem` (Optional)