Merge branch 'develop' into improving-text-embeddings-with-llms

argilla-io · Jun 3, 2024 · 811fab4 · 811fab4
2 parents 1a4903e + e61b598
commit 811fab4
Show file tree

Hide file tree

Showing 46 changed files with 1,209 additions and 293 deletions.
diff --git a/.github/workflows/codspeed.yml b/.github/workflows/codspeed.yml
@@ -0,0 +1,42 @@
+name: Benchmarks
+
+on:
+  push:
+    branches:
+      - "main"
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  benchmarks:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.12"
+          # Looks like it's not working very well for other people:
+          # https://github.com/actions/setup-python/issues/436
+          # cache: "pip"
+          # cache-dependency-path: pyproject.toml
+
+      - uses: actions/cache@v3
+        id: cache
+        with:
+          path: ${{ env.pythonLocation }}
+          key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-benchmarks-v00
+
+      - name: Install dependencies
+        if: steps.cache.outputs.cache-hit != 'true'
+        run: ./scripts/install_dependencies.sh
+
+      - name: Run benchmarks
+        uses: CodSpeedHQ/action@v2
+        with:
+          token: ${{ secrets.CODSPEED_TOKEN }}
+          run: pytest tests/ --codspeed
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -9,6 +9,12 @@ on:
     types:
       - opened
       - synchronize
+  workflow_dispatch:
+    inputs:
+      tmate_session:
+        description: Starts the workflow with tmate enabled.
+        required: false
+        default: "false"
 
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
@@ -19,7 +25,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11"]
+        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
       fail-fast: false
 
     steps:
@@ -42,14 +48,7 @@ jobs:
 
       - name: Install dependencies
         if: steps.cache.outputs.cache-hit != 'true'
-        run: |
-          python_version=$(python -c "import sys; print(sys.version_info[:2])")
-
-          pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
-          if [ "${python_version}" != "(3, 8)" ]; then
-            pip install -e .[mistralai,instructor]
-          fi;
-          pip install git+https://github.com/argilla-io/LLM-Blender.git
+        run: ./scripts/install_dependencies.sh
 
       - name: Lint
         run: make lint
@@ -59,4 +58,3 @@ jobs:
 
       - name: Integration Tests
         run: make integration-tests
-        timeout-minutes: 5
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,11 +11,10 @@ repos:
           - --fuzzy-match-generates-todo
 
   - repo: https://github.com/charliermarsh/ruff-pre-commit
-    rev: v0.1.4
+    rev: v0.4.5
     hooks:
       - id: ruff
-        args:
-          - --fix
+        args: [--fix]
       - id: ruff-format
 
 ci:

diff --git a/Makefile b/Makefile
@@ -2,12 +2,12 @@ sources = src/distilabel tests
 
 .PHONY: format
 format:
-	ruff --fix $(sources)
+	ruff check --fix $(sources)
 	ruff format $(sources)
 
 .PHONY: lint
 lint:
-	ruff $(sources)
+	ruff check $(sources)
 	ruff format --check $(sources)
 
 .PHONY: unit-tests

diff --git a/docs/sections/learn/advanced/fs_to_pass_data.md b/docs/sections/learn/advanced/fs_to_pass_data.md
@@ -0,0 +1,24 @@
+# Using a file system to pass data of batches between steps
+
+In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:
+
+```python
+from distilabel.pipeline import Pipeline
+
+with Pipeline(name="my-pipeline") as pipeline:
+  ...
+
+if __name__ == "__main__":
+    distiset = pipeline.run(
+        ..., 
+        storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
+        use_fs_to_pass_data=True
+    )
+```
+
+The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.
+
+!!! NOTE
+
+    As `GlobalStep`s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if `use_fs_to_pass_data==False`, `distilabel` will use the file system to pass the data to the `GlobalStep`. 
+
diff --git a/docs/sections/pipeline_samples/examples/index.md b/docs/sections/pipeline_samples/examples/index.md
@@ -10,7 +10,7 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis
 
     This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
 
-    It makes use of a local model which can be downlaoded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
+    It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
 
     ??? Run
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -142,6 +142,7 @@ nav:
           - Caching: "sections/learn/advanced/caching.md"
           - Distiset: "sections/learn/advanced/distiset.md"
           - Structured Generation: "sections/learn/advanced/structured_generation.md"
+          - Using the file system to pass batch data: "sections/learn/advanced/fs_to_pass_data.md"
   - Pipeline Samples:
       - "sections/pipeline_samples/index.md"
       - Examples: "sections/pipeline_samples/examples/index.md"
@@ -164,9 +165,9 @@ nav:
           - GlobalStep: "api/step/global_step.md"
           - "@step": "api/step/decorator.md"
           - Step Gallery:
-            - Argilla: "api/step_gallery/argilla.md"
-            - Columns: "api/step_gallery/columns.md"
-            - Extra: "api/step_gallery/extra.md"
+              - Argilla: "api/step_gallery/argilla.md"
+              - Columns: "api/step_gallery/columns.md"
+              - Extra: "api/step_gallery/extra.md"
       - Task:
           - "api/task/index.md"
           - GeneratorTask: "api/task/generator_task.md"

diff --git a/pyproject.toml b/pyproject.toml
@@ -17,6 +17,7 @@ classifiers = [
     "Programming Language :: Python :: 3.9",
     "Programming Language :: Python :: 3.10",
     "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
     "Programming Language :: Python :: Implementation :: CPython",
     "Programming Language :: Python :: Implementation :: PyPy",
 ]
@@ -45,7 +46,7 @@ distilabel = "distilabel.cli.app:app"
 "distilabel/components-gallery" = "distilabel.utils.mkdocs.components_gallery:ComponentsGalleryPlugin"
 
 [project.optional-dependencies]
-dev = ["ruff == 0.2.2", "pre-commit >= 3.5.0"]
+dev = ["ruff == 0.4.5", "pre-commit >= 3.5.0"]
 docs = [
     "mkdocs-material >= 9.5.0",
     "mkdocstrings[python] >= 0.24.0",
@@ -57,11 +58,17 @@ docs = [
     "CairoSVG >= 2.7.1",
     "mknotebooks >= 0.8.0",
 ]
-tests = ["pytest >= 7.4.0", "pytest-asyncio", "nest-asyncio", "pytest-timeout"]
+tests = [
+    "pytest >= 7.4.0",
+    "pytest-asyncio",
+    "nest-asyncio",
+    "pytest-timeout",
+    "pytest-codspeed",
+]
 
 # Optional LLMs, integrations, etc
 anthropic = ["anthropic >= 0.20.0"]
-argilla = ["argilla >= 1.23.0"]
+argilla = ["argilla >= 1.29.0"]
 cohere = ["cohere >= 5.2.0"]
 groq = ["groq >= 0.4.1"]
 hf-inference-endpoints = ["huggingface_hub >= 0.19.0"]
@@ -74,7 +81,7 @@ ollama = ["ollama >= 0.1.7"]
 openai = ["openai >= 1.0.0"]
 outlines = ["outlines >= 0.0.40"]
 vertexai = ["google-cloud-aiplatform >= 1.38.0"]
-vllm = ["vllm >= 0.2.1", "filelock >= 3.13.4"]
+vllm = ["vllm >= 0.4.0", "outlines == 0.0.34", "filelock >= 3.13.4"]
 
 [project.urls]
 Documentation = "https://distilabel.argilla.io/"

diff --git a/scripts/install_dependencies.sh b/scripts/install_dependencies.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+python_version=$(python -c "import sys; print(sys.version_info[:2])")
+
+python -m pip install uv
+
+uv pip install --system -e ".[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai]"
+if [ "${python_version}" != "(3, 8)" ]; then
+	uv pip install --system -e .[mistralai,instructor]
+fi
+
+uv pip install --system git+https://github.com/argilla-io/LLM-Blender.git
diff --git a/src/distilabel/llms/anthropic.py b/src/distilabel/llms/anthropic.py
@@ -33,7 +33,7 @@
 from distilabel.llms.base import AsyncLLM
 from distilabel.llms.typing import GenerateOutput
 from distilabel.mixins.runtime_parameters import RuntimeParameter
-from distilabel.steps.tasks.typing import ChatType
+from distilabel.steps.tasks.typing import StandardInput
 from distilabel.utils.itertools import grouper
 
 if TYPE_CHECKING:
@@ -163,7 +163,7 @@ def model_name(self) -> str:
     @validate_call
     async def agenerate(  # type: ignore
         self,
-        input: ChatType,
+        input: StandardInput,
         max_tokens: int = 128,
         stop_sequences: Union[List[str], None] = None,
         temperature: float = 1.0,
@@ -223,7 +223,7 @@ async def agenerate(  # type: ignore
     @override
     def generate(
         self,
-        inputs: List["ChatType"],
+        inputs: List["StandardInput"],
         num_generations: int = 1,
         **kwargs: Any,
     ) -> List["GenerateOutput"]:
@@ -232,7 +232,7 @@ def generate(
         """
 
         async def agenerate(
-            inputs: List["ChatType"], **kwargs: Any
+            inputs: List["StandardInput"], **kwargs: Any
         ) -> "GenerateOutput":
             """Internal function to parallelize the asynchronous generation of responses."""
             tasks = [

diff --git a/src/distilabel/llms/base.py b/src/distilabel/llms/base.py
@@ -37,7 +37,7 @@
         InstructorStructuredOutputType,
     )
     from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType
-    from distilabel.steps.tasks.typing import ChatType
+    from distilabel.steps.tasks.typing import FormattedInput, StandardInput
     from distilabel.utils.docstring import Docstring
 
 if in_notebook():
@@ -94,7 +94,7 @@ def model_name(self) -> str:
     @abstractmethod
     def generate(
         self,
-        inputs: List["ChatType"],
+        inputs: List["FormattedInput"],
         num_generations: int = 1,
         **kwargs: Any,
     ) -> List["GenerateOutput"]:
@@ -187,7 +187,9 @@ def generate_parsed_docstring(self) -> "Docstring":
         """
         return parse_google_docstring(self.generate)
 
-    def get_last_hidden_states(self, inputs: List["ChatType"]) -> List["HiddenState"]:
+    def get_last_hidden_states(
+        self, inputs: List["StandardInput"]
+    ) -> List["HiddenState"]:
         """Method to get the last hidden states of the model for a list of inputs.
 
         Args:
@@ -231,6 +233,7 @@ class AsyncLLM(LLM):
     """
 
     _event_loop: "asyncio.AbstractEventLoop" = PrivateAttr(default=None)
+    _new_event_loop: bool = PrivateAttr(default=False)
 
     @property
     def generate_parameters(self) -> List[inspect.Parameter]:
@@ -257,14 +260,16 @@ def event_loop(self) -> "asyncio.AbstractEventLoop":
                 self._event_loop = asyncio.get_running_loop()
                 if self._event_loop.is_closed():
                     self._event_loop = asyncio.new_event_loop()  # type: ignore
+                    self._new_event_loop = True
             except RuntimeError:
                 self._event_loop = asyncio.new_event_loop()
+                self._new_event_loop = True
         asyncio.set_event_loop(self._event_loop)
         return self._event_loop
 
     @abstractmethod
     async def agenerate(
-        self, input: "ChatType", num_generations: int = 1, **kwargs: Any
+        self, input: "FormattedInput", num_generations: int = 1, **kwargs: Any
     ) -> List[Union[str, None]]:
         """Method to generate a `num_generations` responses for a given input asynchronously,
         and executed concurrently in `generate` method.
@@ -273,7 +278,7 @@ async def agenerate(
 
     def generate(
         self,
-        inputs: List["ChatType"],
+        inputs: List["FormattedInput"],
         num_generations: int = 1,
         **kwargs: Any,
     ) -> List["GenerateOutput"]:
@@ -282,7 +287,7 @@ def generate(
         """
 
         async def agenerate(
-            inputs: List["ChatType"], **kwargs: Any
+            inputs: List["FormattedInput"], **kwargs: Any
         ) -> List[List[Union[str, None]]]:
             """Internal function to parallelize the asynchronous generation of responses."""
             tasks = [
@@ -301,8 +306,11 @@ def __del__(self) -> None:
         """Closes the event loop when the object is deleted."""
         if sys.meta_path is None:
             return
-        if self.event_loop is not None:
-            self.event_loop.close()
+
+        if self._new_event_loop:
+            if self._event_loop.is_running():
+                self._event_loop.stop()
+            self._event_loop.close()
 
     @staticmethod
     def _prepare_structured_output(

diff --git a/src/distilabel/llms/cohere.py b/src/distilabel/llms/cohere.py
@@ -30,7 +30,7 @@
 
 from distilabel.llms.base import AsyncLLM
 from distilabel.mixins.runtime_parameters import RuntimeParameter
-from distilabel.steps.tasks.typing import ChatType
+from distilabel.steps.tasks.typing import StandardInput
 from distilabel.utils.itertools import grouper
 
 if TYPE_CHECKING:
@@ -132,7 +132,7 @@ def load(self) -> None:
                 self.structured_output = structured_output
 
     def _format_chat_to_cohere(
-        self, input: "ChatType"
+        self, input: "StandardInput"
     ) -> Tuple[Union[str, None], List["ChatMessage"], str]:
         """Formats the chat input to the Cohere Chat API conversational format.
 
@@ -169,7 +169,7 @@ def _format_chat_to_cohere(
     @validate_call
     async def agenerate(  # type: ignore
         self,
-        input: ChatType,
+        input: StandardInput,
         temperature: Optional[float] = None,
         max_tokens: Optional[int] = None,
         k: Optional[int] = None,
@@ -241,15 +241,15 @@ async def agenerate(  # type: ignore
     @override
     def generate(
         self,
-        inputs: List["ChatType"],
+        inputs: List["StandardInput"],
         num_generations: int = 1,
         **kwargs: Any,
     ) -> List["GenerateOutput"]:
         """Method to generate a list of responses asynchronously, returning the output
         synchronously awaiting for the response of each input sent to `agenerate`."""
 
         async def agenerate(
-            inputs: List["ChatType"], **kwargs: Any
+            inputs: List["StandardInput"], **kwargs: Any
         ) -> "GenerateOutput":
             """Internal function to parallelize the asynchronous generation of responses."""
             tasks = [
-Original file line number
+Diff line change
@@ Expand Up @@
         This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.
-        It makes use of a local model which can be downlaoded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
+        It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
         ??? Run
@@ Expand Down @@