Skip to content

Commit

Permalink
Merge branch 'develop' into improving-text-embeddings-with-llms
Browse files Browse the repository at this point in the history
  • Loading branch information
alvarobartt committed Jun 3, 2024
2 parents 1a4903e + e61b598 commit 811fab4
Show file tree
Hide file tree
Showing 46 changed files with 1,209 additions and 293 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/codspeed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Benchmarks

on:
push:
branches:
- "main"
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
benchmarks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.12"
# Looks like it's not working very well for other people:
# https://github.com/actions/setup-python/issues/436
# cache: "pip"
# cache-dependency-path: pyproject.toml

- uses: actions/cache@v3
id: cache
with:
path: ${{ env.pythonLocation }}
key: ${{ runner.os }}-python-${{ env.pythonLocation }}-${{ hashFiles('pyproject.toml') }}-benchmarks-v00

- name: Install dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: ./scripts/install_dependencies.sh

- name: Run benchmarks
uses: CodSpeedHQ/action@v2
with:
token: ${{ secrets.CODSPEED_TOKEN }}
run: pytest tests/ --codspeed
18 changes: 8 additions & 10 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ on:
types:
- opened
- synchronize
workflow_dispatch:
inputs:
tmate_session:
description: Starts the workflow with tmate enabled.
required: false
default: "false"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
Expand All @@ -19,7 +25,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
fail-fast: false

steps:
Expand All @@ -42,14 +48,7 @@ jobs:

- name: Install dependencies
if: steps.cache.outputs.cache-hit != 'true'
run: |
python_version=$(python -c "import sys; print(sys.version_info[:2])")
pip install -e .[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai,vllm]
if [ "${python_version}" != "(3, 8)" ]; then
pip install -e .[mistralai,instructor]
fi;
pip install git+https://github.com/argilla-io/LLM-Blender.git
run: ./scripts/install_dependencies.sh

- name: Lint
run: make lint
Expand All @@ -59,4 +58,3 @@ jobs:

- name: Integration Tests
run: make integration-tests
timeout-minutes: 5
5 changes: 2 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,10 @@ repos:
- --fuzzy-match-generates-todo

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.1.4
rev: v0.4.5
hooks:
- id: ruff
args:
- --fix
args: [--fix]
- id: ruff-format

ci:
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ sources = src/distilabel tests

.PHONY: format
format:
ruff --fix $(sources)
ruff check --fix $(sources)
ruff format $(sources)

.PHONY: lint
lint:
ruff $(sources)
ruff check $(sources)
ruff format --check $(sources)

.PHONY: unit-tests
Expand Down
24 changes: 24 additions & 0 deletions docs/sections/learn/advanced/fs_to_pass_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Using a file system to pass data of batches between steps

In some situations, it can happen that the batches contains so much data that is faster to write it to disk and read it back in the next step, instead of passing it using the queue. To solve this issue, `distilabel` uses [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) to allow providing a file system configuration and whether if this file system should be used to pass data between steps in the `run` method of the `distilabel` pipelines:

```python
from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
...

if __name__ == "__main__":
distiset = pipeline.run(
...,
storage_parameters={"protocol": "gcs", "path": "gcs://my-bucket"},
use_fs_to_pass_data=True
)
```

The code above setups a file system (in this case Google Cloud Storage) and sets the flag `use_fs_to_pass_data` to specify that the data of the batches should be passed to the steps using the file system.The `storage_parameters` argument is optional, and in the case it's not provided but `use_fs_to_pass_data==True`, `distilabel` will use the local file system.

!!! NOTE

As `GlobalStep`s receives all the data from the previous steps in one single batch accumulating all the data, it's very likely that the data of the batch will be too big to be passed using the queue. In this case and even if `use_fs_to_pass_data==False`, `distilabel` will use the file system to pass the data to the `GlobalStep`.

2 changes: 1 addition & 1 deletion docs/sections/pipeline_samples/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Generate RPG characters following a `pydantic.BaseModel` with `outlines` in `dis

This script makes use of [`LlamaCppLLM`][distilabel.llms.llamacpp.LlamaCppLLM] and the structured output capabilities thanks to [`outlines`](https://outlines-dev.github.io/outlines/welcome/) to generate RPG characters that adhere to a JSON schema.

It makes use of a local model which can be downlaoded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].
It makes use of a local model which can be downloaded using curl (explained in the script itself), and can be exchanged with other `LLMs` like [`vLLM`][distilabel.llms.vllm.vLLM].

??? Run

Expand Down
7 changes: 4 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ nav:
- Caching: "sections/learn/advanced/caching.md"
- Distiset: "sections/learn/advanced/distiset.md"
- Structured Generation: "sections/learn/advanced/structured_generation.md"
- Using the file system to pass batch data: "sections/learn/advanced/fs_to_pass_data.md"
- Pipeline Samples:
- "sections/pipeline_samples/index.md"
- Examples: "sections/pipeline_samples/examples/index.md"
Expand All @@ -164,9 +165,9 @@ nav:
- GlobalStep: "api/step/global_step.md"
- "@step": "api/step/decorator.md"
- Step Gallery:
- Argilla: "api/step_gallery/argilla.md"
- Columns: "api/step_gallery/columns.md"
- Extra: "api/step_gallery/extra.md"
- Argilla: "api/step_gallery/argilla.md"
- Columns: "api/step_gallery/columns.md"
- Extra: "api/step_gallery/extra.md"
- Task:
- "api/task/index.md"
- GeneratorTask: "api/task/generator_task.md"
Expand Down
15 changes: 11 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ classifiers = [
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
Expand Down Expand Up @@ -45,7 +46,7 @@ distilabel = "distilabel.cli.app:app"
"distilabel/components-gallery" = "distilabel.utils.mkdocs.components_gallery:ComponentsGalleryPlugin"

[project.optional-dependencies]
dev = ["ruff == 0.2.2", "pre-commit >= 3.5.0"]
dev = ["ruff == 0.4.5", "pre-commit >= 3.5.0"]
docs = [
"mkdocs-material >= 9.5.0",
"mkdocstrings[python] >= 0.24.0",
Expand All @@ -57,11 +58,17 @@ docs = [
"CairoSVG >= 2.7.1",
"mknotebooks >= 0.8.0",
]
tests = ["pytest >= 7.4.0", "pytest-asyncio", "nest-asyncio", "pytest-timeout"]
tests = [
"pytest >= 7.4.0",
"pytest-asyncio",
"nest-asyncio",
"pytest-timeout",
"pytest-codspeed",
]

# Optional LLMs, integrations, etc
anthropic = ["anthropic >= 0.20.0"]
argilla = ["argilla >= 1.23.0"]
argilla = ["argilla >= 1.29.0"]
cohere = ["cohere >= 5.2.0"]
groq = ["groq >= 0.4.1"]
hf-inference-endpoints = ["huggingface_hub >= 0.19.0"]
Expand All @@ -74,7 +81,7 @@ ollama = ["ollama >= 0.1.7"]
openai = ["openai >= 1.0.0"]
outlines = ["outlines >= 0.0.40"]
vertexai = ["google-cloud-aiplatform >= 1.38.0"]
vllm = ["vllm >= 0.2.1", "filelock >= 3.13.4"]
vllm = ["vllm >= 0.4.0", "outlines == 0.0.34", "filelock >= 3.13.4"]

[project.urls]
Documentation = "https://distilabel.argilla.io/"
Expand Down
12 changes: 12 additions & 0 deletions scripts/install_dependencies.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

python_version=$(python -c "import sys; print(sys.version_info[:2])")

python -m pip install uv

uv pip install --system -e ".[dev,tests,anthropic,argilla,cohere,groq,hf-inference-endpoints,hf-transformers,litellm,llama-cpp,ollama,openai,outlines,vertexai]"
if [ "${python_version}" != "(3, 8)" ]; then
uv pip install --system -e .[mistralai,instructor]
fi

uv pip install --system git+https://github.com/argilla-io/LLM-Blender.git
8 changes: 4 additions & 4 deletions src/distilabel/llms/anthropic.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
from distilabel.llms.base import AsyncLLM
from distilabel.llms.typing import GenerateOutput
from distilabel.mixins.runtime_parameters import RuntimeParameter
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps.tasks.typing import StandardInput
from distilabel.utils.itertools import grouper

if TYPE_CHECKING:
Expand Down Expand Up @@ -163,7 +163,7 @@ def model_name(self) -> str:
@validate_call
async def agenerate( # type: ignore
self,
input: ChatType,
input: StandardInput,
max_tokens: int = 128,
stop_sequences: Union[List[str], None] = None,
temperature: float = 1.0,
Expand Down Expand Up @@ -223,7 +223,7 @@ async def agenerate( # type: ignore
@override
def generate(
self,
inputs: List["ChatType"],
inputs: List["StandardInput"],
num_generations: int = 1,
**kwargs: Any,
) -> List["GenerateOutput"]:
Expand All @@ -232,7 +232,7 @@ def generate(
"""

async def agenerate(
inputs: List["ChatType"], **kwargs: Any
inputs: List["StandardInput"], **kwargs: Any
) -> "GenerateOutput":
"""Internal function to parallelize the asynchronous generation of responses."""
tasks = [
Expand Down
24 changes: 16 additions & 8 deletions src/distilabel/llms/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
InstructorStructuredOutputType,
)
from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps.tasks.typing import FormattedInput, StandardInput
from distilabel.utils.docstring import Docstring

if in_notebook():
Expand Down Expand Up @@ -94,7 +94,7 @@ def model_name(self) -> str:
@abstractmethod
def generate(
self,
inputs: List["ChatType"],
inputs: List["FormattedInput"],
num_generations: int = 1,
**kwargs: Any,
) -> List["GenerateOutput"]:
Expand Down Expand Up @@ -187,7 +187,9 @@ def generate_parsed_docstring(self) -> "Docstring":
"""
return parse_google_docstring(self.generate)

def get_last_hidden_states(self, inputs: List["ChatType"]) -> List["HiddenState"]:
def get_last_hidden_states(
self, inputs: List["StandardInput"]
) -> List["HiddenState"]:
"""Method to get the last hidden states of the model for a list of inputs.
Args:
Expand Down Expand Up @@ -231,6 +233,7 @@ class AsyncLLM(LLM):
"""

_event_loop: "asyncio.AbstractEventLoop" = PrivateAttr(default=None)
_new_event_loop: bool = PrivateAttr(default=False)

@property
def generate_parameters(self) -> List[inspect.Parameter]:
Expand All @@ -257,14 +260,16 @@ def event_loop(self) -> "asyncio.AbstractEventLoop":
self._event_loop = asyncio.get_running_loop()
if self._event_loop.is_closed():
self._event_loop = asyncio.new_event_loop() # type: ignore
self._new_event_loop = True
except RuntimeError:
self._event_loop = asyncio.new_event_loop()
self._new_event_loop = True
asyncio.set_event_loop(self._event_loop)
return self._event_loop

@abstractmethod
async def agenerate(
self, input: "ChatType", num_generations: int = 1, **kwargs: Any
self, input: "FormattedInput", num_generations: int = 1, **kwargs: Any
) -> List[Union[str, None]]:
"""Method to generate a `num_generations` responses for a given input asynchronously,
and executed concurrently in `generate` method.
Expand All @@ -273,7 +278,7 @@ async def agenerate(

def generate(
self,
inputs: List["ChatType"],
inputs: List["FormattedInput"],
num_generations: int = 1,
**kwargs: Any,
) -> List["GenerateOutput"]:
Expand All @@ -282,7 +287,7 @@ def generate(
"""

async def agenerate(
inputs: List["ChatType"], **kwargs: Any
inputs: List["FormattedInput"], **kwargs: Any
) -> List[List[Union[str, None]]]:
"""Internal function to parallelize the asynchronous generation of responses."""
tasks = [
Expand All @@ -301,8 +306,11 @@ def __del__(self) -> None:
"""Closes the event loop when the object is deleted."""
if sys.meta_path is None:
return
if self.event_loop is not None:
self.event_loop.close()

if self._new_event_loop:
if self._event_loop.is_running():
self._event_loop.stop()
self._event_loop.close()

@staticmethod
def _prepare_structured_output(
Expand Down
10 changes: 5 additions & 5 deletions src/distilabel/llms/cohere.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@

from distilabel.llms.base import AsyncLLM
from distilabel.mixins.runtime_parameters import RuntimeParameter
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps.tasks.typing import StandardInput
from distilabel.utils.itertools import grouper

if TYPE_CHECKING:
Expand Down Expand Up @@ -132,7 +132,7 @@ def load(self) -> None:
self.structured_output = structured_output

def _format_chat_to_cohere(
self, input: "ChatType"
self, input: "StandardInput"
) -> Tuple[Union[str, None], List["ChatMessage"], str]:
"""Formats the chat input to the Cohere Chat API conversational format.
Expand Down Expand Up @@ -169,7 +169,7 @@ def _format_chat_to_cohere(
@validate_call
async def agenerate( # type: ignore
self,
input: ChatType,
input: StandardInput,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
k: Optional[int] = None,
Expand Down Expand Up @@ -241,15 +241,15 @@ async def agenerate( # type: ignore
@override
def generate(
self,
inputs: List["ChatType"],
inputs: List["StandardInput"],
num_generations: int = 1,
**kwargs: Any,
) -> List["GenerateOutput"]:
"""Method to generate a list of responses asynchronously, returning the output
synchronously awaiting for the response of each input sent to `agenerate`."""

async def agenerate(
inputs: List["ChatType"], **kwargs: Any
inputs: List["StandardInput"], **kwargs: Any
) -> "GenerateOutput":
"""Internal function to parallelize the asynchronous generation of responses."""
tasks = [
Expand Down
Loading

0 comments on commit 811fab4

Please sign in to comment.