✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

Structured generation with `instructor` example

from typing import List

from distilabel.llms import MistralLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, Field


class Node(BaseModel):
    id: int
    label: str
    color: str


class Edge(BaseModel):
    source: int
    target: int
    label: str
    color: str = "black"


class KnowledgeGraph(BaseModel):
    nodes: List[Node] = Field(..., default_factory=list)
    edges: List[Edge] = Field(..., default_factory=list)


with Pipeline(
    name="Knowledge-Graphs",
    description=(
        "Generate knowledge graphs to answer questions, this type of dataset can be used to "
        "steer a model to answer questions with a knowledge graph."
    ),
) as pipeline:
    sample_questions = [
        "Teach me about quantum mechanics",
        "Who is who in The Simpsons family?",
        "Tell me about the evolution of programming languages",
    ]

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
                "instruction": f"{question}",
            }
            for question in sample_questions
        ],
    )

    text_generation = TextGeneration(
        name="knowledge_graph_generation",
        llm=MistralLLM(
            model="open-mixtral-8x22b",
            structured_output={"schema": KnowledgeGraph}
        ),
    )
    load_dataset >> text_generation

InferenceEndpointsLLM now supports structured generation
New StructuredGeneration task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New GenerateSentencePair task that allows to generate a positive sentence for an input anchor, and optionally also a negative sentence. The tasks allows creating different kind of data specifying the action to perform with respect to the anchor: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.
Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
- EmbeddingTaskGenerator which allows generating new embedding-related tasks using an LLM.
- GenerateTextRetrievalData which allows creating text retrieval data with an LLM.
- GenerateShortTextMatchingData which allows creating short texts matching the input data.
- GenerateLongTextMatchingData which allows creating long texts matching the input data.
- GenerateTextClassificationData which allows creating text classification data from the input data.
- MonolingualTripletGenerator which allows creating monolingual triplets from the input data.
- BitextRetrievalGenerator which allows creating bitext retrieval data from the input data.

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

We've added a few new steps allowing to load data from different sources:

LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example

from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
    ...
    
if __name__ == "__main__":
    distiset = pipeline.run(...)
    distiset.save_to_disk(dataset_path="my-distiset")

`MixtureOfAgentsLLM` implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example

from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
    aggregator_llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    proposers_llms=[
        InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
        InferenceEndpointsLLM(
            model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
            tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        ),
        InferenceEndpointsLLM(
            model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
            tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
        ),
    ],
    rounds=2,
)

llm.load()

output = llm.generate(
    inputs=[
        [
            {
                "role": "user",
                "content": "My favorite witty review of The Rings of Power series is this: Input:",
            }
        ]
    ]
)

Saving cache and passing batches to `GlobalStep`s optimizations

The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

`BasePipeline` and `_BatchManager` refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

What's Changed

Add prometheus.md by @alvarobartt in #656
Reduce time required to execute _cache method by @gabrielmbmb in #672
[DOCS] Update theme styles and images by @leiyre in #667
Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
Add CITATION.cff by @alvarobartt in #677
Deprecate conversation support in TextGeneration in favour of ChatGeneration by @alvarobartt in #676
Add functionality to load/save distisets to/from disk by @plaguss in #673
Integration instructor by @plaguss in #654
Fix docs of saving/loading distiset from disk by @plaguss in #679
Pass data of batches using file system by @gabrielmbmb in #678
Add python==3.12 by @gabrielmbmb in #615
Add codspeed benchmarks by @gabrielmbmb in #674
Add StructuredGeneration task and support for grammar in InferenceEndpointsLLM by @alvarobartt in #680
Fix InferenceEndpointsLLM not using cached token by @gabrielmbmb in #690
Add GenerateSentencePair task by @gabrielmbmb in #689
Fix prepend batches by @gabrielmbmb in #696
Fix EvolQuality._apply_random_mutation not properly injecting response in template by @davidberenstein1957 in #703
[FEATURE] Include new GeneratorStep classes to load datasets from different formats by @plaguss in #691
Add citation readme by @plaguss in #712
Move navigation to top bar by @plaguss in #708
Fix install_dependencies.sh by @alvarobartt in #713
Add context to guide the generate sentence pair task if informed by @plaguss in #706
Add examples to the LLMs to be shown in the components gallery by @plaguss in #714
Gather HF_TOKEN internally when calling Distiset.push_to_hub if token is None. by @plaguss in #707
Implement "Improving Text Embeddings with LLMs" by @alvarobartt in #683
Add ArenaHard benchmark and ArenaHardResults step by @alvarobartt in #670
Refactor Pipeline and BasePipeline classes by @gabrielmbmb in #704
Fix AzureOpenAILLM load method setting the correct path to mock the internal class by @plaguss in #725
Components examples steps by @plaguss in #715
Add examples for tasks in the components gallery by @plaguss in #724
[FEATURE] Refactor of structured generation and use schemas defined in a dataset by @plaguss in #688
Update docs document phrasing and funnel by @davidberenstein1957 in #718
docs: 728 docs api reference tasktyping cannot be imported during doc build by @davidberenstein1957 in #729
docs: 730 docs add an index to the guide overview by @davidberenstein1957 in #731
Add MixtureOfAgentsLLM by @gabrielmbmb in #735
Add examples/arena_hard.py and remove from distilabel core by @alvarobartt in #741
Add serving LLM section in the docs by @gabrielmbmb in #742
distilabel v1.2.0 by @alvarobartt in #659

New Contributors

@leiyre made their first contribution in #667

Full Changelog: 1.1.1...1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.2.0

✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

New tasks for generating datasets for training embedding models

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

`MixtureOfAgentsLLM` implementation

Saving cache and passing batches to `GlobalStep`s optimizations

`BasePipeline` and `_BatchManager` refactor

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

📚 Improved documentation structure

What's Changed

New Contributors

Contributors

1.2.0

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

New tasks for generating datasets for training embedding models

New Steps for loading data from different sources and saving/loading Distiset to disk

MixtureOfAgentsLLM implementation

Saving cache and passing batches to GlobalSteps optimizations

BasePipeline and _BatchManager refactor

Added ArenaHard as an example of how to use distilabel to implement a benchmark

📚 Improved documentation structure

What's Changed

New Contributors

Contributors

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

`MixtureOfAgentsLLM` implementation

Saving cache and passing batches to `GlobalStep`s optimizations

`BasePipeline` and `_BatchManager` refactor

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark