1.2.0
✨ Release highlights
Structured generation with instructor
, InferenceEndpointsLLM
now supports structured generation and StructuredGeneration
task
-
instructor
has been integrated bringing support for structured generation withOpenAILLM
,AnthropicLLM
,LiteLLM
,MistralLLM
,CohereLLM
andGroqLLM
:Structured generation with `instructor` example
from typing import List from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field class Node(BaseModel): id: int label: str color: str class Edge(BaseModel): source: int target: int label: str color: str = "black" class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., default_factory=list) edges: List[Edge] = Field(..., default_factory=list) with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ] load_dataset = LoadDataFromDicts( name="load_instructions", data=[ { "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.", "instruction": f"{question}", } for question in sample_questions ], ) text_generation = TextGeneration( name="knowledge_graph_generation", llm=MistralLLM( model="open-mixtral-8x22b", structured_output={"schema": KnowledgeGraph} ), ) load_dataset >> text_generation
-
InferenceEndpointsLLM
now supports structured generation -
New
StructuredGeneration
task that allows defining the schema of the structured generation per input row.
New tasks for generating datasets for training embedding models
sentence-transformers
v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!
- New
GenerateSentencePair
task that allows to generate apositive
sentence for an inputanchor
, and optionally also anegative
sentence. The tasks allows creating different kind of data specifying theaction
to perform with respect to theanchor
: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer. - Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
EmbeddingTaskGenerator
which allows generating new embedding-related tasks using anLLM
.GenerateTextRetrievalData
which allows creating text retrieval data with anLLM
.GenerateShortTextMatchingData
which allows creating short texts matching the input data.GenerateLongTextMatchingData
which allows creating long texts matching the input data.GenerateTextClassificationData
which allows creating text classification data from the input data.MonolingualTripletGenerator
which allows creating monolingual triplets from the input data.BitextRetrievalGenerator
which allows creating bitext retrieval data from the input data.
New Step
s for loading data from different sources and saving/loading Distiset
to disk
We've added a few new steps allowing to load data from different sources:
LoadDataFromDisk
allows loading aDistiset
ordatasets.Dataset
that was previously saved using thesave_to_disk
method.LoadDataFromFileSystem
allows loading adatasets.Dataset
from a file system.
Thanks to @rasdani for helping us testing this new tasks!
In addition, we have added save_to_disk
method to Distiset
akin to datasets.Dataset.save_to_disk
, that allows saving the generated distiset to disk, along with the pipeline.yaml
and pipeline.log
.
`save_to_disk` example
from distilabel.pipeline import Pipeline
with Pipeline(name="my-pipeline") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.run(...)
distiset.save_to_disk(dataset_path="my-distiset")
MixtureOfAgentsLLM
implementation
We've added a new LLM
called MixtureOfAgentsLLM
derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM
allows generating improved outputs thanks to the collective expertise of several LLM
s.
`MixtureOfAgentsLLM` example
from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM
llm = MixtureOfAgentsLLM(
aggregator_llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
proposers_llms=[
InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
InferenceEndpointsLLM(
model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
),
InferenceEndpointsLLM(
model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
),
],
rounds=2,
)
llm.load()
output = llm.generate(
inputs=[
[
{
"role": "user",
"content": "My favorite witty review of The Rings of Power series is this: Input:",
}
]
]
)
Saving cache and passing batches to GlobalStep
s optimizations
- The cache logic of the
_BatchManager
has been improved to incrementally update the cache making the process much faster. - The data of the input batches of the
GlobalStep
s will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration offsspec
, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.
BasePipeline
and _BatchManager
refactor
The logic around BasePipeline
and _BatchManager
has been refactored, which will make it easier to implement new pipelines in the future.
Added ArenaHard
as an example of how to use distilabel
to implement a benchmark
distilabel
can be easily used to create an LLM
benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel
: Arena Hard
📚 Improved documentation structure
We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.
What's Changed
- Add
prometheus.md
by @alvarobartt in #656 - Reduce time required to execute
_cache
method by @gabrielmbmb in #672 - [DOCS] Update theme styles and images by @leiyre in #667
- Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
- Add
CITATION.cff
by @alvarobartt in #677 - Deprecate conversation support in
TextGeneration
in favour ofChatGeneration
by @alvarobartt in #676 - Add functionality to load/save distisets to/from disk by @plaguss in #673
- Integration instructor by @plaguss in #654
- Fix docs of saving/loading distiset from disk by @plaguss in #679
- Pass data of batches using file system by @gabrielmbmb in #678
- Add
python==3.12
by @gabrielmbmb in #615 - Add
codspeed
benchmarks by @gabrielmbmb in #674 - Add
StructuredGeneration
task and support forgrammar
inInferenceEndpointsLLM
by @alvarobartt in #680 - Fix
InferenceEndpointsLLM
not using cached token by @gabrielmbmb in #690 - Add
GenerateSentencePair
task by @gabrielmbmb in #689 - Fix prepend batches by @gabrielmbmb in #696
- Fix
EvolQuality._apply_random_mutation
not properly injectingresponse
in template by @davidberenstein1957 in #703 - [FEATURE] Include new
GeneratorStep
classes to load datasets from different formats by @plaguss in #691 - Add citation readme by @plaguss in #712
- Move navigation to top bar by @plaguss in #708
- Fix
install_dependencies.sh
by @alvarobartt in #713 - Add context to guide the generate sentence pair task if informed by @plaguss in #706
- Add examples to the LLMs to be shown in the components gallery by @plaguss in #714
- Gather HF_TOKEN internally when calling
Distiset.push_to_hub
if token is None. by @plaguss in #707 - Implement "Improving Text Embeddings with LLMs" by @alvarobartt in #683
- Add
ArenaHard
benchmark andArenaHardResults
step by @alvarobartt in #670 - Refactor
Pipeline
andBasePipeline
classes by @gabrielmbmb in #704 - Fix AzureOpenAILLM load method setting the correct path to mock the internal class by @plaguss in #725
- Components examples steps by @plaguss in #715
- Add examples for tasks in the components gallery by @plaguss in #724
- [FEATURE] Refactor of structured generation and use schemas defined in a dataset by @plaguss in #688
- Update docs document phrasing and funnel by @davidberenstein1957 in #718
- docs: 728 docs api reference tasktyping cannot be imported during doc build by @davidberenstein1957 in #729
- docs: 730 docs add an index to the guide overview by @davidberenstein1957 in #731
- Add
MixtureOfAgentsLLM
by @gabrielmbmb in #735 - Add
examples/arena_hard.py
and remove fromdistilabel
core by @alvarobartt in #741 - Add serving LLM section in the docs by @gabrielmbmb in #742
distilabel
v1.2.0 by @alvarobartt in #659
New Contributors
Full Changelog: 1.1.1...1.2.0