Releases: argilla-io/distilabel
1.1.0
Distilabel 1.1.0
Two new tasks implemented!
Genstruct
task (#600)
You can now use Genstruct
task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct
with Pipeline(name="harry-potter-genstruct") as pipeline:
load_hub_dataset = LoadDataFromDicts(
name="load_dataset",
data=[
{
"title": "Harry Potter and the Sorcerer's Stone",
"content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
},
{
"title": "Harry Potter and the Chamber of Secrets",
"content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
},
],
)
task = Genstruct(
name="task",
llm=TransformersLLM(
model="NousResearch/Genstruct-7B",
torch_dtype="float16",
chat_template="{{ messages[0]['content'] }}",
device="cuda:0",
),
num_generations=2,
group_generations=False,
output_mappings={"model_name": "model"},
)
PrometheusEval
task (#610)
A new PrometheusEval
task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":
from distilabel.steps.tasks import PrometheusEval
with Pipeline(name="prometheus") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
repo_id="HuggingFaceH4/instruction-dataset",
split="test",
output_mappings={"prompt": "instruction", "completion": "generation"},
)
task = PrometheusEval(
name="task",
llm=vLLM(
model="prometheus-eval/prometheus-7b-v2.0",
chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
),
mode="absolute",
rubric="factual-validity",
reference=False,
num_generations=1,
group_generations=False,
)
load_dataset >> task
Connect the steps in the pipeline with >>
(#490)
Now you can connect your steps using the binary shift operator in python:
from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns
with Pipeline(name="Pipe name") as pipeline:
load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
evol_instruction_complexity_1 = EvolInstruct(
llm=OpenAILLM(model="gpt-3.5-turbo"),
)
evol_instruction_complexity_2 = EvolInstruct(
llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
)
combine_columns = CombineColumns(
columns=["response"],
output_columns=["candidates"],
)
(
load_hub_dataset
>> [evol_instruction_complexity_1, evol_instruction_complexity_2]
>> combine_columns
)
Routing batch function (#595)
Thanks to the new routing_batch_function
, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps
routing batch function, making easier replicating the definition of the original UltraFeedback paper:
import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration
@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
return random.sample(steps, 2)
with Pipeline("pipe-name", description="My first pipe") as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
tasks = []
for llm in (
OpenAILLM(model="gpt-4-0125-preview"),
MistralLLM(model="mistral-large-2402"),
VertexAILLM(model="gemini-1.0-pro"),
):
tasks.append(
TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
)
combine_generations = CombineColumns(
name="combine_generations",
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
load_dataset >> sample_two_steps >> tasks >> combine_generations
Generate structured outputs using outlines
(#601)
You can generate JSON
or regex
using TransformersLLM
, LlamaCppLLM
or vLLM
thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)
from enum import Enum
from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated
class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"
class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
mithril = "mithril"
class Character(BaseModel):
name: Annotated[str, StringConstraints(max_length=30)]
age: conint(gt=1, lt=3000)
armor: Armor
weapon: Weapon
with Pipeline("RPG-characters") as pipeline:
system_prompt = (
"You are a leading role play gamer. You have seen thousands of different characters and their attributes."
" Please return a JSON object with common attributes of an RPG character."
)
load_dataset = LoadDataFromDicts(
name="load_instructions",
data=[
{
"system_prompt": system_prompt,
"instruction": f"Give me a character description for a {char}",
}
for char in ["dwarf", "elf", "human", "ork"]
],
)
text_generation = TextGeneration(
name="text_generation_rpg",
llm=LlamaCppLLM(
model_path="model/path", # type: ignore
structured_output={"format": "json", "schema": Character},
),
)
load_dataset >> text_generation
New GroqLLM
(#583)
New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0
from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
with Pipeline(name="text-generation-groq") as pipeline:
...
text_generation_with_groq = TextGeneration(
llm=GroqLLM(model="llama3-70b-8192"),
)
...
Easily test your pipeline doing a dry_run
(#635)
with Pipeline(...) as pipeline:
...
distiset = pipeline.dry_run(
parameters=..., # The same argument as `Pipeline.run`
batch_size=1 # Optional, will be set to 1 by default.
)
[05/13/24 16:22:30] INFO ['distilabel.pipeline.local'] 🌵 Dry run mode local.py:103
INFO ['distilabel.pipeline.local'] 📝 Pipeline data will be ... local.py:125
Pipeline.log
file is dumped to the Hugging Face repository (#568)
Now on when you call distiset.push_to_hub
, the pipeline.log
file will be automatically dumped to your dataset repository with the pipeline.yaml
to keep track of the execution.
New distilabel_metadata
column to store internal data (#586)
You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output
, keep the original output to avoid lossing anything.
You can include the metadata at the task level as:
TextGeneration(..., add_raw_output=True|False)
And directly determine whether you want this column in your final Distiset
:
with Pipeline(...,enable_metadata=True|False):
...
This way we can decide to remove all the column altogether.
All the changes in this PR
- Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
- Fix
llm_blender
installation by @alvarobartt in #557 - Warn user a...
1.0.3
What's Changed
- Add
stop
andstop_sequences
inLLM.generate
subclasses by @alvarobartt in #585
Full Changelog: 1.0.2...1.0.3
1.0.2
What's Changed
- Fix
RuntimeParamater
validation when provided as_Step
attr by @alvarobartt in #564 - Add
seed
withrandom.randint
to ensure cache is not used by @alvarobartt in #571
Full Changelog: 1.0.1...1.0.2
1.0.1
What's Changed
- Fix typo in readme and remove the ToArgilla step by @dvsrepo in #548
- Fix
model_validator
inInferenceEndpoints
due toPipeline
pickling by @alvarobartt in #552
Full Changelog: 1.0.0...1.0.1
1.0.0
What's Changed
- Add
Step
abstract class and newPipeline
by @gabrielmbmb in #338 - Add runtime parameters validation by @gabrielmbmb in #345
- Pipeline local execution by @gabrielmbmb in #346
- Add
Task
(minimal implementation) by @alvarobartt in #347 - Refactor
_BatchManager
to have list of batches per step by @gabrielmbmb in #353 - Refactor getting parameters from
Step.process
method by @gabrielmbmb in #355 - Add
LLM
,OpenAILLM
,TransformersLLM
, andLlamaCppLLM
by @alvarobartt in #354 - Fix
Task
andTextGeneration
by @alvarobartt in #356 - Add
combine_dicts
function andCombineColumns
class by @alvarobartt in #358 - Add
PushToHub
step and fixtyping
by @alvarobartt in #357 - Add serialization for the new components by @plaguss in #349
- Fix
OpenAILLM.api_key
due toSecretStr
andStepInput
wrong imports by @alvarobartt in #359 - Add
GlobalStep
, fix_BatchManager
, and addlogging
by @alvarobartt in #362 - Migrate vllm to the new API by @plaguss in #361
- Update
_BatchManager
to work withGlobalStep
s andinput_batch_size
per step by @gabrielmbmb in #366 - Clean up outdated / unused files by @alvarobartt in #369
- Add
input_mappings
andoutput_mappings
attributes by @gabrielmbmb in #367 - Move batching from
Task
toLLM
, fixvLLM.generate
and addDISTILABEL_LOG_LEVEL
by @alvarobartt in #371 - Improve runtime parameter definition by @gabrielmbmb in #372
- Add
AsyncOpenAI
and updateOpenAILLM
accordingly by @alvarobartt in #381 - Update serde by @gabrielmbmb in #382
- Add
MistralLLM
and addgeneration_kwargs
asRuntimeParameters
by @alvarobartt in #383 - Move
steps
out ofpipeline
by @gabrielmbmb in #384 - Add tests and docstring for
Task
and subclasses by @alvarobartt in #385 - Add
step
decorator by @gabrielmbmb in #387 - Add
input
propagation throughTask.process
by @alvarobartt in #399 - Improve
Pipeline
error handling by @gabrielmbmb in #400 - Fix
combine_dicts
andStepInput
import inPushToHub
by @alvarobartt in #401 - Improve
GlobalStep
error handling by @gabrielmbmb in #402 - Changed " by italics in EvolInstruct tutorial where one "" was missing by @ignacioct in #398
- Add
get_last_hidden_states
method and updateTransformersLLM
by @gabrielmbmb in #414 - docs: correct small typos in tutorial by @sdiazlor in #419
- docs: readme positioning by @davidberenstein1957 in #386
- Add
num_generations
andgroup_generations
parameters toTask
by @gabrielmbmb in #416 - Add
Argilla
andPromptCompletionToArgilla
by @alvarobartt in #420 - Add
EvolInstruct
andEvolInstructGenerator
tasks by @alvarobartt in #407 - Wrap optional
LLM
dependencies underload
by @alvarobartt in #428 - Add
ComplexityScorer
task by @gabrielmbmb in #421 - Implement caching mechanism for the pipelines by @plaguss in #370
- Add method to Pipeline to handle keyboard interruptions via ctrl+c by @plaguss in #406
- Add
GenerateEmbeddings
task by @gabrielmbmb in #427 - Add
api_key
withinLLM.load
and addllm_kwargs
asRuntimeParameter
by @alvarobartt in #432 - Add
GeneratorStep.process
validation inDAG
and smaller fixes by @alvarobartt in #435 - Add
EvolComplexity
task by @davidberenstein1957 in #415 - Add
QualityScorer
Task by @ignacioct in #425 - Add
CudaDevicePlacementMixin
class by @gabrielmbmb in #436 - Return
distiset
fromPipeline.run
by @plaguss in #417 - Update README.md by @strickvl in #451
- Add
InferenceEndpointsLLM
by @alvarobartt in #439 - Fix
Distiset
afterPushToHub
and smaller fixes by @alvarobartt in #452 - Fix
Step.process_applying_mappings
by @alvarobartt in #453 - Add
AnyscaleLLM
by @davidberenstein1957 in #447 - Add general function to obtain schema for parquet writer by @plaguss in #454
- Add
TogetherLLM
by @davidberenstein1957 in #449 - Fix
LLM
subclasses based onOpenAILLM
by @alvarobartt in #455 - Improve batching and caching by @gabrielmbmb in #457
- Add
EvolQuality
task by @davidberenstein1957 in #429 - Add
VertexAILLM
by @davidberenstein1957 in #445 - Add
use_cache
toBasePipeline
by @plaguss in #463 - Add
AnthropicLLM
by @sdiazlor in #444 - Add
multiprocess
dependency by @gabrielmbmb in #467 - Add
UltraFeedback
by @alvarobartt in #464 - Add
OllamaLLM
by @davidberenstein1957 in #405 - Add
RuntimeParametersMixin
andLLM
runtime parameters by @gabrielmbmb in #466 - Add
LiteLLM
by @davidberenstein1957 in #441 - Add CLI by @gabrielmbmb in #471
- Set
_batch_manager
toNone
after run by @gabrielmbmb in #473 - Add create_distiset function by @plaguss in #480
- Add
overload
tostep
decorator by @gabrielmbmb in #474 - Move Enum to Dict[str, str] to avoid serialization errors during caching by @plaguss in #482
- Include a dataset card and the
pipeline.yaml
onDistiset.push_to_hub
by @plaguss in #479 - Add
PairRM
task for ranking responses by @plaguss in #450 - Update
_WriteBuffer
to write several parquet files by @gabrielmbmb in #483 - Extend
Argilla
integrationTextGeneration
,Preference
, and more by @alvarobartt in #472 - Add
DeitaFiltering
step by @gabrielmbmb in #481 - Add
InstructionBacktranslation
by @alvarobartt in #486 - Fix huggingface_hub TextGenerationError import by @Wauplin in #485
- Improve azure openai support by @BramVanroy in #461
- Add
SelfInstruct
task by @ignacioct in #456 - Use
QueueHandler
forPipeline
logging by @gabrielmbmb in #489 - Improve
_stop
andlogging
by @gabrielmbmb in #491 - Fix creating empty
Dataset
increate_distiset
function by @gabrielmbmb in #492 - Add imports from
__init__
modules by @gabrielmbmb in #493 batch_size
andinput_batch_size
runtime parameters by @gabrielmbmb in #495- Update serialization method of _BatchManager to write each step on its own file by @plaguss in #496
- Fix
asyncio
inAsyncLLM
to use the running event loop if any by @alvarobartt in #501 - Added authentication header to allow private/gated dataset use by @bjoernpl in https://github.com/argilla-io/distila...
0.6.0
What's Changed
- Fix typo in docstring of to_argilla metrics_ to metric_ by @burtenshaw in #334
- Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in #331
- Add examples for the deita paper tasks by @plaguss in #329
- Add checkpoint strategy to automatically push to hub by @plaguss in #321
- docs: update tutorials avoid argilla installation error by @sdiazlor in #337
- Fix
CustomDataset.load_from_disk
withstr
/Path
objects by @plaguss in #341 - Clalrify number of generations produced when using LLMPool in docs by @davanstrien in #339
- Refactor _build_dataset piece for speed by @plaguss in #344
- Fix documentation and type variables in
CustomDataset
checkpoint methods by @plaguss in #342 - US Spelling and other typo correction on Distilabel tutorials by @ignacioct in #324
- docs: add a tutorial for evolinstruct by @sdiazlor in #327
- Fix Openai api error with OpenAI-compatible providers by @jphme in #351
- Add fix for labels not returned by openai api by @plaguss in #364
- Refactor model availability check in is_serverless_endpoint_available by @davanstrien in #363
New Contributors
- @burtenshaw made their first contribution in #334
- @jphme made their first contribution in #351
Full Changelog: 0.5.0...0.6.0
0.5.0
What's Changed
- fix: Correct import error by @plaguss in #279
- fix: Filter examples for which len generations != len ratings by @plaguss in #284
- feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in #262
- feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in #271
- feat: Add
to_argilla
method toEvolInstructTask
generated datasets by @plaguss in #291 - docs: Shorten titles tutorials and update core example by @davidberenstein1957 in #289
- feat: Add new serialization strategy by @plaguss in #288
- feat: Review
OllamaLLM
andTogetherInferenceLLM
by @alvarobartt in #305 - refactor: Remove Metadata for Ratings by @ignacioct in #303
- docs: Add missing VertexAI information within
README.md
anddocs/index.md
by @alvarobartt in #308 - feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in #297
- feat: Add
ComplexityScorer
andQualityScorer
tasks from Deita by @plaguss in #302 - fix: Fix logging visualization of labeller pipelines by @plaguss in #310
- feat: Add
Improving Text Embeddings with LLMs
tutorial by @alvarobartt in #313 - feat: Add
EvolComplexity
andEvolQuality
by @davidberenstein1957 in #299 - feat: Add
validate_prompts
method to LLMs to help validating the prompts by @plaguss in #314 - fix: typo in clean an existing preference dataset by @sdiazlor in #312
- feat: Add new column for sft fine tuning with
prepare_dataset
by @plaguss in #309 - docs: Custom Task Documentation by @ignacioct in #275
- refactor: Align the
LLM
subclasses args by @alvarobartt in #315 - feat: Include rationale of the model responses on
prepare_dataset
if available by @plaguss in #317 - feat: Add embedding tutorial to docs by @ignacioct in #319
- feat: Add
MistralAILLM
by @plaguss in #293 - feat: Use
ollama
Python client withinOllamaLLM
by @sdiazlor in #307
Full Changelog: 0.4.0...0.5.0
0.4.0
What's Changed
- docs: Notus end2end example for preference and instruction generation by @ignacioct in #145
- docs: binders anchors by @ignacioct in #235
- feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in #238
- docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in #249
- feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in #253
- feat: Add Evol instruct task by @plaguss in #237
- docs: rename
enable_checkpoints
tocheckpoint_strategy
by @davidberenstein1957 in #257 - feat: Fixing progress bar and ETA by @ignacioct in #260
- fix: resolved error with self instruct to argilla method by @plaguss in #265
- chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in #266
- fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in #267
- feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in #269
- docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in #270
- feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in #264
- feat: add support
ollama
api by @davidberenstein1957 in #250
New Contributors
- @philschmid made their first contribution in #238
- @davanstrien made their first contribution in #249
- @sdiazlor made their first contribution in #270
Full Changelog: 0.3.0...0.4.0
0.3.0
What's Changed
- Add
VertexAILLM
&VertexAIEndpointLLM
classes by @gabrielmbmb in #204 - Add draft with social cards by @plaguss in #197
- Relax
LLMPool
check to match parentTask
instead by @plaguss in #210 - Align
README.md
withdocs/
and minor fixes / improvements by @alvarobartt in #214 - Add
TogetherInferenceLLM
by @alvarobartt in #215 - Add checking valid
inputs
before calling_generate
by @gabrielmbmb in #216 - Add
TogetherInferenceLLM
tests by @alvarobartt in #217 - Add Vertex AI
LLM
s documentation by @gabrielmbmb in #222 - Documentation review by @alvarobartt in #223
- Rename
for_text_quality
tofor_overall_quality
method inUltraFeedbackTask
by @alvarobartt in #224 - Add Anyscale endpoints by @plaguss in #213
- Feature dataset checkpoint strategy by @plaguss in #194
- Fix
rating
parsing inRatingToArgillaMixin.to_argilla_record
by @alvarobartt in #227 - Add badges to readme by @plaguss in #226
- Fix badges by @dvsrepo in #228
- Update
LICENSE
and addLICENSE_HEADER
by @davidberenstein1957 in #221
Full Changelog: 0.2.1...0.3.0
0.2.1
What's Changed
- Fix
PrometheusTask
could not be imported by @gabrielmbmb in #190 - Fix
LLM.return_futures
by @gabrielmbmb in #192 - Remove learn section from docs until developed by @plaguss in #188
- Add markdown to fields by default by @plaguss in #189
- Fix
PrometheusTask
andUltraCMTask
could not be chained withTextGenerationTask
by @gabrielmbmb in #195 - Add missing
use_markdown
for every field by @plaguss in #196 - Add
to_argilla_{dataset,record}
forCritiqueTask
by @gabrielmbmb in #198 - Update
generate_prompt
inTask
subclasses to always returnPrompt
by @alvarobartt in #199 - Add
CritiqueTask
documentation by @alvarobartt in #200 - Fix
UltraCMTask
scoring range and alignargilla
imports by @alvarobartt in #201
Full Changelog: 0.2.0...0.2.1