Releases · argilla-io/distilabel

20 May 14:02

plaguss

1.1.0

690013a

1.1.0 Latest

Latest

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (#600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline:
    load_hub_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=[
            {
                "title": "Harry Potter and the Sorcerer's Stone",
                "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
            },
            {
                "title": "Harry Potter and the Chamber of Secrets",
                "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
            },
        ],
    )

    task = Genstruct(
        name="task",
        llm=TransformersLLM(
            model="NousResearch/Genstruct-7B",
            torch_dtype="float16",
            chat_template="{{ messages[0]['content'] }}",
            device="cuda:0",
        ),
        num_generations=2,
        group_generations=False,
        output_mappings={"model_name": "model"},
    )

`PrometheusEval` task (#610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id="HuggingFaceH4/instruction-dataset",
        split="test",
        output_mappings={"prompt": "instruction", "completion": "generation"},
    )

    task = PrometheusEval(
        name="task",
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
        ),
        mode="absolute",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )
    
    load_dataset >> task

Connect the steps in the pipeline with `>>` (#490)

Now you can connect your steps using the binary shift operator in python:

from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    evol_instruction_complexity_1 = EvolInstruct(
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )
    evol_instruction_complexity_2 = EvolInstruct(
        llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
    )

    combine_columns = CombineColumns(
        columns=["response"],
        output_columns=["candidates"],
    )

    (
        load_hub_dataset 
        >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
        >> combine_columns
    )

Routing batch function (#595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration

@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
    return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    tasks = []
    for llm in (
        OpenAILLM(model="gpt-4-0125-preview"),
        MistralLLM(model="mistral-large-2402"),
        VertexAILLM(model="gemini-1.0-pro"),
    ):
        tasks.append(
            TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
        )

    combine_generations = CombineColumns(
        name="combine_generations",
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    load_dataset >> sample_two_steps >> tasks >> combine_generations

Generate structured outputs using `outlines` (#601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

from enum import Enum

from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"

class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"
    mithril = "mithril"

class Character(BaseModel):
    name: Annotated[str, StringConstraints(max_length=30)]
    age: conint(gt=1, lt=3000)
    armor: Armor
    weapon: Weapon

with Pipeline("RPG-characters") as pipeline:
    system_prompt = (
        "You are a leading role play gamer. You have seen thousands of different characters and their attributes."
        " Please return a JSON object with common attributes of an RPG character."
    )

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": system_prompt,
                "instruction": f"Give me a character description for a {char}",
            }
            for char in ["dwarf", "elf", "human", "ork"]
        ],
    )

    text_generation = TextGeneration(
        name="text_generation_rpg",
        llm=LlamaCppLLM(
            model_path="model/path",  # type: ignore
            structured_output={"format": "json", "schema": Character},
        ),
    )
    load_dataset >> text_generation

New `GroqLLM` (#583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline:
		...
    text_generation_with_groq = TextGeneration(
        llm=GroqLLM(model="llama3-70b-8192"),
    )
    ...

Easily test your pipeline doing a `dry_run` (#635)

with Pipeline(...) as pipeline:
    ...
    distiset = pipeline.dry_run(
        parameters=...,  # The same argument as `Pipeline.run`
        batch_size=1  # Optional, will be set to 1 by default.
    )

[05/13/24 16:22:30] INFO     ['distilabel.pipeline.local'] 🌵  Dry run mode                                                                                                                                                                local.py:103
                    INFO     ['distilabel.pipeline.local'] 📝 Pipeline data will be ...                                    local.py:125

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New `distilabel_metadata` column to store internal data (#586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

with Pipeline(...,enable_metadata=True|False):
    ...

This way we can decide to remove all the column altogether.

All the changes in this PR

Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
Fix llm_blender installation by @alvarobartt in #557
Warn user a...

Contributors

gabrielmbmb, alvarobartt, and 3 other contributors

Assets 2

25 Apr 12:48

gabrielmbmb

1.0.3

9f38b49

1.0.3

What's Changed

Add stop and stop_sequences in LLM.generate subclasses by @alvarobartt in #585

Full Changelog: 1.0.2...1.0.3

Contributors

alvarobartt

Assets 2

24 Apr 11:43

alvarobartt

1.0.2

712d32c

1.0.2

What's Changed

Fix RuntimeParamater validation when provided as _Step attr by @alvarobartt in #564
Add seed with random.randint to ensure cache is not used by @alvarobartt in #571

Full Changelog: 1.0.1...1.0.2

Contributors

alvarobartt

Assets 2

19 Apr 10:11

gabrielmbmb

1.0.1

b870f39

1.0.1

What's Changed

Fix typo in readme and remove the ToArgilla step by @dvsrepo in #548
Fix model_validator in InferenceEndpoints due to Pipeline pickling by @alvarobartt in #552

Full Changelog: 1.0.0...1.0.1

Contributors

dvsrepo and alvarobartt

Assets 2

17 Apr 07:42

gabrielmbmb

1.0.0

23c5fe5

1.0.0

What's Changed

Add Step abstract class and new Pipeline by @gabrielmbmb in #338
Add runtime parameters validation by @gabrielmbmb in #345
Pipeline local execution by @gabrielmbmb in #346
Add Task (minimal implementation) by @alvarobartt in #347
Refactor _BatchManager to have list of batches per step by @gabrielmbmb in #353
Refactor getting parameters from Step.process method by @gabrielmbmb in #355
Add LLM, OpenAILLM, TransformersLLM, and LlamaCppLLM by @alvarobartt in #354
Fix Task and TextGeneration by @alvarobartt in #356
Add combine_dicts function and CombineColumns class by @alvarobartt in #358
Add PushToHub step and fix typing by @alvarobartt in #357
Add serialization for the new components by @plaguss in #349
Fix OpenAILLM.api_key due to SecretStr and StepInput wrong imports by @alvarobartt in #359
Add GlobalStep, fix _BatchManager, and add logging by @alvarobartt in #362
Migrate vllm to the new API by @plaguss in #361
Update _BatchManager to work with GlobalSteps and input_batch_size per step by @gabrielmbmb in #366
Clean up outdated / unused files by @alvarobartt in #369
Add input_mappings and output_mappings attributes by @gabrielmbmb in #367
Move batching from Task to LLM, fix vLLM.generate and add DISTILABEL_LOG_LEVEL by @alvarobartt in #371
Improve runtime parameter definition by @gabrielmbmb in #372
Add AsyncOpenAI and update OpenAILLM accordingly by @alvarobartt in #381
Update serde by @gabrielmbmb in #382
Add MistralLLM and add generation_kwargs as RuntimeParameters by @alvarobartt in #383
Move steps out of pipeline by @gabrielmbmb in #384
Add tests and docstring for Task and subclasses by @alvarobartt in #385
Add step decorator by @gabrielmbmb in #387
Add input propagation through Task.process by @alvarobartt in #399
Improve Pipeline error handling by @gabrielmbmb in #400
Fix combine_dicts and StepInput import in PushToHub by @alvarobartt in #401
Improve GlobalStep error handling by @gabrielmbmb in #402
Changed " by italics in EvolInstruct tutorial where one "" was missing by @ignacioct in #398
Add get_last_hidden_states method and update TransformersLLM by @gabrielmbmb in #414
docs: correct small typos in tutorial by @sdiazlor in #419
docs: readme positioning by @davidberenstein1957 in #386
Add num_generations and group_generations parameters to Task by @gabrielmbmb in #416
Add Argilla and PromptCompletionToArgilla by @alvarobartt in #420
Add EvolInstruct and EvolInstructGenerator tasks by @alvarobartt in #407
Wrap optional LLM dependencies under load by @alvarobartt in #428
Add ComplexityScorer task by @gabrielmbmb in #421
Implement caching mechanism for the pipelines by @plaguss in #370
Add method to Pipeline to handle keyboard interruptions via ctrl+c by @plaguss in #406
Add GenerateEmbeddings task by @gabrielmbmb in #427
Add api_key within LLM.load and add llm_kwargs as RuntimeParameter by @alvarobartt in #432
Add GeneratorStep.process validation in DAG and smaller fixes by @alvarobartt in #435
Add EvolComplexity task by @davidberenstein1957 in #415
Add QualityScorer Task by @ignacioct in #425
Add CudaDevicePlacementMixin class by @gabrielmbmb in #436
Return distiset from Pipeline.run by @plaguss in #417
Update README.md by @strickvl in #451
Add InferenceEndpointsLLM by @alvarobartt in #439
Fix Distiset after PushToHub and smaller fixes by @alvarobartt in #452
Fix Step.process_applying_mappings by @alvarobartt in #453
Add AnyscaleLLM by @davidberenstein1957 in #447
Add general function to obtain schema for parquet writer by @plaguss in #454
Add TogetherLLM by @davidberenstein1957 in #449
Fix LLM subclasses based on OpenAILLM by @alvarobartt in #455
Improve batching and caching by @gabrielmbmb in #457
Add EvolQuality task by @davidberenstein1957 in #429
Add VertexAILLM by @davidberenstein1957 in #445
Add use_cache to BasePipeline by @plaguss in #463
Add AnthropicLLM by @sdiazlor in #444
Add multiprocess dependency by @gabrielmbmb in #467
Add UltraFeedback by @alvarobartt in #464
Add OllamaLLM by @davidberenstein1957 in #405
Add RuntimeParametersMixin and LLM runtime parameters by @gabrielmbmb in #466
Add LiteLLM by @davidberenstein1957 in #441
Add CLI by @gabrielmbmb in #471
Set _batch_manager to None after run by @gabrielmbmb in #473
Add create_distiset function by @plaguss in #480
Add overload to step decorator by @gabrielmbmb in #474
Move Enum to Dict[str, str] to avoid serialization errors during caching by @plaguss in #482
Include a dataset card and the pipeline.yaml on Distiset.push_to_hub by @plaguss in #479
Add PairRM task for ranking responses by @plaguss in #450
Update _WriteBuffer to write several parquet files by @gabrielmbmb in #483
Extend Argilla integration TextGeneration, Preference, and more by @alvarobartt in #472
Add DeitaFiltering step by @gabrielmbmb in #481
Add InstructionBacktranslation by @alvarobartt in #486
Fix huggingface_hub TextGenerationError import by @Wauplin in #485
Improve azure openai support by @BramVanroy in #461
Add SelfInstruct task by @ignacioct in #456
Use QueueHandler for Pipeline logging by @gabrielmbmb in #489
Improve _stop and logging by @gabrielmbmb in #491
Fix creating empty Dataset in create_distiset function by @gabrielmbmb in #492
Add imports from __init__ modules by @gabrielmbmb in #493
batch_size and input_batch_size runtime parameters by @gabrielmbmb in #495
Update serialization method of _BatchManager to write each step on its own file by @plaguss in #496
Fix asyncio in AsyncLLM to use the running event loop if any by @alvarobartt in #501
Added authentication header to allow private/gated dataset use by @bjoernpl in https://github.com/argilla-io/distila...

Contributors

BramVanroy, strickvl, and 10 other contributors

Assets 2

01 Mar 17:57

gabrielmbmb

0.6.0

fce6c2d

0.6.0

What's Changed

Fix typo in docstring of to_argilla metrics_ to metric_ by @burtenshaw in #334
Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in #331
Add examples for the deita paper tasks by @plaguss in #329
Add checkpoint strategy to automatically push to hub by @plaguss in #321
docs: update tutorials avoid argilla installation error by @sdiazlor in #337
Fix CustomDataset.load_from_disk with str/Path objects by @plaguss in #341
Clalrify number of generations produced when using LLMPool in docs by @davanstrien in #339
Refactor _build_dataset piece for speed by @plaguss in #344
Fix documentation and type variables in CustomDataset checkpoint methods by @plaguss in #342
US Spelling and other typo correction on Distilabel tutorials by @ignacioct in #324
docs: add a tutorial for evolinstruct by @sdiazlor in #327
Fix Openai api error with OpenAI-compatible providers by @jphme in #351
Add fix for labels not returned by openai api by @plaguss in #364
Refactor model availability check in is_serverless_endpoint_available by @davanstrien in #363

New Contributors

@burtenshaw made their first contribution in #334
@jphme made their first contribution in #351

Full Changelog: 0.5.0...0.6.0

Contributors

jphme, davanstrien, and 4 other contributors

Assets 2

02 Feb 16:21

plaguss

0.5.0

8ccf116

0.5.0

What's Changed

fix: Correct import error by @plaguss in #279
fix: Filter examples for which len generations != len ratings by @plaguss in #284
feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in #262
feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in #271
feat: Add to_argilla method to EvolInstructTask generated datasets by @plaguss in #291
docs: Shorten titles tutorials and update core example by @davidberenstein1957 in #289
feat: Add new serialization strategy by @plaguss in #288
feat: Review OllamaLLM and TogetherInferenceLLM by @alvarobartt in #305
refactor: Remove Metadata for Ratings by @ignacioct in #303
docs: Add missing VertexAI information within README.md and docs/index.md by @alvarobartt in #308
feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in #297
feat: Add ComplexityScorer and QualityScorer tasks from Deita by @plaguss in #302
fix: Fix logging visualization of labeller pipelines by @plaguss in #310
feat: Add Improving Text Embeddings with LLMs tutorial by @alvarobartt in #313
feat: Add EvolComplexity and EvolQuality by @davidberenstein1957 in #299
feat: Add validate_prompts method to LLMs to help validating the prompts by @plaguss in #314
fix: typo in clean an existing preference dataset by @sdiazlor in #312
feat: Add new column for sft fine tuning with prepare_dataset by @plaguss in #309
docs: Custom Task Documentation by @ignacioct in #275
refactor: Align the LLM subclasses args by @alvarobartt in #315
feat: Include rationale of the model responses on prepare_dataset if available by @plaguss in #317
feat: Add embedding tutorial to docs by @ignacioct in #319
feat: Add MistralAILLM by @plaguss in #293
feat: Use ollama Python client within OllamaLLM by @sdiazlor in #307

Full Changelog: 0.4.0...0.5.0

Contributors

davidberenstein1957, alvarobartt, and 3 other contributors

Assets 2

19 Jan 15:20

davidberenstein1957

0.4.0

2abe11a

0.4.0

What's Changed

docs: Notus end2end example for preference and instruction generation by @ignacioct in #145
docs: binders anchors by @ignacioct in #235
feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in #238
docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in #249
feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in #253
feat: Add Evol instruct task by @plaguss in #237
docs: rename enable_checkpoints to checkpoint_strategy by @davidberenstein1957 in #257
feat: Fixing progress bar and ETA by @ignacioct in #260
fix: resolved error with self instruct to argilla method by @plaguss in #265
chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in #266
fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in #267
feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in #269
docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in #270
feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in #264
feat: add support ollama api by @davidberenstein1957 in #250

New Contributors

@philschmid made their first contribution in #238
@davanstrien made their first contribution in #249
@sdiazlor made their first contribution in #270

Full Changelog: 0.3.0...0.4.0

Contributors

davanstrien, davidberenstein1957, and 4 other contributors

Assets 2

09 Jan 15:34

alvarobartt

0.3.0

ba3891a

0.3.0

What's Changed

Add VertexAILLM & VertexAIEndpointLLM classes by @gabrielmbmb in #204
Add draft with social cards by @plaguss in #197
Relax LLMPool check to match parent Task instead by @plaguss in #210
Align README.md with docs/ and minor fixes / improvements by @alvarobartt in #214
Add TogetherInferenceLLM by @alvarobartt in #215
Add checking valid inputs before calling _generate by @gabrielmbmb in #216
Add TogetherInferenceLLM tests by @alvarobartt in #217
Add Vertex AI LLMs documentation by @gabrielmbmb in #222
Documentation review by @alvarobartt in #223
Rename for_text_quality to for_overall_quality method in UltraFeedbackTask by @alvarobartt in #224
Add Anyscale endpoints by @plaguss in #213
Feature dataset checkpoint strategy by @plaguss in #194
Fix rating parsing in RatingToArgillaMixin.to_argilla_record by @alvarobartt in #227
Add badges to readme by @plaguss in #226
Fix badges by @dvsrepo in #228
Update LICENSE and add LICENSE_HEADER by @davidberenstein1957 in #221

Full Changelog: 0.2.1...0.3.0

Contributors

dvsrepo, davidberenstein1957, and 3 other contributors

Assets 2

27 Dec 13:06

alvarobartt

0.2.1

9835760

0.2.1

What's Changed

Fix PrometheusTask could not be imported by @gabrielmbmb in #190
Fix LLM.return_futures by @gabrielmbmb in #192
Remove learn section from docs until developed by @plaguss in #188
Add markdown to fields by default by @plaguss in #189
Fix PrometheusTask and UltraCMTask could not be chained with TextGenerationTask by @gabrielmbmb in #195
Add missing use_markdown for every field by @plaguss in #196
Add to_argilla_{dataset,record} for CritiqueTask by @gabrielmbmb in #198
Update generate_prompt in Task subclasses to always return Prompt by @alvarobartt in #199
Add CritiqueTask documentation by @alvarobartt in #200
Fix UltraCMTask scoring range and align argilla imports by @alvarobartt in #201

Full Changelog: 0.2.0...0.2.1

Contributors

gabrielmbmb, alvarobartt, and plaguss

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (#600)

`PrometheusEval` task (#610)

Connect the steps in the pipeline with `>>` (#490)

Routing batch function (#595)

Generate structured outputs using `outlines` (#601)

New `GroqLLM` (#583)

Easily test your pipeline doing a `dry_run` (#635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (#586)

All the changes in this PR

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: argilla-io/distilabel

1.1.0

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (#600)

PrometheusEval task (#610)

Connect the steps in the pipeline with >> (#490)

Routing batch function (#595)

Generate structured outputs using outlines (#601)

New GroqLLM (#583)

Easily test your pipeline doing a dry_run (#635)

Pipeline.log file is dumped to the Hugging Face repository (#568)

New distilabel_metadata column to store internal data (#586)

All the changes in this PR

Contributors

1.0.3

What's Changed

Contributors

1.0.2

What's Changed

Contributors

1.0.1

What's Changed

Contributors

1.0.0

What's Changed

Contributors

0.6.0

What's Changed

New Contributors

Contributors

0.5.0

What's Changed

Contributors

0.4.0

What's Changed

New Contributors

Contributors

0.3.0

What's Changed

Contributors

0.2.1

What's Changed

Contributors

`Genstruct` task (#600)

`PrometheusEval` task (#610)

Connect the steps in the pipeline with `>>` (#490)

Generate structured outputs using `outlines` (#601)

New `GroqLLM` (#583)

Easily test your pipeline doing a `dry_run` (#635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (#586)