Skip to content

Releases: argilla-io/distilabel

1.1.0

20 May 14:02
690013a
Compare
Choose a tag to compare

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (#600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline:
    load_hub_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=[
            {
                "title": "Harry Potter and the Sorcerer's Stone",
                "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
            },
            {
                "title": "Harry Potter and the Chamber of Secrets",
                "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
            },
        ],
    )

    task = Genstruct(
        name="task",
        llm=TransformersLLM(
            model="NousResearch/Genstruct-7B",
            torch_dtype="float16",
            chat_template="{{ messages[0]['content'] }}",
            device="cuda:0",
        ),
        num_generations=2,
        group_generations=False,
        output_mappings={"model_name": "model"},
    )

PrometheusEval task (#610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id="HuggingFaceH4/instruction-dataset",
        split="test",
        output_mappings={"prompt": "instruction", "completion": "generation"},
    )

    task = PrometheusEval(
        name="task",
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
        ),
        mode="absolute",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )
    
    load_dataset >> task

Connect the steps in the pipeline with >> (#490)

Now you can connect your steps using the binary shift operator in python:

from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    evol_instruction_complexity_1 = EvolInstruct(
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )
    evol_instruction_complexity_2 = EvolInstruct(
        llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
    )

    combine_columns = CombineColumns(
        columns=["response"],
        output_columns=["candidates"],
    )

    (
        load_hub_dataset 
        >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
        >> combine_columns
    )

Routing batch function (#595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration

@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
    return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    tasks = []
    for llm in (
        OpenAILLM(model="gpt-4-0125-preview"),
        MistralLLM(model="mistral-large-2402"),
        VertexAILLM(model="gemini-1.0-pro"),
    ):
        tasks.append(
            TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
        )

    combine_generations = CombineColumns(
        name="combine_generations",
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    load_dataset >> sample_two_steps >> tasks >> combine_generations

Generate structured outputs using outlines (#601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

from enum import Enum

from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"

class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"
    mithril = "mithril"

class Character(BaseModel):
    name: Annotated[str, StringConstraints(max_length=30)]
    age: conint(gt=1, lt=3000)
    armor: Armor
    weapon: Weapon

with Pipeline("RPG-characters") as pipeline:
    system_prompt = (
        "You are a leading role play gamer. You have seen thousands of different characters and their attributes."
        " Please return a JSON object with common attributes of an RPG character."
    )

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": system_prompt,
                "instruction": f"Give me a character description for a {char}",
            }
            for char in ["dwarf", "elf", "human", "ork"]
        ],
    )

    text_generation = TextGeneration(
        name="text_generation_rpg",
        llm=LlamaCppLLM(
            model_path="model/path",  # type: ignore
            structured_output={"format": "json", "schema": Character},
        ),
    )
    load_dataset >> text_generation

New GroqLLM (#583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline:
		...
    text_generation_with_groq = TextGeneration(
        llm=GroqLLM(model="llama3-70b-8192"),
    )
    ...

Easily test your pipeline doing a dry_run (#635)

with Pipeline(...) as pipeline:
    ...
    distiset = pipeline.dry_run(
        parameters=...,  # The same argument as `Pipeline.run`
        batch_size=1  # Optional, will be set to 1 by default.
    )
[05/13/24 16:22:30] INFO     ['distilabel.pipeline.local'] 🌵  Dry run mode                                                                                                                                                                local.py:103
                    INFO     ['distilabel.pipeline.local'] 📝 Pipeline data will be ...                                    local.py:125

Pipeline.log file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New distilabel_metadata column to store internal data (#586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

with Pipeline(...,enable_metadata=True|False):
    ...

This way we can decide to remove all the column altogether.

All the changes in this PR

  • Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
  • Fix llm_blender installation by @alvarobartt in #557
  • Warn user a...
Read more

1.0.3

25 Apr 12:48
9f38b49
Compare
Choose a tag to compare

What's Changed

  • Add stop and stop_sequences in LLM.generate subclasses by @alvarobartt in #585

Full Changelog: 1.0.2...1.0.3

1.0.2

24 Apr 11:43
712d32c
Compare
Choose a tag to compare

What's Changed

  • Fix RuntimeParamater validation when provided as _Step attr by @alvarobartt in #564
  • Add seed with random.randint to ensure cache is not used by @alvarobartt in #571

Full Changelog: 1.0.1...1.0.2

1.0.1

19 Apr 10:11
b870f39
Compare
Choose a tag to compare

What's Changed

  • Fix typo in readme and remove the ToArgilla step by @dvsrepo in #548
  • Fix model_validator in InferenceEndpoints due to Pipeline pickling by @alvarobartt in #552

Full Changelog: 1.0.0...1.0.1

1.0.0

17 Apr 07:42
23c5fe5
Compare
Choose a tag to compare

What's Changed

Read more

0.6.0

01 Mar 17:57
fce6c2d
Compare
Choose a tag to compare

What's Changed

  • Fix typo in docstring of to_argilla metrics_ to metric_ by @burtenshaw in #334
  • Implement a JSON responding OpenAI LLM as JSONOpenAILLM by @burtenshaw in #331
  • Add examples for the deita paper tasks by @plaguss in #329
  • Add checkpoint strategy to automatically push to hub by @plaguss in #321
  • docs: update tutorials avoid argilla installation error by @sdiazlor in #337
  • Fix CustomDataset.load_from_disk with str/Path objects by @plaguss in #341
  • Clalrify number of generations produced when using LLMPool in docs by @davanstrien in #339
  • Refactor _build_dataset piece for speed by @plaguss in #344
  • Fix documentation and type variables in CustomDataset checkpoint methods by @plaguss in #342
  • US Spelling and other typo correction on Distilabel tutorials by @ignacioct in #324
  • docs: add a tutorial for evolinstruct by @sdiazlor in #327
  • Fix Openai api error with OpenAI-compatible providers by @jphme in #351
  • Add fix for labels not returned by openai api by @plaguss in #364
  • Refactor model availability check in is_serverless_endpoint_available by @davanstrien in #363

New Contributors

Full Changelog: 0.5.0...0.6.0

0.5.0

02 Feb 16:21
8ccf116
Compare
Choose a tag to compare

What's Changed

  • fix: Correct import error by @plaguss in #279
  • fix: Filter examples for which len generations != len ratings by @plaguss in #284
  • feat: Add sentence transformers support for the to argilla method by @davidberenstein1957 in #262
  • feat: Add text descriptives support to the to argilla methods by @davidberenstein1957 in #271
  • feat: Add to_argilla method to EvolInstructTask generated datasets by @plaguss in #291
  • docs: Shorten titles tutorials and update core example by @davidberenstein1957 in #289
  • feat: Add new serialization strategy by @plaguss in #288
  • feat: Review OllamaLLM and TogetherInferenceLLM by @alvarobartt in #305
  • refactor: Remove Metadata for Ratings by @ignacioct in #303
  • docs: Add missing VertexAI information within README.md and docs/index.md by @alvarobartt in #308
  • feat: Add functionality to push tasks to the HuggingFace hub and download them automatically. by @plaguss in #297
  • feat: Add ComplexityScorer and QualityScorer tasks from Deita by @plaguss in #302
  • fix: Fix logging visualization of labeller pipelines by @plaguss in #310
  • feat: Add Improving Text Embeddings with LLMs tutorial by @alvarobartt in #313
  • feat: Add EvolComplexity and EvolQuality by @davidberenstein1957 in #299
  • feat: Add validate_prompts method to LLMs to help validating the prompts by @plaguss in #314
  • fix: typo in clean an existing preference dataset by @sdiazlor in #312
  • feat: Add new column for sft fine tuning with prepare_dataset by @plaguss in #309
  • docs: Custom Task Documentation by @ignacioct in #275
  • refactor: Align the LLM subclasses args by @alvarobartt in #315
  • feat: Include rationale of the model responses on prepare_dataset if available by @plaguss in #317
  • feat: Add embedding tutorial to docs by @ignacioct in #319
  • feat: Add MistralAILLM by @plaguss in #293
  • feat: Use ollama Python client within OllamaLLM by @sdiazlor in #307

Full Changelog: 0.4.0...0.5.0

0.4.0

19 Jan 15:20
2abe11a
Compare
Choose a tag to compare

What's Changed

  • docs: Notus end2end example for preference and instruction generation by @ignacioct in #145
  • docs: binders anchors by @ignacioct in #235
  • feat: Add support for dedicated and serverless inference endpoints via inference API by @philschmid in #238
  • docs: Update links to arxiv landing pages rather than PDFs by @davanstrien in #249
  • feat: add ETA to progress bar and fix not showing the progress bar if irrelavant by @ignacioct in #253
  • feat: Add Evol instruct task by @plaguss in #237
  • docs: rename enable_checkpoints to checkpoint_strategy by @davidberenstein1957 in #257
  • feat: Fixing progress bar and ETA by @ignacioct in #260
  • fix: resolved error with self instruct to argilla method by @plaguss in #265
  • chore: Add extra check in llmpool to ensure all the tasks share the same parent class by @plaguss in #266
  • fix: fix for Notus tutorial after bug in record unwrap by @ignacioct in #267
  • feat: add customizable criteria for query generation in SelfInstructTask by @ignacioct in #269
  • docs: add a tutorial on "clean a DPO/preference dataset with distilabel" by @sdiazlor in #270
  • feat: Add new functionality to binarize preference datasets directly from distilabel by @plaguss in #264
  • feat: add support ollama api by @davidberenstein1957 in #250

New Contributors

Full Changelog: 0.3.0...0.4.0

0.3.0

09 Jan 15:34
ba3891a
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.2.1...0.3.0

0.2.1

27 Dec 13:06
9835760
Compare
Choose a tag to compare

What's Changed

Full Changelog: 0.2.0...0.2.1