Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add tutorial argilla haystack integration #4597

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ These are the section headers that we use:
## [1.24.0](https://github.com/argilla-io/argilla/compare/v1.23.0...v1.24.0)

>[!NOTE]
> This release does not contain any new features, but it includes a major change in the `argilla-server` dependency.
> This release does not contain any new features, but it includes a major change in the `argilla-server` dependency.
> The package is using the `argilla-server` dependency defined [here](https://github.com/argilla-io/argilla-server). ([#4537](https://github.com/argilla-io/argilla/pull/4537))

### Changed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ In this guide, you'll learn to deploy your own Argilla app and use it for data l

## Your first Argilla Space

In this section, you'll learn to deploy an Argilla Space and use it for data annotation and training a sentiment classifier with [SetFit](https://github.com/huggingface/setfit/), an amazing few-shot learning library.
In this section, you'll learn to deploy an Argilla Space and use it for human feedback collection.

### Deploy Argilla on Spaces

Expand Down Expand Up @@ -57,84 +57,66 @@ Once Argilla is running, you can use the UI with the Direct URL. This URL gives

If everything goes well, you are ready to use the Argilla Python client from an IDE such as Colab, Jupyter, or VS Code.

If you want a quick step-by-step example, keep reading. If you want an end-to-end tutorial, go to this [tutorial and use Colab or Jupyter](https://docs.argilla.io/en/latest/tutorials/notebooks/training-textclassification-setfit-fewshot.html).
If you want a quick step-by-step example, keep reading. If you prefer an end-to-end tutorial, go to this [tutorial and use Colab or Jupyter](/getting_started/quickstart_workflow_feedback.ipynb).

First, we need to pip install `datasets` and `argilla` on Colab or your local machine:
First, we need to pip install `argilla` on Colab or your local machine:

```bash
pip install datasets argilla
pip install argilla -U
```

Then, you can read the example dataset using the `datasets` library. This dataset is a CSV file uploaded to the Hub using the drag-and-drop feature.

```python
from datasets import load_dataset

dataset = load_dataset("dvilasuero/banking_app", split="train").shuffle()
```

You can create your first dataset by logging it into Argilla using your endpoint URL:
Then, you can connect to Argilla using your endpoint URL.

```python
import argilla as rg

# if you connect to your public app endpoint (uses default API key)
# If you connect to your public app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey")

# if you connect to your private app endpoint (uses default API key)
# If you connect to your private app endpoint (uses default API key)
rg.init(api_url="[your_space_url]", api_key="admin.apikey", extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"})

# transform dataset into Argilla's format and log it
rg.log(rg.read_datasets(dataset, task="TextClassification"), name="bankingapp_sentiment")
```

Congrats! You now have a dataset available from the Argilla UI to start browsing and labeling. In the code above, we've used one of the many integrations with Hugging Face libraries, which let you read hundreds of datasets available on the Hub.

### Data labeling and model training
So let's create a Dataset with two labels ("sadness" and "joy"). Don't forget to replace "your-workspace" where the dataset will be created.

At this point, you can label your data directly using your Argilla Space and read the training data to train your model of choice.
> To check your workspaces, go to "My settings" on the UI. If you need to create a new one, consult the [docs](/getting_started/installation/configurations/workspace_management.md).
> Here, we are using a task template, see the docs to [create a fully custom dataset](/practical_guides/create_update_dataset/create_dataset.md).

```python
# this will read our current dataset and turn it into a clean dataset for training
dataset = rg.load("bankingapp_sentiment").prepare_for_training()
dataset = rg.FeedbackDataset.for_text_classification(
labels=["sadness", "joy"],
multi_label=False,
use_markdown=True,
guidelines=None,
metadata_properties=None,
vectors_settings=None,
)
dataset.push_to_argilla(name="my-first-dataset", workspace="<your-workspace>")
```

You can also get the full dataset and push it to the Hub for reproducibility and versioning:

```python
# save full argilla dataset for reproducibility
rg.load("bankingapp_sentiment").to_datasets().push_to_hub("bankingapp_sentiment")
```
Now, we will add the records. Create a list with the records you want to add and ensure that you match the fields with the ones specified in the previous step.

Finally, this is how you can train a SetFit model using data from your Argilla Space:
> You can also use `pandas` or `load_dataset` to [read an existing dataset and create records from it](/practical_guides/create_update_dataset/records.md#add-records).

```python
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

# Create train test split
dataset = dataset.train_test_split()

# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
loss_class=CosineSimilarityLoss,
batch_size=8,
num_iterations=20,
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()
records = [
rg.FeedbackRecord(
fields={
"text": "I am so happy today",
},
),
rg.FeedbackRecord(
fields={
"text": "I feel sad today",
},
)
]
dataset.add_records(records)
```

As a next step, you can check the [Argilla Tutorials](https://docs.argilla.io/en/latest/tutorials/tutorials.html) section. All the tutorials can be run using Colab or local Jupyter Notebooks, so you can start building datasets with Argilla and Spaces!
Congrats! You now have a dataset available in Argilla to start browsing and labeling.

As a next step, you can check the [Argilla Tutorials](/tutorials_and_integrations/tutorials/tutorials.md) section. All the tutorials can be run using Colab or local Jupyter Notebooks, so you can start building datasets with Argilla and Spaces!

## Feedback and support

Expand Down Expand Up @@ -190,15 +172,18 @@ Additionally, the `LOAD_DATASETS` will let you configure the sample datasets tha
2. `full`: Load all the sample datasets for NLP tasks (TokenClassification, TextClassification, Text2Text)
3. `none`: No datasets being loaded.

## Setting up HF Authentication
## Setting up sign in with Hugging Face

From version `1.23.0` you can enable Hugging Face authentication for your Argilla Space. This feature allows you to give access to your Argilla Space to users that are logged in to the Hugging Face Hub.

```{note}
This feature is specially useful for public crowdsourcing projects. If you would like to have more control over who can log in to the Space, you can set this up on a private space so that only members of your Organization can sign in. Alternatively, you may want to [create users](/getting_started/installation/configurations/user_management.md#create-a-user) and use their credentials instead.
```
```{warning}
For working with stable datasets and keep all the contributions, we highly recommend using the persistent storage layer offered by Hugging Face. For more info check the ["Setting up persistent storage"](#setting-up-persistent-storage) section.
```

To enable this feature, you will first need to [create an OAuth App in Hugging Face](https://huggingface.co/docs/hub/oauth#creating-an-oauth-app). To do that, go to your user settings in Hugging Face and select *Connected Apps* > *Create App*. Once inside, choose a name for your app and complete the form with the following information:
To set up the sign-in page, you first need to [create an OAuth App in Hugging Face](https://huggingface.co/docs/hub/oauth#creating-an-oauth-app). To do that, go to your user settings in Hugging Face and select *Connected Apps* > *Create App*. Once inside, choose a name for your app and complete the form with the following information:

* **Homepage URL:** [Your Argilla Space Direct URL](/getting_started/installation/deployments/huggingface-spaces.md#your-argilla-space-url).
* **Logo URL:** `[Your Argilla Space Direct URL]/favicon.ico`
Expand All @@ -210,7 +195,7 @@ This will create a Client ID and an App Secret that you will need to add as vari
1. **Name:** `OAUTH2_HUGGINGFACE_CLIENT_ID` - **Value:** [Your Client ID]
2. **Name:** `OAUTH2_HUGGINGFACE_CLIENT_SECRET` - **Value:** [Your App Secret]

Alternatively, you can provide the environment variables in the `.oauth.yaml` file like so:
Finally, you need to change the `.oauth.yaml` file located in the Files page of your Space (see below how this file looks like). Once you have merged the change, go back to the *Settings* to do a *Factory rebuild*. Once the Space is restarted, you and your collaborators can sign and log in to your Space using their Hugging Face accounts.

```yaml
# This attribute will enable or disable the Hugging Face authentication
Expand Down Expand Up @@ -238,11 +223,3 @@ providers:
allowed_workspaces:
- name: admin
```

```{warning}
Be aware that the `.oauth.yaml` file is public in the case of public spaces or may be accesible by other members of your organization if it is a private space.

Therefore, we recommend setting these variables as enviroment secrets.
```

Now check that the `enabled` parameter is set to `true` in your `.oauth.yaml` file and go back to the *Settings* to do a *Factory rebuild*. Once the Space is restarted, you and your collaborators can sign and log in to your Space using their Hugging Face accounts.
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ Add text descriptives to your metadata to simplify the data annotation and filte

Add semantic representations to your records using vector embeddings to simplify the data annotation and search process.
```
```{grid-item-card} Haystack: Monitoring LLMs for Agents
:link: use_argilla_callback_in_haystack-v1.html

Learn how to use Argilla to monitor LLMs with Haystack Agents.
```
````

```{toctree}
Expand All @@ -40,4 +45,5 @@ process_documents_with_unstructured
monitor_endpoints with_fastapi
add_text_descriptives_as_metadata
add_sentence_transformers_embeddings_as_vectors
use_argilla_callback_in_haystack-v1
```

Large diffs are not rendered by default.