Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the second sharegpt format #3490

Merged
merged 5 commits into from May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
111 changes: 94 additions & 17 deletions data/README.md
@@ -1,4 +1,4 @@
If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
If you are using a custom dataset, please add your **dataset description** to `dataset_info.json` according to the following format. We also provide several examples in the next section.

```json
"dataset_name": {
Expand Down Expand Up @@ -33,7 +33,7 @@ If you are using a custom dataset, please provide your dataset definition in the
}
```

Given above, you can use the custom dataset via specifying `--dataset dataset_name`.
After that, you can load the custom dataset by specifying `--dataset dataset_name`.

----

Expand All @@ -54,10 +54,11 @@ Currently we support dataset in **alpaca** or **sharegpt** format, the dataset i
]
```

Regarding the above dataset, the `columns` in `dataset_info.json` should be:
Regarding the above dataset, the description in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
Expand All @@ -70,28 +71,60 @@ Regarding the above dataset, the `columns` in `dataset_info.json` should be:

The `query` column will be concatenated with the `prompt` column and used as the user prompt, then the user prompt would be `prompt\nquery`. The `response` column represents the model response.

The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training**.
The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training** in supervised fine-tuning.

For the pre-training datasets, only the `prompt` column will be used for training.
For the **pre-training datasets**, only the `prompt` column will be used for training, for example:

For the preference datasets, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
```json
[
{"text": "document"},
{"text": "document"}
]
```

Regarding the above dataset, the description in `dataset_info.json` should be:

```json
{
"instruction": "user instruction",
"input": "user input",
"output": [
"chosen answer",
"rejected answer"
]
"dataset_name": {
"file_name": "data.json",
"columns": {
"prompt": "text"
}
}
```

Remember to set `"ranking": true` for the preference datasets.
For the **preference datasets**, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:

```json
[
{
"instruction": "user instruction",
"input": "user input",
"output": [
"chosen answer",
"rejected answer"
]
}
]
```

Regarding the above dataset, the description in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"ranking": true,
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
}
}
```

----

The dataset in sharegpt format should follow the below format:
The dataset in **sharegpt** format should follow the below format:

```json
[
Expand All @@ -112,10 +145,12 @@ The dataset in sharegpt format should follow the below format:
]
```

Regarding the above dataset, the `columns` in `dataset_info.json` should be:
Regarding the above dataset, the description in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"system": "system",
Expand All @@ -132,4 +167,46 @@ Regarding the above dataset, the `columns` in `dataset_info.json` should be:

where the `messages` column should be a list following the `u/a/u/a/u/a` order.

Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.
We also supports the dataset in the **openai** format:

```json
[
{
"messages": [
{
"role": "system",
"content": "system prompt (optional)"
},
{
"role": "user",
"content": "user instruction"
},
{
"role": "assistant",
"content": "model response"
}
]
}
]
```

Regarding the above dataset, the description in `dataset_info.json` should be:

```json
"dataset_name": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
```

Pre-training datasets and preference datasets are **incompatible** with the sharegpt format yet.
111 changes: 94 additions & 17 deletions data/README_zh.md
@@ -1,4 +1,4 @@
如果您使用自定义数据集,请务必在 `dataset_info.json` 文件中按照以下格式提供数据集定义
如果您使用自定义数据集,请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子

```json
"数据集名称": {
Expand Down Expand Up @@ -33,7 +33,7 @@
}
```

添加后可通过指定 `--dataset 数据集名称` 参数使用自定义数据集
然后,可通过使用 `--dataset 数据集名称` 参数加载自定义数据集

----

Expand All @@ -54,10 +54,11 @@
]
```

对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为
对于上述格式的数据,`dataset_info.json` 中的描述应为

```json
"数据集名称": {
"file_name": "data.json",
"columns": {
"prompt": "instruction",
"query": "input",
Expand All @@ -70,28 +71,60 @@

其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为用户指令,即用户指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。

`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮的指令和回答。注意历史消息中的回答**也会被用于训练**。
`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮的指令和回答。注意在指令监督学习时,历史消息中的回答**也会被用于训练**。

对于预训练数据集,仅 `prompt` 列中的内容会用于模型训练
对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型训练,例如:

对于偏好数据集,`response` 列应当是一个长度为 2 的字符串列表,排在前面的代表更优的回答,例如:
```json
[
{"text": "document"},
{"text": "document"}
]
```

对于上述格式的数据,`dataset_info.json` 中的描述应为:

```json
{
"instruction": "用户指令",
"input": "用户输入",
"output": [
"优质回答",
"劣质回答"
]
"数据集名称": {
"file_name": "data.json",
"columns": {
"prompt": "text"
}
}
```

添加偏好数据集需要额外指定 `"ranking": true`。
对于**偏好数据集**,`response` 列应当是一个长度为 2 的字符串列表,排在前面的代表更优的回答,例如:

```json
[
{
"instruction": "用户指令",
"input": "用户输入",
"output": [
"优质回答",
"劣质回答"
]
}
]
```

对于上述格式的数据,`dataset_info.json` 中的描述应为:

```json
"数据集名称": {
"file_name": "data.json",
"ranking": true,
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
}
}
```

----

而 sharegpt 格式的数据集按照以下方式组织:
**sharegpt** 格式的数据集按照以下方式组织:

```json
[
Expand All @@ -112,10 +145,12 @@
]
```

对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为
对于上述格式的数据,`dataset_info.json` 中的描述应为

```json
"数据集名称": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"system": "system",
Expand All @@ -132,4 +167,46 @@

其中 `messages` 列应当是一个列表,且符合 `用户/模型/用户/模型/用户/模型` 的顺序。

预训练数据集和偏好数据集尚不支持 sharegpt 格式。
我们同样支持 **openai** 格式的数据集:

```json
[
{
"messages": [
{
"role": "system",
"content": "系统提示词(选填)"
},
{
"role": "user",
"content": "用户指令"
},
{
"role": "assistant",
"content": "模型回答"
}
]
}
]
```

对于上述格式的数据,`dataset_info.json` 中的描述应为:

```json
"数据集名称": {
"file_name": "data.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
```

预训练数据集和偏好数据集**尚不支持** sharegpt 格式。