How to use the data

Datasets Supported by the Framework

We provide the following datasets for the experiments in this framework.

English Instruction Datasets

Stanford Alpaca
Hello-SimpleAI/HC3
databricks-dolly-15k
mosaicml/dolly_hhrlhf
GPT-4 Generated Data
Alpaca CoT
UltraChat
OpenAssistant/oasst1
ShareGPT_Vicuna_unfiltered
timdettmers/openassistant-guanaco
Evol-Instruct

中文指令数据集

Stanford Alpaca (zh)
Alpaca-GPT-4 (zh)
BELLE 2M (zh)
BELLE 1M (zh)
BELLE 0.5M (zh)
BELLE Dialogue 0.4M (zh)
BELLE School Math 0.25M (zh)
BELLE Multiturn Chat 0.8M (zh)
InstructionWild (是一个从网络上收集自然指令)
HuatuoGPT-sft-data-v1(中文医疗指令数据集-华陀)
100PoisonMpts(给AI的100瓶毒药): 中文大模型治理数据集
COIG(Chinese Open Instruction Generalist project)
COIG-PC（Prompt Collection) COIG 数据集二期
ShareChat (倡议大家一起翻译高质量 ShareGPT 数据的项目)
SmileConv(通过ChatGPT改写真实的心理互助 QA为多轮的心理健康支持多轮对话)
OL-CC(OpenLabel-Chinese Conversations Dataset)以众包方式、人工生成的开源中文对话指令集

RLHF Datasets

CValues 数据集说明：开源了数据规模为145k的价值对齐数据集，该数据集对于每个prompt包括了拒绝&正向建议,(safe and reponsibility) > 拒绝为主(safe) > 风险回复(unsafe)三种类型，可用于增强SFT模型的安全性或用于训练reward模型。
CValues-Comparison中文大模型价值观比较数据集

Dataset formation

The dataset_info.yaml file contains all the datasets can be used in the experiments. The following is the format of the datasets, main including the following fields.

dataset_name:
  hf_hub_url: # "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
  local_path: # "the name of the dataset file in the this directory. (required if above are not specified)",
  dataset_format: # "the format of the dataset. (required), e.g., alpaca, dolly, etc.",
  multi_turn:  # "whether the dataset is multi-turn. (default: False)"

For example, the following is the dataset information of the Stanford Alpaca dataset. While training, the framework will load the dataset from the HuggingFace hub.

alpaca:
  hf_hub_url: tatsu-lab/alpaca
  local_path:
  dataset_format: alpaca
  multi_turn: False

If you want to load the dataset from local files, please specify the local_path field.

alpaca:
  hf_hub_url: tatsu-lab/alpaca
  local_path: path/to/alpaca.json
  dataset_format: alpaca
  multi_turn: False

Custom datasets

If you are using a custom dataset, please provide your dataset definition in dataset_info.yaml.

hf_hub_ur/local_path

By defaullt, the framework will load the datasets from the HuggingFace hub. If you want to use the datasets from local files, please specify the local_path field.

dataset_format

As for the dataset_format field, which is used to specify the format of the dataset, will be used to determine the dataset processing method. Currently, we support the following dataset formats.

alpaca: Alpaca dataset
dolly: Dolly dataset
gpt4: GPT-4 generated dataset
alpaca_cot: Alpaca CoT dataset
oasst1: OpenAssistant/oasst1 dataset
sharegpt: Multi-turn ShareGPT dataset

If your dataset is not in the above format, there are two ways to use it.

The first way, implement the format_dataset function in data_utils.

For example, the following is the _format_dolly15k function for the Dolly dataset.

def _format_dolly15k(dataset: Dataset) -> Dataset:
    """Format Dolly-15k dataset."""
    dataset = dataset.rename_column('context', 'input')
    dataset = dataset.rename_column('response', 'output')
    return dataset

The second way, convert your dataset to the above format.

For example, the flowing code is used to convert the databricks-dolly-15k to the Alpaca format.

import json
def convert_dolly_alpaca(in_file, out_file):
    with open(in_file, 'r') as file:
        contents = json.load(file)
        new_content = []
        for i, content in enumerate(contents):
            new_content.append({
              'instruction': content['instruction'],
              'input': content['text'],
              'output': content['text'],
            })

    print(f'#out: {len(new_content)}')
    with open(out_file, 'w') as file:
        json.dump(new_content, file, indent=2, ensure_ascii=False)

multi_turn

If your dataset is multi-turn, pleas set the multi_turn: True in dataset_info.yaml. The framework will automatically process the multi-turn dataset.

Flowing is an example to show the format of multi-turn dataset.

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "What can you do?"
      },
      {
        "from": "gpt",
        "value": "I can chat with you."
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
]

For now, we only support the multi-turn dataset in the above format. If your dataset is not in the above format, please convert it. We also provide the following code to convert the Dolly dataset to the above format. You can find the code in convert_alpaca.

import argparse
import json
from typing import Any, Dict, List

from datasets import load_dataset

def convert_dolly_vicuna(raw_data: List[Dict[str, Any]]):
    collect_data = []
    for i, content in enumerate(raw_data):
        if len(content['context'].strip()) > 1:
            q, a = content['instruction'] + '\nInput:\n' + content[
                'context'], content['response']
        else:
            q, a = content['instruction'], content['response']

        collect_data.append({
            'id':
            f'alpaca_{i}',
            'conversations': [
                {
                    'from': 'human',
                    'value': q
                },
                {
                    'from': 'gpt',
                    'value': a
                },
            ],
        })
    print(f'Original: {len(raw_data)}, Converted: {len(collect_data)}')
    return collect_data

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--in-file', type=str)
    parser.add_argument('--out-file', type=str)
    args = parser.parse_args()

    raw_data = load_dataset('json', data_files=args.in_file)['train']
    new_data = convert_dolly_vicuna(raw_data)
    json_dump(new_data, args.out_file)


if __name__ == '__main__':
    main()

How to use in training scripts

In the data/ directory, we provide some dataset info dict used in the experiments. The following script shows how to use the alpaca_zh.yaml dataset info dict.

python train.py \
  --model_name_or_path  facebook/opt-125m \
  --dataset_cfg alpaca_zh.yaml \
  --output_dir work_dir/full-finetune \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "steps" \
  --save_strategy "steps" \
  --eval_steps 1000 \
  --save_steps 1000 \
  --save_total_limit 5 \
  --logging_steps 1 \
  --learning_rate 2e-5 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --optim "adamw_torch" \
  --lr_scheduler_type "cosine" \
  --gradient_checkpointing True \
  --model_max_length 128 \
  --do_train \
  --do_eval

You can use the alpaca_zh.yaml directly or create a custom dataset config and then set the dataset_cfg argument to your_dataset_info.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

How to use the data

Datasets Supported by the Framework

English Instruction Datasets

中文指令数据集

RLHF Datasets

Dataset formation

Custom datasets

hf_hub_ur/local_path

dataset_format

multi_turn

How to use in training scripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

How to use the data

Datasets Supported by the Framework

English Instruction Datasets

中文指令数据集

RLHF Datasets

Dataset formation

Custom datasets

hf_hub_ur/local_path

dataset_format

multi_turn

How to use in training scripts