Skip to content

Latest commit

 

History

History
258 lines (211 loc) · 9.66 KB

README.md

File metadata and controls

258 lines (211 loc) · 9.66 KB

How to use the data

Datasets Supported by the Framework

We provide the following datasets for the experiments in this framework.

English Instruction Datasets

中文指令数据集

RLHF Datasets

  • CValues 数据集说明:开源了数据规模为145k的价值对齐数据集,该数据集对于每个prompt包括了拒绝&正向建议,(safe and reponsibility) > 拒绝为主(safe) > 风险回复(unsafe)三种类型,可用于增强SFT模型的安全性或用于训练reward模型。
  • CValues-Comparison中文大模型价值观比较数据集

Dataset formation

The dataset_info.yaml file contains all the datasets can be used in the experiments. The following is the format of the datasets, main including the following fields.

dataset_name:
  hf_hub_url: # "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
  local_path: # "the name of the dataset file in the this directory. (required if above are not specified)",
  dataset_format: # "the format of the dataset. (required), e.g., alpaca, dolly, etc.",
  multi_turn:  # "whether the dataset is multi-turn. (default: False)"

For example, the following is the dataset information of the Stanford Alpaca dataset. While training, the framework will load the dataset from the HuggingFace hub.

alpaca:
  hf_hub_url: tatsu-lab/alpaca
  local_path:
  dataset_format: alpaca
  multi_turn: False

If you want to load the dataset from local files, please specify the local_path field.

alpaca:
  hf_hub_url: tatsu-lab/alpaca
  local_path: path/to/alpaca.json
  dataset_format: alpaca
  multi_turn: False

Custom datasets

If you are using a custom dataset, please provide your dataset definition in dataset_info.yaml.

hf_hub_ur/local_path

By defaullt, the framework will load the datasets from the HuggingFace hub. If you want to use the datasets from local files, please specify the local_path field.

dataset_format

As for the dataset_format field, which is used to specify the format of the dataset, will be used to determine the dataset processing method. Currently, we support the following dataset formats.

  • alpaca: Alpaca dataset
  • dolly: Dolly dataset
  • gpt4: GPT-4 generated dataset
  • alpaca_cot: Alpaca CoT dataset
  • oasst1: OpenAssistant/oasst1 dataset
  • sharegpt: Multi-turn ShareGPT dataset

If your dataset is not in the above format, there are two ways to use it.

  • The first way, implement the format_dataset function in data_utils.

For example, the following is the _format_dolly15k function for the Dolly dataset.

def _format_dolly15k(dataset: Dataset) -> Dataset:
    """Format Dolly-15k dataset."""
    dataset = dataset.rename_column('context', 'input')
    dataset = dataset.rename_column('response', 'output')
    return dataset
  • The second way, convert your dataset to the above format.

For example, the flowing code is used to convert the databricks-dolly-15k to the Alpaca format.

import json
def convert_dolly_alpaca(in_file, out_file):
    with open(in_file, 'r') as file:
        contents = json.load(file)
        new_content = []
        for i, content in enumerate(contents):
            new_content.append({
              'instruction': content['instruction'],
              'input': content['text'],
              'output': content['text'],
            })

    print(f'#out: {len(new_content)}')
    with open(out_file, 'w') as file:
        json.dump(new_content, file, indent=2, ensure_ascii=False)

multi_turn

If your dataset is multi-turn, pleas set the multi_turn: True in dataset_info.yaml. The framework will automatically process the multi-turn dataset.

Flowing is an example to show the format of multi-turn dataset.

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "What can you do?"
      },
      {
        "from": "gpt",
        "value": "I can chat with you."
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
]

For now, we only support the multi-turn dataset in the above format. If your dataset is not in the above format, please convert it. We also provide the following code to convert the Dolly dataset to the above format. You can find the code in convert_alpaca.

import argparse
import json
from typing import Any, Dict, List

from datasets import load_dataset

def convert_dolly_vicuna(raw_data: List[Dict[str, Any]]):
    collect_data = []
    for i, content in enumerate(raw_data):
        if len(content['context'].strip()) > 1:
            q, a = content['instruction'] + '\nInput:\n' + content[
                'context'], content['response']
        else:
            q, a = content['instruction'], content['response']

        collect_data.append({
            'id':
            f'alpaca_{i}',
            'conversations': [
                {
                    'from': 'human',
                    'value': q
                },
                {
                    'from': 'gpt',
                    'value': a
                },
            ],
        })
    print(f'Original: {len(raw_data)}, Converted: {len(collect_data)}')
    return collect_data

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--in-file', type=str)
    parser.add_argument('--out-file', type=str)
    args = parser.parse_args()

    raw_data = load_dataset('json', data_files=args.in_file)['train']
    new_data = convert_dolly_vicuna(raw_data)
    json_dump(new_data, args.out_file)


if __name__ == '__main__':
    main()

How to use in training scripts

In the data/ directory, we provide some dataset info dict used in the experiments. The following script shows how to use the alpaca_zh.yaml dataset info dict.

python train.py \
  --model_name_or_path  facebook/opt-125m \
  --dataset_cfg alpaca_zh.yaml \
  --output_dir work_dir/full-finetune \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "steps" \
  --save_strategy "steps" \
  --eval_steps 1000 \
  --save_steps 1000 \
  --save_total_limit 5 \
  --logging_steps 1 \
  --learning_rate 2e-5 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --optim "adamw_torch" \
  --lr_scheduler_type "cosine" \
  --gradient_checkpointing True \
  --model_max_length 128 \
  --do_train \
  --do_eval

You can use the alpaca_zh.yaml directly or create a custom dataset config and then set the dataset_cfg argument to your_dataset_info.yaml.