We provide the following datasets for the experiments in this framework.
- Stanford Alpaca
- Hello-SimpleAI/HC3
- databricks-dolly-15k
- mosaicml/dolly_hhrlhf
- GPT-4 Generated Data
- Alpaca CoT
- UltraChat
- OpenAssistant/oasst1
- ShareGPT_Vicuna_unfiltered
- timdettmers/openassistant-guanaco
- Evol-Instruct
- Stanford Alpaca (zh)
- Alpaca-GPT-4 (zh)
- BELLE 2M (zh)
- BELLE 1M (zh)
- BELLE 0.5M (zh)
- BELLE Dialogue 0.4M (zh)
- BELLE School Math 0.25M (zh)
- BELLE Multiturn Chat 0.8M (zh)
- InstructionWild (是一个从网络上收集自然指令)
- HuatuoGPT-sft-data-v1(中文医疗指令数据集-华陀)
- 100PoisonMpts(给AI的100瓶毒药): 中文大模型治理数据集
- COIG(Chinese Open Instruction Generalist project)
- COIG-PC(Prompt Collection) COIG 数据集二期
- ShareChat (倡议大家一起翻译高质量 ShareGPT 数据的项目)
- SmileConv(通过ChatGPT改写真实的心理互助 QA为多轮的心理健康支持多轮对话)
- OL-CC(OpenLabel-Chinese Conversations Dataset)以众包方式、人工生成的开源中文对话指令集
- CValues 数据集说明:开源了数据规模为145k的价值对齐数据集,该数据集对于每个prompt包括了拒绝&正向建议,(safe and reponsibility) > 拒绝为主(safe) > 风险回复(unsafe)三种类型,可用于增强SFT模型的安全性或用于训练reward模型。
- CValues-Comparison中文大模型价值观比较数据集
The dataset_info.yaml
file contains all the datasets can be used in the experiments. The following is the format of the datasets, main including the following fields.
dataset_name:
hf_hub_url: # "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
local_path: # "the name of the dataset file in the this directory. (required if above are not specified)",
dataset_format: # "the format of the dataset. (required), e.g., alpaca, dolly, etc.",
multi_turn: # "whether the dataset is multi-turn. (default: False)"
For example, the following is the dataset information of the Stanford Alpaca dataset. While training, the framework will load the dataset from the HuggingFace hub.
alpaca:
hf_hub_url: tatsu-lab/alpaca
local_path:
dataset_format: alpaca
multi_turn: False
If you want to load the dataset from local files, please specify the local_path
field.
alpaca:
hf_hub_url: tatsu-lab/alpaca
local_path: path/to/alpaca.json
dataset_format: alpaca
multi_turn: False
If you are using a custom dataset, please provide your dataset definition in dataset_info.yaml
.
By defaullt, the framework will load the datasets from the HuggingFace hub. If you want to use the datasets from local files, please specify the local_path
field.
As for the dataset_format field, which is used to specify the format of the dataset, will be used to determine the dataset processing method. Currently, we support the following dataset formats.
alpaca
: Alpaca datasetdolly
: Dolly datasetgpt4
: GPT-4 generated datasetalpaca_cot
: Alpaca CoT datasetoasst1
: OpenAssistant/oasst1 datasetsharegpt
: Multi-turn ShareGPT dataset
If your dataset is not in the above format, there are two ways to use it.
- The first way, implement the
format_dataset
function in data_utils.
For example, the following is the _format_dolly15k
function for the Dolly dataset.
def _format_dolly15k(dataset: Dataset) -> Dataset:
"""Format Dolly-15k dataset."""
dataset = dataset.rename_column('context', 'input')
dataset = dataset.rename_column('response', 'output')
return dataset
- The second way, convert your dataset to the above format.
For example, the flowing code is used to convert the databricks-dolly-15k to the Alpaca format.
import json
def convert_dolly_alpaca(in_file, out_file):
with open(in_file, 'r') as file:
contents = json.load(file)
new_content = []
for i, content in enumerate(contents):
new_content.append({
'instruction': content['instruction'],
'input': content['text'],
'output': content['text'],
})
print(f'#out: {len(new_content)}')
with open(out_file, 'w') as file:
json.dump(new_content, file, indent=2, ensure_ascii=False)
If your dataset is multi-turn, pleas set the multi_turn: True
in dataset_info.yaml
. The framework will automatically process the multi-turn dataset.
Flowing is an example to show the format of multi-turn dataset.
[
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
},
{
"from": "human",
"value": "What can you do?"
},
{
"from": "gpt",
"value": "I can chat with you."
}
]
},
{
"id": "identity_1",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
}
]
},
]
For now, we only support the multi-turn dataset in the above format. If your dataset is not in the above format, please convert it. We also provide the following code to convert the Dolly dataset to the above format. You can find the code in convert_alpaca.
import argparse
import json
from typing import Any, Dict, List
from datasets import load_dataset
def convert_dolly_vicuna(raw_data: List[Dict[str, Any]]):
collect_data = []
for i, content in enumerate(raw_data):
if len(content['context'].strip()) > 1:
q, a = content['instruction'] + '\nInput:\n' + content[
'context'], content['response']
else:
q, a = content['instruction'], content['response']
collect_data.append({
'id':
f'alpaca_{i}',
'conversations': [
{
'from': 'human',
'value': q
},
{
'from': 'gpt',
'value': a
},
],
})
print(f'Original: {len(raw_data)}, Converted: {len(collect_data)}')
return collect_data
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--in-file', type=str)
parser.add_argument('--out-file', type=str)
args = parser.parse_args()
raw_data = load_dataset('json', data_files=args.in_file)['train']
new_data = convert_dolly_vicuna(raw_data)
json_dump(new_data, args.out_file)
if __name__ == '__main__':
main()
In the data/
directory, we provide some dataset info dict used in the experiments. The following script shows how to use the alpaca_zh.yaml
dataset info dict.
python train.py \
--model_name_or_path facebook/opt-125m \
--dataset_cfg alpaca_zh.yaml \
--output_dir work_dir/full-finetune \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "steps" \
--save_strategy "steps" \
--eval_steps 1000 \
--save_steps 1000 \
--save_total_limit 5 \
--logging_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--optim "adamw_torch" \
--lr_scheduler_type "cosine" \
--gradient_checkpointing True \
--model_max_length 128 \
--do_train \
--do_eval
You can use the alpaca_zh.yaml
directly or create a custom dataset config and then set the dataset_cfg
argument to your_dataset_info.yaml
.