自定义预训练数据集
#1467
Replies: 3 comments 2 replies
-
预训练的 json 只接受一个字段的内容,对于如下示例 json 文件 [
{
"completion": "昭通机场(ZPZT)是位于中国云南昭通的民用机场,始建于1935年,1960年3月开通往返航班“昆明-昭通”,原来属军民合用机场。1986年机场停止使用。1991年11月扩建,于1994年2月恢复通航。是西南地区「文明机场」,通航城市昆明。"
},
{
"completion": "派森百是中国大陆第一家生产非浓缩还原橙汁的公司,总部位于重庆渝中区,工厂位于重庆忠县。\n产品\n派森百公司主要生产非浓缩还原橙汁,即无添加剂不加水的所谓NFC(Not From Concentrated)橙汁。"
}
] dataset_info.json 中的数据集定义应当为: "数据集名称": {
"file_name": "xx.json",
"columns": {
"prompt": "completion"
}
} |
Beta Was this translation helpful? Give feedback.
1 reply
-
所有预训练数据必须得存放在一个json文件吗,可以从文件夹里加载吗,里面存有多个json文件 |
Beta Was this translation helpful? Give feedback.
1 reply
-
佬 有个问题想问一下你,我想对一个专业领域的数据集进行增量预训练(大约有40w个产品的描述),会增加通用领域的数据(专与通的比例1:7的样子),考虑用ChatGLM3-6b去做,想要学习到领域知识的同时尽可能去避免灾难性遗忘,有些问题想请教一下您: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
如何根据要求模板自定义预训练数据集格式?需要对数据集进行修改还是有别的容易的方法?我在自定义后进行训练时遇到了如下问题:
Traceback (most recent call last):
File "/data/zym_conda/envs/llm-fine-tune/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/data/zym_conda/envs/llm-fine-tune/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/data/zym_proj/huan/projects/LLaMA-Factory/src/llmtuner/tuner/tune.py", line 24, in run_exp
run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
File "/data/zym_proj/huan/projects/LLaMA-Factory/src/llmtuner/tuner/pt/workflow.py", line 23, in run_pt
dataset = get_dataset(model_args, data_args)
File "/data/zym_proj/huan/projects/LLaMA-Factory/src/llmtuner/dsets/loader.py", line 118, in get_dataset
dataset = dataset.rename_column(getattr(dataset_attr, column_name), column_name)
File "/data/zym_conda/envs/llm-fine-tune/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/data/zym_conda/envs/llm-fine-tune/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/data/zym_conda/envs/llm-fine-tune/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2203, in rename_column
raise ValueError(
ValueError: Original column name ['author', 'title', 'content', 'id'] not in the dataset. Current columns in the dataset: ['author', 'title', 'content', 'id']
求好心大哥帮忙指导一下!
Beta Was this translation helpful? Give feedback.
All reactions