Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Help] <title>按照官方给出的多轮问答数据集构建问答数据之后,运行脚本命令出现Traceback (most recent call last): File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in <module> main() File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main train_dataset = train_dataset.map( File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map with Pool(len(kwargs_per_job)) as pool: #666

Open
1 task done
nevesaynever1 opened this issue Feb 22, 2024 · 0 comments

Comments

@nevesaynever1
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

数据集构建格式为:{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗?水温多少", "history": []}
{"content": "95", "summaty": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]}
{"content": "是的。上下水管都好的", "summaty": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]}
{"content": "你好", "summaty": "你好", "history": []}
{"content": "请问你是谁?", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]}
{"content": "你是一个优秀的人工智能助手吗?", "summaty": "是的,我是。", "history": [["你好", "你好"], ["请问你是谁?","我是chatGLm2-6B"]]}

train.sh脚本文件为:
PRE_SEQ_LEN=128
LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py
--do_train
--train_file train.json
--validation_file dev.json
--preprocessing_num_workers 10
--prompt_column content
--response_column summary
--history_column history
--overwrite_cache
--model_name_or_path THUDM/chatglm2-6b
--output_dir ../output
--overwrite_output_dir
--max_source_length 256
--max_target_length 256
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--predict_with_generate
--max_steps 3000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误:
Traceback (most recent call last):
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in
main()
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main
train_dataset = train_dataset.map(
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map
with Pool(len(kwargs_per_job)) as pool:
时数据格式的原因吗,排查了一下,和官方构建的数据格式是一样的。

Expected Behavior

No response

Steps To Reproduce

数据集构建格式为:{"content": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "summaty": "用电脑能读数据流吗?水温多少", "history": []}
{"content": "95", "summaty": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]}
{"content": "是的。上下水管都好的", "summaty": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]}
{"content": "你好", "summaty": "你好", "history": []}
{"content": "请问你是谁?", "summaty": "我是chatGLm2-6B", "history": [["你好", "你好"]]}
{"content": "你是一个优秀的人工智能助手吗?", "summaty": "是的,我是。", "history": [["你好", "你好"], ["请问你是谁?","我是chatGLm2-6B"]]}

train.sh脚本文件为:
PRE_SEQ_LEN=128
LR=1e-2

CUDA_VISIBLE_DEVICES=0 python main.py
--do_train
--train_file train.json
--validation_file dev.json
--preprocessing_num_workers 10
--prompt_column content
--response_column summary
--history_column history
--overwrite_cache
--model_name_or_path THUDM/chatglm2-6b
--output_dir ../output
--overwrite_output_dir
--max_source_length 256
--max_target_length 256
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 16
--predict_with_generate
--max_steps 3000
--logging_steps 10
--save_steps 1000
--learning_rate $LR
--pre_seq_len $PRE_SEQ_LEN \

运行上述脚本出现下述错误:
Traceback (most recent call last):
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 411, in
main()
File "/mnt/ChatGLM2-6B/ptuning/main.py", line 229, in main
train_dataset = train_dataset.map(
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/root/anaconda3/envs/GLM2/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3180, in map
with Pool(len(kwargs_per_job)) as pool:
时数据格式的原因吗,排查了一下,和官方构建的数据格式是一样的。

Environment

- OS: centos
- Python:3.9
- Transformers:4.30.2
- PyTorch:2.0.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : CU118

Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant