Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡似乎不能将每张卡跑满,请问如何才能让每张卡的计算负载跑满呢 #66

Open
RayneSun opened this issue Jul 19, 2023 · 13 comments

Comments

@RayneSun
Copy link

我设置了CUDA_VISIBLE_DEVICE和device_map,在2张A100上跑的时候,发现确实都有内存占用,但是gpu负载总是某张卡高,其他都很低。

@jianzhnie
Copy link
Owner

你训练用的哪个方法

@RayneSun
Copy link
Author

用的lora,训练baichuan-13B

@jianzhnie
Copy link
Owner

不应该呀,我训练的时候卡基本都是占满的

@RayneSun
Copy link
Author

image
大概就是这个样子,有点像是流水线并行

@RayneSun
Copy link
Author

是不是因为我没有用deepspeed呢?能麻烦看一下您跑baichuan-13b的shell脚本吗

@jianzhnie
Copy link
Owner

或许在这个位置,开启了模型并行,你注释掉这两行试试

@RayneSun
Copy link
Author

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=1 train_lora.py
--model_name_or_path ../Baichuan-13B-Chat
--dataset_name train.json,test.json
--data_dir ../../data/toolbench
--load_from_local yes
--output_dir baichuan-lora
--max_steps 50000
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--evaluation_strategy no
--save_strategy steps
--save_steps 1000
--learning_rate 5e-4
--weight_decay 0.
--warmup_ratio 0.07
--optim "adamw_torch"
--lr_scheduler_type "linear"
--model_max_length 2560
--source_max_len 2048
--target_max_len 512
--logging_steps 5
--do_train
--gradient_checkpointing True
--trust_remote_code true
--lora_target_modules W_pack
--deepspeed "ds_config_zero3_auto.json

@RayneSun
Copy link
Author

我注释掉您说的那两句了,但是跑的时候还是单张卡占用高

@RayneSun
Copy link
Author

RayneSun commented Jul 21, 2023

而且我把train_lora的device_map配置去掉了:
image

因为不去掉会报错:
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.
请问和这个相关吗?

@RayneSun
Copy link
Author

好像找到问题了,需要设置启动时的参数--nproc_per_node=2

@wgzhendong
Copy link

好像找到问题了,需要设置启动时的参数--nproc_per_node=2

你能完整训练完吗,我和你一样的训练代码跑了200步就挂了

@RayneSun
Copy link
Author

最后没用deepspeed,速度反而会特别慢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants