New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时 出现OOM报错 #1043
Comments
@KevinFan0 @JustinLin610 @JianxinMa I am getting the same error when using a single machine with 8 X V100, 32 GB each even with batch size = 1. There are no other processes running apart from this one. Have you managed to solve this somehow? Many thanks. Here is the error log (see below): | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | +-----------------------------------------------------------------------------+ Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-02-06 00:36:46,618] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it] File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^ ^ ^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^ ^ ^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^module._apply(fn)^^^ ^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize [Previous line repeated 5 more times]
^ ^ return self._apply(convert) ^ ^ ^ ^ ^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^ ^^^^^^^^^ ^^self._buffers[key] = fn(buf)^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
|
调小 model_max_length 8192调到512,再一步步往上加 |
8卡A100都能全参finetune了呀 |
@WangJianQ-cmd Thanks for your reply. Unfortunately I am still getting OOM even when I reduce model_max_length to values lower than 512 . The same applies for any other value larger than 512 . Thanks |
@Yuxiang1995 Thanks for getting back to me! I only have 8 V100 not A100 . Do you think it would be still possible? |
Hi did you manage to solve. the problem? I encountered the same problem. |
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread. |
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
利用Qwen-72B-chat-Int4进行qlora微调的时候,目前是8张A100,model_max_length调成8192的话就会报OOM,下面是我训练的脚本代码,请问下是否有什么解决方案,我的训练样本长度没有太长
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=
pwd
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}
MODEL="Qwen__Qwen-72B-Chat-Int4"
DATA="sft_train.json"
function usage() {
echo '
Usage: bash finetune/finetune_qlora_ds.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}
while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
Remember to use --fp16 instead of --bf16 due to autogptq
torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--fp16 True
--output_dir /home/qs/output
--num_train_epochs 100
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000000
--save_total_limit 10
--learning_rate 3e-4
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "tensorboard"
--model_max_length 8192
--lazy_preprocess True
--use_lora
--q_lora
--gradient_checkpointing
--deepspeed finetune/ds_config_zero2.json
这是报错内容:
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 2; 79.35 GiB total capacity; 69.52 GiB already allocated; 7.83 GiB free; 69.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
期望行为 | Expected Behavior
期望可以不报oom
复现方法 | Steps To Reproduce
sh finetune_lora_ds.sh
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: