[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时出现OOM报错 #1043

KevinFan0 · 2024-02-02T05:50:45Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

利用Qwen-72B-chat-Int4进行qlora微调的时候，目前是8张A100，model_max_length调成8192的话就会报OOM，下面是我训练的脚本代码，请问下是否有什么解决方案，我的训练样本长度没有太长

export CUDA_DEVICE_MAX_CONNECTIONS=1

DIR=pwd

GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

NNODES=${NNODES:-1}

NODE_RANK=${NODE_RANK:-0}

MASTER_ADDR=${MASTER_ADDR:-localhost}

MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen__Qwen-72B-Chat-Int4"
DATA="sft_train.json"

function usage() {
echo '
Usage: bash finetune/finetune_qlora_ds.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}

while [[ "$1" != "" ]]; do
case $1 in
-m | --model )
shift
MODEL=$1
;;
-d | --data )
shift
DATA=$1
;;
-h | --help )
usage
exit 0
;;
* )
echo "Unknown argument ${1}"
exit 1
;;
esac
shift
done

DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"

Remember to use --fp16 instead of --bf16 due to autogptq

torchrun $DISTRIBUTED_ARGS finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--fp16 True
--output_dir /home/qs/output
--num_train_epochs 100
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000000
--save_total_limit 10
--learning_rate 3e-4
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "tensorboard"
--model_max_length 8192
--lazy_preprocess True
--use_lora
--q_lora
--gradient_checkpointing
--deepspeed finetune/ds_config_zero2.json

这是报错内容：
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 2; 79.35 GiB total capacity; 69.52 GiB already allocated; 7.83 GiB free; 69.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

期望行为 | Expected Behavior

期望可以不报oom

复现方法 | Steps To Reproduce

sh finetune_lora_ds.sh

运行环境 | Environment

- OS:
- Python: 3.10
- Transformers: 4.32.0
- PyTorch: 2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7

备注 | Anything else?

No response

The text was updated successfully, but these errors were encountered:

mikeleatila · 2024-02-08T02:32:09Z

@KevinFan0 @JustinLin610 @JianxinMa I am getting the same error when using a single machine with 8 X V100, 32 GB each even with batch size = 1. There are no other processes running apart from this one. Have you managed to solve this somehow? Many thanks.

Here is the error log (see below):

| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:00:0D.0 Off | 0 |
| N/A 34C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:00:0E.0 Off | 0 |
| N/A 32C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100S-PCI... Off | 00000000:00:0F.0 Off | 0 |
| N/A 33C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100S-PCI... Off | 00000000:00:10.0 Off | 0 |
| N/A 34C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100S-PCI... Off | 00000000:00:11.0 Off | 0 |
| N/A 33C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100S-PCI... Off | 00000000:00:12.0 Off | 0 |
| N/A 34C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100S-PCI... Off | 00000000:00:13.0 Off | 0 |
| N/A 35C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100S-PCI... Off | 00000000:00:14.0 Off | 0 |
| N/A 33C P0 25W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[2024-02-06 00:36:46,618] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,619] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,624] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,648] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,705] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,719] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 00:36:46,752] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:49,982] [INFO] [comm.py:637:init_distributed] cdb=None
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:50,025] [INFO] [comm.py:637:init_distributed] cdb=None
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:50,185] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-06 00:36:50,191] [INFO] [comm.py:637:init_distributed] cdb=None
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:50,265] [INFO] [comm.py:637:init_distributed] cdb=None
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:50,280] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-06 00:36:50,288] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-06 00:36:50,288] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-06 00:36:50,402] [INFO] [comm.py:637:init_distributed] cdb=None
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
CUDA extension not installed.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Using disable_exllama is deprecated and will be removed in version 4.37. Use use_exllama instead and specify the version with exllama_config.The value of use_exllama will be overwritten by disable_exllama passed in GPTQConfig or stored in your config file.
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Try importing flash-attention for faster inference...
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:04, 4.18it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.47it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.56it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.36it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.41it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.50it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:03, 5.44it/s]
Loading checkpoint shards: 0%| | 0/21 [00:00<?, ?it/s]
Loading checkpoint shards: 5%|▍ | 1/21 [00:00<00:04, 4.94it/s]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:22, 1.16s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:22, 1.19s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:21, 1.16s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:24, 1.29s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:24, 1.27s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:24, 1.26s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:02<00:22, 1.21s/it]
Loading checkpoint shards: 10%|▉ | 2/21 [00:01<00:20, 1.05s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:35, 1.96s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:34, 1.91s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:34, 1.93s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:05<00:35, 1.98s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:33, 1.84s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:34, 1.93s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:05<00:35, 1.97s/it]
Loading checkpoint shards: 14%|█▍ | 3/21 [00:04<00:34, 1.92s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:07<00:34, 2.01s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:07<00:34, 2.02s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:06<00:33, 1.99s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:06<00:33, 1.99s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:06<00:33, 1.94s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:06<00:33, 1.99s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:07<00:34, 2.00s/it]
Loading checkpoint shards: 19%|█▉ | 4/21 [00:07<00:34, 2.03s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:37, 2.32s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:36, 2.31s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:37, 2.31s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:10<00:37, 2.33s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:37, 2.33s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:37, 2.32s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:37, 2.31s/it]
Loading checkpoint shards: 24%|██▍ | 5/21 [00:09<00:36, 2.29s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.54s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.54s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.54s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:13<00:38, 2.56s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.54s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:37, 2.52s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.56s/it]
Loading checkpoint shards: 29%|██▊ | 6/21 [00:12<00:38, 2.56s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.45s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.44s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.44s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.44s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.45s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.44s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:14<00:33, 2.43s/it]
Loading checkpoint shards: 33%|███▎ | 7/21 [00:15<00:34, 2.45s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:32, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:33, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:32, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:18<00:33, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:32, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:32, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:33, 2.54s/it]
Loading checkpoint shards: 38%|███▊ | 8/21 [00:17<00:32, 2.53s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:20<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:19<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:20<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:20<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:20<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:19<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:20<00:29, 2.43s/it]
Loading checkpoint shards: 43%|████▎ | 9/21 [00:19<00:29, 2.42s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:23<00:28, 2.62s/it]
Loading checkpoint shards: 48%|████▊ | 10/21 [00:22<00:28, 2.61s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.42s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.43s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.42s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.42s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.43s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:25<00:24, 2.43s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:24<00:24, 2.42s/it]
Loading checkpoint shards: 52%|█████▏ | 11/21 [00:24<00:24, 2.42s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:27<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:28<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:28<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:28<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:27<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:27<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:27<00:23, 2.58s/it]
Loading checkpoint shards: 57%|█████▋ | 12/21 [00:28<00:23, 2.60s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:31<00:21, 2.66s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.66s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.67s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.67s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.67s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.67s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.66s/it]
Loading checkpoint shards: 62%|██████▏ | 13/21 [00:30<00:21, 2.68s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.55s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.56s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.56s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.55s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.56s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.56s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:32<00:17, 2.56s/it]
Loading checkpoint shards: 67%|██████▋ | 14/21 [00:33<00:17, 2.56s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:36<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 71%|███████▏ | 15/21 [00:35<00:15, 2.60s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:38<00:12, 2.50s/it]
Loading checkpoint shards: 76%|███████▌ | 16/21 [00:37<00:12, 2.50s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:41<00:10, 2.63s/it]
Loading checkpoint shards: 81%|████████ | 17/21 [00:40<00:10, 2.63s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.68s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.68s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.67s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.68s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.68s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.67s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:44<00:08, 2.68s/it]
Loading checkpoint shards: 86%|████████▌ | 18/21 [00:43<00:08, 2.68s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:46<00:05, 2.56s/it]
Loading checkpoint shards: 90%|█████████ | 19/21 [00:45<00:05, 2.57s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:47<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 95%|█████████▌| 20/21 [00:48<00:02, 2.37s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:47<00:00, 1.70s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:47<00:00, 2.28s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.29s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.30s/it]

Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.29s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 21/21 [00:48<00:00, 2.30s/it]
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
Loading data...
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
trainable params: 754,974,720 || all params: 3,247,710,208 || trainable%: 23.246369646537133
Formatting inputs...Skip in lazy mode
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
train()
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
train()
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
trainer.train()
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
trainer.train()
train()
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
train()
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
trainer.train()
trainer.train() File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train

File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
^^^^^^^^^^^^^^^Traceback (most recent call last):
^^^^^
train() File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop

File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
Traceback (most recent call last):
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
trainer.train()
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(return inner_training_loop(

                  ^^    ^^^train()^^^

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
train()
File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
^^^^^^^^^^^^^^^^^^^^^^^ ^trainer.train()^

File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
trainer.train()
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
^^return inner_training_loop(^
^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^
^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(

                                       result = self._prepare_deepspeed(*args)  
                                                        ^  ^^ ^^ ^^ ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^
^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
return inner_training_loop(
^^^^^^^^^^^^ ^^result = self._prepare_deepspeed(*args)^
^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
return inner_training_loop(
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
^
^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
result = self._prepare_deepspeed(*args)
result = self._prepare_deepspeed(*args)
^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) ^^
^
^ ^ ^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^^^^^^^^^^^^^^^^^^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^
^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
^^^^^^ ^model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(^
^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
result = self._prepare_deepspeed(*args)
engine = DeepSpeedEngine(args=args,
^^ ^^ ^ ^ ^^ ^ ^ ^ ^ ^^^ ^^ ^^^ ^^^ ^^^ ^^ ^ ^ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)^^
^^ ^ ^^^ ^engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)^
^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^ ^^^^ ^^^^^^
^^ ^^^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
^ ^^ ^Traceback (most recent call last):
^^ ^
^^ File "/home/mentox/project/qwen_72b_int4/finetune.py", line 374, in
^^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^ ^ ^^ ^^ ^ ^^ ^ ^result = self._prepare_deepspeed(*args)

^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
^ ^ ^ ^ ^ ^ ^^ ^^ ^^ ^^ ^^^^^ ^^^^^^self._configure_distributed_model(model)^^
^^^^^^^^^^^^^^^^^^^^ ^^^^engine = DeepSpeedEngine(args=args,^^^^^
^^^^^^^^^^ ^^^ ^^^ ^^ ^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^
^^ ^^^ ^^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize

^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^
^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
^^^^^^^^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
engine = DeepSpeedEngine(args=args,
engine = DeepSpeedEngine(args=args,
train()
^ ^ ^^^^^^^^^^^^^^ ^^self._configure_distributed_model(model)^^^
^^^^^^^^^^^ File "/home/mentox/project/qwen_72b_int4/finetune.py", line 367, in train
^ ^^result = self._prepare_deepspeed(*args)^^
^^^^^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^^
^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
self.module.to(self.device)
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
self._configure_distributed_model(model) self._configure_distributed_model(model)trainer.train()

               File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train

File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^^^^^^^^^^^ ^engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)^^
^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
self.module.to(self.device)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^^^^^^^^^^^^^^^^^^^ ^engine = DeepSpeedEngine(args=args,^
^^^ ^ return self._apply(convert)^
^ ^ ^ ^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^ ^ ^ ^ ^^ ^ ^ ^ ^ ^^^^^^ ^^^self.module.to(self.device)^
^^self.module.to(self.device)^^
^^^^^^^^^^^^^^^
^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
engine = DeepSpeedEngine(args=args,
^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^return self._apply(convert) ^
^ ^ ^ ^ ^ return inner_training_loop( ^
^ ^ ^ self._configure_distributed_model(model) ^

      File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in __init__
            ^    File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model

^ ^ ^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^module._apply(fn)^^^

^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1690, in _inner_training_loop
^^^^^^^^ ^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
return self._apply(convert)^^
^^ ^^^return self._apply(convert)
^
self._configure_distributed_model(model)

File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
engine = DeepSpeedEngine(args=args,
self.module.to(self.device)
^^^^^^^^^^ ^module._apply(fn)^
^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init
module._apply(fn)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
self.module.to(self.device)
module._apply(fn)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
self._configure_distributed_model(model)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
module._apply(fn)
[Previous line repeated 5 more times]
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
return self._apply(convert)
module._apply(fn)
^^^^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn) ^
^ ^ ^^ ^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^^^^^^^ ^^module._apply(fn)^^
^^^^^^^^
^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1219, in prepare
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
return self._apply(convert)
self._buffers[key] = fn(buf)
self.module.to(self.device)
^ ^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^ ^ ^ module._apply(fn)^
^ ^ ^ ^ ^ ^^^module._apply(fn)^^^
^^ [Previous line repeated 5 more times]
^^
^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
[Previous line repeated 5 more times]
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
module._apply(fn)module._apply(fn)

[Previous line repeated 5 more times]
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
result = self._prepare_deepspeed(*args)
module._apply(fn) self._buffers[key] = fn(buf)

 self._buffers[key] = fn(buf) 
  ^^^^^  File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply

^ ^ return self._apply(convert) ^ ^
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)^
^ ^ ^ ^ ^ ^ module._apply(fn) ^ ^
^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
self._buffers[key] = fn(buf)^
^ ^ ^ ^ ^ ^ ^^ ^
^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
^ ^^ ^^^ ^^^^^ ^^^^^ ^^^^^ ^ ^^^^^ ^^^^^ ^^

^ ^ ^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^^ ^^ module._apply(fn)
^
^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^ ^ ^ ^ ^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^^^^^^^^^^^^^
^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^^^^^^^^^^^^ ^module._apply(fn)^
^^^^^^^^^^^ [Previous line repeated 5 more times]
^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 1; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
module._apply(fn)module._apply(fn)

  [Previous line repeated 5 more times]

File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
self._buffers[key] = fn(buf)engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

                   ^^ ^^ ^^ ^^ ^^ ^^ ^ ^ ^^  ^^ ^ ^^  ^   ^ ^  ^ ^^  ^  ^  ^ ^ ^   ^^    ^^   ^^   ^^ ^^  ^^ ^ ^^^  ^ ^^ ^ ^^^ ^^     ^ ^^ ^^ module._apply(fn)^^    self._buffers[key] = fn(buf)^^

^^
^^^^ ^^ ^^^ ^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
^^ ^^ ^^
^ ^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^ ^^^ ^^ ^^ ^ ^ ^^ ^^^ ^^^ ^ ^ ^^ ^^^ ^^ ^^^ ^^ ^^^ ^^^ ^^^^^ ^^^^^^ ^^^^^ ^^^^ ^^^ ^^^^^^ ^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^ ^^^module._apply(fn) File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^^^
^^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 5 more times]
^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
^^^^^
^^^^^^torch.cuda^torch.cuda.^^.OutOfMemoryError^: OutOfMemoryError^^: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 2; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF^^
CUDA out of memory. Tried to allocate 96.00 MiB (GPU 4; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF^
^return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)^
^^
torch.cuda.OutOfMemoryError : engine = DeepSpeedEngine(args=args,
CUDA out of memory. Tried to allocate 96.00 MiB (GPU 3; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^ ^return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 263, in init

^^^^^^^^^ ^^self._buffers[key] = fn(buf)^
^^^^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^self._configure_distributed_model(model) ^^
^^^^^^^^^^^^^^^^^^^^^
^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
^ File "/home/mentox/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
^^^^^^^torch.cuda^^.^OutOfMemoryError^^: ^CUDA out of memory. Tried to allocate 96.00 MiB (GPU 7; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
self.module.to(self.device)return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 5; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 844, in _apply
self._buffers[key] = fn(buf)
^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 6; 31.75 GiB total capacity; 30.97 GiB already allocated; 61.50 MiB free; 30.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 81) of binary: /home/mentox/miniconda3/bin/python
Traceback (most recent call last):
File "/home/mentox/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mentox/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/mentox/project/qwen_72b_int4/finetune.py FAILED

Failures:
[1]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 82)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 83)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 84)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 85)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 86)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 87)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 88)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 81)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

WangJianQ-cmd · 2024-02-21T01:34:48Z

调小 model_max_length 8192调到512，再一步步往上加

Yuxiang1995 · 2024-03-20T09:00:18Z

8卡A100都能全参finetune了呀

mikeleatila · 2024-03-21T17:38:20Z

Adjust model_max_length down from 8192 to 512, and then increase it step by step.

@WangJianQ-cmd Thanks for your reply. Unfortunately I am still getting OOM even when I reduce model_max_length to values lower than 512 . The same applies for any other value larger than 512 . Thanks

mikeleatila · 2024-03-21T17:48:16Z

All 8-card A100 can participate in finetune.

@Yuxiang1995 Thanks for getting back to me! I only have 8 V100 not A100 . Do you think it would be still possible?

ff1Zzd · 2024-04-10T11:05:47Z

All 8-card A100 can participate in finetune.

@Yuxiang1995 Thanks for getting back to me! I only have 8 V100 not A100 . Do you think it would be still possible?

Hi did you manage to solve. the problem? I encountered the same problem.

github-actions · 2024-05-11T08:05:19Z

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
此问题由于长期未有新进展而被系统自动标记为不活跃。如果您认为它仍有待解决，请在此帖下方留言以补充信息。

github-actions bot added the inactive label May 11, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时出现OOM报错 #1043

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时出现OOM报错 #1043

KevinFan0 commented Feb 2, 2024 •

edited

mikeleatila commented Feb 8, 2024 •

edited

WangJianQ-cmd commented Feb 21, 2024

Yuxiang1995 commented Mar 20, 2024

mikeleatila commented Mar 21, 2024

mikeleatila commented Mar 21, 2024

ff1Zzd commented Apr 10, 2024

github-actions bot commented May 11, 2024

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时 出现OOM报错 #1043

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时 出现OOM报错 #1043

Comments

KevinFan0 commented Feb 2, 2024 • edited

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

Remember to use --fp16 instead of --bf16 due to autogptq

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

mikeleatila commented Feb 8, 2024 • edited

/home/mentox/project/qwen_72b_int4/finetune.py FAILED

Root Cause (first observed failure): [0]: time : 2024-02-06_00:40:59 host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112 rank : 0 (local_rank: 0) exitcode : 1 (pid: 81) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

WangJianQ-cmd commented Feb 21, 2024

Yuxiang1995 commented Mar 20, 2024

mikeleatila commented Mar 21, 2024

mikeleatila commented Mar 21, 2024

ff1Zzd commented Apr 10, 2024

github-actions bot commented May 11, 2024

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时出现OOM报错 #1043

[BUG] <title>单机8卡A100进行Qwen-72B-chat-Int4 QLora训练时出现OOM报错 #1043

KevinFan0 commented Feb 2, 2024 •

edited

mikeleatila commented Feb 8, 2024 •

edited

Root Cause (first observed failure):
[0]:
time : 2024-02-06_00:40:59
host : job-6edb80eb-6e5a-4cc7-82c2-29ad70ed2112
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 81)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html