Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在RTX 4090 上微调chatglm3报这个错:Current loss scale already at minimum - cannot decrease scale anymore #130

Open
450586509 opened this issue Jan 17, 2024 · 1 comment

Comments

@450586509
Copy link

ss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
[2024-01-17 10:10:24,477] [INFO] [logging.py:96:log_dist] [Rank 0] step=16, skipped=16, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:24,478] [INFO] [timer.py:260:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=7.620608764200451, CurrSamplesPerSec=7.801349699356398, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 67/114599 [00:09<4:29:54, 7.07batch/s][2024-01-17 10:10:25,080] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
[2024-01-17 10:10:25,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=17, skipped=17, lr=[0.0001], mom=[(0.9, 0.95)]
[2024-01-17 10:10:25,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=7.547169766053195, CurrSamplesPerSec=6.64997792221485, MemAllocated=13.44GB, MaxMemAllocated=14.82GB
0%| | 71/114599 [00:10<4:44:42, 6.70batch/s]
Traceback (most recent call last):
File "/root/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/root/ChatGLM-Finetuning/train.py", line 195, in main
model.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2148, in step
self._take_model_step(lr_kwargs)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2054, in _take_model_step
self.optimizer.step()
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1778, in step
self._update_scale(self.overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2029, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/root/miniconda3/envs/ft/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
[2024-01-17 10:10:28,152] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1997

@450586509 450586509 changed the title 在RTX 4090 上运行报这个错:Current loss scale already at minimum - cannot decrease scale anymore 在RTX 4090 上微调chatglm3报这个错:Current loss scale already at minimum - cannot decrease scale anymore Jan 17, 2024
@origi6615
Copy link

你好,请问解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants