Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOOGLE COLLAB works well for 2 days, then breaks. Why? #2827

Open
LIQUIDMIND111 opened this issue May 4, 2024 · 4 comments
Open

GOOGLE COLLAB works well for 2 days, then breaks. Why? #2827

LIQUIDMIND111 opened this issue May 4, 2024 · 4 comments

Comments

@LIQUIDMIND111
Copy link

I get a good model for a day or two, then next training i get this:

Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 803, in
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 535, in main
import bitsandbytes as bnb
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py", line 6, in
from .autograd._functions import (
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 5, in
import bitsandbytes.functional as F
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 13, in
from .cextension import COMPILED_WITH_CUDA, lib
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 37, in get_instance
cls._instance.initialize()
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--image_captions_filename', '--train_only_unet', '--save_starting_step=500', '--save_n_steps=0', '--Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI', '--pretrained_model_name_or_path=/content/stable-diffusion-v1-5', '--instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/instance_images', '--output_dir=/content/models/NicoleTEST768-TEXT4NXI', '--captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/NicoleTEST768-TEXT4NXI/captions', '--instance_prompt=', '--seed=869457', '--resolution=768', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=2e-06', '--lr_scheduler=linear', '--lr_warmup_steps=0', '--max_train_steps=1500']' returned non-zero exit status 1.
Something went wrong

@LIQUIDMIND111
Copy link
Author

always CUDA SETUP FAILS......

@LIQUIDMIND111
Copy link
Author

Same issue for me. Looks like the owner either doesn't know how to fix this or isn't fussed anymore

its working now, but after i change the google runtime from L4 to T4, and yesterday i used an A100 no issues....... maybe its an error on both sides? google GPU and the collab page...

@LIQUIDMIND111
Copy link
Author

@TheLastBen i found the glitch - is when using L4 GPU, it will give a CUDA SETUP ERROR, and on A100 and T4 you dont get an error..... the bad side of this is that we are paying for google credits or PRO, and cannot use faster GPUs because A100 is NOT always available and its 11.30 credits PER HOUR compared to L4 that is 4 credits and hour....... so at the end, we pay ONLY for MORE TIME instead of FASTER GPUs, if A100 is not available, since L4 will give CUDA ERROR....

Are you aware of this issue?

@TheLastBen
Copy link
Owner

I'm aware, I'll try to find a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants