Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in train in colab #42

Open
fermions75 opened this issue Apr 9, 2023 · 7 comments
Open

Issue in train in colab #42

fermions75 opened this issue Apr 9, 2023 · 7 comments
Labels

Comments

@fermions75
Copy link

While I run the train in colab, this error is shown -

Something went wrong
Connection errored out.

How can I solve this?

@alior101
Copy link

alior101 commented Apr 9, 2023

Getting this error too..

@lxe
Copy link
Owner

lxe commented Apr 11, 2023

I'm guessing it's running out of RAM? Are you using high ram env?

@lxe lxe added the colab label Apr 11, 2023
@fermions75
Copy link
Author

No, I did not. I just tried using colab pro. I used the base model cerebras/Cerebras-GPT-2.7B. When I press train, the following error shows in colab -
image

@MillionthOdin16
Copy link

No, it's broken. It works on hugging face now but can't download loras xD.

@rs189
Copy link

rs189 commented Apr 22, 2023

I have the same issue, I even tried running it without Gradio's tunnel but rather with another 3rd party but I get the same error.

@Clybius
Copy link

Clybius commented Apr 23, 2023

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post.
Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

@rs189
Copy link

rs189 commented Apr 24, 2023

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post. Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

What model and dataset are you using to generate and train? Because this is happening even with a half-precision 7b LLaMa model with default "unhelpful" example in my case, I can even generate with it on my PC which has only 8GB of VRAM, I can't train however, but I don't believe that fine tunning half-precision 7b LLaMa should be more demanding that 15GB of VRAM that Colab provides for free? As you can see the crash/"Connection errored out" error occurs way before RAM and/or VRAM is saturated.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants