Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh instead of timing out #15

Open
paplorinc opened this issue Apr 19, 2023 · 3 comments
Open

Refresh instead of timing out #15

paplorinc opened this issue Apr 19, 2023 · 3 comments

Comments

@paplorinc
Copy link
Contributor

paplorinc commented Apr 19, 2023

My current training takes 35 hours, it will time out - unless we refresh or increase the timeout substantially

@zetavg
Copy link
Owner

zetavg commented Apr 19, 2023

I'm thinking of not relying on Gradio's loading for the training process, don't think it's suitable for things that will last for minutes or hours. Can't monitor the progress on multiple devices, and it won't be possible to hook back into the training progress once the page is closed or disconnected - have to rely on the terminal to monitor the progress or abort it.

Instead, we can put the training into a subprocess, run it in the background and let the UI poll for its status, enabling us to see and control the progress on multiple devices. Have to craft a loading UI and block other features, such as inference, during fine-tuning, though.

Another thing I want to do is to add CLI support, so I can do long fine-tuning on SkyPilot's managed spot instance or terminate the machine automatically after fine-tuning ended to save cost.

@paplorinc
Copy link
Contributor Author

Nice, let me know how I can help!

@zetavg
Copy link
Owner

zetavg commented Apr 24, 2023

Update: this has now been merged into main.

I just implemented it on the dev-2 branch. Now it's possible to track the training progress on multiple devices, even on phones. Please feel free to give it a try and see if there're any issues.

I'll merge it back to main after testing on Colab (no free resource now).

The current known issue is that some processes, such as loading the base model or mapping the training dataset, can't be aborted immediately by clicking the abort button on the UI - will have to wait for that process to finish to get actually aborted.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants