Refresh instead of timing out #15

paplorinc · 2023-04-19T17:37:30Z

My current training takes 35 hours, it will time out - unless we refresh or increase the timeout substantially

zetavg · 2023-04-19T17:53:55Z

I'm thinking of not relying on Gradio's loading for the training process, don't think it's suitable for things that will last for minutes or hours. Can't monitor the progress on multiple devices, and it won't be possible to hook back into the training progress once the page is closed or disconnected - have to rely on the terminal to monitor the progress or abort it.

Instead, we can put the training into a subprocess, run it in the background and let the UI poll for its status, enabling us to see and control the progress on multiple devices. Have to craft a loading UI and block other features, such as inference, during fine-tuning, though.

Another thing I want to do is to add CLI support, so I can do long fine-tuning on SkyPilot's managed spot instance or terminate the machine automatically after fine-tuning ended to save cost.

paplorinc · 2023-04-19T18:06:13Z

Nice, let me know how I can help!

zetavg · 2023-04-24T03:30:54Z

Update: this has now been merged into main.

I just implemented it on the dev-2 branch. Now it's possible to track the training progress on multiple devices, even on phones. Please feel free to give it a try and see if there're any issues.

I'll merge it back to main after testing on Colab (no free resource now).

The current known issue is that some processes, such as loading the base model or mapping the training dataset, can't be aborted immediately by clicking the abort button on the UI - will have to wait for that process to finish to get actually aborted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh instead of timing out #15

Refresh instead of timing out #15

paplorinc commented Apr 19, 2023 •

edited

zetavg commented Apr 19, 2023 •

edited

paplorinc commented Apr 19, 2023

zetavg commented Apr 24, 2023 •

edited

Refresh instead of timing out #15

Refresh instead of timing out #15

Comments

paplorinc commented Apr 19, 2023 • edited

zetavg commented Apr 19, 2023 • edited

paplorinc commented Apr 19, 2023

zetavg commented Apr 24, 2023 • edited

paplorinc commented Apr 19, 2023 •

edited

zetavg commented Apr 19, 2023 •

edited

zetavg commented Apr 24, 2023 •

edited