Testing/notes for CUDA implementation

Current version is tested the following instance config:

A10 GPU, 24Gb
200 Gb RAM
30 core intel cpu
1.4 Tb SSD drive

Running finetune for 7B llama works reasonably fast, but GPU utilization is bad - we can do much better.

Several immediate observations here:

one CPU core is 100% utilized. What exactly is it doing, moving data around, supposedly?
no disk reads are happening - we most likely just serve all files from cache.
each burst of writes is a forward pass - we save inputs. GPU util is especially bad there.
On backward/combined pass GPU utilization is slightly better but not much.
The time before first forward pass is generation - we use batch of size 1 here so utilization is very low but that's to be expected
GPU memory util is low as well - we can go with much larger batch size

What should we do?

With this amount of memory we don't need to write to disk ever. We can just move layers back and forth between main memory and GPU
Optionally prefetch and save async.

Let's check 7B model:

offloading to disk:

2023-09-05 22:11:56,099 starting iteration 5
2023-09-05 22:12:16,246 starting iteration 6
2023-09-05 22:12:36,387 starting iteration 7
2023-09-05 22:12:56,433 starting iteration 8

Each iteration takes ~20s

After offloading the layers to RAM rather than disk, we get considerable speedup:

2023-09-05 22:19:27,544 starting iteration 5
2023-09-05 22:19:35,563 starting iteration 6
2023-09-05 22:19:43,580 starting iteration 7
2023-09-05 22:19:51,603 starting iteration 8

Each iteration is ~8 seconds.

For 70B model on disk:

2023-09-06 00:08:59,143 starting iteration 5
2023-09-06 00:10:41,738 starting iteration 6
2023-09-06 00:12:24,215 starting iteration 7
2023-09-06 00:14:04,903 starting iteration 8

~100s / iteration

70B model on RAM:

2023-09-05 23:14:31,709 starting iteration 5
2023-09-05 23:15:46,593 starting iteration 6
2023-09-05 23:17:01,593 starting iteration 7
2023-09-05 23:18:16,730 starting iteration 8

~75s / iteration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a10.md

a10.md

Testing/notes for CUDA implementation

Files

a10.md

Latest commit

History

a10.md

File metadata and controls

Testing/notes for CUDA implementation