Skip to content

Latest commit

 

History

History
64 lines (51 loc) · 2.07 KB

a10.md

File metadata and controls

64 lines (51 loc) · 2.07 KB

Testing/notes for CUDA implementation

Current version is tested the following instance config:

  1. A10 GPU, 24Gb
  2. 200 Gb RAM
  3. 30 core intel cpu
  4. 1.4 Tb SSD drive

Running finetune for 7B llama works reasonably fast, but GPU utilization is bad - we can do much better.

finetune a10

Several immediate observations here:

  1. one CPU core is 100% utilized. What exactly is it doing, moving data around, supposedly?
  2. no disk reads are happening - we most likely just serve all files from cache.
  3. each burst of writes is a forward pass - we save inputs. GPU util is especially bad there.
  4. On backward/combined pass GPU utilization is slightly better but not much.
  5. The time before first forward pass is generation - we use batch of size 1 here so utilization is very low but that's to be expected
  6. GPU memory util is low as well - we can go with much larger batch size

What should we do?

  1. With this amount of memory we don't need to write to disk ever. We can just move layers back and forth between main memory and GPU
  2. Optionally prefetch and save async.

Let's check 7B model:

offloading to disk:

2023-09-05 22:11:56,099 starting iteration 5
2023-09-05 22:12:16,246 starting iteration 6
2023-09-05 22:12:36,387 starting iteration 7
2023-09-05 22:12:56,433 starting iteration 8

Each iteration takes ~20s

After offloading the layers to RAM rather than disk, we get considerable speedup:

2023-09-05 22:19:27,544 starting iteration 5
2023-09-05 22:19:35,563 starting iteration 6
2023-09-05 22:19:43,580 starting iteration 7
2023-09-05 22:19:51,603 starting iteration 8

Each iteration is ~8 seconds.

For 70B model on disk:

2023-09-06 00:08:59,143 starting iteration 5
2023-09-06 00:10:41,738 starting iteration 6
2023-09-06 00:12:24,215 starting iteration 7
2023-09-06 00:14:04,903 starting iteration 8

~100s / iteration

70B model on RAM:

2023-09-05 23:14:31,709 starting iteration 5
2023-09-05 23:15:46,593 starting iteration 6
2023-09-05 23:17:01,593 starting iteration 7
2023-09-05 23:18:16,730 starting iteration 8

~75s / iteration