feat: prefetching #152

daquexian · 2023-07-07T14:42:59Z

Add a prefetching strategy: *5+3 means the weights of the first 5 layers always remain in GPU memory, and when the i-th layer is executed, the weights of {i+3}-th layer will be prefetched asynchronously into GPU memory (and be dropped once the execution of {i+3}-th layer finishes).

Now the old stream strategy like *5+ becomes the abbrev of *5+0

The following shows the overlap between the computation cuda stream (blue) and memcpy cuda stream (green):

However, the prefetching feature doesn't necessarily speed up the inference compared to the old stream strategy with the same memory budget (e.g. *10+10 vs *20+0), because memcpy is much slower than computation and cannot be fully overlapped. Here are some benchmarks of 7b world model (bf16, RWKV_JIT_ON=1, RWKV_CUDA_ON=0, A100 80G):

Strategy	GPU Mem	Time
*10+0	6306MB	0.7756s
*15+0	8382MB	0.6054s
*10+10	10502MB	0.6912s
*15+5	10498MB	0.5655s
*20+0	10456MB	0.7067s
No stream	15184MB	0.0567s

7b world model (fp16, RWKV_JIT_ON=1, RWKV_CUDA_ON=1, A100 80G):

Strategy	GPU Mem	Time
*10+0	6346MB	0.6043s
*15+0	8422MB	0.5046s
*10+10	10602MB	0.5930s
*15+5	10600MB	0.4961s
*20+0	10498MB	0.5532s
No stream	15184MB	0.0195s

BTW, it may be helpful to have a prepare API which prefetches the weights manually to reduce the time of forward in some scenes.

Signed-off-by: daquexian <daquexian566@gmail.com>

overlap communication and computation

e84dea7

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian force-pushed the stream branch from 3d49dc2 to e84dea7 Compare July 8, 2023 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: prefetching #152

feat: prefetching #152

daquexian commented Jul 7, 2023 •

edited

feat: prefetching #152

Are you sure you want to change the base?

feat: prefetching #152

Conversation

daquexian commented Jul 7, 2023 • edited

daquexian commented Jul 7, 2023 •

edited