How to make quantised models work faster on CPU machines #132

TheRabidWolverine · 2024-02-21T16:08:07Z

TheRabidWolverine
Feb 21, 2024

The assumption is that quantised models are better suited on CPU machines than their non-quantised counterparts. So what governs the performance of quantized versions of the embeddings (both during embedding generation as well as runtime inferencing) on these CPU machines? Number of CPU cores? Or RAM? Or both? Will a machine with more cores be able to parallel process higher number of queries with good performance, or does it require RAM as well to do that? Please advise.

Answered by NirantK

Feb 23, 2024

Hey @TheRabidWolverine, couple of things:

It's not always necessary that quantization works better for CPU only. One can quantize models for CUDA or Apple runtimes too. It's just that we've chosen to prefer CPU since we had difficulty setting up tests for CUDA and Apple runtimes.
What governs the speed gain — performance? Primarily, the model size. Quantized models do two things: It's less operations and hence the model size is smaller
FastEmbed does use more CPU processes — that's because for larger datasets, we do data parallel processing. That can have a RAM impact as well, since we load more data into RAM for parallel processing.

I hope this answers your questions? Please feel free …

View full answer

NirantK · 2024-02-23T05:05:30Z

NirantK
Feb 23, 2024
Maintainer

Hey @TheRabidWolverine, couple of things:

It's not always necessary that quantization works better for CPU only. One can quantize models for CUDA or Apple runtimes too. It's just that we've chosen to prefer CPU since we had difficulty setting up tests for CUDA and Apple runtimes.
What governs the speed gain — performance? Primarily, the model size. Quantized models do two things: It's less operations and hence the model size is smaller
FastEmbed does use more CPU processes — that's because for larger datasets, we do data parallel processing. That can have a RAM impact as well, since we load more data into RAM for parallel processing.

I hope this answers your questions? Please feel free to ask more follow up questions!

0 replies

TheRabidWolverine · 2024-02-23T06:11:56Z

TheRabidWolverine
Feb 23, 2024
Author

Thanks Nirant. Do more cores necessarily make the performance faster?Sent from my iPhoneOn 23-Feb-2024, at 10:35 AM, Nirant ***@***.***> wrote: Closed #132 as resolved. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make quantised models work faster on CPU machines #132

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to make quantised models work faster on CPU machines #132

TheRabidWolverine Feb 21, 2024

Replies: 2 comments

NirantK Feb 23, 2024 Maintainer

TheRabidWolverine Feb 23, 2024 Author

TheRabidWolverine
Feb 21, 2024

NirantK
Feb 23, 2024
Maintainer

TheRabidWolverine
Feb 23, 2024
Author