[Feature] Multi-Machine Support for Distributed Inference #1046

ronaldmannak · 2024-04-28T01:56:38Z

The growth in size of open-source models is outpacing the growth of memory capacity of Mac computers. The latest 70B version of Llama 3 is already pushing the limits of a fully loaded Mac Pro. The upcoming 400B version of Llama 3 will exceed available memory entirely unless heavily quantized.

While memory limits may increase in future Mac Pro and Mac Studio models, it is likely that LLMs will continue to grow in size at an even faster rate. This poses a challenge for running the latest large open-source models with MLX. Without changes, MLX could be restricted to handling small to medium-sized models or heavily quantized versions of large models, resulting in inevitable inaccuracies.

MLX may become unsuitable for scenarios where local GPT-4-equivalent open-source models are necessary for cost and/or privacy considerations. I'm particularly thinking of SMB businesses and power users.

If we discard lossy options like quantizing, there are alternative approaches to consider:

Optimizing memory usage. For instance, Air_LLM, which involves loading and unloading layers. It is unclear if every LLM supports unloading entire layers, and this method may be inefficient since the layers have to be cycled for each generated token.
Implementing multi-machine support for distributed inference, where inference is distributed across multiple Macs. I shared a tweet about this possible solution and received significant interest, even though it was just a spontaneous idea. One way this approach could function is by splitting the model across multiple Macs (Mini, Studio, or Pro) connected via IP over Thunderbolt.

I am not proposing a definitive solution, but if there is interest in this topic, this discussion could serve as a starting point for further exploration of the possibilities.

sck-at-ucy · 2024-04-28T06:01:56Z

"Implementing multi-machine support for distributed inference": I am not sure how this would work in technical detail but this is an idea that would be very attractive in the context of using MLX for physics/engineering ML/AI+HPC applications.

ivanfioravanti · 2024-04-28T14:04:20Z

I think we should extend the concept to distributed training and inference, like the deepseed library. This can enable incredible scenarios powered by Apple Silicons chips.

awni · 2024-04-28T14:08:59Z

As a first step we are looking at adding the communication primitives you would use to implement both of these. Ops like send, receive, broadcast, reduce etc. Basically the ops in this table / MPI.

Both distributed training and inference should be implementable on top of those ops. Exactly what those APIs look like and where they live is still TBD. But we need those ops as a first step either way so we can do distributed work with MLX in a flexible way.

ronaldmannak · 2024-04-29T16:15:47Z

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See
tweet and PR.

fblissjr · 2024-05-07T21:13:20Z

This is something I'd be happy to test and contribute to as well. I remember seeing the original tweet, and just now got the Thunderbolt 4 cable connected across two macs (macbook pro and a m2 ultra).

Can see this use case being common for Apple users with a work and home machine.

fblissjr · 2024-05-07T21:14:13Z

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.

From what i've read, this is a llama.cpp issue, more than just llama3 quants.

awni · 2024-05-10T15:02:47Z

In progress #1097

fblissjr · 2024-05-11T18:37:27Z

In progress #1097

amazing. so ready.

sck-at-ucy · 2024-05-24T09:10:06Z

So exciting to see this moving along!! We have several Studios with M2 Ultras in the lab that we could hookup to test distributed computing when we reach that point of maturity. Would be happy to get involved in the testing.

awni added the enhancement New feature or request label Apr 28, 2024

awni mentioned this issue May 8, 2024

Enhancement - Use M series iPads as extra GPU power #1089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multi-Machine Support for Distributed Inference #1046

[Feature] Multi-Machine Support for Distributed Inference #1046

ronaldmannak commented Apr 28, 2024

sck-at-ucy commented Apr 28, 2024

ivanfioravanti commented Apr 28, 2024

awni commented Apr 28, 2024

ronaldmannak commented Apr 29, 2024

fblissjr commented May 7, 2024

fblissjr commented May 7, 2024

awni commented May 10, 2024

fblissjr commented May 11, 2024

sck-at-ucy commented May 24, 2024

[Feature] Multi-Machine Support for Distributed Inference #1046

[Feature] Multi-Machine Support for Distributed Inference #1046

Comments

ronaldmannak commented Apr 28, 2024

sck-at-ucy commented Apr 28, 2024

ivanfioravanti commented Apr 28, 2024

awni commented Apr 28, 2024

ronaldmannak commented Apr 29, 2024

fblissjr commented May 7, 2024

fblissjr commented May 7, 2024

awni commented May 10, 2024

fblissjr commented May 11, 2024

sck-at-ucy commented May 24, 2024