Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Multi-Machine Support for Distributed Inference #1046

Open
ronaldmannak opened this issue Apr 28, 2024 · 9 comments
Open

[Feature] Multi-Machine Support for Distributed Inference #1046

ronaldmannak opened this issue Apr 28, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@ronaldmannak
Copy link

The growth in size of open-source models is outpacing the growth of memory capacity of Mac computers. The latest 70B version of Llama 3 is already pushing the limits of a fully loaded Mac Pro. The upcoming 400B version of Llama 3 will exceed available memory entirely unless heavily quantized.

While memory limits may increase in future Mac Pro and Mac Studio models, it is likely that LLMs will continue to grow in size at an even faster rate. This poses a challenge for running the latest large open-source models with MLX. Without changes, MLX could be restricted to handling small to medium-sized models or heavily quantized versions of large models, resulting in inevitable inaccuracies.

MLX may become unsuitable for scenarios where local GPT-4-equivalent open-source models are necessary for cost and/or privacy considerations. I'm particularly thinking of SMB businesses and power users.

If we discard lossy options like quantizing, there are alternative approaches to consider:

  1. Optimizing memory usage. For instance, Air_LLM, which involves loading and unloading layers. It is unclear if every LLM supports unloading entire layers, and this method may be inefficient since the layers have to be cycled for each generated token.

  2. Implementing multi-machine support for distributed inference, where inference is distributed across multiple Macs. I shared a tweet about this possible solution and received significant interest, even though it was just a spontaneous idea. One way this approach could function is by splitting the model across multiple Macs (Mini, Studio, or Pro) connected via IP over Thunderbolt.

I am not proposing a definitive solution, but if there is interest in this topic, this discussion could serve as a starting point for further exploration of the possibilities.

@sck-at-ucy
Copy link

"Implementing multi-machine support for distributed inference": I am not sure how this would work in technical detail but this is an idea that would be very attractive in the context of using MLX for physics/engineering ML/AI+HPC applications.

@awni awni added the enhancement New feature or request label Apr 28, 2024
@ivanfioravanti
Copy link

I think we should extend the concept to distributed training and inference, like the deepseed library. This can enable incredible scenarios powered by Apple Silicons chips.

@awni
Copy link
Member

awni commented Apr 28, 2024

As a first step we are looking at adding the communication primitives you would use to implement both of these. Ops like send, receive, broadcast, reduce etc. Basically the ops in this table / MPI.

Both distributed training and inference should be implementable on top of those ops. Exactly what those APIs look like and where they live is still TBD. But we need those ops as a first step either way so we can do distributed work with MLX in a flexible way.

@ronaldmannak
Copy link
Author

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See
tweet and PR.

@fblissjr
Copy link

fblissjr commented May 7, 2024

This is something I'd be happy to test and contribute to as well. I remember seeing the original tweet, and just now got the Thunderbolt 4 cable connected across two macs (macbook pro and a m2 ultra).

Can see this use case being common for Apple users with a work and home machine.

@fblissjr
Copy link

fblissjr commented May 7, 2024

@awni That sounds like a good and doable first step. If I have time this week, I'm happy to take a first stab at it.

FWIW, there is a relevant discussion about how the accuracy of Llama 3 model may be disproportionately affected by quantization, which might be a side effect of using a large training set. This would be another argument for distributed inference. See tweet and PR.

From what i've read, this is a llama.cpp issue, more than just llama3 quants.

@awni
Copy link
Member

awni commented May 10, 2024

In progress #1097

@fblissjr
Copy link

In progress #1097

amazing. so ready.

@sck-at-ucy
Copy link

So exciting to see this moving along!! We have several Studios with M2 Ultras in the lab that we could hookup to test distributed computing when we reach that point of maturity. Would be happy to get involved in the testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants