Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta: Wider model support for PowerInfer #93

Open
1 of 3 tasks
hodlen opened this issue Dec 27, 2023 · 10 comments
Open
1 of 3 tasks

Meta: Wider model support for PowerInfer #93

hodlen opened this issue Dec 27, 2023 · 10 comments
Labels
tracker Track related issues and linked to a Project item

Comments

@hodlen
Copy link
Collaborator

hodlen commented Dec 27, 2023

PowerInfer currently optimizes for LLMs (Large Language Models) that utilize the ReLU activation function, leveraging their internal activation locality. However, many of the trending models do not use ReLU activation, creating a significant gap in PowerInfer's applicability.

This ongoing issue tracks our efforts to onboard new LLMs, particularly those in high demand within the community, and to continually enhance our existing ReLU-based LLMs.

Onboarding Progress

We're actively fine-tuning models into ReLU sparse models:

  • Mistral 7B (Now released as Bamboo)

Inviting broader participation, we're also:

  • Releasing guidelines and reference implementations for converting LLMs to ReLU-based models.
  • Open-sourcing our predictor training code post and during ReLU LLM fine-tuning.

Onboarding New Models

We recognize that fine-tuning upstream models is computationally intensive, and the requirement for high-quality data often surpasses our current capabilities. As such, we are actively seeking industrial collaborations to unlock more of PowerInfer's potential and bring state-of-the-art models to a wider audience. For direct inquiries and partnership discussions, please contact us at yzmizeyu@sjtu.edu.cn.

We will also focus on models that have garnered significant interest in our community 🌟. Your input and feedback are highly valued and encouraged! 💬👍

@hodlen hodlen added the tracker Track related issues and linked to a Project item label Dec 27, 2023
@hodlen hodlen pinned this issue Dec 27, 2023
@linkerlin
Copy link

I believe that a statistical method could be employed to set all outputs of non-ReLU activation functions that are below, for instance, the 30th percentile to zero, in a similar manner to obtain sparsity guarantees akin to those provided by ReLU.

@samvanity
Copy link

It's also important to keep MoE models in mind when you expand the compatibility of PowerInfer. The ceiling for consumer grade GPUs is around 3_0 for a 8x7b so if Powerinfer can easily handle 5_k_m or even 6_k for a 8x7b, then it will really be good news.

Create a ReLu version for the popular Mixtral Instruct v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) and the dolphin fine tuned (https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b), and people will start taking this project seriously.

@YixinSong-e
Copy link
Collaborator

YixinSong-e commented Jan 26, 2024

It's also important to keep MoE models in mind when you expand the compatibility of PowerInfer. The ceiling for consumer grade GPUs is around 3_0 for a 8x7b so if Powerinfer can easily handle 5_k_m or even 6_k for a 8x7b, then it will really be good news.

Create a ReLu version for the popular Mixtral Instruct v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) and the dolphin fine tuned (https://huggingface.co/cognitivecomputations/dolphin-2.7-mixtral-8x7b), and people will start taking this project seriously.

Thank you for your insight. Actually we are training mixtral now. Please wait for our updates. :)

@llCurious
Copy link

Hi @YixinSong-e . I notice that you provide the ReLU-LLaMA in HF. I run the model and found that the sparsity (values lower than zero) is much lower than OPT models, which could even achieve 99%. ReLU-LLaMA can only achieve about 70~80%. It seems has degradation on the sparse matmul, considering that much lower sparsity is observed.

@YixinSong-e
Copy link
Collaborator

YixinSong-e commented Jan 26, 2024

Hi @YixinSong-e . I notice that you provide the ReLU-LLaMA in HF. I run the model and found that the sparsity (values lower than zero) is much lower than OPT models, which could even achieve 99%. ReLU-LLaMA can only achieve about 70~80%. It seems has degradation on the sparse matmul, considering that much lower sparsity is observed.

Hello @llCurious. Yes, for now relu-llama has limited sparsity due to GLU varient, and thus the acceleration ratio of relullama is also relatively less compared to OPT. Interestingly, we found in the reglu activation function that even if there are some activation values that are not 0, they can still be ignored.

To push more sparisity in GLU based model, we currently did some experiments on mistral, which we will release recently.

@llCurious
Copy link

Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU?

BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).

@YixinSong-e
Copy link
Collaborator

YixinSong-e commented Jan 27, 2024

Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU?

BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).

First, zero values after ReLU remain zero is right. Further, some value after ReLU multiplication with GLU output is very close to zero, which can also be ignored. We will provide a specific explanation of this phenomenon in a paper (in the coming weeks)

Second, we do this by collecting the number of activations of all neurons in a given corpus. Then we calculate the CDF of activation counts by sorting the neurons in descending order of activation counts.

@guyk1971
Copy link

Thanks for your reply. I have a question that in my understanding, ReGLU uses element-wise multiplication, which means those zero values after ReLU remain zero, theoretically yilelding same sparsity level as ReLU?
BTW, i wonder how do you calculate the CDF in Figure 5 (power-law activation).

First, zero values after ReLU remain zero is right. Further, some value after ReLU multiplication with GLU output is very close to zero, which can also be ignored. We will provide a specific explanation of this phenomenon in a paper (in the coming weeks)

Second, we do this by collecting the number of activations of all neurons in a given corpus. Then we calculate the CDF of activation counts by sorting the neurons in descending order of activation counts.

Do you have plans to release the code for the profiler that collects the activation statistics ? it would be great to evaluate various models and working points. thanks !

@llCurious
Copy link

hi @hodlen . I notice that you provide ReLUFalcon-40B in the HF. Do you have the tuned ReLU-Falcon-7B weights?

@hodlen
Copy link
Collaborator Author

hodlen commented Mar 4, 2024

hi @hodlen . I notice that you provide ReLUFalcon-40B in the HF. Do you have the tuned ReLU-Falcon-7B weights?

We haven't tuned the Falcon 7B model and currently have no plan to do so. After reviewing benchmarks performances, we've opted to focus on our tuning efforts on Mistral 7B which has demonstrated to be a more robust foundation model for this scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracker Track related issues and linked to a Project item
Projects
Status: Ready
Development

No branches or pull requests

6 participants