-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block sparse MM MoEs #782
Block sparse MM MoEs #782
Conversation
9b91525
to
ed4ef64
Compare
8aca3b7
to
79f444d
Compare
@awni I think this is ready for a review. I didn't implement the |
llms/mlx_lm/tuner/utils.py
Outdated
elif isinstance(layer, (SwitchLinear, QuantizedSwitchLinear)): | ||
LoRALayer = LoRASwitchLinear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add use_dora
to the condition here? Or throw if trying to DoRA-fy a switch linear layer (until we support Dora)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 thanks for fixing the quantized models!
Implements all MoEs with the new
mx.block_sparse_mm
andmx.block_sparse_qmm
(the latter is at ml-explore/mlx#1124 ).One notable change is in the quantization predicate which accepts modules that define
to_quantized()
and delegates the check for the existence of quantized weights to the module usingis_quantized()
. This allows us to defineSwitchMLP
for each model and a quantized equivalent and auto convert to the quantized one.Performance comparison at 4bits
Before
After