Block sparse MM MoEs #782

angeloskath · 2024-05-16T08:04:16Z

Implements all MoEs with the new mx.block_sparse_mm and mx.block_sparse_qmm (the latter is at ml-explore/mlx#1124 ).

One notable change is in the quantization predicate which accepts modules that define to_quantized() and delegates the check for the existence of quantized weights to the module using is_quantized(). This allows us to define SwitchMLP for each model and a quantized equivalent and auto convert to the quantized one.

Performance comparison at 4bits

Before

Qwen1.5-MoE-A2.7B-Chat     : 56.3 tps
Mixtral-8x7B-Instruct-v0.1 : 40.3 tps
phixtral-4x2_8             : 47.2 tps

After

Qwen1.5-MoE-A2.7B-Chat     : 121.7 tps
Mixtral-8x7B-Instruct-v0.1 :  59.3 tps
phixtral-4x2_8             :  92.8 tps

angeloskath · 2024-05-20T04:31:08Z

@awni I think this is ready for a review. I didn't implement the DoRASwitchLinear but it can easily be added in a later PR that also adds quantization to DoRA layers.

awni · 2024-05-20T14:35:00Z

llms/mlx_lm/tuner/utils.py

+        elif isinstance(layer, (SwitchLinear, QuantizedSwitchLinear)):
+            LoRALayer = LoRASwitchLinear


Should we add use_dora to the condition here? Or throw if trying to DoRA-fy a switch linear layer (until we support Dora)?

awni

🚀 thanks for fixing the quantized models!

angeloskath requested a review from awni May 16, 2024 08:04

angeloskath force-pushed the block_mask_moe branch from 9b91525 to ed4ef64 Compare May 17, 2024 07:42

awni and others added 9 commits May 19, 2024 21:11

work towards moe

ec36c23

runs

09c4659

use fast switch linear

5e71b16

use new API

3e56a8b

Add a QuantizedSwitchMLP for Qwen

2f4080e

Use the new API for the MoE models

3a4c2b3

Share the switch layers

3046ac3

Refactor switch layers

fe840e5

Add LoRASwitch layers

79f444d

angeloskath force-pushed the block_mask_moe branch from 8aca3b7 to 79f444d Compare May 20, 2024 04:14

Auto freeze the quantized switch linear

4c0ee7b

awni reviewed May 20, 2024

View reviewed changes

angeloskath added 2 commits May 20, 2024 10:15

Raise error for DoRA on SwitchLinear layers

58f5370

Make old quantized models loadable

e336e35

awni approved these changes May 21, 2024

View reviewed changes

angeloskath merged commit 9f67122 into main May 21, 2024
2 checks passed

angeloskath deleted the block_mask_moe branch May 21, 2024 22:58

BrewTestBot mentioned this pull request May 24, 2024

mlx 0.14.0 Homebrew/homebrew-core#172686

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block sparse MM MoEs #782

Block sparse MM MoEs #782

angeloskath commented May 16, 2024 •

edited

angeloskath commented May 20, 2024

awni May 20, 2024

awni left a comment

		elif isinstance(layer, (SwitchLinear, QuantizedSwitchLinear)):
		LoRALayer = LoRASwitchLinear

Block sparse MM MoEs #782

Block sparse MM MoEs #782

Conversation

angeloskath commented May 16, 2024 • edited

Performance comparison at 4bits

angeloskath commented May 20, 2024

awni May 20, 2024

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

angeloskath commented May 16, 2024 •

edited