Does the number of slots * the number of experts have to be the same as the length of the token? #2

ZQpengyu · 2023-11-16T02:37:23Z

Assuming I have a sequence of length 512, but num_slot * num_expert = 9, it should still be able to run. In this case, would there be a performance drop?

andersonbcdefg · 2023-12-11T18:30:17Z

Have to validate that empirically, but I would definitely expect a performance drop here. Because of softmax, my intuition is it's likely that 1 token dominates each expert, and some tokens get short shrift. This might be fine if you have like, 2x more slots, or maybe even 4x or 8x, but >50 tokens per slot I imagine would be skimping a bit and likely to hurt results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the number of slots * the number of experts have to be the same as the length of the token? #2

Does the number of slots * the number of experts have to be the same as the length of the token? #2

ZQpengyu commented Nov 16, 2023

andersonbcdefg commented Dec 11, 2023

Does the number of slots * the number of experts have to be the same as the length of the token? #2

Does the number of slots * the number of experts have to be the same as the length of the token? #2

Comments

ZQpengyu commented Nov 16, 2023

andersonbcdefg commented Dec 11, 2023