-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updt init #210
base: main
Are you sure you want to change the base?
updt init #210
Conversation
@@ -112,7 +116,7 @@ def __init__( | |||
self.skip_moe = (int(os.environ.get('SKIP_MOE', '0')) != 0) | |||
|
|||
self.num_local_experts = experts.pop('count_per_node', 1) | |||
self.register_buffer('_num_global_experts', torch.tensor(MOELayer.global_expert_count(self.num_local_experts, self.group))) | |||
self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask what is the difference with register_buffer()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharding libraries generally shard all network parameters and buffers
This is a constant which shouldn't be treated like a parameters or buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, but seems like it is a global thing right now so that all layers use the same global_expert_count? And I also found few issues in some other changes where I am still investigating whether it has influence in existing features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes all layers use the same global_expert_count
is there any reason for diff rank to output diff things from MOELayer.global_expert_count
?
just setting it as a python int seems like a valid approach
(Using a buffer results in fsdp treating it as part of the state which should be sharded, and this produces issues in some corner cases.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also found few issues in some other changes where I am still investigating whether it has influence in existing features.
👀 lmk what you find
Any updates on this? |
For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases. |
we're talking about this line:
Am I missing something? |
We could remove it from this PR and open a followup PR. |
Sure. The feature
If your changes can avoid above problems, we are happy to accept your requests. |
To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if |
All tensor naming is still the same. If I only make layer init more conventional (along with using a more conventional I just moved parameter init, I didn't change how init happens. The seeding is unchanged. The parameters are still initialized in an nn.Linear layer and copied over. I'll open a new PR with out |
Clean up __init__ / param init for FusedExpertsNetworkdone in another PRenable FusedExpertsNetwork to run without biasdone in another PRdevice=meta
helpers