Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updt init #210

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

updt init #210

wants to merge 4 commits into from

Conversation

vchiley
Copy link
Contributor

@vchiley vchiley commented Aug 10, 2023

  • Clean up __init__ / param init for FusedExpertsNetwork done in another PR
  • enable FusedExpertsNetwork to run without bias done in another PR
  • make _num_global_experts not a buffer while still adding it to the state dict
    • helps when init network with device=meta helpers
    • helps layer play nice with torch distributed frameworks

@@ -112,7 +116,7 @@ def __init__(
self.skip_moe = (int(os.environ.get('SKIP_MOE', '0')) != 0)

self.num_local_experts = experts.pop('count_per_node', 1)
self.register_buffer('_num_global_experts', torch.tensor(MOELayer.global_expert_count(self.num_local_experts, self.group)))
self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask what is the difference with register_buffer()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharding libraries generally shard all network parameters and buffers
This is a constant which shouldn't be treated like a parameters or buffers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, but seems like it is a global thing right now so that all layers use the same global_expert_count? And I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes all layers use the same global_expert_count
is there any reason for diff rank to output diff things from MOELayer.global_expert_count?
just setting it as a python int seems like a valid approach
(Using a buffer results in fsdp treating it as part of the state which should be sharded, and this produces issues in some corner cases.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

👀 lmk what you find

@vchiley
Copy link
Contributor Author

vchiley commented Sep 12, 2023

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

Any updates on this?

@ghostplant
Copy link
Contributor

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

@vchiley
Copy link
Contributor Author

vchiley commented Sep 25, 2023

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

we're talking about this line:
self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)

MOELayer.global_expert_count is a static method which takes in the args of the specific MoE layer. Then self._num_global_experts is set per layer and one MoE's _num_global_experts shouldn't interfere with the next layers ie it shouldn't effect Pyramid MoE in any way.

Am I missing something?

@vchiley
Copy link
Contributor Author

vchiley commented Sep 26, 2023

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)
Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

@ghostplant
Copy link
Contributor

ghostplant commented Oct 1, 2023

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)
Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

Sure. The feature non-ffn-ending-bias-training is relatively more acceptable. However, for this feature, what I worry about is whether it may have negative impact on 2 things below:

  • Previous SWIN-V2's checkpoint files containing MoE layers may be failed to load after your changes since data from those old state_dict is no longer compatible with new parameter registration. A lot of users having their private checkpoints for private models may also result in failure after this update.
  • Original's expert parameter initialization seed was specially designed to ensure initial parameter values to be deterministic regardless of whether expert parameters are sharded or not. In order words, 2 GPU training 2 experts in total should have exact the same training loss compared with 4 GPU training 2 experts (i.e. each GPU training half of an expert). Let's assume "2 GPU training 2 experts" have seed A + B on the first GPU, and C + D on the second GPU. In "4 GPU training 2 experts" mode, we seed A on the first GPU, B on the second GPU, C on the third GPU, D on the last GPU. This special design of parameter seeding ensures the initialization of expert parameters to be stable even if you want to change the number of GPUs but keep the number of global expert count unchanged. Your new changes seem to break this seed setting, right?

If your changes can avoid above problems, we are happy to accept your requests.

@ghostplant
Copy link
Contributor

ghostplant commented Oct 1, 2023

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if bias=True which should be the default setting, all following registration & seeding order & seed initialization design should have no difference with original's in any cases.

@vchiley
Copy link
Contributor Author

vchiley commented Oct 5, 2023

All tensor naming is still the same. If bias=True the generated state dict will be the same.

I only make layer init more conventional (along with using a more conventional reset_parameters) eg torch.nn.Linear layer.
(Standard PyTorch attaches params in __init__ instead of in an auxiliary self.update(ctx) fn. After params are attached it calls standard PyTorch calls reset_parameters.)

I just moved parameter init, I didn't change how init happens. The seeding is unchanged. The parameters are still initialized in an nn.Linear layer and copied over.

I'll open a new PR with out self._num_global_experts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants