updt init #210

vchiley · 2023-08-10T18:58:29Z

~~Clean up __init__ / param init for FusedExpertsNetwork~~ done in another PR
~~enable FusedExpertsNetwork to run without bias~~ done in another PR
make _num_global_experts not a buffer while still adding it to the state dict
- helps when init network with device=meta helpers
- helps layer play nice with torch distributed frameworks

ghostplant · 2023-08-11T09:22:58Z

tutel/impls/moe_layer.py

@@ -112,7 +116,7 @@ def __init__(
        self.skip_moe = (int(os.environ.get('SKIP_MOE', '0')) != 0)

        self.num_local_experts = experts.pop('count_per_node', 1)
-        self.register_buffer('_num_global_experts', torch.tensor(MOELayer.global_expert_count(self.num_local_experts, self.group)))
+        self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)



May I ask what is the difference with register_buffer()?

Sharding libraries generally shard all network parameters and buffers
This is a constant which shouldn't be treated like a parameters or buffers

Make sense, but seems like it is a global thing right now so that all layers use the same global_expert_count? And I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

yes all layers use the same global_expert_count
is there any reason for diff rank to output diff things from MOELayer.global_expert_count?
just setting it as a python int seems like a valid approach
(Using a buffer results in fsdp treating it as part of the state which should be sharded, and this produces issues in some corner cases.)

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

👀 lmk what you find

vchiley · 2023-09-12T02:25:10Z

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

Any updates on this?

ghostplant · 2023-09-12T14:01:57Z

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

vchiley · 2023-09-25T17:50:39Z

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

we're talking about this line:
self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)

MOELayer.global_expert_count is a static method which takes in the args of the specific MoE layer. Then self._num_global_experts is set per layer and one MoE's _num_global_experts shouldn't interfere with the next layers ie it shouldn't effect Pyramid MoE in any way.

Am I missing something?

vchiley · 2023-09-26T19:32:37Z

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)
Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

ghostplant · 2023-10-01T19:40:18Z

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)
Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

Sure. The feature non-ffn-ending-bias-training is relatively more acceptable. However, for this feature, what I worry about is whether it may have negative impact on 2 things below:

Previous SWIN-V2's checkpoint files containing MoE layers may be failed to load after your changes since data from those old state_dict is no longer compatible with new parameter registration. A lot of users having their private checkpoints for private models may also result in failure after this update.
Original's expert parameter initialization seed was specially designed to ensure initial parameter values to be deterministic regardless of whether expert parameters are sharded or not. In order words, 2 GPU training 2 experts in total should have exact the same training loss compared with 4 GPU training 2 experts (i.e. each GPU training half of an expert). Let's assume "2 GPU training 2 experts" have seed A + B on the first GPU, and C + D on the second GPU. In "4 GPU training 2 experts" mode, we seed A on the first GPU, B on the second GPU, C on the third GPU, D on the last GPU. This special design of parameter seeding ensures the initialization of expert parameters to be stable even if you want to change the number of GPUs but keep the number of global expert count unchanged. Your new changes seem to break this seed setting, right?

If your changes can avoid above problems, we are happy to accept your requests.

ghostplant · 2023-10-01T19:45:14Z

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if bias=True which should be the default setting, all following registration & seeding order & seed initialization design should have no difference with original's in any cases.

vchiley · 2023-10-05T16:54:59Z

All tensor naming is still the same. If bias=True the generated state dict will be the same.

I only make layer init more conventional (along with using a more conventional reset_parameters) eg torch.nn.Linear layer.
(Standard PyTorch attaches params in __init__ instead of in an auxiliary self.update(ctx) fn. After params are attached it calls standard PyTorch calls reset_parameters.)

I just moved parameter init, I didn't change how init happens. The seeding is unchanged. The parameters are still initialized in an nn.Linear layer and copied over.

I'll open a new PR with out self._num_global_experts.

updt init + param init

2cdd850

vchiley force-pushed the param_init branch from 3e3656b to 2cdd850 Compare August 10, 2023 19:25

repr is has bias

b92656b

ghostplant reviewed Aug 11, 2023

View reviewed changes

vchiley requested a review from ghostplant August 12, 2023 15:42

updt bias init

96d1e84

vchiley force-pushed the param_init branch from 5bb3ecd to 96d1e84 Compare October 5, 2023 17:04

vchiley mentioned this pull request Oct 5, 2023

Enable running without bias and update ffn instantiation #216

Merged

Merge branch 'main' into param_init

507ddcc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updt init #210

updt init #210

vchiley commented Aug 10, 2023 •

edited

ghostplant Aug 11, 2023

vchiley Aug 12, 2023

ghostplant Aug 12, 2023

vchiley Aug 12, 2023

vchiley Aug 12, 2023

vchiley commented Sep 12, 2023

ghostplant commented Sep 12, 2023

vchiley commented Sep 25, 2023

vchiley commented Sep 26, 2023

ghostplant commented Oct 1, 2023 •

edited

ghostplant commented Oct 1, 2023 •

edited

vchiley commented Oct 5, 2023

updt init #210

Are you sure you want to change the base?

updt init #210

Conversation

vchiley commented Aug 10, 2023 • edited

ghostplant Aug 11, 2023

Choose a reason for hiding this comment

vchiley Aug 12, 2023

Choose a reason for hiding this comment

ghostplant Aug 12, 2023

Choose a reason for hiding this comment

vchiley Aug 12, 2023

Choose a reason for hiding this comment

vchiley Aug 12, 2023

Choose a reason for hiding this comment

vchiley commented Sep 12, 2023

ghostplant commented Sep 12, 2023

vchiley commented Sep 25, 2023

vchiley commented Sep 26, 2023

ghostplant commented Oct 1, 2023 • edited

ghostplant commented Oct 1, 2023 • edited

vchiley commented Oct 5, 2023

vchiley commented Aug 10, 2023 •

edited

ghostplant commented Oct 1, 2023 •

edited

ghostplant commented Oct 1, 2023 •

edited