[bug] fix xavier uniform init for output layers #814

hjlee1371 · 2024-05-08T14:16:08Z

In TransformerConfig, initializations for the output layers should be specified as output_layer_init_method.

Megatron-LM/megatron/core/transformer/transformer_config.py

Lines 110 to 113 in db3a3f7

    
               output_layer_init_method: Callable = None 
        
               """Method to initialize weights of the output layer of both attention and MLP blocks. If None, 
        
               will be set to megatron.core.utils.scaled_init_method_normal(init_method_std) which is torch nn 
        
               init normal with mean=0.0 and std=init_method_std / math.sqrt(2.0 * num_layers)."""

But Xavier uniform initialization with --init-method-xavier-uniform utilizes scaled_init_method instead. This PR addresses the bugs encountered when utilizing Xavier uniform initialization.

fix xavier init for output layers

489a751

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] fix xavier uniform init for output layers #814

[bug] fix xavier uniform init for output layers #814

hjlee1371 commented May 8, 2024

	output_layer_init_method: Callable = None
	"""Method to initialize weights of the output layer of both attention and MLP blocks. If None,
	will be set to megatron.core.utils.scaled_init_method_normal(init_method_std) which is torch nn
	init normal with mean=0.0 and std=init_method_std / math.sqrt(2.0 * num_layers)."""

[bug] fix xavier uniform init for output layers #814

Are you sure you want to change the base?

[bug] fix xavier uniform init for output layers #814

Conversation

hjlee1371 commented May 8, 2024