Incorrect transformer mask size #2344

egaznep · 2024-01-21T14:19:25Z

Describe the bug

PyTorch multi_head_attention code enforces that data size and mask size matches.

https://github.com/pytorch/pytorch/blob/df4e3d9d08f3d5d5439c3626be4bf29659488cdf/torch/nn/functional.py#L5442-L5444

However, speechbrain code generates masks according to the longest wav_len, which can be shorter than the batch size with padding, resulting in the exception given in the codeblock above.

speechbrain/speechbrain/lobes/models/transformer/TransformerASR.py

Line 146 in f6e297e

src_key_padding_mask = ~length_to_mask(abs_len).bool()

Solution:

The function length_to_mask accepts an optional argument max_len, this could be used. Should I open a PR?

speechbrain/speechbrain/dataio/dataio.py

Lines 758 to 772 in f6e297e

    
           def length_to_mask(length, max_len=None, dtype=None, device=None): 
        
               """Creates a binary mask for each sequence. 
        
               Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3 
        
               Arguments 
        
               --------- 
        
               length : torch.LongTensor 
        
                   Containing the length of each sequence in the batch. Must be 1D. 
        
               max_len : int 
        
                   Max length for the mask, also the size of the second dimension. 
        
               dtype : torch.dtype, default: None 
        
                   The dtype of the generated mask. 
        
               device: torch.device, default: None 
        
                   The device to put the mask variable.

Expected behaviour

The mask should have been generated according to what PyTorch anticipates. Instead, an exception is thrown.

To Reproduce

No response

Environment Details

speechbrain==0.5.16  (checking the develop branch this still seem to be an issue)
pytorch==2.1.1

Relevant Log Output

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Adel-Moumen · 2024-01-21T14:34:49Z

Hello @egaznep, thanks for opening this issue!

Could you please have a look @TParcollet? Thanks :)

TParcollet · 2024-01-30T14:22:22Z

Hello @egaznep I am not sure to understand the issue here. Could you provide a code snippet showing explicitly the error? The function length_to_mask() is expected to provide masks containing padding i.e. real size of the input tensor, and this for each sequence. Could you detail a bit more what you are trying to achieve?

egaznep · 2024-01-30T19:07:47Z

@TParcollet Here is a minimal working (or in this case crashing) example:

import torch
import torch.nn as nn
from speechbrain.lobes.models.transformer.TransformerASR import TransformerASR
import random


# Instantiate the TransformerASR model
model = TransformerASR(
    tgt_vocab=720,
    input_size=80,
    d_model=512,
    nhead=1,
    num_encoder_layers=1,
    num_decoder_layers=1,
)

# Generate some dummy input with different lengths
input_lengths = torch.tensor([l for l in range(10,101,10)])
input_data = [torch.randn(length,80) for length in input_lengths]
input_targets = [torch.randint(low=0, high=720, size=(length.item(),)) for length in input_lengths]
# Pad the input sequences to have the same length
input_data = nn.utils.rnn.pad_sequence(input_data, batch_first=True)
input_targets = nn.utils.rnn.pad_sequence(input_targets, batch_first=True)
input_lengths = input_lengths/100.0

print(input_data.shape, input_targets.shape, input_lengths.shape)

output = model.forward(input_data, input_targets, wav_len=input_lengths) # works
output = model.forward(input_data[:-1], input_targets[:-1], wav_len=input_lengths[:-1]) # fails

First call to the model.forward in line 29 has the following wav_lens: 10, 20, ..., 100, thus all are padded to 100. The mask computation will not fail because it can infer the proper mask size.

Second call to the model.forward in line 30, however, has the following wav_lens: 10, 20, ..., 90, even though all are padded to 100 because that is the longest in the entire dataset. The mask computation will fail because mask generator does not check how long the padded sequences are, but simply the largest wav_lens are.

TParcollet · 2024-01-30T20:27:10Z

Hello thanks. SpeechBrain padding is relative to the batch, not the dataset. The max len of wav_lens is the max len of the batch.

egaznep · 2024-01-31T09:56:15Z

I had this error while training a model using DistributedDataParallel on 2 GPUs. Could it be that initial data sampler does the padding relative to the complete batch size, but then during distribution to individual GPUs the error I am facing occurs?

TParcollet · 2024-02-01T13:15:50Z

@Gastron correct me if I am wrong, but as far as I know, DDP sampler is per-process, hence the padding should be relative to the batch of each process. @egaznep any chance that you could give us an example where this happens?

Adel-Moumen · 2024-04-08T14:35:27Z

Hello @egaznep, any news on your side please ?

egaznep · 2024-04-08T14:53:09Z

I was swamped with some projects until now, and I'm out of office this week. I will try to reproduce when I am back, but I guess it's more likely an issue with that specific project and not really related with Speechbrain internals. Thank you for reminding me.

egaznep added the bug Something isn't working label Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect transformer mask size #2344

Incorrect transformer mask size #2344

egaznep commented Jan 21, 2024

Adel-Moumen commented Jan 21, 2024

TParcollet commented Jan 30, 2024

egaznep commented Jan 30, 2024 •

edited

TParcollet commented Jan 30, 2024

egaznep commented Jan 31, 2024

TParcollet commented Feb 1, 2024 •

edited

Adel-Moumen commented Apr 8, 2024

egaznep commented Apr 8, 2024

Incorrect transformer mask size #2344

Incorrect transformer mask size #2344

Comments

egaznep commented Jan 21, 2024

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Adel-Moumen commented Jan 21, 2024

TParcollet commented Jan 30, 2024

egaznep commented Jan 30, 2024 • edited

TParcollet commented Jan 30, 2024

egaznep commented Jan 31, 2024

TParcollet commented Feb 1, 2024 • edited

Adel-Moumen commented Apr 8, 2024

egaznep commented Apr 8, 2024

egaznep commented Jan 30, 2024 •

edited

TParcollet commented Feb 1, 2024 •

edited