Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect transformer mask size #2344

Open
egaznep opened this issue Jan 21, 2024 · 8 comments
Open

Incorrect transformer mask size #2344

egaznep opened this issue Jan 21, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@egaznep
Copy link

egaznep commented Jan 21, 2024

Describe the bug

PyTorch multi_head_attention code enforces that data size and mask size matches.

https://github.com/pytorch/pytorch/blob/df4e3d9d08f3d5d5439c3626be4bf29659488cdf/torch/nn/functional.py#L5442-L5444

However, speechbrain code generates masks according to the longest wav_len, which can be shorter than the batch size with padding, resulting in the exception given in the codeblock above.

src_key_padding_mask = ~length_to_mask(abs_len).bool()

Solution:

The function length_to_mask accepts an optional argument max_len, this could be used. Should I open a PR?

def length_to_mask(length, max_len=None, dtype=None, device=None):
"""Creates a binary mask for each sequence.
Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
Arguments
---------
length : torch.LongTensor
Containing the length of each sequence in the batch. Must be 1D.
max_len : int
Max length for the mask, also the size of the second dimension.
dtype : torch.dtype, default: None
The dtype of the generated mask.
device: torch.device, default: None
The device to put the mask variable.

Expected behaviour

The mask should have been generated according to what PyTorch anticipates. Instead, an exception is thrown.

To Reproduce

No response

Environment Details

speechbrain==0.5.16  (checking the develop branch this still seem to be an issue)
pytorch==2.1.1

Relevant Log Output

No response

Additional Context

No response

@egaznep egaznep added the bug Something isn't working label Jan 21, 2024
@Adel-Moumen
Copy link
Collaborator

Hello @egaznep, thanks for opening this issue!

Could you please have a look @TParcollet? Thanks :)

@TParcollet
Copy link
Collaborator

Hello @egaznep I am not sure to understand the issue here. Could you provide a code snippet showing explicitly the error? The function length_to_mask() is expected to provide masks containing padding i.e. real size of the input tensor, and this for each sequence. Could you detail a bit more what you are trying to achieve?

@egaznep
Copy link
Author

egaznep commented Jan 30, 2024

@TParcollet Here is a minimal working (or in this case crashing) example:

import torch
import torch.nn as nn
from speechbrain.lobes.models.transformer.TransformerASR import TransformerASR
import random


# Instantiate the TransformerASR model
model = TransformerASR(
    tgt_vocab=720,
    input_size=80,
    d_model=512,
    nhead=1,
    num_encoder_layers=1,
    num_decoder_layers=1,
)

# Generate some dummy input with different lengths
input_lengths = torch.tensor([l for l in range(10,101,10)])
input_data = [torch.randn(length,80) for length in input_lengths]
input_targets = [torch.randint(low=0, high=720, size=(length.item(),)) for length in input_lengths]
# Pad the input sequences to have the same length
input_data = nn.utils.rnn.pad_sequence(input_data, batch_first=True)
input_targets = nn.utils.rnn.pad_sequence(input_targets, batch_first=True)
input_lengths = input_lengths/100.0

print(input_data.shape, input_targets.shape, input_lengths.shape)

output = model.forward(input_data, input_targets, wav_len=input_lengths) # works
output = model.forward(input_data[:-1], input_targets[:-1], wav_len=input_lengths[:-1]) # fails

First call to the model.forward in line 29 has the following wav_lens: 10, 20, ..., 100, thus all are padded to 100. The mask computation will not fail because it can infer the proper mask size.

Second call to the model.forward in line 30, however, has the following wav_lens: 10, 20, ..., 90, even though all are padded to 100 because that is the longest in the entire dataset. The mask computation will fail because mask generator does not check how long the padded sequences are, but simply the largest wav_lens are.

@TParcollet
Copy link
Collaborator

Hello thanks. SpeechBrain padding is relative to the batch, not the dataset. The max len of wav_lens is the max len of the batch.

@egaznep
Copy link
Author

egaznep commented Jan 31, 2024

I had this error while training a model using DistributedDataParallel on 2 GPUs. Could it be that initial data sampler does the padding relative to the complete batch size, but then during distribution to individual GPUs the error I am facing occurs?

@TParcollet
Copy link
Collaborator

TParcollet commented Feb 1, 2024

@Gastron correct me if I am wrong, but as far as I know, DDP sampler is per-process, hence the padding should be relative to the batch of each process. @egaznep any chance that you could give us an example where this happens?

@Adel-Moumen
Copy link
Collaborator

Hello @egaznep, any news on your side please ?

@egaznep
Copy link
Author

egaznep commented Apr 8, 2024

I was swamped with some projects until now, and I'm out of office this week. I will try to reproduce when I am back, but I guess it's more likely an issue with that specific project and not really related with Speechbrain internals. Thank you for reminding me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants