Fix llama model sdpa attention forward function masking bug when output_attentions=True #30652

Aladoro · 2024-05-04T15:58:07Z

What does this PR do?

Very simple fix to a nasty issue I have recently encountered. Due to its simplicity, I opened a PR directly without raising an issue first to avoid redundancy. Please, let me know if I should also raise an issue, and I'll do that right away.

Description

When output_attentions is True, sdpa implementation's forward method calls the eager implementation's forward method. However, a None mask is still returned if sdpa's 'AttentionMaskConverter._ignore_causal_mask_sdpa' returns true (which occurs whenever the input is unmasked, as sdpa would defer the causal masking to the sdpa Pytorch implementation).
This inconsistency causes the model to run the eager implementation with no causal attention mask if the original input is unmasked (e.g., if a single input sequence is encoded or all encoded input sequences have the same length) and requires_attn=True.

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.

Tagging @ArthurZucker and @younesbelkada

…oded sequence.

Aladoro · 2024-05-04T16:04:53Z

A minimal example of this erroneous behavior can be reproduced via:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Meta-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map='cuda', 
                                             torch_dtype=torch.bfloat16
                                             )

tokenizer.pad_token_id = tokenizer.eos_token_id

inputs = tokenizer(["Today is the day I went to the store and ..."],
                    return_tensors="pt").to('cuda')

expanded_batch_size = 1


outputs = model.generate(
    input_ids = inputs['input_ids'].expand(expanded_batch_size, -1),
    attention_mask = inputs['attention_mask'].expand(expanded_batch_size, -1),
    do_sample=False,
    max_new_tokens=5, 
    return_dict_in_generate=True,
    )


input_length = inputs.input_ids.shape[1]
sequences= outputs.sequences

for sequence in sequences:
    decoded_sequence = tokenizer.decode(sequence)
    print(decoded_sequence)

# separator
print('-'*20)


outputs = model.generate(
    input_ids = inputs['input_ids'].expand(expanded_batch_size, -1),
    attention_mask = inputs['attention_mask'].expand(expanded_batch_size, -1),
    do_sample=False,
    max_new_tokens=5, 
    return_dict_in_generate=True,
    output_attentions=True, # ?!
    )


input_length = inputs.input_ids.shape[1]
sequences= outputs.sequences

# garbage generated outputs since no masking is applied
for sequence in sequences:
    decoded_sequence = tokenizer.decode(sequence)
    print(decoded_sequence)

…he same sdpa masking logic from llama)

ArthurZucker

Great catch.

causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype) line 1127 needs to be ignored as well.
we need to add your small example script as a test! 🤗

Aladoro · 2024-05-06T13:13:25Z

Great catch.

causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype) line 1127 needs to be ignored as well.

we need to add your small example script as a test! 🤗

@ArthurZucker Thanks for reviewing my pull request and all your work in maintaining this awesome repo! :) Regarding your comments:

Done.
Let me know if you would like me to make a small testing script for this bug myself! (i.e., check that generated outputs with the 'eager' implementation match the generated outputs with output_attentions=True, although inherent stochasticity in the GPU kernels might make it difficult to always get 100% consistent results).

p.s. There seem to be some CircleCI tests failing on the main branch... which are now failing after I merged.

ArthurZucker · 2024-05-06T15:17:59Z

For 2. the test is already implemented, but I don't think it tests output_attention=True. It probably a matter of adding a parametrized. See this file here: (and the generate tests) https://github.com/huggingface/transformers/blob/main/tests/test_modeling_common.py#L3590.

Potentially adding output_attention to make sure sdpa with output attention matches eager with or without (which it is supposed to!)

ArthurZucker · 2024-05-06T15:19:18Z

Feel free to rebase it might be fixed on main / be flaky

Aladoro · 2024-05-06T17:43:01Z

Feel free to rebase it might be fixed on main / be flaky

Just did :)

HuggingFaceDocBuilderDev · 2024-05-07T10:07:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Aladoro · 2024-05-07T13:26:40Z

@ArthurZucker Let me know if you think this fix is ready for merging, or if you'd like to add the tests to the same PR!

ArthurZucker

Would be nice to just add the test in this PR 😉

…s=True

Aladoro · 2024-05-12T12:26:33Z

Would be nice to just add the test in this PR 😉

Alright - I made the addition of output_attentions=True to the sdpa equivalence test, as you suggested ;) (Black code re-formatting seems to have messed up the diff, but the changes are minimal...)

@ArthurZucker - Let me know if there are any outstanding issues or if there is something else missing before merging ^^

ArthurZucker

Ok, let's make sure you rebase as Gemma was updated a bit and commit with [run-slow] so that slow tests are run!

tests/test_modeling_common.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ArthurZucker

LGTM, fyi @fxmarty @gante and @ydshieh

ArthurZucker · 2024-05-15T13:15:04Z

(Merging once the CIs are all green!)

Aladoro · 2024-05-15T13:46:00Z

@ArthurZucker thanks for your suggestions! I also propagated the same changes to the new jetmoe model. All default checks are now passing ^^

ArthurZucker · 2024-05-15T17:48:24Z

THanks for the fix

Aladoro added 2 commits May 4, 2024 15:23

Fix llama model forward function with attention=True, same-length enc…

514c1c3

…oded sequence.

Fix style

6c0fa7b

Aladoro added 2 commits May 4, 2024 16:30

propagate fix to modeling_cohere, gemma, dbrx, and olmo (which copy t…

0d91bea

…he same sdpa masking logic from llama)

Fix style

894c14b

ArthurZucker reviewed May 6, 2024

View reviewed changes

Aladoro and others added 2 commits May 6, 2024 12:48

ignore unnecessary sdpa mask converter when output_attentions=True

2584308

Merge branch 'huggingface:main' into fix-llama-mask-output-attn

c7bdc95

Merge branch 'huggingface:main' into fix-llama-mask-output-attn

8d793a3

ArthurZucker mentioned this pull request May 9, 2024

Add torch.compile for Mistral #30642

Merged

4 tasks

ArthurZucker reviewed May 9, 2024

View reviewed changes

Aladoro and others added 2 commits May 12, 2024 09:20

Merge branch 'huggingface:main' into fix-llama-mask-output-attn

fc143ac

add tests checking sdpa and eager outputs match when output_attention…

3e0fada

…s=True

ArthurZucker mentioned this pull request May 15, 2024

GemmaForCausalLM Causal Masking Not Working #30813

Closed

4 tasks

ArthurZucker reviewed May 15, 2024

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

Aladoro and others added 2 commits May 15, 2024 15:10

Split if statements in two lines

9acc119

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'huggingface:main' into fix-llama-mask-output-attn

08dbd4b

ArthurZucker approved these changes May 15, 2024

View reviewed changes

Aladoro added 3 commits May 15, 2024 13:18

Fix formatting

9b79aee

Add fix to new jetmoe model

dd69923

Add missing output_attentions argument to jetmoe mask creation

ad4aded

ArthurZucker merged commit 4b3eb19 into huggingface:main May 15, 2024
22 checks passed

cmathw mentioned this pull request May 15, 2024

Update Gemma to reflect upstream HF changes TransformerLensOrg/TransformerLens#596

Merged

7 tasks

Aladoro deleted the fix-llama-mask-output-attn branch May 15, 2024 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llama model sdpa attention forward function masking bug when output_attentions=True #30652

Fix llama model sdpa attention forward function masking bug when output_attentions=True #30652

Aladoro commented May 4, 2024 •

edited

Aladoro commented May 4, 2024

ArthurZucker left a comment

Aladoro commented May 6, 2024 •

edited

ArthurZucker commented May 6, 2024 •

edited

ArthurZucker commented May 6, 2024

Aladoro commented May 6, 2024

HuggingFaceDocBuilderDev commented May 7, 2024

Aladoro commented May 7, 2024

ArthurZucker left a comment

Aladoro commented May 12, 2024

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker commented May 15, 2024 •

edited

Aladoro commented May 15, 2024

ArthurZucker commented May 15, 2024

Fix llama model sdpa attention forward function masking bug when output_attentions=True #30652

Fix llama model sdpa attention forward function masking bug when output_attentions=True #30652

Conversation

Aladoro commented May 4, 2024 • edited

What does this PR do?

Description

Aladoro commented May 4, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Aladoro commented May 6, 2024 • edited

ArthurZucker commented May 6, 2024 • edited

ArthurZucker commented May 6, 2024

Aladoro commented May 6, 2024

HuggingFaceDocBuilderDev commented May 7, 2024

Aladoro commented May 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Aladoro commented May 12, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented May 15, 2024 • edited

Aladoro commented May 15, 2024

ArthurZucker commented May 15, 2024

Aladoro commented May 4, 2024 •

edited

Aladoro commented May 6, 2024 •

edited

ArthurZucker commented May 6, 2024 •

edited

ArthurZucker commented May 15, 2024 •

edited