Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the absence of attention_mask when using sliding_window #1565

Open
DarijaNS opened this issue Feb 15, 2024 · 0 comments
Open

Problem with the absence of attention_mask when using sliding_window #1565

DarijaNS opened this issue Feb 15, 2024 · 0 comments

Comments

@DarijaNS
Copy link

DarijaNS commented Feb 15, 2024

Hello!

I am trying to fine-tune an Electra model with my own dataset, as described HERE, and I am using these model arguments:

import torch
cuda_available = torch.cuda.is_available() 
cuda_available

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

TRAIN_FILE = "train_set.txt"
VAL_FILE = "val_set.txt"

model_args = LanguageModelingArgs()

model_args.reprocess_input_data = True                 
model_args.overwrite_output_dir = True                  
model_args.num_train_epochs = 3                         
model_args.dataset_type = "simple"                     

model_args.sliding_window = True                        
model_args.max_seq_length = 512                         
model_args.train_batch_size=32                         
model_args.gradient_accumulation_steps=32               

model_args.config = {
    "embedding_size": 768,
    "hidden_size": 768,
    "intermediate_size": 3072,
    "num_attention_heads": 12
}

model_args.vocab_size=32000                           

model_args.evaluate_during_training = True              
model_args.evaluate_during_training_silent = False     
model_args.evaluate_during_training_verbose = True      
model_args.manual_seed = 42 

model = LanguageModelingModel(
    model_type="electra",
    model_name="electra",
    discriminator_name="classla/bcms-bertic",
    generator_name="classla/bcms-bertic-generator",
    args=model_args,
    use_cuda=cuda_available
)

model.train_model(TRAIN_FILE, eval_file=VAL_FILE)               

When setting model_args.sliding_window = True I always get this: We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

I took a closer look at the source code in language_modeling_model.py and noticed that the attention masks are only created in this situation with use_hf_datasets:

....
                inputs = inputs.to(self.device)
                attention_mask = (
                    batch["attention_mask"].to(self.device)
                    if self.args.use_hf_datasets
                    else None
                )
                token_type_ids = (
                    batch["token_type_ids"].to(self.device)
                    if self.args.use_hf_datasets and "token_type_ids" in batch
                    else None
                )
...

I assume that without the use of sliding_window no padding is added, so this warning does not occur.

Did I understand that correctly? I also tested the model evaluation on a test set with examples no longer than max_seq_length, with and without this parameter and found drastic differences in the results in terms of eval loss and perplexity.

So my question is: Is there a way to somehow include the attention_mask, which is generally important for training LM, or does it have no influence on the quality of model fine-tuning in this particular situation?

Thank you in advance!
Darija

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant