Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER with BERT-based Model: Unexpected Panic During Prediction #455

Open
mmich-pl opened this issue May 1, 2024 · 0 comments
Open

NER with BERT-based Model: Unexpected Panic During Prediction #455

mmich-pl opened this issue May 1, 2024 · 0 comments

Comments

@mmich-pl
Copy link

mmich-pl commented May 1, 2024

Description

I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.

I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)

  let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];

  let config = TokenClassificationConfig::new(
          ModelType::Bert,
          ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
          LocalResource::from(PathBuf::from(model_config_path)),
          LocalResource::from(PathBuf::from(vocab_path)),
          Some(LocalResource::from(PathBuf::from(merge_path))),  //merges resource only relevant with ModelType::Roberta
          false, //lowercase
          false,
          None,
          LabelAggregationOption::Mode,
      );

Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.

    let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
    println!("{:?}", tokenizer.tokenize(input[0]));
   
    let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
    let output = ner_model.predict_full_entities(&input);
    for entity in output {
        println!("{entity:?}");
    }

as output I got:

["<unk>", "się", "Jan", "<unk>", "i", "<unk>", "we", "<unk>", "."]
[]

Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.

    let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
    println!("{:?}", tok_opt.tokenize(input[0]));
    let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
["Nazy", "wam</w>", "się</w>", "Jan</w>", "Kowalski</w>", "i</w>", "mieszkam</w>", "we</w>", "Wrocławiu</w>", ".</w>"]

But now I encountered a runtime panic during the prediction phase:

thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49

Environment:

  • Rust version: 1.77.2
  • PyTorch version: 2.2.0
  • tch version: v0.15.0
  • rust-bert copy of repository (current version from the main branch)

I would be grateful if you could help.

EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant