Skip to content

📰 Named entitity recognition (NER) and Entity linking (EL) on the dataset of Patents

License

Notifications You must be signed in to change notification settings

kinivi/patent_ner_linking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named entitity recognition (NER) and Detecting Hyponym\Hypernym relationship on the dataset of Patents

The main goals of this project are:

  • Train NER model with dataset of Patents in the specific domain
  • Fine-tune with prodidy
  • Implement automatic detection of hyponyms\hypernyms with Hearst patterns
  • Validate detection results with several methods, inluding Wikidata

Structure

Setup

  1. Install dependencies from requirements.txt
  2. Unpack data:
    tar -xvf G06K.txt.gz
  3. Open project.ipynb and run first cell to chek that all imports works propperly

Notebook structure

Here is a brief overview of the project.ipynb parts.

Data processing

Screenshot 2022-06-03 at 11 14 57

In this section patent text read and prcessed to extract potential Named entities using curated list of terms manyterms.lower.txt

Training NER model

Screenshot 2022-06-03 at 11 21 47

Next, we are training the model on the created dataset.
Additionaly, if you have access to the Prodiy, you can apply Active Learning to tune the model.

Hearst patterns for hyponym detection

Screenshot 2022-06-03 at 11 33 11

Thise section is dedicated to extracting potential Entity linking (like hypernyms) using Hearst Patterns.

Automatic validation of the results

Screenshot 2022-06-03 at 11 34 55

Afte extraction, we validate results automatically, using Wiki API, WordNet or SpaCy embeddings. Here is an example of validation table after processing:

hq2SyK1SEvKTISY0DtddgY_mF9j966vIPi8Fhm26nJq-xPNc_NH0xPhap97ZAruJOHaEjqbf7a2-kKwSZnw6JeRFH9dwk2w06Dd9OjTOq3EmgRbpmFAYIIuyTphYtAeqcYa70NWnW_9ZwK4cGmEv0A

About

📰 Named entitity recognition (NER) and Entity linking (EL) on the dataset of Patents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published