Skip to content

Named Entity Extraction with OpenCV, Pytesseract, Spacy (OCR + NER), BIO Labelling

Notifications You must be signed in to change notification settings

MvMukesh/AutoKYC-ExtractionEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 

Repository files navigation

AutoKYC Extraction Engine | DEMO

Named entity extraction from financial documents with OpenCV, Pytesseract, Spacy (OCR + NER)


Development Stages

Training Architecture -(NER Model)

Architecture

Text Detection WorkFlow

Image Preprocessing (Suppressing unwanted distortions, Enhancing important Image Features)

1. Binarization 2. Rescaling 3. Dilation

Image Segmentation (breaking image based on)

1. Single Character 2. Word 3. Line

Labeling - BIO/IOB Tagging Format

Most chronophagous task, took around more than 10 good hours per day and some weeks
Learning -- Collecting good data in Real Life is not a cakewalk

Bounding Boxes

Input - Real Time

Eyeballing Scanned results of very common and easy input point you can get in Real Time, input can be anything in range of crazy to very crazy

NER Prediction

You are observing NER Prediction on scanned results of above business card
Finding organisation and name is still bit difficult , clearly I have to increase business card data from 3000+ cards to maybe 10000+, in parallel I need to update my approach a bot more to bit more maybe



Problem Statement

Develop customized Named Entity Recognizer to extract entities from scanned documents images like:

  1. Invoice
  2. Business Card [my focus] || Extract Entities like: Name, Phone, Email, Organisation and Website link
  3. Shipping Bill etc

Technologies used

  1. Compute Vision modules were used to:

    1. scan document
    2. identify location of text
    3. extract text from image
  2. Natural Language Processing used to

    1. extract entitles from text
    2. text cleaning
    3. parsing entities form text

Python Libraries used in Computer Vision Module

Python Libraries used in Natural Language Processing

Flow to Extract Entities

  1. Location of Entity
  2. Text of Corresponding Entity

Some more NER use-cases

Improvements:

  1. I am using Spacy NER model, which is a BERT architecture i.e. I have to provide more data to this model to see performance improvement
  2. I can also improve Data Preparation Framework
  3. I am using PyTesseract(google) to extract text, it have some limitations like:
  4. Image resolution must be atlest 200 dpi or width & height must be atlest 300 pixels
  5. Text must not be Rotated or Skewed
  6. Text must not be having some effets applied on it
  7. Text must not be blured
  8. Text must not be cursive handwriting

Refrences

What Next