GitHub - LynnHaDo/Checkbox-Detection: Checkbox Detection Model for Scanned Documents

Checkbox Detection

Checkbox Detector Model using YOLOv8 model
View Demo · Report Bug · Request Feature

Table of Contents

Updates
About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Works Cited
Contact

Updates

In this project, I provided 2 models (classification and detection models) trained on the existing YOLOv8 weights. They are uploaded in my Hugging Face Space of the project. If you feel the need to use or fine-tune the models in any parts of your work, please cite this repository. Thank you, and don't forget to give this repo a 🌟!

About The Project

The biggest challenge when I approach this problem is the lack of public datasets that contain documents with checkbox annotations. There are only either images of checkboxes alone, or images of scanned documents. As a result, the solution comes down to generating a sufficiently large annotated dataset of document with checkboxes.

Although the idea of using the Copy-Paste technique in augmenting data is simple, how to make that augmented dataset works well with the existing YOLO architecture is the most difficult part, which takes a lot of trial and error. Throughout this process, I experimented with different ways to paste the checkboxes onto the documents, which include pasting boxes contiguously in horizontal and vertical directions, pasting distractors, adding "background" images, pasting while also avoiding text blocks (using the Document Layout Analysis model I created), etc. I ended with over 10,000 images for the training dataset, and to test the model's performance, an additional 150 human-annotated documents are used as the validation dataset. The annotations are in YOLO format (normalized bounding boxes).

    1 0.402831 0.965 0.048906 0.032
    0 0.904762 0.856 0.018018 0.014
    0 0.189189 0.7005 0.036036 0.029
    0 0.388031 0.2395 0.037323 0.029
    1 0.0199485 0.2185 0.037323 0.029
    1 0.741313 0.96 0.046332 0.032
    1 0.677606 0.1045 0.047619 0.041
    0 0.956242 0.9045 0.041184 0.033
    1 0.838481 0.6575 0.037323 0.029
    1 0.837838 0.514 0.03861 0.03
    0 0.0456885 0.8305 0.032175 0.027

The model was trained on a GPU P100 for 200 epochs. In the end, under the supervision and mentorship of my advisor, I was able to achieve notable inference results, with the model achieving relatively high precision and recall rates after ~100 epochs.

(back to top)

Built With

(back to top)

Prerequisites

For generating data

opencv-python: 4.7.0
matplotlib: 3.7.1
numpy: 1.25.2
albumentations: 1.3.1

For training

ultralytics: 8.0.153
gradio
torch
ruamel

Installation

Clone the repo

git clone https://github.com/LynnHaDo/Checkbox-Detection.git

Install packages

pip install opencv-python
pip install matplotlib
pip install numpy
pip install albumentations
pip install ultralytics
pip install gradio
pip install torch
pip install ruamel

Dataset:

Source documents: RVL-CDIP
Checkboxes images: currently not publicly available

(back to top)

Works Cited

Ultralytics YOLOv8

authors:
 - family-names: Jocher
   given-names: Glenn
   orcid: "https://orcid.org/0000-0001-5950-6979"
 - family-names: Chaurasia
   given-names: Ayush
   orcid: "https://orcid.org/0000-0002-7603-6750"
 - family-names: Qiu
   given-names: Jing
   orcid: "https://orcid.org/0000-0003-3783-7069"
title: "YOLO by Ultralytics"
version: 8.0.0
date-released: 2023-1-10
license: AGPL-3.0
url: "https://github.com/ultralytics/ultralytics"

RVL-CDIP dataset

@inproceedings{harley2015icdar,
 title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
 author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
 booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
 year = {2015}
 }

Doclaynet-base dataset

@article{doclaynet2022,
 title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
 doi = {10.1145/3534678.353904},
 url = {https://doi.org/10.1145/3534678.3539043},
 author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
 year = {2022},
 isbn = {9781450393850},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
 pages = {3743–3751},
 numpages = {9},
 location = {Washington DC, USA},
 series = {KDD '22}
 }

XFUND dataset

@inproceedings{xu-etal-2022-xfund,
 title = "{XFUND}: A Benchmark Dataset for Multilingual Visually Rich Form Understanding",
 author = "Xu, Yiheng  and
   Lv, Tengchao  and
   Cui, Lei  and
   Wang, Guoxin  and
   Lu, Yijuan  and
   Florencio, Dinei  and
   Zhang, Cha  and
   Wei, Furu",
 booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
 month = may,
 year = "2022",
 address = "Dublin, Ireland",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2022.findings-acl.253",
 doi = "10.18653/v1/2022.findings-acl.253",
 pages = "3214--3224",
 abstract = "Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. However, the existed research work has focused only on the English domain while neglecting the importance of multilingual generalization. In this paper, we introduce a human-annotated multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Meanwhile, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually rich document understanding. Experimental results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The XFUND dataset and the pre-trained LayoutXLM model have been publicly available at https://aka.ms/layoutxlm.",
 }

Contact

Linh Do - do24l@mtholyoke.edu/dohalinh2303@gmail.com (personal)

Project Link: https://github.com/LynnHaDo/Checkbox-Detection

LinkedIn: https://linkedin.com/in/Linh Do

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
demo		demo
CITATION.cff		CITATION.cff
LICENSE.txt		LICENSE.txt
README.md		README.md
all-boxes.yaml		all-boxes.yaml
app.py		app.py
data_generator_full_class.ipynb		data_generator_full_class.ipynb
document_colorizer.ipynb		document_colorizer.ipynb
trainer.ipynb		trainer.ipynb
yolov8s.yaml		yolov8s.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo

demo

CITATION.cff

CITATION.cff

LICENSE.txt

LICENSE.txt

README.md

README.md

all-boxes.yaml

all-boxes.yaml

app.py

app.py

data_generator_full_class.ipynb

data_generator_full_class.ipynb

document_colorizer.ipynb

document_colorizer.ipynb

trainer.ipynb

trainer.ipynb

yolov8s.yaml

yolov8s.yaml

Repository files navigation

Checkbox Detection

Updates

About The Project

Built With

Prerequisites

For generating data

For training

Installation

Works Cited

Contact

About

Releases

Packages

Languages

License

LynnHaDo/Checkbox-Detection

Folders and files

Latest commit

History

Repository files navigation

Checkbox Detection

Updates

About The Project

Built With

Prerequisites

For generating data

For training

Installation

Works Cited

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages