Skip to content

Latest commit

 

History

History
62 lines (47 loc) · 6.68 KB

File metadata and controls

62 lines (47 loc) · 6.68 KB

Table of contents

  1. Benchmarks
  2. Papers
  3. Datasets
  4. Useful links

Benchmarks

  1. Best OCR by Text Extraction Accuracy in 2021, https://research.aimultiple.com/ocr-accuracy/
  2. Best OCR Software of 2021, https://nanonets.com/blog/ocr-software-best-ocr-software/
  3. Comparison of OCR tools: how to choose the best tool for your project, https://medium.com/dida-machine-learning/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project-bd21fb9dce6b
  4. Our Search for the Best OCR Tool, and What We Found, 2019, https://source.opennews.org/articles/so-many-ocr-options/ (https://github.com/factful/ocr_testing)

Papers

  • DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding, [code/ ]

    Liang Qiao, Hui Jiang, Ying Chen, Can Li, Pengfei Li, Zaisheng Li, Baorui Zou, Dashan Guo, Yingda Xu, Yunlu Xu, Zhanzhan Cheng, Yi Niu ACM MM 2022 This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous opensource OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share.
  • TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [code/data ]

    Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei arxiv 2021 Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
  • Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents

    Amit Gupte, Alexey Romanov, Sahitya Mantravadi, Dalitso Banda, Jianjie Liu, Raza Khan, Lakshmanan Ramu Meenal, Benjamin Han, Soundar Srinivasan Document Intelligence Workshop at KDD 2021 Document digitization is essential for the digital transformation of our societies, yet a crucial step in the process, Optical Character Recognition (OCR), is still not perfect. Even commercial OCR systems can produce questionable output depending on the fidelity of the scanned documents. In this paper, we demonstrate an effective framework for mitigating OCR errors for any downstream NLP task, using Named Entity Recognition (NER) as an example. We first address the data scarcity problem for model training by constructing a document synthesis pipeline, generating realistic but degraded data with NER labels. We measure the NER accuracy drop at various degradation levels and show that a text restoration model, trained on the degraded data, significantly closes the NER accuracy gaps caused by OCR errors, including on an out-of-domain dataset. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
  • Text Recognition in the Wild: A Survey

    Xiaoxue Chen, Lianwen Jin, Yuanzhi Zhu, Canjie Luo, T. Wang arxiv 2020 The history of text can be traced back over thousands of years. Rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios. Therefore, text recognition in natural scenes has been an active research field in computer vision and pattern recognition. In recent years, with the rise and development of deep learning, numerous methods have shown promising in terms of innovation, practicality, and efficiency. This paper aims to (1) summarize the fundamental problems and the state-of-the-art associated with scene text recognition; (2) introduce new insights and ideas; (3) provide a comprehensive review of publicly available resources; (4) point out directions for future work. In summary, this literature review attempts to present the entire picture of the field of scene text recognition. It provides a comprehensive reference for people entering this field, and could be helpful to inspire future research. Related resources are available at our Github repository: this https URL.

Datasets

  1. Total-Text paper repo - scene text detection dataset
  2. Synth90k - popular dataset of single-word synthetic images (90k words, 9M images)
  3. SROIE - scanned receipts OCR and information extraction
  4. FUNSD - A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding
  5. RDCL2019 - ICDAR Competition on Recognition of Documents with Complex Layouts
  6. REID2019 - ICDAR Competition on Recognition of Early Indian printed Documents
  7. RETAS OCR EVALUATION DATASET - scanned books from Gutenberg project

Useful links

  1. https://github.com/mindee/doctr - alternative for Tesseract project!
  2. https://mindee.com/
  3. https://github.com/open-mmlab/mmocr
  4. https://github.com/Belval/TextRecognitionDataGenerator
  5. http://tc11.cvc.uab.es/datasets/type/
  6. https://www.primaresearch.org/
  7. http://iapr-tc11.org/mediawiki/index.php?title=IAPR-TC11:Reading_Systems