Checks OCR + RAG

Software to extract key information from printed and handwritten text on bank checks, using object detection techniques, cloud ML services and Retrieval Augmented Generation (RAG). The solution provides enhanced transparency by reporting confidence levels in the OCR results.

How it works?

The program includes annotations indicating the locations of details to extract from checks, organized by bank. It employs Amazon Textract to extract text from checks, followed by an Intersection over Union (IoU) algorithm to retrieve specific details from the Amazon Textract results. These results are cached to prevent redundant costs if a check is placed in the unprocessed category again. The confidence scores provided by Amazon Textract help visualize the generated data, highlighting potentially inaccurate results—commonly occurring when check details are misplaced or when the check is not horizontally aligned during scanning.

If utilizing the LLM feature, the program utilizes Retrieval Augmented Generation (RAG) with LangChain, Chroma DB (which uses a user-updatable context) and the GPT-3.5 Turbo model by default (users can specify another OpenAI model). This enhances result accuracy by correcting potential errors.

It's important to note that the current annotations specifically cover a subset of banks in Ecuador (see the at the end of this document). Additionally, as of now, the program exclusively supports OpenAI models. This is the list of supported banks:

Prerequisites

Docker installed in your system
AWS Credentials for Amazon Textract
OPENAI Credentials if you want to use the LLM feature

If you don't have Docker installed, you would need:

A recent version of Python installed in your system.
A Python package manager installed in your system (like pip or conda).

Setup

Create a folder (let's call it checks).
Create an empty folder unprocessed.
Create an empty folder processed.
Clone the repository into your checks folder:

git clone https://github.com/este6an13/checks-ocr.git

Set your AWS credentials for Amazon Textract as environment variables.

For the program to work you need to use these names for the variables:

TEXTRACT_AWS_ACCESS_KEY_ID
TEXTRACT_AWS_SECRET_ACCESS_KEY_ID
TEXTRACT_AWS_REGION

If you want to use the LLM feature, add your OpenAI credentials as well. For the program to work, use this name for your environment variable:

OPENAI_API_KEY

Warning

If someone can inspect your container while it's running, they will be able to see your secrets. The container is automatically removed after it finishes running in an attempt to mitigate this risk.

If you don't have Docker installed in your system, follow these extra steps:

Enter the repository folder:

cd checks-ocr

Create a virtual environment (you need a recent version of Python installed):

The following commands work for Windows.
If you use Linux, make sure to run the correct commands accordingly.

py -m venv env

Activate the virtual environment:

cd env/Scripts
activate
cd ../..

Install the required dependencies (you need pip installed in your system for this):

pip install -r requirements.txt

Usage

Move your checks in pdf format to the unprocessed folder.

Currently only supporting pdf format for your checks.
Make sure the check is in the first page of the pdf.
To get better results, make sure the checks are horizontally aligned and that all of its content is visible.

If you have Docker installed in your system:

Build the Docker image running the following script from the checks folder:

bash checks-ocr/build.sh

Start the application by running the following script from the checks folder:

bash checks-ocr/run.sh

To use the LLM feature add the --llm option when running the script:

bash checks-ocr/run.sh --llm

You can manually add the client_names.txt and account_names.txt files to the folder checks-ocr/data/data and populate it with the list of clients and accounts of your organization. This will help the LLM to generate better results. You can modify the checks-ocr/data/data/territories.txt file too if needed.

After you modify any of these files you can tell the program to update its internal vector databases to start using the new context in the RAG chain. To do so, pass the --update option to the script like this:

bash checks-ocr/run.sh --llm --update client_names --update account_names --update territories

You only need the pass the --update option for the files that you modified.

You can also switch the LLM model you want to use by using the --model-name option.

bash checks-ocr/run.sh --llm --model-name gpt-4-0125-preview

The program will use the gpt-3.5-turbo-0125 model by default.

Note

Without the --llm option, the --update and --model-name options won't take effect.

If you don't have Docker installed in your system, follow steps 4 and 5:

Navigate to the src folder:

cd checks-ocr
cd src

Run the program with the following command:

py main.py

If running the program this way, pass the arguments in the following format:

py main.py --llm --update=client_names --update=account_names --update=territories --model-name=gpt-4-0125-preview

Note the equal symbol between the option and its value.

After starting the program, wait for the program to complete the processing

Note

It takes around 8 seconds to process a new check. If using the LLM feature, it may take 10 seconds per new check.

Check the results in the created data.xlsx file. Click on the ID of each row to see the check and fix any mistake or bad formatting if you need to.

Remember

The processed checks are moved to the processed folder automatically.
Images of the checks are generated and saved in the checks-ocr/images folder.
If you move the images from that folder, you won't be able to see them from the data.xlsx file when clicking their ID.
The cache folder stores the responses received from Amazon Textract, so if you move a check to the unprocessed folder and removes its row from the data.xlsx folder, it will re-process the image but won't make any call to Amazon Textract. This is useful in case you want to manually adjust the BOXES of a given bank in checks-ocr/src/constants/__init__.py file if you need it, and re-run the processing for a given set of checks.
The generated data.xlsx file cells are painted based on the confidence reported by Amazon Textract. Cells in red color indicated a confidence lower than 90. Violet cells are cells that seem to have some inconsistencies in their content suggesting that the BOXES coordinates seemed to not haven't captured the contents precisely. This happens when the checks details are not in the place they use to be or they cross with other details in the check.
You can update the context that the LLM uses to generate results by adding and updating the client_names.txt, account_names.txt, and territories.txt files located in the checks-ocr/data/data folder. This way, users have complete control over the context used by the program and can make updates as needed. To apply these changes to the internal vector databases, run the program with the --update option, as explained earlier. You only need to do this once after making any changes to the files.
Remember to close the data.xlsx file when running the script, otherwise the program won't be able to write the collected data and you will have to run it again.

Supported Banks

Ecuador

Banco Internacional
Banco Pichincha
Banco Guayaquil
Banco del Austro
BanEcuador
Produbanco

Direct Dependencies

boto3
coverage
chromadb
langchain
langchain-openai
openpyxl
pandas
pillow
pyarrow
pymupdf
pytest
unidecode

For a comprehensive list that includes both direct and transitive dependencies, please refer to the requirements.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
fig		fig
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
makes.nix		makes.nix
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

fig

fig

src

src

test

test

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

build.sh

build.sh

makes.nix

makes.nix

requirements.txt

requirements.txt

run.sh

run.sh

Repository files navigation

Checks OCR + RAG

How it works?

Prerequisites

Setup

Usage

Remember

Supported Banks

Direct Dependencies

About

Releases

Packages

Languages

este6an13/checks-ocr

Folders and files

Latest commit

History

Repository files navigation

Checks OCR + RAG

How it works?

Prerequisites

Setup

Usage

Remember

Supported Banks

Direct Dependencies

About

Topics

Resources

Stars

Watchers

Forks

Languages