Document Content Extractor

The Document Content Extractor is a tool that can be used to extract semantic information from a pdf text file. This tool is useful when the structure and format of the text to be parsed in dynamic and the information to be extracted is semantic as opposed to syntactic in nature. It's not as fast as a static parser would be, but it can extract information that would be hard to encode in a static parser and does not need to be modified every time the format of the underlying document changes.

The tool leverages OpenAI Completions Api to process the text.
It is fully configurable, refer to the section on Fine tuning the adpater below for more information.
the tool can be used both as a standalone python application or as a microservice running inside a larger application.

↓ Click on each link below to be redirected to the appropriate seciton of the Wiki

How does it work?

Code Walkthrough

Getting Started

Adjusting Configuration for Accurate Content Extraction

Frequently Asked Questions

⚠️ Caution: Before using this application on your data, ensure you verify that the open ai api meets the necessary requirements for the data you want to process by visiting their security and compliance page

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.idea		.idea
README.md		README.md
__init__.py		__init__.py
app.py		app.py
ascii_output_generator.py		ascii_output_generator.py
closewise_parser_configurator.json		closewise_parser_configurator.json
config.ini		config.ini
content_extractor.py		content_extractor.py
document_content_extractor.py		document_content_extractor.py
document_reader.py		document_reader.py
requirements.txt		requirements.txt
stopwords.txt		stopwords.txt
utils.py		utils.py

atbasu/document-content-extractor

Folders and files

Latest commit

History

Repository files navigation

Document Content Extractor

About

Topics

Resources

Stars

Watchers

Forks

Languages