Copy Spotter

About

This program will process pdf, txt, docx, and odt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.

Usage

$ pip install copy-spotter
$ copy-spotter [-s] [-o] [-h] input_directory

Positional Arguments:

input_directory: One directory that contains all files (pdf, txt, docx, odt) (see data/pdf/plagiarism for example)

input_directory/
│
├── file_1.docx
├── file_2.pdf
└── file_3.pdf

Optional Arguments:

-s, --block-size: Set minimum number of consecutive and similar words detected. (Default is 2)
-o, --out_dir: Set the output directory for html files. (Default is creating a new directory called results)
-h, --help: Show this message and exit.

Examples

# Analyze documents in 'data/pdf/plagiarism', with default settings
$ copy-spotter data/pdf/plagiarism

# Analyze with custom block size and specify output directory
$ copy-spotter data/pdf/plagiarism -s 5 -o results/output

Development Setup:

# Clone this repository
$ git clone https://github.com/Wazzabeee/copy_spotter

# Go into the repository
$ cd copy_spotter

# Install requirements
$ pip install -r requirements.txt
$ pip install -r requirements_lint.txt

# Install precommit
$ pip install pre-commit
$ pre-commit install

# Run tests
$ pip install pytest
$ pytest tests/

# Run package locally
$ python -m scripts.main [-s] [-o] [-h] input_directory

Recommandations

Please make sure that all text files are closed before running the program.
In order to get the best results please provide text files of the same languages.
Pdf files that are made from scanned images won't be processed correctly.
Ensure you have writing access when using the package
If a specific file is not processed correctly feel free to contact me so that I can address the issue.

TODO

Add more tests on existing functions
Implement OCR with tesseract for scanned documents
Add custom naming option for pdf files

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
VERSION		VERSION
requirements.txt		requirements.txt
requirements_lint.txt		requirements_lint.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

Wazzabeee/copy-spotter

Folders and files

Latest commit

History

Repository files navigation

Copy Spotter

About

Usage

Examples

Development Setup:

Recommandations

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages