DataIntegrationDuplicateDetection

[Uni] Data Integration Excercise 3 - Data Deduplication

In this excercise we were supposed to parse a dataset and find duplicate rows in it. Hereby rows don't have to be exact duplicates, but are usually fuzzy-duplicates (including typos, missing attributes etc). Main problem was the size of the dataset (94.000 rows) which made the brute-force approach really time and memory consuming. A hand full of other algorithms had to be implemented in order to achieve a decent runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
SSNcompare.py		SSNcompare.py
inputDB.csv		inputDB.csv
main.py		main.py
merge_results.py		merge_results.py
presentation.pptx		presentation.pptx
smallInput.csv		smallInput.csv
smallInput2.csv		smallInput2.csv
smalltest.csv		smalltest.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

.gitignore

.gitignore

README.md

README.md

SSNcompare.py

SSNcompare.py

inputDB.csv

inputDB.csv

main.py

main.py

merge_results.py

merge_results.py

presentation.pptx

presentation.pptx

smallInput.csv

smallInput.csv

smallInput2.csv

smallInput2.csv

smalltest.csv

smalltest.csv

Repository files navigation

DataIntegrationDuplicateDetection

About

Releases

Packages

Languages

LorToso/DataIntegrationDuplicateDetection

Folders and files

Latest commit

History

Repository files navigation

DataIntegrationDuplicateDetection

About

Topics

Resources

Stars

Watchers

Forks

Languages