Skip to content

LorToso/DataIntegrationDuplicateDetection

Repository files navigation

DataIntegrationDuplicateDetection

[Uni] Data Integration Excercise 3 - Data Deduplication

In this excercise we were supposed to parse a dataset and find duplicate rows in it. Hereby rows don't have to be exact duplicates, but are usually fuzzy-duplicates (including typos, missing attributes etc). Main problem was the size of the dataset (94.000 rows) which made the brute-force approach really time and memory consuming. A hand full of other algorithms had to be implemented in order to achieve a decent runtime.

About

[Uni] Data Integration Excercise 3 - Data Deduplication

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages