🧼🔎 SelfClean

A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates, and label errors.

Publications: SelfClean Paper | Data Cleaning Protocol Paper (ML4H24@NeurIPS)

NOTE: Make sure to have git-lfs installed before pulling the repository to ensure the pre-trained models are pulled correctly (git-lfs install instructions).

This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International license.

Installation

Install SelfClean via PyPI:

# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean

Getting Started

You can run SelfClean in a few lines of code:

from selfclean import SelfClean

selfclean = SelfClean()

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_irrelevants = issues.get_issues("irrelevants", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)

Examples: In examples/, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean. These examples analyze different benchmark datasets such as:

Imagenette 🖼️ (Open in NBViewer | GitHub | Colab)
Oxford-IIIT Pet 🐶 (Open in NBViewer | GitHub | Colab)

Development Environment

Run make for a list of possible targets.

Run these commands to install the requirements for the development environment:

make init
make install

To run linters on all files:

pre-commit run --all-files

We use the following packages for code and test conventions:

black for code style
isort for import sorting
pytest for running tests

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint		.yamllint
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
common.mk		common.mk
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

Digital-Dermatology/SelfClean

Folders and files

Latest commit

History

Repository files navigation

🧼🔎 SelfClean

Installation

Getting Started

Development Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages