How certain are we that cleanlab can find errors on the dataset? #696

MocktaiLEngineer · 2023-05-04T19:09:42Z

MocktaiLEngineer
May 4, 2023

Doesn't this problem become a chicken and egg problem if incase you had a fairly noisy dataset, and your model will certainly do bad on the dataset because of the noisy dataset, in this particular case, how will cleanlab find errors on the dataset?

Thank you for this wonderful library! 👍

Answered by cgnorthcutt

May 4, 2023

Heya @MocktaiLEngineer, great question.

Obvious answer

If your model's performance on a perfect version of your dataset (no outliers, no label issues, etc) was only 50%, you shouldn't expect Cleanlab to boost you beyond that.

Rule of thumb answer

The accuracy of cleanlab is correlated with the accuracy of your model and the amount of error in your dataset. For example, if a model's error rate on your dataset is 30% (70% accuracy) and your dataset contains 20% errors, you might expect the accuracy of errors/issues found by cleanlab to be something like 100% - (30% + 20%) = 50%. This has some minimal theoretical justification in the theory section of this paper, but is largely an empirical …

View full answer

cgnorthcutt · 2023-05-04T21:29:56Z

cgnorthcutt
May 4, 2023
Maintainer

Heya @MocktaiLEngineer, great question.

Obvious answer

If your model's performance on a perfect version of your dataset (no outliers, no label issues, etc) was only 50%, you shouldn't expect Cleanlab to boost you beyond that.

Rule of thumb answer

The accuracy of cleanlab is correlated with the accuracy of your model and the amount of error in your dataset. For example, if a model's error rate on your dataset is 30% (70% accuracy) and your dataset contains 20% errors, you might expect the accuracy of errors/issues found by cleanlab to be something like 100% - (30% + 20%) = 50%. This has some minimal theoretical justification in the theory section of this paper, but is largely an empirical 'rule of thumb' and should be used as such with a grain (or two) of salt.

Practical answer (if you need best results)

As long as your model has more signal than noise, you can work with that signal, make some improvements, retrain, and repeat. But you'll need an interface and some other tools and that's what Cleanlab Studio is for (free to try link here).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How certain are we that cleanlab can find errors on the dataset? #696

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How certain are we that cleanlab can find errors on the dataset? #696

MocktaiLEngineer May 4, 2023

Obvious answer

Rule of thumb answer

Replies: 1 comment

cgnorthcutt May 4, 2023 Maintainer

Obvious answer

Rule of thumb answer

Practical answer (if you need best results)

MocktaiLEngineer
May 4, 2023

cgnorthcutt
May 4, 2023
Maintainer