How certain are we that cleanlab can find errors on the dataset? #696
-
Doesn't this problem become a chicken and egg problem if incase you had a fairly noisy dataset, and your model will certainly do bad on the dataset because of the noisy dataset, in this particular case, how will cleanlab find errors on the dataset? Thank you for this wonderful library! 👍 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Heya @MocktaiLEngineer, great question. Obvious answerIf your model's performance on a perfect version of your dataset (no outliers, no label issues, etc) was only 50%, you shouldn't expect Cleanlab to boost you beyond that. Rule of thumb answerThe accuracy of cleanlab is correlated with the accuracy of your model and the amount of error in your dataset. For example, if a model's error rate on your dataset is 30% (70% accuracy) and your dataset contains 20% errors, you might expect the accuracy of errors/issues found by cleanlab to be something like 100% - (30% + 20%) = 50%. This has some minimal theoretical justification in the theory section of this paper, but is largely an empirical 'rule of thumb' and should be used as such with a grain (or two) of salt. Practical answer (if you need best results)As long as your model has more signal than noise, you can work with that signal, make some improvements, retrain, and repeat. But you'll need an interface and some other tools and that's what Cleanlab Studio is for (free to try link here). |
Beta Was this translation helpful? Give feedback.
Heya @MocktaiLEngineer, great question.
Obvious answer
If your model's performance on a perfect version of your dataset (no outliers, no label issues, etc) was only 50%, you shouldn't expect Cleanlab to boost you beyond that.
Rule of thumb answer
The accuracy of cleanlab is correlated with the accuracy of your model and the amount of error in your dataset. For example, if a model's error rate on your dataset is 30% (70% accuracy) and your dataset contains 20% errors, you might expect the accuracy of errors/issues found by cleanlab to be something like 100% - (30% + 20%) = 50%. This has some minimal theoretical justification in the theory section of this paper, but is largely an empirical …