New feature: extend package to gold (verified) labels #1004
Labels
help-wanted
We need your help to add this, but it may be more challenging than a "good first issue"
needs triage
We want to extend some of the core methods in this package, eg:
cleanlab.filter.find_label_issues
cleanlab.multiannotator.get_label_quality_multiannotator
to be more useful in settings where there are some gold labels available.
The gold labels are verified correct already (say by an expert), and can simply be specified by user via an optional
verified_labels
argument, which say is a sparse array which only contains classes at indices i corresponding to datapoints whose ground-truth label has been verified asverified_labels[i]
.Most of the time, users will probably use
verified_labels
only to specify whichlabels
are correct. But occasionally they may also specify whichlabels
are wrong via this argument, specifying the correct label for those datapoints. For datapoints i which are verified mislabeled, but no correct label exists, we could allowverified_labels[i]
to be a missing-value say.What can be done with these gold labels?
We can do hyperparameter-tuning of all cleanlab arguments to ensure the set of returned label quality scores and issues aligns best with the gold/verified information. This is different than the cleanlab argument hyperparameter tuning done in this example which is instead about maximizing predictive accuracy of a ML model.
Here we are interested in maximizing the label error detection performance with respect to the gold labels. For instance, when
verified _labels
only contains verifications of certain given labels that are correct, we can optimize for the false positive rate of label error detection. Ifverified_labels
contains verifications of certain labels that are correct and some that are incorrect, then we can optimize for most interesting label error detection metrics such as: AUROC, AUPRC, precision@k, etc.The text was updated successfully, but these errors were encountered: