Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: extend package to gold (verified) labels #1004

Open
jwmueller opened this issue Feb 13, 2024 · 0 comments
Open

New feature: extend package to gold (verified) labels #1004

jwmueller opened this issue Feb 13, 2024 · 0 comments
Labels
help-wanted We need your help to add this, but it may be more challenging than a "good first issue" needs triage

Comments

@jwmueller
Copy link
Member

We want to extend some of the core methods in this package, eg:

cleanlab.filter.find_label_issues

cleanlab.multiannotator.get_label_quality_multiannotator

to be more useful in settings where there are some gold labels available.

The gold labels are verified correct already (say by an expert), and can simply be specified by user via an optional verified_labels argument, which say is a sparse array which only contains classes at indices i corresponding to datapoints whose ground-truth label has been verified as verified_labels[i].

Most of the time, users will probably use verified_labels only to specify which labels are correct. But occasionally they may also specify which labels are wrong via this argument, specifying the correct label for those datapoints. For datapoints i which are verified mislabeled, but no correct label exists, we could allow verified_labels[i] to be a missing-value say.

What can be done with these gold labels?

We can do hyperparameter-tuning of all cleanlab arguments to ensure the set of returned label quality scores and issues aligns best with the gold/verified information. This is different than the cleanlab argument hyperparameter tuning done in this example which is instead about maximizing predictive accuracy of a ML model.

Here we are interested in maximizing the label error detection performance with respect to the gold labels. For instance, when verified _labels only contains verifications of certain given labels that are correct, we can optimize for the false positive rate of label error detection. If verified_labels contains verifications of certain labels that are correct and some that are incorrect, then we can optimize for most interesting label error detection metrics such as: AUROC, AUPRC, precision@k, etc.

@jwmueller jwmueller added the help-wanted We need your help to add this, but it may be more challenging than a "good first issue" label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We need your help to add this, but it may be more challenging than a "good first issue" needs triage
Projects
None yet
Development

No branches or pull requests

1 participant