Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new CLOS train test split tutorial notebook #1071

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

mturk24
Copy link
Contributor

@mturk24 mturk24 commented Mar 28, 2024

Summary

Added new tutorial that shows how to improve ML performance using train-test splits on your data with CLOS.

There is currently an issue preventing me from fully building the docs to see how quickly (and if successfully) the new tutorial builds.

Also modified the index files necessary to include this in the main sidebar of the CLOS tutorials. This is replacing the tabular datalab tutorial as well.

Latest Update: Bug in tutorial has been fixed and index files have been updated appropriately. Latest commits show fix/improvements to tutorial and data in S3 has been updated

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.18%. Comparing base (e0b7615) to head (f2257b2).
Report is 17 commits behind head on master.

❗ Current head f2257b2 differs from pull request most recent head 2096c5d. Consider uploading reports for the commit 2096c5d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1071      +/-   ##
==========================================
- Coverage   96.20%   96.18%   -0.03%     
==========================================
  Files          76       76              
  Lines        6005     5996       -9     
  Branches     1070      992      -78     
==========================================
- Hits         5777     5767      -10     
  Misses        135      135              
- Partials       93       94       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… iid issues and filtered training data based on exact duplicates between training and test sets
…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data
…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.
… and cleaned up some of the code, put data used into s3 bucket
…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one
@mturk24 mturk24 requested review from jwmueller and elisno April 4, 2024 01:52
@mturk24 mturk24 changed the title Added WIP new CLOS train test split tutorial notebook Added new CLOS train test split tutorial notebook Apr 4, 2024
@mturk24 mturk24 requested a review from sanjanag April 4, 2024 01:56
@mturk24
Copy link
Contributor Author

mturk24 commented Apr 4, 2024

Also adding @sanjanag as reviewer (since she was very helpful/involved in this)

@@ -7,4 +7,3 @@ Datalab Tutorials
Detecting Common Data Issues with Datalab <datalab_quickstart>
Advanced Data Auditing with Datalab <datalab_advanced>
Text Dataset <text>
Tabular Data (Numeric/Categorical) <tabular>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about actually keeping the tabular tutorial?

I actually don't think this new tutorial is really a close replacement for that tabular tutorial, which is about quickly detecting issues in a tabular dataset. Also the tabular tutorial makes it explicit that Datalab can be used for tabular data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to that - and given this one isn't being added into the datalab folder, that aligns well.

@jwmueller jwmueller removed the request for review from elisno April 4, 2024 05:09
@mturk24
Copy link
Contributor Author

mturk24 commented Apr 4, 2024

I was able to get a workaround for this issue using this approach so was able to build the docs successfully. I'm not sure what the expected runtime is to build but going to try comparing build time with and without the new notebook more thoroughly.

@jwmueller
Copy link
Member

can you resolve merge conflicts, thanks!

"exact_duplicates_indices = exact_duplicates.index\n",
"\n",
"# Filter the indices to drop by which indices in exact duplicates are <= to the index cutoff\n",
"indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mturk24 Are you sure this is correct? We only want to drop the data points that are:

  • in the training dataset
  • exactly duplicated with some data point in the test set.

I don't see the test set appearing in any of the code here. It seems like you've just dropped all the training data points that are exact duplicates, even if theyre duplicated with other training data points only? We don't want that

Copy link
Contributor Author

@mturk24 mturk24 Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I am adjusting the code now to only drop the data points that are:

  • in the training dataset
  • exactly duplicated with some data point in the test set.

When rerunning the code and adjusting the logic, I actually find that there are NO data points that are exact duplicates between the training and test sets.

However, there are data points where is_near_duplicate_issue = True and near_duplicate_score is quite low as you can see in the screenshot below (715 is the cutoff index between training and test sets):

Screenshot 2024-04-16 at 11 05 30 AM

Therefore, the behavior in the tutorial will be to drop no rows in the training set in this section since the only EXACT duplicates are intra-training set and not between training and test sets.

I recall adding exact duplicates to our test set from the training set, but can adjust the test set to ensure we do in fact have exact duplicates that are detected and dropped from the training set here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see slack, your proposal is not what we want

Copy link
Contributor Author

@mturk24 mturk24 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in one of the latest commits

"source": [
"# Define training index cutoff and find the exact duplicate indices to reference\n",
"train_idx_cutoff = len(preprocessed_train_data) - 1\n",
"exact_duplicates_indices = exact_duplicates.index\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be subsetted to the set of exact_duplicates where at least one of the datapoints is from the test set.

@@ -154,6 +154,11 @@ Link to Cleanlab Studio docs: `help.cleanlab.ai <https://help.cleanlab.ai/>`_
Datalab Tutorials <tutorials/datalab/index>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase to make this look like the master branch version of the index, except for inserting your new Improving ML Performance tutorial in the sidebar right in between Datalab Tutorials and CleanLearning Tutorials. I think that's the best place for it (it's technically a datalab tutorial too, but is so important I think it should be outside of the the Datalab sidebar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep just made sure my version is the master branch version with my tutorial added in between datalab and cleanlearning

@@ -5,6 +5,7 @@ Tutorials
:maxdepth: 1

datalab/
improving_ml_performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the correct place for the new tutorial, so make the other index.rst file match this location

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed the other index.rst file to match this in latest commit

@jwmueller jwmueller requested review from jwmueller and removed request for sanjanag April 16, 2024 03:50
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a bug early on in this tutorial, so will stop reviewing until you've had a look and pinged me about it (because it seems like all subsequent results are affected if this step changes).

Specifically this is what we want to do: drop the extra duplicated copies of test data points found in our training set from this training set.

But I think your code is simply: dropping extra copies of any exact duplicate of a training datapoint, regardless if the set of exact duplicates only contains training data (and no test data).

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 2bfafe9 to 43dfe63 Compare April 30, 2024 21:33
@mturk24 mturk24 closed this Apr 30, 2024
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 13442e2 Compare April 30, 2024 21:39
@mturk24 mturk24 reopened this Apr 30, 2024
@mturk24 mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 83d4209 Compare April 30, 2024 21:45
…torial added between datalab and cleanlearning
@mturk24 mturk24 requested a review from jwmueller April 30, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants