Added new CLOS train test split tutorial notebook #1071

mturk24 · 2024-03-28T18:42:51Z

Summary

Added new tutorial that shows how to improve ML performance using train-test splits on your data with CLOS.

There is currently an issue preventing me from fully building the docs to see how quickly (and if successfully) the new tutorial builds.

Also modified the index files necessary to include this in the main sidebar of the CLOS tutorials. This is replacing the tabular datalab tutorial as well.

Latest Update: Bug in tutorial has been fixed and index files have been updated appropriately. Latest commits show fix/improvements to tutorial and data in S3 has been updated

codecov · 2024-03-28T18:55:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.18%. Comparing base (e0b7615) to head (f2257b2).
Report is 17 commits behind head on master.

❗ Current head f2257b2 differs from pull request most recent head 2096c5d. Consider uploading reports for the commit 2096c5d to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1071      +/-   ##
==========================================
- Coverage   96.20%   96.18%   -0.03%     
==========================================
  Files          76       76              
  Lines        6005     5996       -9     
  Branches     1070      992      -78     
==========================================
- Hits         5777     5767      -10     
  Misses        135      135              
- Partials       93       94       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… iid issues and filtered training data based on exact duplicates between training and test sets

…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data

…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.

… and cleaned up some of the code, put data used into s3 bucket

…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one

mturk24 · 2024-04-04T01:57:30Z

Also adding @sanjanag as reviewer (since she was very helpful/involved in this)

jwmueller · 2024-04-04T05:09:00Z

docs/source/tutorials/datalab/index.rst

@@ -7,4 +7,3 @@ Datalab Tutorials
   Detecting Common Data Issues with Datalab <datalab_quickstart>
   Advanced Data Auditing with Datalab <datalab_advanced>
   Text Dataset <text>
-   Tabular Data (Numeric/Categorical) <tabular>


WDYT about actually keeping the tabular tutorial?

I actually don't think this new tutorial is really a close replacement for that tabular tutorial, which is about quickly detecting issues in a tabular dataset. Also the tabular tutorial makes it explicit that Datalab can be used for tabular data

I am not opposed to that - and given this one isn't being added into the datalab folder, that aligns well.

mturk24 · 2024-04-04T19:28:48Z

I was able to get a workaround for this issue using this approach so was able to build the docs successfully. I'm not sure what the expected runtime is to build but going to try comparing build time with and without the new notebook more thoroughly.

jwmueller · 2024-04-08T18:46:30Z

can you resolve merge conflicts, thanks!

…als and adjusted intro section as well

jwmueller · 2024-04-16T03:40:52Z

docs/source/tutorials/improving_ml_performance.ipynb

+    "exact_duplicates_indices = exact_duplicates.index\n",
+    "\n",
+    "# Filter the indices to drop by which indices in exact duplicates are <= to the index cutoff\n",
+    "indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]"


@mturk24 Are you sure this is correct? We only want to drop the data points that are:

in the training dataset

exactly duplicated with some data point in the test set.

I don't see the test set appearing in any of the code here. It seems like you've just dropped all the training data points that are exact duplicates, even if theyre duplicated with other training data points only? We don't want that

You're correct, I am adjusting the code now to only drop the data points that are:

in the training dataset

exactly duplicated with some data point in the test set.

When rerunning the code and adjusting the logic, I actually find that there are NO data points that are exact duplicates between the training and test sets.

However, there are data points where is_near_duplicate_issue = True and near_duplicate_score is quite low as you can see in the screenshot below (715 is the cutoff index between training and test sets):

Therefore, the behavior in the tutorial will be to drop no rows in the training set in this section since the only EXACT duplicates are intra-training set and not between training and test sets.

I recall adding exact duplicates to our test set from the training set, but can adjust the test set to ensure we do in fact have exact duplicates that are detected and dropped from the training set here

see slack, your proposal is not what we want

Addressed this in one of the latest commits

jwmueller · 2024-04-16T03:42:00Z

docs/source/tutorials/improving_ml_performance.ipynb

+   "source": [
+    "# Define training index cutoff and find the exact duplicate indices to reference\n",
+    "train_idx_cutoff = len(preprocessed_train_data) - 1\n",
+    "exact_duplicates_indices = exact_duplicates.index\n",


I think this needs to be subsetted to the set of exact_duplicates where at least one of the datapoints is from the test set.

jwmueller · 2024-04-16T03:49:46Z

docs/source/index.rst

@@ -154,6 +154,11 @@ Link to Cleanlab Studio docs: `help.cleanlab.ai <https://help.cleanlab.ai/>`_
   Datalab Tutorials <tutorials/datalab/index>


please rebase to make this look like the master branch version of the index, except for inserting your new Improving ML Performance tutorial in the sidebar right in between Datalab Tutorials and CleanLearning Tutorials. I think that's the best place for it (it's technically a datalab tutorial too, but is so important I think it should be outside of the the Datalab sidebar)

Yep just made sure my version is the master branch version with my tutorial added in between datalab and cleanlearning

jwmueller · 2024-04-16T03:50:14Z

docs/source/tutorials/index.rst

@@ -5,6 +5,7 @@ Tutorials
   :maxdepth: 1

   datalab/
+   improving_ml_performance


this is the correct place for the new tutorial, so make the other index.rst file match this location

Just fixed the other index.rst file to match this in latest commit

jwmueller

I think there may be a bug early on in this tutorial, so will stop reviewing until you've had a look and pinged me about it (because it seems like all subsequent results are affected if this step changes).

Specifically this is what we want to do: drop the extra duplicated copies of test data points found in our training set from this training set.

But I think your code is simply: dropping extra copies of any exact duplicate of a training datapoint, regardless if the set of exact duplicates only contains training data (and no test data).

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly

…torial added between datalab and cleanlearning

Added WIP new CLOS train test split tutorial notebook

b5c44b7

mturk24 added 8 commits March 28, 2024 16:46

Fixed datasets and added sections on checking for near duplicates/non…

f954aa8

… iid issues and filtered training data based on exact duplicates between training and test sets

Can ignore commented out code and also some code I pasted in from a p…

6448856

…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data

Added fix to test data inspection/cleaning, changed wording, cleaned …

e82a607

…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.

Fixed some logic in hyperparameter section, still WIP

58e59f8

Fixed hyperparameter optimization section

ad94a5e

Added hidden tests at end of tutorial, changed intro section wording,…

797b762

… and cleaned up some of the code, put data used into s3 bucket

Modified index files to put clos train test tutorial in general sideb…

3164e94

…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one

Fixed markdown wording

ffe9b33

mturk24 requested review from jwmueller and elisno April 4, 2024 01:52

mturk24 changed the title ~~Added WIP new CLOS train test split tutorial notebook~~ Added new CLOS train test split tutorial notebook Apr 4, 2024

mturk24 requested a review from sanjanag April 4, 2024 01:56

jwmueller reviewed Apr 4, 2024

View reviewed changes

jwmueller removed the request for review from elisno April 4, 2024 05:09

Fixed index at docs/source level

47f9976

mturk24 added 2 commits April 4, 2024 14:41

Re-added tabular datalab tutorial

e22ffd8

Fixed tabular datalab tutorial in index

aaaedf2

jwmueller and others added 8 commits April 8, 2024 16:23

top part of tutorial

068b8cc

dont need sklearn because its already dependency of cleanlab

447002b

wording

3f4108e

Merge branch 'master' into add-train-test-clos-tutorial

36ad7d1

Fixed headings throughout notebook to be more similar to other tutori…

d2419ea

…als and adjusted intro section as well

more clarification on the main steps of the tutorial

a143129

better motivate the merged data checks

9ebe4fc

duplicates section

f2257b2

jwmueller reviewed Apr 16, 2024

View reviewed changes

shorten duplicates notebook cells

761318f

jwmueller reviewed Apr 16, 2024

View reviewed changes

jwmueller requested review from jwmueller and removed request for sanjanag April 16, 2024 03:50

jwmueller requested changes Apr 16, 2024

View reviewed changes

Updated train and test datasets used, fixed bug with not dropping row…

83d4209

…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly

mturk24 force-pushed the add-train-test-clos-tutorial branch from 2bfafe9 to 43dfe63 Compare April 30, 2024 21:33

mturk24 closed this Apr 30, 2024

mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 13442e2 Compare April 30, 2024 21:39

mturk24 reopened this Apr 30, 2024

mturk24 force-pushed the add-train-test-clos-tutorial branch from 43dfe63 to 83d4209 Compare April 30, 2024 21:45

Updated docs/source/index.rst to be same as master branch with new tu…

2096c5d

…torial added between datalab and cleanlearning

mturk24 requested a review from jwmueller April 30, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new CLOS train test split tutorial notebook #1071

Added new CLOS train test split tutorial notebook #1071

mturk24 commented Mar 28, 2024 •

edited

codecov bot commented Mar 28, 2024 •

edited

mturk24 commented Apr 4, 2024

jwmueller Apr 4, 2024

mturk24 Apr 4, 2024

mturk24 commented Apr 4, 2024

jwmueller commented Apr 8, 2024

jwmueller Apr 16, 2024

mturk24 Apr 16, 2024 •

edited

jwmueller Apr 16, 2024

mturk24 Apr 30, 2024 •

edited

jwmueller Apr 16, 2024

jwmueller Apr 16, 2024

mturk24 Apr 30, 2024

jwmueller Apr 16, 2024

mturk24 Apr 30, 2024

jwmueller left a comment

		@@ -154,6 +154,11 @@ Link to Cleanlab Studio docs: `help.cleanlab.ai <https://help.cleanlab.ai/>`_
		Datalab Tutorials <tutorials/datalab/index>

Added new CLOS train test split tutorial notebook #1071

Are you sure you want to change the base?

Added new CLOS train test split tutorial notebook #1071

Conversation

mturk24 commented Mar 28, 2024 • edited

Summary

codecov bot commented Mar 28, 2024 • edited

Codecov Report

mturk24 commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mturk24 commented Apr 4, 2024

jwmueller commented Apr 8, 2024

Choose a reason for hiding this comment

mturk24 Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mturk24 Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwmueller left a comment

Choose a reason for hiding this comment

mturk24 commented Mar 28, 2024 •

edited

codecov bot commented Mar 28, 2024 •

edited

mturk24 Apr 16, 2024 •

edited

mturk24 Apr 30, 2024 •

edited