-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new CLOS train test split tutorial notebook #1071
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1071 +/- ##
==========================================
- Coverage 96.20% 96.18% -0.03%
==========================================
Files 76 76
Lines 6005 5996 -9
Branches 1070 992 -78
==========================================
- Hits 5777 5767 -10
Misses 135 135
- Partials 93 94 +1 ☔ View full report in Codecov by Sentry. |
… iid issues and filtered training data based on exact duplicates between training and test sets
…revious version following the model eval on clean training + test data. Fixed section on using Datalab on training data to clean the data
…up notebook and added more on hyperparameter optimization section. This section still needs to be improved.
… and cleaned up some of the code, put data used into s3 bucket
…ar before DCAI workflow tutorial, and renamed it to improving_ml_performance, also removed datalab tabular tutorial since this tutorial is replacing that one
Also adding @sanjanag as reviewer (since she was very helpful/involved in this) |
@@ -7,4 +7,3 @@ Datalab Tutorials | |||
Detecting Common Data Issues with Datalab <datalab_quickstart> | |||
Advanced Data Auditing with Datalab <datalab_advanced> | |||
Text Dataset <text> | |||
Tabular Data (Numeric/Categorical) <tabular> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about actually keeping the tabular tutorial?
I actually don't think this new tutorial is really a close replacement for that tabular tutorial, which is about quickly detecting issues in a tabular dataset. Also the tabular tutorial makes it explicit that Datalab can be used for tabular data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to that - and given this one isn't being added into the datalab folder, that aligns well.
I was able to get a workaround for this issue using this approach so was able to build the docs successfully. I'm not sure what the expected runtime is to build but going to try comparing build time with and without the new notebook more thoroughly. |
can you resolve merge conflicts, thanks! |
…als and adjusted intro section as well
"exact_duplicates_indices = exact_duplicates.index\n", | ||
"\n", | ||
"# Filter the indices to drop by which indices in exact duplicates are <= to the index cutoff\n", | ||
"indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mturk24 Are you sure this is correct? We only want to drop the data points that are:
- in the training dataset
- exactly duplicated with some data point in the test set.
I don't see the test set appearing in any of the code here. It seems like you've just dropped all the training data points that are exact duplicates, even if theyre duplicated with other training data points only? We don't want that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, I am adjusting the code now to only drop the data points that are:
- in the training dataset
- exactly duplicated with some data point in the test set.
When rerunning the code and adjusting the logic, I actually find that there are NO data points that are exact duplicates between the training and test sets.
However, there are data points where is_near_duplicate_issue
= True
and near_duplicate_score
is quite low as you can see in the screenshot below (715 is the cutoff index between training and test sets):
Therefore, the behavior in the tutorial will be to drop no rows in the training set in this section since the only EXACT duplicates are intra-training set and not between training and test sets.
I recall adding exact duplicates to our test set from the training set, but can adjust the test set to ensure we do in fact have exact duplicates that are detected and dropped from the training set here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see slack, your proposal is not what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this in one of the latest commits
"source": [ | ||
"# Define training index cutoff and find the exact duplicate indices to reference\n", | ||
"train_idx_cutoff = len(preprocessed_train_data) - 1\n", | ||
"exact_duplicates_indices = exact_duplicates.index\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be subsetted to the set of exact_duplicates where at least one of the datapoints is from the test set.
@@ -154,6 +154,11 @@ Link to Cleanlab Studio docs: `help.cleanlab.ai <https://help.cleanlab.ai/>`_ | |||
Datalab Tutorials <tutorials/datalab/index> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rebase to make this look like the master branch version of the index, except for inserting your new Improving ML Performance
tutorial in the sidebar right in between Datalab Tutorials
and CleanLearning Tutorials
. I think that's the best place for it (it's technically a datalab tutorial too, but is so important I think it should be outside of the the Datalab sidebar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep just made sure my version is the master branch version with my tutorial added in between datalab and cleanlearning
@@ -5,6 +5,7 @@ Tutorials | |||
:maxdepth: 1 | |||
|
|||
datalab/ | |||
improving_ml_performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the correct place for the new tutorial, so make the other index.rst file match this location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed the other index.rst file to match this in latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be a bug early on in this tutorial, so will stop reviewing until you've had a look and pinged me about it (because it seems like all subsequent results are affected if this step changes).
Specifically this is what we want to do: drop the extra duplicated copies of test data points found in our training set from this training set.
But I think your code is simply: dropping extra copies of any exact duplicate of a training datapoint, regardless if the set of exact duplicates only contains training data (and no test data).
…s from training data that are exact duplicat with test set, updated seed usage to be proper, and fixed unit tests accordingly
2bfafe9
to
43dfe63
Compare
43dfe63
to
13442e2
Compare
43dfe63
to
83d4209
Compare
…torial added between datalab and cleanlearning
Summary
Added new tutorial that shows how to improve ML performance using train-test splits on your data with CLOS.
There is currently an issue preventing me from fully building the docs to see how quickly (and if successfully) the new tutorial builds.
Also modified the index files necessary to include this in the main sidebar of the CLOS tutorials. This is replacing the tabular datalab tutorial as well.
Latest Update: Bug in tutorial has been fixed and index files have been updated appropriately. Latest commits show fix/improvements to tutorial and data in S3 has been updated