Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added datapoint for a small dataset #1249

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

levscaut
Copy link
Contributor

@levscaut levscaut commented Oct 19, 2023

Why are these changes needed?

Currently default LGBMClassifer does not have a datapoint for small dataset that generated from code snippet below:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=.5)

Applying LGBMClassifer to this dataset will have low performance than expected:

from flaml.default import LGBMClassifier
lgbm = LGBMClassifier().fit(X_train, y_train)
lgbm.score(X_test, y_test)

Hence, I added a datapoint which has metafeature of this dataset to LGBMClassifer default config. But I'm not sure if this new datapoint will break other scenario using default LGBMClassifer. I have passed test/default on my PC.

According to my test, XGBClassifer and RandomForestClassifer don't have this issue. They perform well in this dataset.

Related issue number

#1247

Checks

@sonichi
Copy link
Collaborator

sonichi commented Oct 21, 2023

The change is simple. Some performance test will be needed to approve this PR.

@levscaut
Copy link
Contributor Author

The change is simple. Some performance test will be needed to approve this PR.

Thanks for review! I'm happy to help running performance test if needed.

@amueller
Copy link

@sonichi how were the original 5 selected? I did some similar works a couple of years ago and I used a greedy approximation because I was going for a ranking, not a hard subset. Did you use a larger benchmark set and use some partitioning of the space or some integer programming to get to the subset?

I can run my benchmark suite on the branch and see if it helps improve accuracy on the datasets I'm looking at. But we probably also want to run against whatever system / benchmark you originally used.
Maybe at least running against the AutoML benchmark?

@sonichi
Copy link
Collaborator

sonichi commented Oct 24, 2023

@sonichi how were the original 5 selected? I did some similar works a couple of years ago and I used a greedy approximation because I was going for a ranking, not a hard subset. Did you use a larger benchmark set and use some partitioning of the space or some integer programming to get to the subset?

I can run my benchmark suite on the branch and see if it helps improve accuracy on the datasets I'm looking at. But we probably also want to run against whatever system / benchmark you originally used. Maybe at least running against the AutoML benchmark?

I used a new greedy algorithm in the zero-shot AutoML paper to select the portfolio from a large set of candidate configurations.
I agree that we shoul run against the AutoML benchmark for the multiclass tasks.

@amueller
Copy link

Can you please link the paper, I'm not sure which one that is.

@sonichi
Copy link
Collaborator

sonichi commented Oct 24, 2023

@amueller
Copy link

With the PR, the model doesn't fail catastrophically on my benchmark (a subset of OpenML CC-18 with a 50/50 train/test split and 10 fold cross-validation) anymore, but it's still not competitive. I assume making it perform well would at least require running the greedy algorithm again on an expanded benchmark. Let me check the paper for details.

@amueller
Copy link

Ok looks like the portfolio building is mostly the same as in my work and in autosklearn 2.0 apart from some minor differences, and the use of meta-features instead of just iterating through configurations. I'm somewhat surprised by how well the meta-feature based zero-shot works tbh, very cool!

@levscaut
Copy link
Contributor Author

Is the code for profile mining process still in the repo? I can help with the experiment if we have more specific information like experiment code or which extra datasets to include

@sonichi
Copy link
Collaborator

sonichi commented Oct 26, 2023

@amueller
Copy link

The list of datasets used for the original work is in the paper. I'm not 100% sure about how datasets were selected for the AutoML benchmark vs the cc-18 (which is what I'm using). Also I'm currently using a somewhat non-standard splitting strategy that splits data 50/50.
It might be interesting to also include subsampled versions of the datasets into the initial pool to broaden the space of datasets that are explored, but that very much feels like a research question that goes beyond a simple PR.

@levscaut
Copy link
Contributor Author

Agreed on that, running over the whole benchmark is a little overwork for this PR. It will be great if there is a simple performance test to check how this new point is affecting exisiting flaml default usage, and then we could decide to keep or discard this change.

@sonichi
Copy link
Collaborator

sonichi commented Nov 1, 2023

Agreed on that, running over the whole benchmark is a little overwork for this PR. It will be great if there is a simple performance test to check how this new point is affecting exisiting flaml default usage, and then we could decide to keep or discard this change.

There are around 15 multi-class tasks in the benchmark, which is manageable to run just the default.lightgbm before and after. We can merge if the performance doesn't degrade. Likely the performance wouldn't change because the added dataset is not similar to them.

@levscaut
Copy link
Contributor Author

levscaut commented Nov 6, 2023

I've been digging into the zero-shot paper for experiment details. I managed to selected all the multiclass task from the paper, following the 10-fold evaluation, using default LGBM, and I get the following result before and after this datapoint is added:

mean_score_old mean_score_new duration(minutes)
car 0.901042 0.901042 1.61618
cnae-9 0.853704 0.853704 12.0663
fabert 0.712031 0.712031 76.7194
mfeat-factors 0.9695 0.9695 17.4963
segment 0.944589 0.944589 3.97898
vehicle 0.781373 0.781373 1.91733
connect-4 0.649939 0.649939 7.5523
Fashion-MNIST 0.903129 0.903129 698.402
Helena 0.0657854 0.055831 251.748
Jannis 0.71568 0.71568 39.1677
jungle_chess_2pcs_raw_endgame_complete 0.679398 0.679398 3.86058
Shuttle 0.885672 0.828362 8.92685
Volkert 0.688921 0.688921 142.019
Covertype 0.609783 0.609783 100.821
Dionis 0.17365 0.172989 4770.91
dilbert 0.9894 0.9894 735.268
Robert 0.5194 0.5194 7950.23

Unfortunately, it appears that this particular datapoint does impact and slightly diminish performance for a subset of datasets. I will investigate whether I can adjust this datapoint to prevent any negative effects on current tasks.

@amueller
Copy link

amueller commented Nov 6, 2023

I'm not sure if doing tweaks based on such a small number of tasks will be very robust. You don't have any other dataset to confirm that any additional changes generalize, right? So I feel you're likely to overfit to these three tasks you just identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants