Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncertainty and Prediction Intervals in LightGBM #6382

Open
tiagoleonmelo opened this issue Mar 25, 2024 · 15 comments
Open

Uncertainty and Prediction Intervals in LightGBM #6382

tiagoleonmelo opened this issue Mar 25, 2024 · 15 comments
Labels

Comments

@tiagoleonmelo
Copy link

tiagoleonmelo commented Mar 25, 2024

Summary

I have been looking into possible approaches to have a Prediction Interval for a LIghtGBM model. This model is already trained, and it is not a quantile regressor - since its purpose is for binary classification. I would like to get a prediction interval, and know that within it, there is a strong statistical guarantee (for example 95% confidence) that the predicted probability of label=1. For example, p(y_0=1 | x=x_0) = 0.74, and the 95% CI goes from 0.68 to 0.90

To do this, I've ran into a few resources. There's MAPIE, a model-agnostic library to perform parametric and non-parametric approaches to get Prediction Intervals and sets; but also different approaches through Bootstrapping as proposed in this paper and explored in this blog

I have just now ran into this discussion here in the LightGBM repository. I was wondering if anyone knows a bit more about this. I had also thought about something like this, since each tree in the GBM will be a model trained on a bootstrap, and if we only use some models, we are essentially obtaining a model trained on a bootstrap of the sample. So my plan to have a prediction interval for a single instance y0 would look something like this

for _ in range(N):
    pred = gbm.booster_.shuffle_models().predict(y0, num_iteration=alpha * total_iterations)
    preds += [pred]

lower_bound = np.quantile(preds, q=0.025)
upper_bound = np.quantile(preds, q=0.975) 

Would this be a good way of using the shuffle_models() method?

@mayer79
Copy link
Contributor

mayer79 commented Mar 31, 2024

In the question, you use both terms "prediction interval" and "confidence interval". These are completely different.

  • A prediction interval is a statement about the random variable $Y | X = x$. Slightly optimistic prediction intervals can be obtained via quantile regression. As you mentioned, for binary $Y$, it is almost always $[0, 1]$, that is, completely useless. So: yes, we can get approximate prediction intervals, but they do not make sense for binary responses.
  • A confidence interval is a statement about a distributional property, often $E(Y | X = x)$. You can get an approximate confidence interval for the expected response, e.g., via Bootstrapping.

@tiagoleonmelo
Copy link
Author

Right, in my case I would be trying to replicate a prediction interval despite being in a binary classification context - not a confidence interval in the "traditional" sense (eg. model accuracy is between 0.8 and 0.9 with a 95% confidence).

From what I read, I gather that prediction sets would be pretty much useless (since, like you say, they are almost always [0, 1]). However, it doesn't seem unreasonable to me to create a prediction interval around the predicted probability of y = 1. I imagine some samples are easier to classify than others, and though this may be communicated through a lower/higher probability, having an upper and lower boundary for the prediction is much more informative (not to mention a distribution of predictions).

To reiterate over the idea I proposed: my hypothesis is that if we bootstrap the models being used, we are to some extent getting a confidence interval but only for a single sample, if that makes sense, which we could then interpret as a prediction interval. Since each weak learner in a GBM is trained over a bootstrap of the dataset, when we randomly sample (with replacement) the weak learners used to make a prediction, we should be "close enough" to the real Bootstrapping method.

I also found that when alpha < 1 the upper and lower bounds I generate end up not including the value predicted by the "real" model sometimes. I figure that in those cases all models are required to generate that value.

Boiling down:

  • Is model sampling a viable way to compute a CI?
  • Can CIs be computed over a single sample, and if so, are they a good proxy for prediction intervals?

@mayer79
Copy link
Contributor

mayer79 commented Apr 2, 2024

CI and PI are well-defined concepts. It does not make sense to redefine them or calling a CI a PI.

Maybe you are indeed looking for a CI for the parameter $E(Y | X = x)$. You can get such CI via Bootstrapping. In probabilistic classification, this parameter equals $P(Y = 1 | X = x)$, and you would read a corresponding CI as:

"We are 95% confident that the true probability of having a "1" for observations with feature values $x$ is between 0.11 and 0.14.

The interval depends on feature values $x$, so it is kind of individual. But it is not individual in the sense that you have a single $Y$ and then say: $Y$ is between 0.11 and 0.14. ($Y$ is either 0 or 1, so it can't be between 0.11 and 1.14).

Edit: How to find an approximate 95% CI for $P(Y = 1 | X = x)$ via percentile Bootstrap?

  1. Refit your model $B$ on Bootstrap resamples of the training data. This gives you a set of $B$ models. $B$ is as large as possible, e.g., 999.
  2. Whenever you want to calculate a 95% confidence interval for $P(Y = 1 | X = x)$, you get the $B$ predictions of $x$ and calculate their empirical 2.5% and 97.5% quantiles. They form a very approximate 95% percentile Bootstrap CI for the parameter of interest.

This is just a standard percentile Bootstrap. Nothing special about Boosted Trees. Important is that one reads the result correctly, i.e., "I am about 95% confident that the true probability of having a "1" of such observations is between a and b."

@tiagoleonmelo
Copy link
Author

tiagoleonmelo commented Apr 3, 2024

Thank you very much for the comment and clarification. Indeed, I was mixing up CIs and PIs - and a CI is all I'm looking for in this case since that is exactly what I am trying to achieve, some interval that reads as you put it:

"We are 95% confident that the true probability of having a "1" for observations with feature values x is between 0.11 and 0.14."

Regarding the second bit: thanks for the explanation as well. It sounds reasonable to me, and as you say, there is nothing about that method that makes it inherently about Boosted Trees. However, if we know we are working with Boosted Trees, would it be valid to modify that method in such a way that we don't have to refit B models, and instead we need only generate those models "on the fly" by performing bootstrap samples B times of the weak learners of a GBM?

Edit:
In a way I suppose this boils down to: are B models trained on bootstrapped samples of a dataset D equivalent to B models "ensembled" from N models trained on bootstraps of D for the purpose of CI estimation?

@mayer79
Copy link
Contributor

mayer79 commented Apr 3, 2024

These are interesting ideas. It would be great to have a simulation study where we would investigate the coverage probability of such intervals! Typical question would be:

  • How close is the actual coverage to the nominal coverage of, say 95%?
  • What is the effect of row subsampling?

Intuitively, it is not so clear how well the approach works. But if it works, it would be fantastic.

@tiagoleonmelo
Copy link
Author

tiagoleonmelo commented Apr 4, 2024

Thank you for the discussion and your insights! I don't think I have the time/resources to formally conduct such a study, but will most definitely give the method a try.

It's not very easy to compare actual and nominal coverages in binary classification however if I'm thinking correctly. For example, please consider $P(Y = 1 | X = x_0) = 0.13$ with a $CI_{95}(x) = [0.09, 0.20]$ and $y_0 = 1$. We cannot know if our CI spans the actual probability, because we don't have a probability.

To achieve this, I have an idea where I would also appreciate your input on: We use a toy dataset whose label is a real number between 0 and 1. We set some threshold on this real number to create a binary target and then remove the original real number target. The hypothesis is that the model will have to learn a probability distribution similar to the label we have just removed. The problem is finding one such dataset since regression tasks usually have a much larger domain I think.

Maybe there is some other way through model calibration, that deals with the frequency of each label perhaps? Though that avenue isn't as clear to me

Ultimately, I suppose we can generate the "official" confidence intervals by training $B$ separate models, and comparing the generated CIs with our experimental CIs

@tiagoleonmelo
Copy link
Author

Some quick and dirty results for a 95%CI, comparing the two methods with B = 10000 using a public classification dataset

image

I'm somewhat happy with this plot. The main issue with the proposed method seems to be that the generated intervals are too broad, especially in the ends of the spectrum (ie when the model is surer of its prediction). I suppose now we would just need some way of performing the sanity check of comparing the nominal with actual coverages within a binary classification context

@mayer79
Copy link
Contributor

mayer79 commented Apr 11, 2024

@tiagoleonmelo Thanks for the additional thoughts.

Ultimately, I suppose we can generate the "official" confidence intervals by training
separate models, and comparing the generated CIs with our experimental CIs

Interesting idea! I think your plot does exactly this, right? Is the model itself using row subsampling? (I guess: yes). Can you do me a favor and add a second plot, now on logit scale?

I will write more, but need a litte bit of time.

@tiagoleonmelo
Copy link
Author

tiagoleonmelo commented Apr 12, 2024

Yes, the plot is exactly that and yes, the models are built by sampling the rows, not the columns. Here is the same plot, but now on the logit scale (i.e. using the raw model scores instead of their predicted probabilities).

image

This is a much better visualisation of the differences of the intervals. I'll make sure to keep this in mind in the work ahead, thanks!

In the meantime, I have done quite a bit of reading on this topic (and encountered a few roadblocks in the way 😄). From what I gathered, uncertainty in predictions within a binary classification domain is still somewhat of a blindspot when it comes to confidence quantification. Moreover, it seems that it is impossible to validate confidence intervals in frequentist approaches, for the reasons I said above. For this reason, I will need an analytical approach that can provide these strong guarantees and replace the traditional Monte-Carlo hits/misses that I was planning on computing.

I ran into this paper recently and I am really enjoying reading it. It touches many points that are relevant to this discussion. My main take-aways from their discussion so far are: 1. Venn-Abers can be used to establish probability intervals (yet another term to mix-up 😛) and 2. it only makes sense to discuss intervals after a model has been calibrated.

Keeping these points in mind, my current plan of action after finishing the paper is to study the intervals generated through Venn-Abers without calibrating the model. Then, compare these intervals with the intervals we get from real bootstrapping and weak-learner bootstrapping. Regarding the calibration point, I am not sure how I will factor it into the experiments. I suppose ideally all models should be calibrated, but the ideal solution I am looking for does not require re-training any models, so I will be comparing and establishing intervals for uncalibrated models - even if it risks less sensible intervals.

@tiagoleonmelo
Copy link
Author

It's not possible to compare the Inducitve Venn-Abers Predictor intervals with the bootstrap-based ones, since one considers model calibration and the other doesn't. That being said, I'm backtracking and now attempting to further validate the real bootstrap method with 10000 models. Do you know of any resources that showcase validity guarantees for this family of methods?

For instance, in this case it's impossible to compute empirical coverage - how can we know that the bootstrap intervals are indeed correct 95% of the time?

@mayer79
Copy link
Contributor

mayer79 commented Apr 17, 2024

I am not 100% certain. Maybe you could start with a known, deterministic model.

@tiagoleonmelo
Copy link
Author

Not sure what you mean.. even with deterministic models we would the same issue right? We could train 1000 Decision Trees that classify each sample as either 0 or 1, but we would still suffer the issue of not knowing whether their corresponding 95% percentile bootstrap includes the true label

@mayer79
Copy link
Contributor

mayer79 commented Apr 17, 2024

I am only talking about confidence intervals as prediction intervals are completely irrelevant in classification. With CIs, you would calculate the proportion of intervals that cover the true conditional probability for different feature values.

@tiagoleonmelo
Copy link
Author

What's the true conditional probability in this case?

@francescomandruvs
Copy link

@tiagoleonmelo Hey, Did you tried something else? I find this topic very interesting and I'm also trying to get a CI for my binary classification model. I have done a calibration plot and my current model is well calibrated, however, I would still like to have these confidence intervals. I tried Venn-Abers predictor and the lower / upper bound probabilities are so close to the predicted one that these CI are almost useless (probably I have done some mistakes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants