-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: maybe, S- and M-estimator for multivariate mean, scatter for large k_vars #9244
Comments
thinking a bit more: All the standard theory applies, we just use a different norm than the base norm, we just use a different, transformed random variable in the base norm. Aside: I haven't checked. Does Rocke norm in his parameterization satisfy the assumptions on an M-scale rho? |
This is close to what Rocke is doing, for the limit as k -> inf For large k: mean of chi distribution is approximately sqrt(k - 0.5) and var = 0.5 However in smaller samples center M shifts with width using predefined fixed asymptotic rejection point M+c One problem: The mean would be for maha given the normal distribution. If the distribution is not normal, then the mean will differ and the loss function is not concentrated around the mean/central observations. My guess is that if we rescale by median of maha to match median of chi/chi2, then we recenter the maha to the appropiate part of the loss/weight function. AFAIU, for the usual norms, we get good results for elliptical distribution, but only the scale is inconsistent when computing an estimate for the covariance, variances if we don't have a normal distribution. (Also breakdown point is calibrated for normal distribution and breakdown point will be different for non-normal distributions.) Possibly: There is still the problem with the additional scaling extra computation. |
trying out just shifting d in the tukeybiweight norm by subtracting median of chi distribution. I guess just implementing translated biweight tbiweight and biflat in Rocke will be more straightforward. aside The main problem is that biweight S and MM-estimator which we have now are not good at k_vars greater than 10 or 15. (The results in Rocke are for fixed breakdown point as k increases. We could also limit efficiency for larger k_vars. |
(This is just an idea, I have not seen it in the literature, but I did not look at the high-dimension robust cov literature)
Problem
As k_vars increase, efficiency of CovS increases but also the bias.
According Rocke this is because distribution of mahalanobis distance concentrates around mean and we reject too few points.
(or something like that. I did not go through the details.)
To avoid this problem with standard norms like Bisquare, he proposes a new norm that rejects on both sides (low and high values) and is adjusted to changing k_vars.
Idea
We use chisquare as reference distribution for the mahalanobis distances.
Instead of using the mahalanobis distances we use maha distances that are scaled and demeaned so the usual norm would reject on both sides.
https://en.wikipedia.org/wiki/Chi-squared_distribution#Related_distributions
as k->inf:$(\chi _{k}^2-k)/{\sqrt{2k}}~{\xrightarrow {d}}\ N(0,1)$ $X\sim \chi _{\nu }^{2},$ and $c>0$ , then $cX\sim \Gamma (k=\nu /2,\theta =2c)$ (gamma distribution)
and
If
see also https://modelassist.epixanalytics.com/space/EA/26575265/Normal+approximation+to+the+Chi+Squared+distribution
So we could define new norm, rho that adjusts to k_vars directly
rho(d2) = rho_base(c * d2 - m)
or maybe better use z-scoring for chi distribution for d = sqrt(d2) instead of chisquare. The approximation should be better.
That is expectations for breakdown point and efficiency are computed using chi distribution with loc, scale != (0,1).
standard recommendation is CovMM for k_vars <= 15, or maybe <= 10
Somewhere (Maronna article to compare robust cov?) it is recommended to use OGK for large k_vars, and Rocke norm for intermediate k_vars.
The above "zscored" maha might be an alternative to ogk for large k_vars for M and DetS.
Needs to be verified.
The text was updated successfully, but these errors were encountered: