Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any possbility that a ordinary supervised model performs better than a outlier algorithm in this task? #12

Open
Minqi824 opened this issue Jul 13, 2020 · 5 comments

Comments

@Minqi824
Copy link

I have tried some outlier detection datasets (ODDs) in this website like Annthyroid dataset (http://odds.cs.stonybrook.edu/annthyroid-dataset/).

However, when I compare some ordinary supervised models (e.g., SVM and Random Forest), the results indicate that SVM and RF are much better than the anomaly detection algorithms like OC-SVM and Isolation Forest.

I was wonder the reason for this weird results, because threoratically the outlier detection algorithms should perform better in the outlier detection task. Could anyone help me figure this problem? Thanks!

@odb9402
Copy link

odb9402 commented Jul 13, 2020

[Sorry for my bad English skills]
The way I see it, the difference between normal classifications (for biased data) and outlier detections is the unsupervised/supervised.

As far as I know, OC-SVM performs outlier detection without the known anomalies. Even if the data from ODD gives information about which data is abnormal, In the real problem, we usually do not know what is the abnormal samples. If we do not know what is abnormal data, the SVM and RF cannot even be used.

If the exact information of anomalies is given, the high performance of SVM looks reasonable for me.

@Minqi824
Copy link
Author

[Sorry for my bad English skills]
The way I see it, the difference between normal classifications (for biased data) and outlier detections is the unsupervised/supervised.

As far as I know, OC-SVM performs outlier detection without the known anomalies. Even if the data from ODD gives information about which data is abnormal, In the real problem, we usually do not know what is the abnormal samples. If we do not know what is abnormal data, the SVM and RF cannot even be used.

If the exact information of anomalies is given, the high performance of SVM looks reasonable for me.

Thanks for your great answer!
I agree with your opinion since many anomaly detection task may not have labels at all (or cost high when labeling). And this may be one of the reasons why we always compare supervised learning models vs supervised learning models, and anomaly detection algorithms vs other detection algorithms.

Another confusion is that why these supervised algorithms (like SVM and RF) performs well even in the highly umbalanced dataset? (e.g., Annthyroid dataset in the ODDs, contaiining 7.42% positive samples). Intuitively spearking, the ordinary classification model may classify all the samples to the majority (negative samples) and cannot detect the anomaly samples, but again the empirical results indicate that my opinion may be wrong.
Could you please explain the above problems or even try some models on the Annthyroid dataset? Thanks a lot!

@Minqi824
Copy link
Author

Actually I tried most of the dataset in ODDs (http://odds.cs.stonybrook.edu/annthyroid-dataset/) and upload the results in my github website (https://github.com/jmq19950824/Anomaly-Detection/blob/master/ODDs.ipynb).

The results indicate that even using the binary classification algorithm (SVM here) could be good to solve the anomaly detection task. Can anyone explain this result?

@yzhao062
Copy link
Owner

A rule of thumb, if you have labels, using supervised models is preferred even for anomaly detection.
Charu Aggarwal-- Outlier Analysis--Second Edition--Page 26
image

@Minqi824
Copy link
Author

@yzhao062 ,great answer, thanks a lot. I notice that there exists a sentance "Supervised outlier detection is a (difficult) special case of the classification problem. The main characteristic of this problem is that the labels are extremely unbalanced in terms of relative presense. Since anomalies are far lass common than normal points, it is possible for off-thre-shelf classifiers to predict all test points as normal points and still ahieve excellent accuracy"

I tried some supervised model (like Random Forest) in some extremely unbalanced dataset like Credit Card Fraud Detection (CCFD) dataset in kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud), where the positive samples only take up 0.172% of the whole dataset (i.e., extremely unbalanced).
However, Random Forest still performs well in this dataset (aboud 0.7~0.8 F1-score), could you please explain this results? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants