Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

hoanganhngo610 · 2023-10-20T08:51:13Z

LODA (Lightweight On-line Detection of Anomalies) is a popular static/incremental anomaly detection algorithm. LODA is implemented in various frameworks for anomaly detection, including PyOD and PySAD.

Unlike various other algorithms, this implementation of LODA relies primarily on numpy, since the function np.histogram plays an important part in the learning phase and calculation of anomaly score. This implementation is also an adaptation of previous implementations of PyOD and PySAD to River.

hoanganhngo610 · 2023-10-20T08:53:15Z

@MaxHalford Would you mind having a look at this PR, and also on why River build is failing for 3.11? Thank you so much in advance!

MaxHalford · 2023-10-20T10:05:29Z

Hey @hoanganhngo610!

Good job on taking the initiative to implement this.

The main feedback I have is that we should avoid using numpy for new implementations. As you know, we prefer working exclusively with dicts. I know you want to use numpy's histogram: we have sketch.Histogram which may do the trick. Could you take a look at using that instead?

Regarding CI, I'm aware we have an issue. It's on my todo list :)

hoanganhngo610 · 2023-10-21T08:24:43Z

Hey @MaxHalford ,

Thank you so much for your response, and also thank you so much for having a look at my PR so promptly.

Regarding the implementation using dictionaries and particularly sketch.Histogram, actually that was the first idea coming to my mind. However, upon comparing the results returned by np.Histogram versus, sketch.Histogram, unfortunately, the results are significantly different.

For example, we can take a look at the example provided in River with sketch.Histogram itself:

from river import sketch
import numpy as np
np.random.seed(42)
values = np.hstack((
    np.random.normal(-3, 1, 1000),
    np.random.normal(3, 1, 1000),
))
hist = sketch.Histogram(max_bins=10)
for x in values:
    hist = hist.update(x)
for bin in hist:
    print(bin)

which will return

[-6.19837, -5.84386]: 4
[-5.52527, -4.29012]: 79
[-4.28605, -2.54958]: 583
[-2.54880, -1.36430]: 281
[-1.35756, 0.37807]: 57
[0.46562, 1.77854]: 110
[1.78993, 2.70345]: 253
[2.70739, 4.37628]: 544
[4.37893, 5.44241]: 85
[5.54414, 6.09777]: 4

However, with numpy, under

np.histogram(values, bins=10)

the result would be

(array([ 12, 205, 464, 277,  44,  35, 234, 446, 257,  26]),
 array([-6.19836796, -4.96875389, -3.73913983, -2.50952576, -1.2799117 ,
        -0.05029763,  1.17931644,  2.4089305 ,  3.63854457,  4.86815863,
         6.0977727 ]))

As such, to ensure that the results are consistent with other implementation in PyOD or PySAD, I decided to continue with the use of numpy instead of dictionary. Hope that this answer is satisfactory and can justify my implementation!

MaxHalford · 2023-10-21T08:44:28Z

@hoanganhngo610 I believe the difference comes from the fact the histogram definitions in both cases. The histogram in numpy has equal bin widths, while the one in River is adaptive. The thing is, equal bin widths is not difficult to implement online.

I've actually started #1344 to remove dict2numpy and numpy2dict. They were not actually used anymore. I really believe we should not allow ourselves to use numpy anymore. I get it, it's practical. But as you know there's a price to pay to initialize numpy arrays online. They're made to do computation on batch data, not online data.

hoanganhngo610 · 2023-10-21T09:20:57Z

@MaxHalford Thank you for clarifying that! In that case, I would proceed to modify this PR to use dictionaries andsketch.Histogram as discussed.

MaxHalford · 2023-10-21T09:37:04Z

Actually what I'm saying is that River's histogram is not based on equi-width bins. If you use it, it might not yield optimal performance. I suggest doing some speed/accuracy benchmarks with River vs. numpy. If it's more accurate with numpy's equi-width bins, then let's implement them with dictionaries, because it's trivial.

Implementation of LODA (Lightweight On-line Detection of Anomalies)

60521a4

hoanganhngo610 requested review from MaxHalford and smastelini as code owners October 20, 2023 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

hoanganhngo610 commented Oct 20, 2023

hoanganhngo610 commented Oct 20, 2023 •

edited

MaxHalford commented Oct 20, 2023

hoanganhngo610 commented Oct 21, 2023

MaxHalford commented Oct 21, 2023

hoanganhngo610 commented Oct 21, 2023

MaxHalford commented Oct 21, 2023

Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

Are you sure you want to change the base?

Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

Conversation

hoanganhngo610 commented Oct 20, 2023

hoanganhngo610 commented Oct 20, 2023 • edited

MaxHalford commented Oct 20, 2023

hoanganhngo610 commented Oct 21, 2023

MaxHalford commented Oct 21, 2023

hoanganhngo610 commented Oct 21, 2023

MaxHalford commented Oct 21, 2023

hoanganhngo610 commented Oct 20, 2023 •

edited