Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN in SLO dashboard #531

Open
jenniejia opened this issue Jan 23, 2024 · 5 comments
Open

NaN in SLO dashboard #531

jenniejia opened this issue Jan 23, 2024 · 5 comments

Comments

@jenniejia
Copy link

Hello Guys,

We just integrated Sloth SLO Spec to our systems. We used two SLO dashboard from here https://sloth.dev/introduction/dashboards/

Run some traffic and Observed the SLO dashboard, see this "NaN" issue, this is why I am here.. it is Year 2024.. By reading some dev thread. Do we have better solution instead of doing >0 or vector(1) things to avoid NaNs in the first place. But how to fix this PromQL query? SLI window is 5 min

NaN_issue

Thanks,
Jennie

@jenniejia
Copy link
Author

Actually I just edit the PromQL to like this
slo:period_error_budget_remaining:ratio{sloth_service="${service}", sloth_slo="${slo}"} >0 or on() vector(1)

It seems fixed the issue.. am I correct here?

NaN_fix

@tokheim
Copy link

tokheim commented Jan 23, 2024

So root issue is probably that with zero traffic errorQuery/totalQuery evaluates to NaN especially for the short 5min window size. Unless you use the feature in #241 slo:period_error_budget_remaining:ratio is just the average of the 5 min windows over 30 days, so as soon as the underlying metric records a NaN value the error budget will evaluate to NaN for the next 30 days even if underlying query is fixed.

I'd recommend you try to fix the underlying total query to never evaluate to 0, possibly with >0 or on() vector(1)-logic. These are the metrics that matter for alerting, and I suspect the dashboard will report 100% if value was NaN at any point in time, no matter what budget was actually spent.

Reason why this needs to be handled in the slo query is that how these edge cases should be handled depends on the rest of the query. Still you could argue for derived metrics like slo:period_error_budget_remaining:ratio always should ignore NaN values to make the problem less visible.

@jenniejia
Copy link
Author

Thank you! But I did a try, it did not work.. I do login to the Prometheus UI to check the recording rule..
One thing I am not clear the following two statement which is correct? since I read previous issue #231

>0 or on() vector(1)

>0 or vector(1)

Here is my updated Sloth Spec, I update the "total_query" to add >0 or vector(1)

sli:
events:
error_query: sum(rate(nginx_requests{path="/auth",service="myservice",status=~"(5..|499|409)"}[{{.window}}]))
total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or vector(1)

@tokheim
Copy link

tokheim commented Jan 24, 2024

Your updated Sloth spec looks correct, though I don't have access to a prometheus server to test the query at the moment. Dashboard might still show NaN until window period is over. You could use prometheus admin endpoints to clear metric history, rename or relabel the slo, or wait it out.

You can read about on() here. An expression like (up{instance="x"} > 0) or vector(1) would normally produce two timeseries due to the label mismatch, while (up{instance="x"} > 0) or on() vector(1) produces a single time series like you'd want in a SLO. Since you do a sum(...) your expression doesn't have a label so either should work, but using on() is maybe the safer choice.

@jenniejia
Copy link
Author

Thank you! after update the total_query and rename the SLO. I got the negative value . i.e. -448%, can you explain to me why and if it make senses?
Here is my slo query
total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or on() vector(1)
attached screenshot
negative_slo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants