-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data corruption in remote write if max_sample_age is applied #14078
base: main
Are you sure you want to change the base?
Conversation
Unfortunately, I had no luck with writing a test that would cover the case yet Yet, it is possible to reproduce the bug with 100% "success" Run locally Prometheus with following config global:
scrape_interval: 5s
external_labels:
__replica__: prometheus-local-test-0
cluster: local-test
mimir_cluster_id: test-cluster
remote_write:
- url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
queue_config:
max_shards: 1
min_shards: 1
batch_send_deadline: 5s
capacity: 100
sample_age_limit: 30s
metadata_config:
send: true
send_interval: 1m
max_samples_per_send: 100
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: node-exporter # running with default configuration
static_configs:
- targets:
- localhost:9100 When it is running, block the communication to the remote write endpoint IP using Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again At this point the issue happens. If you run with debugger inspecting the |
Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind. |
Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here. My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request. Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again. |
Hi @cstyan, thanks for answer. Yes, I agree. We used this bug fix successfully in our production, since we needed it asap, I'll ping you once I manage to reproduce it, if you don't mind. |
I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.
That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, the difference proposed here is to not mutate the input timeSeries
slice, right? Otherwise the logic looks the same.
Do you think that's why it's impacted on retries? I'd also like to have a test to make sure we don't accidentally regress on this.
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
@tpaschalis yes, this seemed to fix the issue using the above described way to reproduce. I'm trying to reproduce it in a test case, but with no luck yet. |
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Signed-off-by: David Vavra <sevenood@gmail.com>
b44a4e1
to
b286bc9
Compare
Signed-off-by: David Vavra <sevenood@gmail.com>
b286bc9
to
049a42f
Compare
/fixes #13979
The PR #13002 added the option to drop old samples in remote write.
Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979
after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the
filter
inbuildTimeSeries
is applied.We did investigate and it appears that the issue is in the newly added
buildTimeSeries
function which does modify thetimeSeries
argument causing the corruption.The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.