Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA · 2024-05-10T21:41:03Z

The PR #13002 added the option to drop old samples in remote write.

Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979

after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the filter in buildTimeSeries is applied.

We did investigate and it appears that the issue is in the newly added buildTimeSeries function which does modify the timeSeries argument causing the corruption.

The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.

FUSAKLA · 2024-05-10T22:02:31Z

Unfortunately, I had no luck with writing a test that would cover the case yet

Yet, it is possible to reproduce the bug with 100% "success"

Run locally Prometheus with following config

global:
  scrape_interval: 5s
  external_labels:
    __replica__: prometheus-local-test-0
    cluster: local-test
    mimir_cluster_id: test-cluster

remote_write:
  - url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
    queue_config:
      max_shards: 1
      min_shards: 1
      batch_send_deadline: 5s
      capacity: 100
      sample_age_limit: 30s
    metadata_config:
      send: true
      send_interval: 1m
      max_samples_per_send: 100

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node-exporter # running with default configuration
    static_configs:
      - targets:
          - localhost:9100

When it is running, block the communication to the remote write endpoint IP using sudo iptables -I OUTPUT -d <ip> -j DROP

Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again sudo iptables -F OUTPUT

At this point the issue happens. If you run with debugger inspecting the pendingData variable in the runShard function, you can find for example up metrics with duplicated labels like job, mimir_cluster_id etc

FUSAKLA · 2024-05-10T22:08:32Z

Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind.

cstyan · 2024-05-15T01:31:17Z

Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here.

My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request.

Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again.

FUSAKLA · 2024-05-15T12:56:30Z

Hi @cstyan, thanks for answer.

Yes, I agree. We used this bug fix successfully in our production, since we needed it asap,
but I'm planning to look into writing the test, so I can verify it's really not happening anymore (it's hard to find the corrupted data).

I'll ping you once I manage to reproduce it, if you don't mind.

cstyan · 2024-05-15T18:33:40Z

I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.

We used this bug fix successfully in our production, since we needed it asap

That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.

I'll ping you once I manage to reproduce it, if you don't mind.

👍

tpaschalis

Just to understand, the difference proposed here is to not mutate the input timeSeries slice, right? Otherwise the logic looks the same.

Do you think that's why it's impacted on retries? I'd also like to have a test to make sure we don't accidentally regress on this.

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

FUSAKLA · 2024-05-28T09:58:20Z

@tpaschalis yes, this seemed to fix the issue using the above described way to reproduce.

I'm trying to reproduce it in a test case, but with no luck yet.
Here is a text case I'm working on right now #14157

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

Signed-off-by: David Vavra <sevenood@gmail.com>

david-vavra · 2024-06-12T06:38:19Z

I rebased this over #14157 which adds a test case which reproduces the bug reported in #13979.

FUSAKLA marked this pull request as ready for review May 10, 2024 22:04

FUSAKLA requested review from cstyan, bwplotka and tomwilkie as code owners May 10, 2024 22:04

FUSAKLA mentioned this pull request May 10, 2024

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Open

tpaschalis reviewed May 27, 2024

View reviewed changes

fix: try to reproduce the bug from prometheus#13979 in a test case

eaf26b3

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

feat: improve logging and locking in the test case

de8d132

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

david-vavra mentioned this pull request Jun 11, 2024

fix: try to reproduce the bug from https://github.com/prometheus/prom… #14157

Draft

fix: try to reproduce the bug from prometheus#13979 in a test case

25e3354

Signed-off-by: David Vavra <sevenood@gmail.com>

david-vavra force-pushed the fus-fix-max-sample-age branch from b44a4e1 to b286bc9 Compare June 11, 2024 20:40

fix: malformation of remote write data if samples are dropped

049a42f

Signed-off-by: David Vavra <sevenood@gmail.com>

david-vavra force-pushed the fus-fix-max-sample-age branch from b286bc9 to 049a42f Compare June 11, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data corruption in remote write if max_sample_age is applied #14078

Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 •

edited

tpaschalis left a comment

FUSAKLA commented May 28, 2024

david-vavra commented Jun 12, 2024

Fix data corruption in remote write if max_sample_age is applied #14078

Are you sure you want to change the base?

Fix data corruption in remote write if max_sample_age is applied #14078

Conversation

FUSAKLA commented May 10, 2024 • edited

FUSAKLA commented May 10, 2024 • edited

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 • edited

tpaschalis left a comment

Choose a reason for hiding this comment

FUSAKLA commented May 28, 2024

david-vavra commented Jun 12, 2024

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024 •

edited

cstyan commented May 15, 2024 •

edited