Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent panic on remoteWrite.streamAggr.dedupInterval #6205

Open
alexintech opened this issue Apr 29, 2024 · 8 comments · Fixed by #6206
Open

vmagent panic on remoteWrite.streamAggr.dedupInterval #6205

alexintech opened this issue Apr 29, 2024 · 8 comments · Fixed by #6206
Assignees
Labels
bug Something isn't working vmagent

Comments

@alexintech
Copy link

Describe the bug

vmagent crashes periodically when the -remoteWrite.streamAggr.dedupInterval="0s,120s" flag set.

To Reproduce

vmagent configured with remoteWrite.streamAggr.dedupInterval configuration:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
  name: vmagent-multi-retention
  namespace: victoria-metrics
spec:
  image:
    tag: v1.101.0
  selectAllByDefault: true
  replicaCount: 1
  scrapeInterval: 20s
  scrapeTimeout: 10s
  externalLabels:
    cluster: mycluster
  extraArgs:
    promscrape.streamParse: 'true'
    remoteWrite.streamAggr.dedupInterval: "0s,120s"
  statefulMode: true
  statefulStorage:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 20Gi
  remoteWrite:
    - url: "http://vminsert-vmcluster-retention-1m.victoria-metrics.svc:8480/insert/0/prometheus/api/v1/write"
    - url: "http://vminsert-vmcluster-retention-3m.victoria-metrics.svc:8480/insert/0/prometheus/api/v1/write"

Version

./vmagent-prod --version
vmagent-20240425-145801-tags-v1.101.0-0-g5334f0c2c

Logs

panic: runtime error: index out of range [6] with length 0

goroutine 15146 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*writeRequest).copyTimeSeries(0xc000000008, 0xc004a236e0, 0xc000a796e8)
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:207 +0x6a9
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*writeRequest).tryPush(0xc000000008, {0xc000a72008, 0x283, 0xc0004f8820?})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:192 +0x6d
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*pendingSeries).TryPush(0xc000000000, {0xc000a72008?, 0x40c025?, 0x10?})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:64 +0x67
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*remoteWriteCtx).tryPushInternal(0x8?, {0xc000a72008?, 0x0?, 0xc00013c510?})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:1015 +0x1c5
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*remoteWriteCtx).TryPush(0xc000099b60, {0xc000a72008?, 0x10a20?, 0xc0000a3950?})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:957 +0x605
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.tryPushBlockToRemoteStorages.func1(0xc00117aeac?)
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:593 +0x65
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.tryPushBlockToRemoteStorages in goroutine 49
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:591 +0xea

Screenshots

No response

Used command-line flags

command-line flags
-httpListenAddr=":8429"
-promscrape.config="/etc/vmagent/config_out/vmagent.env.yaml"
-promscrape.streamParse="true"
-remoteWrite.maxDiskUsagePerURL="1073741824"
-remoteWrite.streamAggr.dedupInterval="0s,2m0s"
-remoteWrite.tmpDataPath="/vmagent_pq/vmagent-remotewrite-data"
-remoteWrite.url="secret"

Additional information

No response

@alexintech alexintech added the bug Something isn't working label Apr 29, 2024
@hagen1778
Copy link
Collaborator

Thanks for report!
This looks like race condition. @AndrewChubatiuk would you mind taking a look?

@AndrewChubatiuk AndrewChubatiuk self-assigned this Apr 29, 2024
@jiekun
Copy link
Contributor

jiekun commented Apr 29, 2024

It only happens when you have multiple remotewrite targets with:

  1. some of them runs with deduplicator.
  2. others don't.

The remotewrite (with deduplicator) Push data here:

rwctx.deduplicator.Push(tss)

And clear(tss)

While the remotewrite (without deduplicator) Push data here:

ok := rwctx.tryPushInternal(tss)

And here's the critical part:


The goroutine (without deduplicator) refer timeseries data with index tsSrc := &src[i], where the timeseries data might be cleared.

While the goroutine(with deduplicator) refer timeseries data with a copy:

for _, ts := range tss {

It could be reproduced whenever you have:

  1. some remotewrites go with the deduplicator path. (dedupInterval != 0s)
  2. some remotewrites go with the normal path. (dedupInterval = 0s)

Hope this could help

@AndrewChubatiuk
Copy link
Contributor

@alexintech just curious if you change the order - 120s,0s will it also cause an error?

@f41gh7
Copy link
Contributor

f41gh7 commented Apr 29, 2024

It'd be great to build vmagent with race detector: make vmagent-race and test it for possible data races.

Note, it significantly reduces performance of application and must be used only for testing.

@AndrewChubatiuk
Copy link
Contributor

the most obvious reason is this as mentioned by @jiekun, I've reproduces an issue as well and I've tested these changes
@alexintech you can try this if you want

@alexintech
Copy link
Author

@alexintech just curious if you change the order - 120s,0s will it also cause an error?

The same error, but it crashes quicker, just after the start.

@alexintech you can try this if you want

I'll check

@alexintech
Copy link
Author

@alexintech you can try this if you want

seems that it's working!

hagen1778 added a commit that referenced this issue May 6, 2024
…multiple remote write contexts (#6206)

When at least one remote write has deduplication configured it cleans up
timeseries while they can be in use by another remote write without
deduplication

#6205
---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
hagen1778 pushed a commit that referenced this issue May 6, 2024
…multiple remote write contexts (#6206)

When at least one remote write has deduplication configured it cleans up
timeseries while they can be in use by another remote write without
deduplication

#6205
---------

Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit 8797718)
@hagen1778
Copy link
Collaborator

Re-opening issue since #6206 isn't released yet. It will be included into the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmagent
Projects
None yet
5 participants