vmagent's remotewrite speed decreases with time, but restores after restart #6246

laixintao · 2024-05-09T03:15:54Z

Describe the bug

Yesterday, I received the alert "RemoteWriteConnectionIsSaturated", suggesting that the data vmagent scrape is larger than its sending speed. So I change: -remoteWrite.rateLimit=50000000 to -remoteWrite.rateLimit=80000000, at point A of this picture, also I have upgrade vmagent from 1.82.1 to 1.93.12.

Problems solved. (But from the monitoring, vmagent_remotewrite_conn_bytes_written_total seems even less than before. Is it because VictoriaMetrics remote write protocol enabled by default in new version?)

Then at point B, issue occurred:

vmagent remote write speed decrease
data pending at vmagent local disk, not sending to vminsert
but it didn't reach vmagent's remote write speed limit, which is 80000000
I have 6 vmagents, same config, 3 of them met the same issue at different time.

At point C, I update the config again -remoteWrite.rateLimit=100000000 and restart vmagent, problem solved.

To Reproduce

It happened once this morning, so I can not reproduce.

Version

1.93.12

Logs

No errors from vmagent stdout.

Screenshots

No response

Used command-line flags

No response

Additional information

I have searched the release log from 1.82.1 -> 1.93.12, I didn't see any obvious bugfix related to that, only in 1.93.13:

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP connections to remote storage systems, scrape targets and service discovery endpoints at vmagent. This may result in incorrect service discovery, target scraping and failed sending samples to remote storage.

The text was updated successfully, but these errors were encountered:

jiekun · 2024-05-09T05:08:33Z

vmagent_remotewrite_conn_bytes_written_total seems even less than before

vm remote-write protocol with zstd compression is introduced in v1.88.0.
Since v1.88.0, vmagent send a handshake request to vminsert at start-up phase if no protocol is specified via command-line flags.

vmagents remote write speed decrease ...

I guess the issue might occur on the remote-write target(s). It would be helpful to have:

some screenshots of the remote-write target(s) status. I have seen a similar case happens on vmagent when our vmstorage cannot support large amounts of data and Slow inserts went up to ~80%.

# I recommand finding those trouble shooting query on https://grafana.com/orgs/victoriametrics/dashboards
max(
    rate(vm_slow_row_inserts_total{job=~"$job_storage"}[$__rate_interval]) 
    / rate(vm_rows_added_to_storage_total{job=~"$job_storage"}[$__rate_interval])
)

(Since it recovered quickly,) Logs from vmstorage (and possibly vminsert) to see if something goes wrong.

laixintao · 2024-05-09T06:23:06Z

Thank you so much for your information.

vm remote-write protocol with zstd compression is introduced in v1.88.0.

that explains the bandwidth reduce, however, can I confirm that, the -remoteWrite.rateLimit is for limiting the rate after compress, right?

and seems it's not slow inserts.

The metrics this cluster collected are stable, as far as I know, vmagent only query when insert the metrics for first time, so if no huge change, it should be any slow query and slow inserts.

Also no logs found from vminsert and vmsorage.

From metrics of vmstorage, seems there is nothing wrong, only the source reduce the ingestion speed.

(I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.)

Thanks again for you information!

jiekun · 2024-05-09T06:50:12Z

the -remoteWrite.rateLimit is for limiting the rate after compress, right?

Correct.

VictoriaMetrics/app/vmagent/remotewrite/client.go

Line 394 in a8d0c1a

c.rl.Register(len(block))

I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.

This makes sense and the slow insert looks absolutely fine.

I'm not able to locate the root cause for you right now. If both vmstorage and vminsert are fine, then some metrics from vmagent may help.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).

laixintao · 2024-05-09T09:33:32Z

Thanks for info.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).

Targets should be ok, as the scape rows are not changed, and vmagent local disk pending data was increasing, suggesting that the metrics were scraped, but can not send to remote.

laixintao · 2024-05-10T06:45:21Z

It happened today again. I suspect that is a vmagent issue, because after restarting, vmagent behaves ok.

some abnormal vmagent panels I have noticed:

Push delay increased

target's unique labels changed, but I suspect that it's not true, as I inpsect the target's /metrics path, seems nothing changed at that time.

After the unique samples decrease, vmagent didn't recover, it kept pending data at local, and push delay remains high, after restarting, everything became normal.

jiekun · 2024-05-10T07:08:43Z

@laixintao Thank you for the extra monitoring metrics.
Did you see some req rate/traffic changes in remote-write panels?
e.g.

# the same promql as you mentioned in the issue
sum(rate(vmagent_remotewrite_conn_bytes_written_total{job=~"$job", instance=~"$instance"}[$__rate_interval])) by(job, pod) > 0

# check the response status code with this one
sum(rate(vmagent_remotewrite_requests_total{job=~"$job", instance=~"$instance", url=~"$url"}[$__rate_interval])) by (job, url, status_code, pod) > 0

After the unique samples decrease, vmagent didn't recover,

I would like to share my thoughts here. The first direction I am considering is whether there are some limitations on your network, such as blocking all requests larger than a certain size (e.g., xx MiB). In such cases, vmagent might encounter failures in sending these (big) requests and continue buffering and retrying them. In this scenario, reducing the number of unique samples won't address the issue of retrying requests.

In this case, you should have some error logs from vmagent, as well as some abnormal metrics via the PromQLs above.

after restarting, everything became normal.

May I confirm with you how vmagent is deployed (e.g. what flag is used especially those who are related to persistent)? Is it a StatefulSet or Deployment? Will it load the persistent queue after a restart? If it's deployed as a Deployment, it could lose the retry queue, so everything might appear to be back to normal.

laixintao · 2024-05-10T10:47:39Z

for the second metric, same.

For netowrking, I think it's fine, they are in same IDC and vmagent -> vminsert (same server with vmagent) -> vmstorage, there is no middle proxy, so it's pretty simple. I ahve checked logs, still no logs of vmagent, only some errors requesting http sd, but it should not be a problem, as vmagent should use the targets from last success sd.

Is it a StatefulSet or Deployment?
Sorry I am not sure what this is, they are deployed on baremetal server.

Will it load the persistent queue after a restart?
Yes, all cached data was loaded and sent to vmstorage, no data loss.

I have upgraded those vmagents to v1.101 (latest) see if it still had this issue or not.

jiekun · 2024-05-10T10:54:42Z

Thanks for more info.

Sorry did not help with this issur. In case Im going to a wrong direction, it would be appreciated if we could have some input from maintainers @f41gh7 :) thanks

hagen1778 · 2024-05-13T17:57:50Z

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP

I think this was exactly the issue, since vmagent communicates via HTTP to vminserts. If connections get deadlocked one by one, you'd see gradual ingestion delay. From the vminsert perspective it should look like number of active TCP connections deacreases with time.

I recommend updating to the latest LTS https://docs.victoriametrics.com/changelog/#v19314 or to upstream versions.

laixintao · 2024-05-15T03:59:50Z

Thanks for confirmation! I agree this is exactly the issue!

it only happens on 1.93.12 version, before I use 1.89, no issue
I upgrade the cluster to 1.101 after it happen twice, it resolves the issue.

Version changes and remote_write_connections:

Thanks!

(btw I think we need to add warning in the changelog of 1.93.12 here https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.12 , cc @valyala )

laixintao · 2024-05-15T04:10:33Z

link to golang/go#65705

laixintao added the bug Something isn't working label May 9, 2024

hagen1778 added the vmagent label May 13, 2024

hagen1778 added the need more info label May 13, 2024

hagen1778 self-assigned this May 13, 2024

hagen1778 changed the title ~~remotewrite speed decrease, causing data pending on vmagent's local disk.~~ vmagent's remotewrite speed decreases with time, but restores after restart May 13, 2024

laixintao closed this as completed May 15, 2024

hagen1778 added question The question issue and removed bug Something isn't working need more info labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmagent's remotewrite speed decreases with time, but restores after restart #6246

vmagent's remotewrite speed decreases with time, but restores after restart #6246

laixintao commented May 9, 2024

jiekun commented May 9, 2024 •

edited

laixintao commented May 9, 2024

jiekun commented May 9, 2024

laixintao commented May 9, 2024

laixintao commented May 10, 2024

jiekun commented May 10, 2024 •

edited

laixintao commented May 10, 2024 •

edited

jiekun commented May 10, 2024 •

edited

hagen1778 commented May 13, 2024

laixintao commented May 15, 2024

laixintao commented May 15, 2024

vmagent's remotewrite speed decreases with time, but restores after restart #6246

vmagent's remotewrite speed decreases with time, but restores after restart #6246

Comments

laixintao commented May 9, 2024

Describe the bug

To Reproduce

Version

Logs

Screenshots

Used command-line flags

Additional information

jiekun commented May 9, 2024 • edited

laixintao commented May 9, 2024

jiekun commented May 9, 2024

laixintao commented May 9, 2024

laixintao commented May 10, 2024

jiekun commented May 10, 2024 • edited

laixintao commented May 10, 2024 • edited

jiekun commented May 10, 2024 • edited

hagen1778 commented May 13, 2024

laixintao commented May 15, 2024

laixintao commented May 15, 2024

jiekun commented May 9, 2024 •

edited

jiekun commented May 10, 2024 •

edited

laixintao commented May 10, 2024 •

edited

jiekun commented May 10, 2024 •

edited