Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent's remotewrite speed decreases with time, but restores after restart #6246

Closed
laixintao opened this issue May 9, 2024 · 11 comments
Closed
Assignees
Labels
question The question issue vmagent

Comments

@laixintao
Copy link
Contributor

Describe the bug

Yesterday, I received the alert "RemoteWriteConnectionIsSaturated", suggesting that the data vmagent scrape is larger than its sending speed. So I change: -remoteWrite.rateLimit=50000000 to -remoteWrite.rateLimit=80000000, at point A of this picture, also I have upgrade vmagent from 1.82.1 to 1.93.12.

image

Problems solved. (But from the monitoring, vmagent_remotewrite_conn_bytes_written_total seems even less than before. Is it because VictoriaMetrics remote write protocol enabled by default in new version?)

Then at point B, issue occurred:

  • vmagent remote write speed decrease
  • data pending at vmagent local disk, not sending to vminsert
  • but it didn't reach vmagent's remote write speed limit, which is 80000000
  • I have 6 vmagents, same config, 3 of them met the same issue at different time.

At point C, I update the config again -remoteWrite.rateLimit=100000000 and restart vmagent, problem solved.

To Reproduce

It happened once this morning, so I can not reproduce.

Version

1.93.12

Logs

No errors from vmagent stdout.

Screenshots

No response

Used command-line flags

No response

Additional information

I have searched the release log from 1.82.1 -> 1.93.12, I didn't see any obvious bugfix related to that, only in 1.93.13:

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP connections to remote storage systems, scrape targets and service discovery endpoints at vmagent. This may result in incorrect service discovery, target scraping and failed sending samples to remote storage.

@laixintao laixintao added the bug Something isn't working label May 9, 2024
@jiekun
Copy link
Contributor

jiekun commented May 9, 2024

vmagent_remotewrite_conn_bytes_written_total seems even less than before

  1. vm remote-write protocol with zstd compression is introduced in v1.88.0.
  2. Since v1.88.0, vmagent send a handshake request to vminsert at start-up phase if no protocol is specified via command-line flags.

vmagents remote write speed decrease ...

I guess the issue might occur on the remote-write target(s). It would be helpful to have:

  1. some screenshots of the remote-write target(s) status. I have seen a similar case happens on vmagent when our vmstorage cannot support large amounts of data and Slow inserts went up to ~80%.
# I recommand finding those trouble shooting query on https://grafana.com/orgs/victoriametrics/dashboards
max(
    rate(vm_slow_row_inserts_total{job=~"$job_storage"}[$__rate_interval]) 
    / rate(vm_rows_added_to_storage_total{job=~"$job_storage"}[$__rate_interval])
)

image

  1. (Since it recovered quickly,) Logs from vmstorage (and possibly vminsert) to see if something goes wrong.

@laixintao
Copy link
Contributor Author

Thank you so much for your information.

vm remote-write protocol with zstd compression is introduced in v1.88.0.

that explains the bandwidth reduce, however, can I confirm that, the -remoteWrite.rateLimit is for limiting the rate after compress, right?

and seems it's not slow inserts.

image

The metrics this cluster collected are stable, as far as I know, vmagent only query when insert the metrics for first time, so if no huge change, it should be any slow query and slow inserts.

Also no logs found from vminsert and vmsorage.

From metrics of vmstorage, seems there is nothing wrong, only the source reduce the ingestion speed.

image

(I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.)

Thanks again for you information!

@jiekun
Copy link
Contributor

jiekun commented May 9, 2024

the -remoteWrite.rateLimit is for limiting the rate after compress, right?

Correct.

c.rl.Register(len(block))

I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.

This makes sense and the slow insert looks absolutely fine.

I'm not able to locate the root cause for you right now. If both vmstorage and vminsert are fine, then some metrics from vmagent may help.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).
image

@laixintao
Copy link
Contributor Author

Thanks for info.

Since it's not reproducible, I recommend checking the dashboard of vmagent to see if any issue happens for the scrape targets. For example, some scrape targets are down so some of your vmagent failed to retrieve the metrics (at different times).

Targets should be ok, as the scape rows are not changed, and vmagent local disk pending data was increasing, suggesting that the metrics were scraped, but can not send to remote.

image

@laixintao
Copy link
Contributor Author

It happened today again. I suspect that is a vmagent issue, because after restarting, vmagent behaves ok.

some abnormal vmagent panels I have noticed:

Push delay increased

image

target's unique labels changed, but I suspect that it's not true, as I inpsect the target's /metrics path, seems nothing changed at that time.

image

After the unique samples decrease, vmagent didn't recover, it kept pending data at local, and push delay remains high, after restarting, everything became normal.

@jiekun
Copy link
Contributor

jiekun commented May 10, 2024

@laixintao Thank you for the extra monitoring metrics.
Did you see some req rate/traffic changes in remote-write panels?
e.g.

# the same promql as you mentioned in the issue
sum(rate(vmagent_remotewrite_conn_bytes_written_total{job=~"$job", instance=~"$instance"}[$__rate_interval])) by(job, pod) > 0

# check the response status code with this one
sum(rate(vmagent_remotewrite_requests_total{job=~"$job", instance=~"$instance", url=~"$url"}[$__rate_interval])) by (job, url, status_code, pod) > 0

image

After the unique samples decrease, vmagent didn't recover,

I would like to share my thoughts here. The first direction I am considering is whether there are some limitations on your network, such as blocking all requests larger than a certain size (e.g., xx MiB). In such cases, vmagent might encounter failures in sending these (big) requests and continue buffering and retrying them. In this scenario, reducing the number of unique samples won't address the issue of retrying requests.

In this case, you should have some error logs from vmagent, as well as some abnormal metrics via the PromQLs above.

after restarting, everything became normal.

May I confirm with you how vmagent is deployed (e.g. what flag is used especially those who are related to persistent)? Is it a StatefulSet or Deployment? Will it load the persistent queue after a restart? If it's deployed as a Deployment, it could lose the retry queue, so everything might appear to be back to normal.

@laixintao
Copy link
Contributor Author

laixintao commented May 10, 2024

image

for the second metric, same.

image

For netowrking, I think it's fine, they are in same IDC and vmagent -> vminsert (same server with vmagent) -> vmstorage, there is no middle proxy, so it's pretty simple. I ahve checked logs, still no logs of vmagent, only some errors requesting http sd, but it should not be a problem, as vmagent should use the targets from last success sd.

Is it a StatefulSet or Deployment?
Sorry I am not sure what this is, they are deployed on baremetal server.

Will it load the persistent queue after a restart?
Yes, all cached data was loaded and sent to vmstorage, no data loss.

I have upgraded those vmagents to v1.101 (latest) see if it still had this issue or not.

@jiekun
Copy link
Contributor

jiekun commented May 10, 2024

Thanks for more info.

Sorry did not help with this issur. In case Im going to a wrong direction, it would be appreciated if we could have some input from maintainers @f41gh7 :) thanks

@hagen1778
Copy link
Collaborator

BUGFIX: downgrade Go builder from 1.22.0 to 1.21.7, since 1.22.0 contains the bug, which can lead to deadlocked HTTP

I think this was exactly the issue, since vmagent communicates via HTTP to vminserts. If connections get deadlocked one by one, you'd see gradual ingestion delay. From the vminsert perspective it should look like number of active TCP connections deacreases with time.

I recommend updating to the latest LTS https://docs.victoriametrics.com/changelog/#v19314 or to upstream versions.

@hagen1778 hagen1778 self-assigned this May 13, 2024
@hagen1778 hagen1778 changed the title remotewrite speed decrease, causing data pending on vmagent's local disk. vmagent's remotewrite speed decreases with time, but restores after restart May 13, 2024
@laixintao
Copy link
Contributor Author

Thanks for confirmation! I agree this is exactly the issue!

  • it only happens on 1.93.12 version, before I use 1.89, no issue
  • I upgrade the cluster to 1.101 after it happen twice, it resolves the issue.

Version changes and remote_write_connections:

image

Thanks!

(btw I think we need to add warning in the changelog of 1.93.12 here https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.12 , cc @valyala )

@laixintao
Copy link
Contributor Author

link to golang/go#65705

@hagen1778 hagen1778 added question The question issue and removed bug Something isn't working need more info labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue vmagent
Projects
None yet
Development

No branches or pull requests

3 participants