New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vmagent's remotewrite speed decreases with time, but restores after restart #6246
Comments
I guess the issue might occur on the remote-write target(s). It would be helpful to have:
# I recommand finding those trouble shooting query on https://grafana.com/orgs/victoriametrics/dashboards
max(
rate(vm_slow_row_inserts_total{job=~"$job_storage"}[$__rate_interval])
/ rate(vm_rows_added_to_storage_total{job=~"$job_storage"}[$__rate_interval])
)
|
Thank you so much for your information.
that explains the bandwidth reduce, however, can I confirm that, the and seems it's not slow inserts. The metrics this cluster collected are stable, as far as I know, vmagent only query when insert the metrics for first time, so if no huge change, it should be any slow query and slow inserts. Also no logs found from vminsert and vmsorage. From metrics of vmstorage, seems there is nothing wrong, only the source reduce the ingestion speed. (I don't think it's vmstorage's issue, becuase 3 out of 6 vmagents got issue, if it's vmstorage's issue, all of vmagents should have trouble sending data base.) Thanks again for you information! |
Thanks for info.
Targets should be ok, as the scape rows are not changed, and vmagent local disk pending data was increasing, suggesting that the metrics were scraped, but can not send to remote. |
It happened today again. I suspect that is a vmagent issue, because after restarting, vmagent behaves ok. some abnormal vmagent panels I have noticed: Push delay increased target's unique labels changed, but I suspect that it's not true, as I inpsect the target's After the unique samples decrease, vmagent didn't recover, it kept pending data at local, and push delay remains high, after restarting, everything became normal. |
@laixintao Thank you for the extra monitoring metrics.
I would like to share my thoughts here. The first direction I am considering is whether there are some limitations on your network, such as blocking all requests larger than a certain size (e.g., xx MiB). In such cases, vmagent might encounter failures in sending these (big) requests and continue buffering and retrying them. In this scenario, reducing the number of unique samples won't address the issue of retrying requests. In this case, you should have some error logs from
May I confirm with you how |
for the second metric, same. For netowrking, I think it's fine, they are in same IDC and vmagent -> vminsert (same server with vmagent) -> vmstorage, there is no middle proxy, so it's pretty simple. I ahve checked logs, still no logs of vmagent, only some errors requesting http sd, but it should not be a problem, as vmagent should use the targets from last success sd.
I have upgraded those vmagents to v1.101 (latest) see if it still had this issue or not. |
Thanks for more info. Sorry did not help with this issur. In case Im going to a wrong direction, it would be appreciated if we could have some input from maintainers @f41gh7 :) thanks |
I think this was exactly the issue, since vmagent communicates via HTTP to vminserts. If connections get deadlocked one by one, you'd see gradual ingestion delay. From the vminsert perspective it should look like number of active TCP connections deacreases with time. I recommend updating to the latest LTS https://docs.victoriametrics.com/changelog/#v19314 or to upstream versions. |
Thanks for confirmation! I agree this is exactly the issue!
Version changes and remote_write_connections: Thanks! (btw I think we need to add warning in the changelog of 1.93.12 here https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.12 , cc @valyala ) |
link to golang/go#65705 |
Describe the bug
Yesterday, I received the alert "RemoteWriteConnectionIsSaturated", suggesting that the data vmagent scrape is larger than its sending speed. So I change:
-remoteWrite.rateLimit=50000000
to-remoteWrite.rateLimit=80000000
, at point A of this picture, also I have upgrade vmagent from 1.82.1 to 1.93.12.Problems solved. (But from the monitoring,
vmagent_remotewrite_conn_bytes_written_total
seems even less than before. Is it becauseVictoriaMetrics remote write protocol
enabled by default in new version?)Then at point B, issue occurred:
At point C, I update the config again
-remoteWrite.rateLimit=100000000
and restart vmagent, problem solved.To Reproduce
It happened once this morning, so I can not reproduce.
Version
1.93.12
Logs
No errors from vmagent stdout.
Screenshots
No response
Used command-line flags
No response
Additional information
I have searched the release log from 1.82.1 -> 1.93.12, I didn't see any obvious bugfix related to that, only in 1.93.13:
The text was updated successfully, but these errors were encountered: