Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

McRouter not forwarding gets / sets / deletes for largish values #383

Open
danielbeardsley opened this issue Oct 7, 2021 · 2 comments
Open

Comments

@danielbeardsley
Copy link

Mcrouter has been performing very well for us for years, but just recently we've started to notice a problem that's undermining our faith. Sometimes (for several minutes at a time) for large values, Mcrouter just stops forwarding sets / deletes (and some gets) for a particularly large value to both machines in our pool. For several minutes, sets and deletes counts are wayyyy off and the commands only make it to one of the two machines. At other times, they are even and for other slabs they are even.

Here's an example of the meta-data about one of the keys in question in memcache:
key={the key name in question} exp=1631856704 la=1631856415 cas=1382190321 fetch=yes cls=39 size=831500

The size is 831kb (though under the 1MB memcache limit and we don't have values splitting turned on) and the expire time is ~5min.

Here's a graph of command counts between our two memcache machines in the pool. At other times and for other slabs, the values are nearly identical. But occasionally (maybe 1/day, but it's becoming more frequent) we see these imbalances that lead to almost constant cache misses (cause we use AllFastestRoute for gets).

image

A reduced view of our mcrouter config:
image

Something of note, each time we see one of these anomalies Mcrouter seemingly randomly doesn't send the commands to one of the two machines (but not the same machine each time)

CC @djmetzle @andyg0808 @sctice-ifixit

@djvaporize
Copy link
Contributor

Hi there -
My initial instinct is that it could be the server-timeout setting where the value is too large to transfer in the time window. Can you first check this setting?

https://github.com/facebook/mcrouter/blob/main/mcrouter/mcrouter_options_list.h#L617

Additionally, 831kb is not terribly large for this to not be able to handle. However, I have seen others setting a [smaller] value threshold via big value route (see https://github.com/facebook/mcrouter/blob/main/mcrouter/mcrouter_options_list.h#L119). This comes with a tradeoff of distributing pieces of the data across more than 1 machine however.

Also, if you could, share more of the options and its routing configuration (with proprietary info redacted please)? E.g. command line parameters, etc. This would be helpful to understand the problem better. An easy way is to use the preprocess config dump for the routing side:

https://github.com/facebook/mcrouter/wiki/Admin-requests#get-__mcrouter__preprocessed_config

Thanks!

@danielbeardsley
Copy link
Author

My initial instinct is that it could be the server-timeout setting where the value is too large to transfer in the time window. Can you first check this setting?

Man, that sounds like just the thing. But we're at the default of 1000ms, still I may experiment.

Here's the cli options:

usr/local/bin/mcrouter -f /var/run/mcrouter/mcrouter.conf -a /var/spool/mcrouter --port 11222 --probe-timeout-initial 100 --big-value-split-threshold 500000 --timeouts-until-tko 3 --use-asynclog-version2

And confirming that default value:

$> echo "get __mcrouter__.options" | nc 127.0.0.1 11222 | grep server_timeout
server_timeout_ms 1000

Note: To solve our issue, we added the --big-value-split-threshold though the underlying problem is still there, we've just side stepped it (831kb values weren't being propagated to both pools).

Here's our config:

{
  "pools": {
    "A": {
      "servers": [
        "10.0.1.X:11211"
      ]
    },
    "B": {
      "servers": [
        "10.0.1.Y:11211"
      ]
    }
  },
  "route": {
    "type": "OperationSelectorRoute",
    "operation_policies": {
      "get": {
        "type": "AllFastestRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      },
      "set": {
        "type": "AllFastestRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      },
      "add": {
        "type": "AllSyncRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      },
      "delete": {
        "type": "AllAsyncRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      },
      "incr": {
        "type": "AllSyncRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      },
      "decr": {
        "type": "AllSyncRoute",
        "children": [
          "PoolRoute|A",
          "PoolRoute|B"
        ]
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants