Performance and scaling nomad-driver-podman under high allocation loads #175

jdoss · 2022-06-20T21:25:27Z

I have been using the Podman driver for my workloads and my current project involves a very large Nomad cluster and launching 16,000 containers across the cluster. I have noticed some performance and scaling issues that have impacted my deployment of the workloads. I am wondering if there are any specific steps I could take to improve the stability of my deployments and optimize the number of containers per client node. Here are two major issues I am setting when launching jobs in batches of 4000:

The deployment will overload allocations on a select number of client nodes where some will one or two containers running and others will have 50+. This seems to cause the second issue.
The Podman socket socket, I assume, gets overloaded. Under high allocation load the Podman driver becomes unavailable to in the Web UI and allocations start failing.

The failed allocations tend to snowball a client node into an usable state because the Podman socket cannot fully recover to accept new allocations. The leads to a large amount of failed allocations.

Does anyone have any recommendation for changing my jobs so they spread out more evenly across my client nodes? I think I need to have more time between container start. I am using these settings in my job:

update {
  stagger = "30s"
  max_parallel = 1
  min_healthy_time = "15s"
  progress_deadline = "30m"
}

restart {
  attempts = 10
  interval = "30m"
  delay    = "2m"
  mode     = "fail"
}

scaling {
  enabled = true
  min     = 0
  max     = 20000
}

Also, any thoughts on why the podman socket gets overwhelmed by the driver? My client nodes use Fedora CoreOS which has pretty decent sysctl settings out of the box and I am using the Nomad recommended settings as well:

- path: /etc/sysctl.d/30-nomad-bridge-iptables.conf
    contents:
      inline: |
        net.bridge.bridge-nf-call-arptables=1
        net.bridge.bridge-nf-call-ip6tables=1
        net.bridge.bridge-nf-call-iptables=1
  - path: /etc/sysctl.d/31-nomad-dynamic-ports.conf
    contents:
      inline: |
        net.ipv4.ip_local_port_range=49152 65535
  - path: /etc/sysctl.d/32-nomad-max-user.conf
    contents:
      inline: |
        fs.inotify.max_user_instances=16384
        fs.inotify.max_user_watches=1048576
  - path: /etc/sysctl.d/33-nomad-nf-conntrack-max.conf
    contents:
      inline: |
        net.netfilter.nf_conntrack_max = 524288

$ cat /proc/sys/fs/file-max
9223372036854775807

Does anyone else use the Podman driver for high allocation workloads?

The text was updated successfully, but these errors were encountered:

rina-spinne · 2022-06-25T22:13:42Z

Not a solution here, but I have seen a similar behaviour to your second case with a smaller cluster.

Most of the time, the podman socket CPU usage is very high even when most of the services are idle. The more services that are running on a node the greater the CPU usage is. If any CPU intensive task happens, the socket stops responding and allocations start failing for a while.
From a quick diagnostic, most of the logs are healthchecks so it might be that the podman socket can't handle too many requests at the same time.

I haven't had the chance to debug further but it might be a bug on podman's service. I don't use docker so I don't know how it behaves, but I doubt this behaviour is normal there since I have seen docker machines running more containers.

towe75 · 2022-06-27T06:29:56Z

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

@rina-spinne getting metrics/stats from a single container is somewhat expensive. Running many containers concurrently and polling stats in a frequent pace can quickly cause quite some load. Maybe you can tune the collection_interval configuration option? It has an aggressive default of just 1 second. A good solution is to align it with you metric collector interval. This way you end up with a 30s or 60s for a typical prometheus setup.

jdoss · 2022-07-14T14:21:50Z

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Thanks for this tip. I will modify my jobs to see if I can spread things out to prevent the socket from getting overloaded and report back.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

Could you be able to share this unit and timer?

jdoss · 2022-07-14T15:26:00Z

@towe75 @rina-spinne I opened containers/podman#14941 to see if the Podman team has any thoughts on this issue. If you have any additional context to add to that issue, I am sure that would help track things down.

jdoss · 2023-03-02T15:16:54Z

Maybe related hashicorp/nomad#16246

towe75 · 2023-03-10T20:45:41Z

@jdoss i do not think that it's related.
I don't know enough about your environment to recommend something. A rule of thumb in our cluster is: keep number of containers below 70 for a 2 core machine (e.g. AWS m5a.large etc.). We found that the overhead for logging, scraping, process management etc. is rather high when going above 100 containers on such a node. But this depends, of course, on a lot of things and is likely not be true for your workload.

jdoss mentioned this issue Jul 14, 2022

Podman socket performance issues containers/podman#14941

Closed

lgfa29 added stage/needs-investigation theme/performance type/enhancement labels Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and scaling nomad-driver-podman under high allocation loads #175

Performance and scaling nomad-driver-podman under high allocation loads #175

jdoss commented Jun 20, 2022

rina-spinne commented Jun 25, 2022

towe75 commented Jun 27, 2022

jdoss commented Jul 14, 2022

jdoss commented Jul 14, 2022

jdoss commented Mar 2, 2023

towe75 commented Mar 10, 2023

Performance and scaling nomad-driver-podman under high allocation loads #175

Performance and scaling nomad-driver-podman under high allocation loads #175

Comments

jdoss commented Jun 20, 2022

rina-spinne commented Jun 25, 2022

towe75 commented Jun 27, 2022

jdoss commented Jul 14, 2022

jdoss commented Jul 14, 2022

jdoss commented Mar 2, 2023

towe75 commented Mar 10, 2023