Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and scaling nomad-driver-podman under high allocation loads #175

Open
jdoss opened this issue Jun 20, 2022 · 6 comments
Open

Comments

@jdoss
Copy link
Contributor

jdoss commented Jun 20, 2022

I have been using the Podman driver for my workloads and my current project involves a very large Nomad cluster and launching 16,000 containers across the cluster. I have noticed some performance and scaling issues that have impacted my deployment of the workloads. I am wondering if there are any specific steps I could take to improve the stability of my deployments and optimize the number of containers per client node. Here are two major issues I am setting when launching jobs in batches of 4000:

  1. The deployment will overload allocations on a select number of client nodes where some will one or two containers running and others will have 50+. This seems to cause the second issue.
  2. The Podman socket socket, I assume, gets overloaded. Under high allocation load the Podman driver becomes unavailable to in the Web UI and allocations start failing.

The failed allocations tend to snowball a client node into an usable state because the Podman socket cannot fully recover to accept new allocations. The leads to a large amount of failed allocations.

Does anyone have any recommendation for changing my jobs so they spread out more evenly across my client nodes? I think I need to have more time between container start. I am using these settings in my job:

update {
  stagger = "30s"
  max_parallel = 1
  min_healthy_time = "15s"
  progress_deadline = "30m"
}

restart {
  attempts = 10
  interval = "30m"
  delay    = "2m"
  mode     = "fail"
}

scaling {
  enabled = true
  min     = 0
  max     = 20000
}

Also, any thoughts on why the podman socket gets overwhelmed by the driver? My client nodes use Fedora CoreOS which has pretty decent sysctl settings out of the box and I am using the Nomad recommended settings as well:

- path: /etc/sysctl.d/30-nomad-bridge-iptables.conf
    contents:
      inline: |
        net.bridge.bridge-nf-call-arptables=1
        net.bridge.bridge-nf-call-ip6tables=1
        net.bridge.bridge-nf-call-iptables=1
  - path: /etc/sysctl.d/31-nomad-dynamic-ports.conf
    contents:
      inline: |
        net.ipv4.ip_local_port_range=49152 65535
  - path: /etc/sysctl.d/32-nomad-max-user.conf
    contents:
      inline: |
        fs.inotify.max_user_instances=16384
        fs.inotify.max_user_watches=1048576
  - path: /etc/sysctl.d/33-nomad-nf-conntrack-max.conf
    contents:
      inline: |
        net.netfilter.nf_conntrack_max = 524288
$ cat /proc/sys/fs/file-max
9223372036854775807

Does anyone else use the Podman driver for high allocation workloads?

@rina-spinne
Copy link

Not a solution here, but I have seen a similar behaviour to your second case with a smaller cluster.

Most of the time, the podman socket CPU usage is very high even when most of the services are idle. The more services that are running on a node the greater the CPU usage is. If any CPU intensive task happens, the socket stops responding and allocations start failing for a while.
From a quick diagnostic, most of the logs are healthchecks so it might be that the podman socket can't handle too many requests at the same time.

I haven't had the chance to debug further but it might be a bug on podman's service. I don't use docker so I don't know how it behaves, but I doubt this behaviour is normal there since I have seen docker machines running more containers.

@towe75
Copy link
Collaborator

towe75 commented Jun 27, 2022

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

@rina-spinne getting metrics/stats from a single container is somewhat expensive. Running many containers concurrently and polling stats in a frequent pace can quickly cause quite some load. Maybe you can tune the collection_interval configuration option? It has an aggressive default of just 1 second. A good solution is to align it with you metric collector interval. This way you end up with a 30s or 60s for a typical prometheus setup.

@jdoss
Copy link
Contributor Author

jdoss commented Jul 14, 2022

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Thanks for this tip. I will modify my jobs to see if I can spread things out to prevent the socket from getting overloaded and report back.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

Could you be able to share this unit and timer?

@jdoss
Copy link
Contributor Author

jdoss commented Jul 14, 2022

@towe75 @rina-spinne I opened containers/podman#14941 to see if the Podman team has any thoughts on this issue. If you have any additional context to add to that issue, I am sure that would help track things down.

@jdoss
Copy link
Contributor Author

jdoss commented Mar 2, 2023

Maybe related hashicorp/nomad#16246

@towe75
Copy link
Collaborator

towe75 commented Mar 10, 2023

@jdoss i do not think that it's related.
I don't know enough about your environment to recommend something. A rule of thumb in our cluster is: keep number of containers below 70 for a 2 core machine (e.g. AWS m5a.large etc.). We found that the overhead for logging, scraping, process management etc. is rather high when going above 100 containers on such a node. But this depends, of course, on a lot of things and is likely not be true for your workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants