Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Topology Spread Constraints & Per-Deployment Affinity #856

Open
1 of 3 tasks
kevin-lindsay-1 opened this issue Oct 18, 2021 · 25 comments
Open
1 of 3 tasks

Comments

@kevin-lindsay-1
Copy link
Sponsor Contributor

It's fairly normal for k8s deployments that live in a multi-zone cluster to have topology constraints or affinity rules, however it's fairly normal to have an affinity rule per deployment, as the label selectors for pods are typically different per-deployment.

Expected Behaviour

Be able to set anti-affinity and topology spread constraints per deployment.

Current Behaviour

Cannot set affinity per deployment, and topology spread constraints are not in the chart.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

  • Yes
  • No
  • No, but I sponsor Alex

List All Possible Solutions and Workarounds

Affinity is currently a global option for what appears to be all deployments. Per-deployment overrides is a reasonable option.

Topology Spread Constraints are not currently in the chart, and I don't think a global configuration option would really even work, as the labelSelector would need to be different per-deployment.

Which Solution Do You Recommend?

Make overrides for affinity and make topology spread configurable, not implementing a global variant.

Steps to Reproduce (for bugs)

Context

Trying to have HA for faas-netes/openfaas.

Your Environment

  • FaaS-CLI version ( Full output from: faas-cli version ):
    0.13.13

  • Docker version docker version (e.g. Docker 17.0.05 ):
    20.10.8

  • What version and distriubtion of Kubernetes are you using? kubectl version
    server v1.21.3
    client v1.22.2

  • Operating System and version (e.g. Linux, Windows, MacOS):
    MacOS

  • Link to your project or a code example to reproduce issue:

  • What network driver are you using and what CIDR? i.e. Weave net / Flannel

@alexellis
Copy link
Member

Hi @kevin-lindsay-1

I think what we are looking for here is more description of the the actual behaviour vs. the desired behaviour. What is happening now in the cluster, why is that bad, and what would make that better?

Have you looked into Profiles for affinity yet? I didn't see you mention it in your issue.

Regards,

Alex

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

Hey @alexellis

Right now I have no guarantee of regional, zonal, or nodal high availability on services. We currently use EKS with spot instances preferred, and part of how EKS works is that a node might be gracefully drained and cordoned at any time when the node's underlying spot instance is reclaimed.

As such, making sure multiple pods for each service exist and that they're not on the same node (and preferably not in the same zone) is a strong desire. We currently do this through Topology Spread Constraints.

Typically you need to set something like:

labelSelector:
  app: gateway

Per app, and with the current way affinity is exposed, I don't think I have enough expressivity to easily select OF components and specify their desired affinities, especially if they differ in some way now or down the road.


Have you looked into Profiles for affinity yet? I didn't see you mention it in your issue.

Interesting; I never really looked into profiles, as the CRD exposes up most of what I want, and since it's easy enough to write a helm chart with values per-environment, this really hasn't come up.

Now that I look at profiles, I guess I'm wondering why I'd use them instead of a CRD. One use case I can think of is reasonable defaults for a set of functions, but our helmfile setup already has defaults on top of a custom chart, which already has defaults. Don't really think a 3rd layer would be particularly advantageous here at this time, and would instead probably prefer what profiles exposes exposed in the CRD.

On that note, I'm starting to wonder if it would be easier for the CRD to simply expose a deployment spec or something, because manually exposing up every option that kubernetes exposes to a deployment seems like a never-ending chore.

@alexellis
Copy link
Member

I don't follow the comparison of "CRD" vs Profiles.

You can use the OpenFaaS REST API or Function CRD to create your function, there's an annotation to tie that function to a Profile.

Profiles enable affinity/anti-affinity rules.

Clarify for me: Is this request for functions or the core services?

Do you also run the core services on spot instances?

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

I don't follow the comparison of "CRD" vs Profiles.

To me, both just look like different ways to specify certain things in k8s config. I've only given a cursory look to profiles, but they look to me like a way to apply multiple sets of certain k8s config to multiple functions, similar to having multiple queue workers for different "types" of functions.

Personally, I don't really see the need for profiles as I understand them now, as if the currently exposed options:

  • runtimeClassName
  • tolerations
  • affinity
  • podSecurityContext

were just part of the deployment created when the CRD is applied, it doesn't seems like the profiles would add anything. Those are all just normal k8s config options, just like annotations on the CRD, so it's unclear to me what the separation of concerns is between these two.

That, and if, for example, there was some kind of overlap between what the CRD for the function would expose, and what the profiles would expose. For example, if the CRD let you specify a podSecurityContext, which would override which, the CRD or the profile?

Clarify for me: Is this request for functions or the core services?

This is for the core services, because currently we don't expect functions to be always present (we expect to use scale-to-zero as much as reasonably possible for a given workload), and as such are not particularly concerned with the spread of the pods at this time. It will probably come, and would definitely be better to have than not, but right now the focus is core services.

Do you also run the core services on spot instances?

Yes; with the exception of OpenFaaS, all of our current workloads are highly available and individual pods are "stateless", i.e. assuming that a pod handling a workload is not interrupted by a SIGKILL, everything has time to gracefully terminate within the eviction period given to us by EKS' spot instances.

The question/issue that I have is that in regards to queues in OpenFaaS, I've noticed in local testing that terminating certain pods seems to cause queues to be cleared while they should still have a workload. That's definitely not ideal, and I would like to know if that is expected given the current implementation/arch, and if so, whether or not persistent queues could be implemented. I've only recently begun to look into that, but I expect to create a separate issue for that once I do a little more digging.

Assuming that a queue is in-memory, and that termination of a specific core service, graceful or not, currently clears a queue, then perhaps a PVC or the like could be used to flush the contents of a queue during graceful termination, or some kind of "hand-off", where pod A (draining) waits for pod B (pending/running) to come online, and once available, pod A sends its queue to pod B.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

ftr I could probably pretty easily add these to the chart, as long I know whether or not we'd like to use global defaults or something.

@LucasRoesler
Copy link
Member

just a small thing while i read and think about the the thread. The comment about the queue persistence probably deserves its own issue because it will require us to modify this

and the options are file and sql, per this doc https://docs.nats.io/nats-streaming-server/configuring/persistence but both of these options come with their issues that need to be carefully considered and we would need to provide a lot more documentation about error states

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 19, 2021

Maybe a cluster would solve for both NATS high availability and this shared state? Assuming you could mount a file via a PVC and let RAFT take care of the rest, would allow for multiple NATS pods at once, and be persistent, killing two birds with one stone.

A SQL database running in k8s would probably have the same issue, meaning I'd either have to set up an external k8s database, or it would need some kind of k8s-level persistence, so probably just more config for the same end value.

Although, you could theoretically create a db in the OF chart, allowing you to turn it off and change the endpoint, port, and auth. That would basically give you in-memory storage by default (the db is lost upon termination), and persistence if you want it (the db is remote).

PVC and cluster sounds simpler, though.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 19, 2021

PVC and cluster sounds simpler, though.

That said, using it in AWS in a (potentially) multi-regional, multi-zonal, multi-nodal spread, EFS or FSx are probably the best choices, as both appear to support multi-zone/region use cases.

People installing a NATS cluster and expecting multi-zonal persistence will likely need something similar, which is potentially an extra wrench to throw in, but we would probably be able to handle the persistence just fine with EFS; people say "it's kinda slow" at lower workloads, but I don't think that's necessarily a big deal for something like a queue.

I imagine most people who need both HA and multi-AZ persistence have a good storageclass option at their disposal from their preferred k8s implementation/cloud provider.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

I have broken out the NATS issues discussed into two separate issues, persistence and high availability; that way we can keep discussion focused to Topology Spread and Affinity.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 20, 2021

@alexellis if my proposal of just exposing up different topologySpreadConstraints and affinity for each deployment sounds good to you, I can just go ahead and make a PR for that.

For the sake of keeping the existing pattern:

# ./values.yaml

affinity: {...} # [current] global "default"

topologySpreadConstraints: # [proposed] global "default"
  - {...}
  - {...}

gateway:
  affinity: {...} # [proposed] overrides global
  topologySpreadConstraints: # [proposed] overrides global
    - ...
      labelSelector:
        app: gateway
    - ...
      labelSelector:
        app: gateway

faasIdler:
  affinity: {...} # [proposed] overrides global
  topologySpreadConstraints: # [proposed] overrides global
    - ...
      labelSelector:
        app: faas-idler
    - ...
      labelSelector:
        app: faas-idler

@alexellis
Copy link
Member

The additional context is useful, but please don't take us not getting to requests within 24 hours meaning that we are giving a silent approval on the design.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 20, 2021

Understood. Not trying to pressure you, I'd just rather be a little noisy in a (hopefully) useful way than quiet in an unuseful way; not trying to signal that I think the response is slow.

@LucasRoesler
Copy link
Member

@kevin-lindsay-1 your proposal here #856 (comment) is this to apply just to the openfaas components (gateway, provider, nats, etc) or is this a global value to apply to all functions? it looks like it is just the components, right?

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

@LucasRoesler only the core OF components in faas-netes/openfaas. I'm not worried about functions in this immediate moment, because their transience makes it less of a big deal to me; why worry about availability or failover when you're scaling to zero?

That, and most of our functions at this time are primarily async.

@LucasRoesler
Copy link
Member

in that case, it seems like we could treat this the same as affinity, tolorations, and nodeSelector. These are empty and omitted by default, but the template can provide the required values to insert them where needed. This is essentially what you proposed here #856 (comment)

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 21, 2021

@LucasRoesler yep. only thing is that I don't think a single global rule will work for either TSC or Affinity, as you need to select the pods, which typically needs a labelSelector such as app: <app name>. If you have a global rule, you can't select the correct app's pods.

As it stands, I'm not even sure that I could really use the existing global affinity rule; assuming that you need a selector of some sort that differs per deployment, I don't even think you could make a single affinity rule that would work. If something exists that would make me not need to have different affinity rules per service, I would consider it, but as far as I'm aware something like that does not exist, so I'm not even sure what I could use a global affinity rule for.

The same gotcha applies to TSCs, to my knowledge.

@LucasRoesler
Copy link
Member

yeah, but perhaps we could automatically add the deployment name as a label to global constraint or something like that to make writing a generic constraint a bit easier and less repetitive. So you can write a constraint, then the final constraint will be what was written + one additional label.

So, if the global TSC is

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar

then the helm chart generates

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
        app: gateway

for the gateway deployment, and likewise for the others.

Having a Deployment specific override would still be useful though.

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Oct 21, 2021

I would be fine with the global rules automatically adding their respective label selectors; for the most part there's no real reason to have the overrides if those work appropriately, as most, if not all of the core services would have the same affinity/spread needs.

Whether or not that's a breaking API change for affinity is a different question.


This is also possibly out of scope for this issue, but It may also be useful to have default preferred affinity between certain core services, so that for example the core services might be preferred to be scheduled on the same node where possible, rather than spread throughout a cluster unnecessarily.

Might throw a wrench in, though, because for example I have an external NATS cluster, which may need its own preferences covered. Should theoretically be able to be scheduled with the other pods, though, just like anti-affinity/topology spread.

I could probably set the NATS cluster's affinity to match OF pods, though, so may be a moot point.

@LucasRoesler
Copy link
Member

Just some book keeping here, for future reference etc.

This is related but different from #815 which is asking for the same TSC configuration but at the pod level

@alexellis alexellis changed the title Topology Spread Constraints & Per-Deployment Affinity [Feature request] Topology Spread Constraints & Per-Deployment Affinity Nov 1, 2021
@alexellis
Copy link
Member

Kevin mentioned that this became more important since his team moved to using spot instances. I know that other openfaas users make use of spot instances, or GKE's preemptive VMs.

The spread or affinity rule would schedule the three replicas of the gateway onto three separate nodes before adding duplicate replicas to the same hosts again, so that if a VM is reclaimed by the cloud provider, there are other replicas available to serve requests whilst a new host is provisioned.

Affinity is currently a global option for what appears to be all deployments. Per-deployment overrides is a reasonable option.

So, is there a way we can achieve this result without making the chart much more verbose and complex to manage? i.e. with a single rule that templates the deployment labels etc into the right place?

#856 (comment)

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Nov 2, 2021

is there a way we can achieve this result without making the chart much more verbose and complex to manage?

I do not currently have a need for overrides if the label selectors were automatically added per deployment.


I would also consider a default affinity rule for services to preferDuringSchedulingIgnoredDuringExecution nodes with other OF components on them, a la:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: app
            operator: In
            values:
            - openfaas

This rule basically says "I prefer to be on a node with openfaas components".

and I think the following is basically the same:

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - openfaas
          topologyKey: kubernetes.io/hostname

I believe the current non-templated labels affinity rule supports this, unsure about if a future templated-for-you iteration would.

@alexellis
Copy link
Member

I do not currently have a need for overrides if the label selectors were automatically added per deployment.

I can see labels, so could you help me understand which labels are not being added? Perhaps you could show an example of what you got and what you expected?

kubectl get deploy -n openfaas gateway -o yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: openfaas
    meta.helm.sh/release-namespace: openfaas
  creationTimestamp: "2021-11-01T17:50:20Z"
  generation: 1
  labels:
    app: openfaas
    app.kubernetes.io/managed-by: Helm
    chart: openfaas-8.1.3
    component: gateway
    heritage: Helm
    release: openfaas


spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: gateway
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/port: "8082"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: gateway

I can see app: gateway at each level. Which additional labels are needed at the pod level?

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

kevin-lindsay-1 commented Nov 4, 2021

@alexellis I'm referring to what @LucasRoesler was thinking on #856 (comment). Effectively, the chart could have a global and not have the need for an override if each component's affinity/tsc rules templated in their rules.

For example:

  • with pod topology spread constraints, I could see the pod's component label being used to identify which component is being spread.
  • with affinity rules, I could see pods having a default rule of preferring to be scheduled on the same node as other openfaas components, via the app label.

Right now, I would probably not have a need for per-component overrides, as long as the fact that each component would be referring to itself doesn't conflict with other possible affinity/tsc settings a user might want to reasonably make, which I can't really think of off the top of my head. Seems like most use cases for spread with be trying to spread components out, while affinity rules will be trying to keep core components together.

@alexellis
Copy link
Member

Given the amount of conversation it's taking to understand the problem, I think your original suggestion may fit better of having one of these per every deployment in the chart. It will default to off, but can be enabled if required.

values.yaml will have your overrides for images and configuration

gateway:
  image: ghcr.io/openfaas/gateway:latest

values-topology.yaml, would look like this:

gateway:
  affinity: {...} # [proposed] overrides global
  topologySpreadConstraints: # [proposed] overrides global
    - ...
      labelSelector:
        app: gateway
    - ...
      labelSelector:
        app: gateway

faasIdler:
  affinity: {...} # [proposed] overrides global
  topologySpreadConstraints: # [proposed] overrides global
    - ...
      labelSelector:
        app: faas-idler
    - ...
      labelSelector:
        app: faas-idler

To make this happen, do we just need to output whatever is put into the chart into the deployment and quote it, or are any additional more complex templating rules needed for dynamic fields?

@kevin-lindsay-1
Copy link
Sponsor Contributor Author

Should be as simple as passing it straight through.

And yeah, keeping them separate is definitely simpler. There's an existing global rule for affinity, not sure if that would need to stick around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants