Reviewing a PR to upgrade ingress-nginx my brain got caught on thinking more about Kubernetes pod scheduling. Curiosity to find the optimal solution to meet company’s resiliency requirements for the ingress controller led me to discover a fresh new K8s feature - Pod Topology Spread Constraints. In this blog post, I’m going to show you an example of how to fine tune Kubernetes scheduling using the constraints to spread your workload in a more resilient way.

Default scheduling

I run a low-traffic beta workload as Kubernetes deployment with 2 replicas waiting in a task backlog to be optimized before going GA. The deployment with low resource requests and limits runs in a GKE cluster with 10+ nodes in a single node pool in 2 availability zones. To make it more resilient, I’d like to have every pod replica scheduled on a different K8s node and in a different availability zone. The Kubernetes scheduler initially schedules resources well, even without additional configuration. Though combined with other external factors (other workloads, node scale up/down events, preemptible node restarts, node-pool upgrades, …), pod replicas sometimes end up scheduled on the same node in a single availability zone. What’s indeed not what I want for my workload to be resilient.

Let’s go straight to the example. Let’s have a GKE cluster with 2 nodes in 2 availability zones (4 nodes total) and our workload defined as a deployment with 2 replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fine-spread-app
  namespace: edu
spec:
  replicas: 2
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fine-spread-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fine-spread-app
    spec:
      containers:
      - name: dummy
        image: k8s.gcr.io/pause:3.2
        resources:
          requests:
            cpu: 50m
            memory: 50Mi
          limits:
            cpu: 500m
            memory: 500Mi

Create a namespace, apply the manifest, check the pods and, depending on the star constellation, you might get (un)lucky to have all replicas scheduled on the same node:

$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME                             NODE
fine-spread-app-7589d49465-29wr9   gke-default-0c04c813-0d1z
fine-spread-app-7589d49465-8l7v7   gke-default-0c04c813-0d1z

If the node gke-default-0c04c813-0d1z goes down, all pod replicas are gone too, making the deployment unavailable until rescheduled elsewhere. This is something that is outside your control by default. The good thing is, you might easily prevent it.

Inter-pod anti-affinity

To ensure the multiple pod replicas are each scheduled on a different node, you can use inter-pod anti-affinity constraints. Let’s add it to the existing manifest (please check the referenced doc on how it works in details):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fine-spread-app
  namespace: edu
spec:
  replicas: 2
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fine-spread-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fine-spread-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - fine-spread-app
              topologyKey: kubernetes.io/hostname
      containers:
      - name: dummy
        image: k8s.gcr.io/pause:3.2

Applying the manifest, the pods now got rescheduled to different nodes:

$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME                             NODE
fine-spread-app-856975d9d4-2ptg8   gke-default-0c04c813-0d1z
fine-spread-app-856975d9d4-dp9sg   gke-default-0c04c813-7pfr

In case you have more nodes with free resources available in the cluster than the requested number of replicas, this indeed helps the app resiliency. This is also what the PR I mentioned in the intro was trying to address. Reviewing it, I got curious whether I can improve it further. As the cluster spreads across 2 different availability zones, can we schedule each pod replica in a different AZ? Yes, we can - as the doc explains, simply modify the podAntiAffinity constraints by changing the topologyKey to prevent scheduling the replicas in the same zone:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fine-spread-app
  namespace: edu
spec:
  replicas: 2
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fine-spread-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fine-spread-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - fine-spread-app
              topologyKey: topology.kubernetes.io/zone
      containers:
      - name: dummy
        image: k8s.gcr.io/pause:3.2

For the example deployment with 2 replicas, this solution fully meets the requirement to have the pods scheduled, each on a different node and in a different AZ. But what if we need more replicas, e.g. 4? The podAntiAffinity example might again end up scheduling multiple pods on the same node. Searching for the solution, I first thought I will simply add another podAffinityTerm to the podAntiAffinity spec. Though, checking the doc I noticed there might be a better solution - a fresh new Kubernetes feature providing much more flexibility.

Pod Topology Spread Constraints

The rather recent Kubernetes version v1.19 added a new feature called Pod Topology Spread Constraints to “control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.”. This fits perfectly with this post’s goal of better controlling the pod scheduling in our cluster.

Using the Pod Topology Spread Constraints allows us to further improve the existing deployment - simply change back the podAntiAffinity topologyKey to kubernetes.io/hostname and add topologySpreadConstraints section to the deployment template.spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fine-spread-app
  namespace: edu
spec:
  replicas: 4
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fine-spread-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fine-spread-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - fine-spread-app
              topologyKey: kubernetes.io/hostname
      containers:
      - name: dummy
        image: k8s.gcr.io/pause:3.2
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: fine-spread-app

Apply the manifest, the pods are now scheduled based on the requirements (Note, we have also increased the replicas count):

$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME                                 NODE
fine-spread-app-6c97884994-cptjm   gke-default-92685fea-81d3
fine-spread-app-6c97884994-f4w22   gke-default-0c04c813-rbwf
fine-spread-app-6c97884994-qr58f   gke-default-92685fea-7pfr
fine-spread-app-6c97884994-s4qj8   gke-default-0c04c813-0d1z

Wait, do we need to combine podAntiAffinity and topologySpreadConstraints? No!

# Multiple topologySpreadConstraints 

The TSC feature documentation [provides clarifying comparison with the _PodAffinity_/_PodAntiaffinity_](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#comparison-with-podaffinity-podantiaffinity):
> * For `PodAffinity`, you can try to pack any number of Pods into qualifying topology domain(s)
> * For `PodAntiAffinity`, only one Pod can be scheduled into a single topology domain.
>
> For finer control, you can specify topology spread constraints to distribute Pods across different topology domains - to achieve either high availability or cost-saving. This can also help on rolling update workloads and scaling out replicas smoothly.

It means we can fully replace the `podAntiAffinity` spec with multiple `topologySpreadConstraints` items as the doc further explains: 
> When a Pod defines more than one topologySpreadConstraint, those constraints are ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all the constraints.

So let's constraint:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fine-spread-app
  namespace: edu
spec:
  replicas: 4
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: fine-spread-app
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fine-spread-app
    spec:
      containers:
      - name: dummy
        image: k8s.gcr.io/pause:3.2
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: fine-spread-app
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: fine-spread-app

Apply the manifest, and see the pods being distributed, as in the previous example - each on a different node across the 2 availability zones. And this is the level of pod scheduling control I wanted to achieve to show you an example of how to spread workload to make it more resilient.

Summary

Workload scheduling in Kubernetes is an interesting topic to explore if you want to better utilize your servers, make the workload more resilient, or prevent unexpected surprises. I wrote this post to show you a practical example of how to control the scheduling using Pod (Anti) Affinities and Pod Topology Spread Constraints. The former is available since the early K8s versions, but sometimes might not be flexible enough for your needs. Pod Topology Spread Constraints, available since v1.19 when EvenPodsSpread feature gate reached GA stage, provides much more flexibility. There is, of course, much more to explore beyond format of this introductory post - check the referenced Kubernetes documentation to learn more.

Let me know if you liked the article or interested to read more about different Kubernetes features! 👋

Finer Kubernetes pod scheduling with topology spread constraints

Default scheduling

Inter-pod anti-affinity

Pod Topology Spread Constraints

Summary