7 minutes
Finer Kubernetes pod scheduling with topology spread constraints
Reviewing a PR to upgrade ingress-nginx my brain got caught on thinking more about Kubernetes pod scheduling. Curiosity to find the optimal solution to meet company’s resiliency requirements for the ingress controller led me to discover a fresh new K8s feature - Pod Topology Spread Constraints. In this blog post, I’m going to show you an example of how to fine tune Kubernetes scheduling using the constraints to spread your workload in a more resilient way.
Default scheduling
I run a low-traffic beta workload as Kubernetes deployment with 2 replicas waiting in a task backlog to be optimized before going GA. The deployment with low resource requests and limits runs in a GKE cluster with 10+ nodes in a single node pool in 2 availability zones. To make it more resilient, I’d like to have every pod replica scheduled on a different K8s node and in a different availability zone. The Kubernetes scheduler initially schedules resources well, even without additional configuration. Though combined with other external factors (other workloads, node scale up/down events, preemptible node restarts, node-pool upgrades, …), pod replicas sometimes end up scheduled on the same node in a single availability zone. What’s indeed not what I want for my workload to be resilient.
Let’s go straight to the example. Let’s have a GKE cluster with 2 nodes in 2 availability zones (4 nodes total) and our workload defined as a deployment with 2 replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fine-spread-app
namespace: edu
spec:
replicas: 2
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
template:
metadata:
labels:
app.kubernetes.io/name: fine-spread-app
spec:
containers:
- name: dummy
image: k8s.gcr.io/pause:3.2
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 500m
memory: 500Mi
Create a namespace, apply the manifest, check the pods and, depending on the star constellation, you might get (un)lucky to have all replicas scheduled on the same node:
$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME NODE
fine-spread-app-7589d49465-29wr9 gke-default-0c04c813-0d1z
fine-spread-app-7589d49465-8l7v7 gke-default-0c04c813-0d1z
If the node gke-default-0c04c813-0d1z
goes down, all pod replicas are gone too, making the deployment unavailable until rescheduled elsewhere. This is something that is outside your control by default. The good thing is, you might easily prevent it.
Inter-pod anti-affinity
To ensure the multiple pod replicas are each scheduled on a different node, you can use inter-pod anti-affinity constraints. Let’s add it to the existing manifest (please check the referenced doc on how it works in details):
apiVersion: apps/v1
kind: Deployment
metadata:
name: fine-spread-app
namespace: edu
spec:
replicas: 2
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
template:
metadata:
labels:
app.kubernetes.io/name: fine-spread-app
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- fine-spread-app
topologyKey: kubernetes.io/hostname
containers:
- name: dummy
image: k8s.gcr.io/pause:3.2
Applying the manifest, the pods now got rescheduled to different nodes:
$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME NODE
fine-spread-app-856975d9d4-2ptg8 gke-default-0c04c813-0d1z
fine-spread-app-856975d9d4-dp9sg gke-default-0c04c813-7pfr
In case you have more nodes with free resources available in the cluster than the requested number of replicas, this indeed helps the app resiliency. This is also what the PR I mentioned in the intro was trying to address. Reviewing it, I got curious whether I can improve it further. As the cluster spreads across 2 different availability zones, can we schedule each pod replica in a different AZ? Yes, we can - as the doc explains, simply modify the podAntiAffinity
constraints by changing the topologyKey
to prevent scheduling the replicas in the same zone:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fine-spread-app
namespace: edu
spec:
replicas: 2
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
template:
metadata:
labels:
app.kubernetes.io/name: fine-spread-app
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- fine-spread-app
topologyKey: topology.kubernetes.io/zone
containers:
- name: dummy
image: k8s.gcr.io/pause:3.2
For the example deployment with 2 replicas, this solution fully meets the requirement to have the pods scheduled, each on a different node and in a different AZ. But what if we need more replicas, e.g. 4? The podAntiAffinity
example might again end up scheduling multiple pods on the same node. Searching for the solution, I first thought I will simply add another podAffinityTerm
to the podAntiAffinity
spec. Though, checking the doc I noticed there might be a better solution - a fresh new Kubernetes feature providing much more flexibility.
Pod Topology Spread Constraints
The rather recent Kubernetes version v1.19
added a new feature called Pod Topology Spread Constraints to “control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.”. This fits perfectly with this post’s goal of better controlling the pod scheduling in our cluster.
Using the Pod Topology Spread Constraints allows us to further improve the existing deployment - simply change back the podAntiAffinity
topologyKey
to kubernetes.io/hostname
and add topologySpreadConstraints
section to the deployment template.spec
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fine-spread-app
namespace: edu
spec:
replicas: 4
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
template:
metadata:
labels:
app.kubernetes.io/name: fine-spread-app
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- fine-spread-app
topologyKey: kubernetes.io/hostname
containers:
- name: dummy
image: k8s.gcr.io/pause:3.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
Apply the manifest, the pods are now scheduled based on the requirements (Note, we have also increased the replicas count):
$ kubectl get pod -n edu -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME NODE
fine-spread-app-6c97884994-cptjm gke-default-92685fea-81d3
fine-spread-app-6c97884994-f4w22 gke-default-0c04c813-rbwf
fine-spread-app-6c97884994-qr58f gke-default-92685fea-7pfr
fine-spread-app-6c97884994-s4qj8 gke-default-0c04c813-0d1z
Wait, do we need to combine podAntiAffinity and topologySpreadConstraints? No!
# Multiple topologySpreadConstraints
The TSC feature documentation [provides clarifying comparison with the _PodAffinity_/_PodAntiaffinity_](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#comparison-with-podaffinity-podantiaffinity):
> * For `PodAffinity`, you can try to pack any number of Pods into qualifying topology domain(s)
> * For `PodAntiAffinity`, only one Pod can be scheduled into a single topology domain.
>
> For finer control, you can specify topology spread constraints to distribute Pods across different topology domains - to achieve either high availability or cost-saving. This can also help on rolling update workloads and scaling out replicas smoothly.
It means we can fully replace the `podAntiAffinity` spec with multiple `topologySpreadConstraints` items as the doc further explains:
> When a Pod defines more than one topologySpreadConstraint, those constraints are ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all the constraints.
So let's constraint:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: fine-spread-app
namespace: edu
spec:
replicas: 4
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
template:
metadata:
labels:
app.kubernetes.io/name: fine-spread-app
spec:
containers:
- name: dummy
image: k8s.gcr.io/pause:3.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: fine-spread-app
Apply the manifest, and see the pods being distributed, as in the previous example - each on a different node across the 2 availability zones. And this is the level of pod scheduling control I wanted to achieve to show you an example of how to spread workload to make it more resilient.
Summary
Workload scheduling in Kubernetes is an interesting topic to explore if you want to better utilize your servers, make the workload more resilient, or prevent unexpected surprises. I wrote this post to show you a practical example of how to control the scheduling using Pod (Anti) Affinities and Pod Topology Spread Constraints. The former is available since the early K8s versions, but sometimes might not be flexible enough for your needs. Pod Topology Spread Constraints, available since v1.19
when EvenPodsSpread
feature gate reached GA stage, provides much more flexibility. There is, of course, much more to explore beyond format of this introductory post - check the referenced Kubernetes documentation to learn more.
Let me know if you liked the article or interested to read more about different Kubernetes features! 👋