Lab 6: HA Control Plane Design and Simulation
Time: 40 minutes
Objective: Design an HA control-plane topology and validate quorum/failure behavior in a multi-control-plane kind simulation.
The Story
A single control-plane node outage should not become a full platform outage. This lab builds operational intuition for HA behavior so you can reason quickly about quorum, failure tolerance, and recovery order before touching production clusters.
CKA Objectives Mapped
- Configure high-availability control plane concepts
- Understand etcd quorum requirements
- Validate failure domains and recovery sequencing
Background: What HA Control Plane Really Means
High availability for Kubernetes control plane means preserving API and control-state continuity when a control-plane node fails. The scheduler and controller manager are replicated for process availability, but etcd quorum is the hard gate for write safety and cluster memory. Without etcd quorum, the API may still answer some reads briefly, yet the control plane cannot reliably commit state changes.
Quorum math is why odd member counts matter. With N etcd members, quorum is floor(N/2)+1, so a 3-member control plane tolerates one failure, while a 5-member setup tolerates two. A 2-member design is worse than 3 because it still tolerates only one failure but increases split-brain risk and operational complexity, which is why production guidance favors odd counts.
An HA API endpoint is a stable address in front of multiple API servers, typically via a load balancer and health checks. Think of it like an internal ALB or NLB target set for API servers: clients and kubelets use one endpoint, while back-end control-plane nodes can rotate during failures. This stable endpoint is what keeps kubelets and operators from reconfiguring clients every time one node is replaced.
When one control-plane node is down and quorum remains, existing workloads continue and new scheduling can still happen because surviving API server, scheduler, and controller-manager instances keep reconciling state. If quorum is lost, pods already running may continue temporarily because kubelet acts locally, but cluster-level operations such as new deployments, scaling, and controller updates degrade or halt. That is why recovery order starts with API and quorum validation before workload-level triage.
This lab uses kind to simulate those behaviors safely, not to replicate production HA networking and certificate distribution in full detail. The goal is operational reasoning: identify tolerated failures, verify quorum, and follow a deterministic recovery sequence. For production implementation details, use the kubeadm HA guide: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/.
Scope Note
This lab is an architecture and operations simulation.
kind helps visualize HA behavior, but it is not a production kubeadm HA deployment.
Part 1: Build a Multi-Control-Plane Cluster
Create a kind config:
cat <<'EOF' > /tmp/kind-ha.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: ha
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
EOFCreate the cluster:
kind create cluster --name ha --config /tmp/kind-ha.yaml
kubectl config use-context kind-ha
kubectl get nodes -o wideStarter assets for this lab are in starter/:
kind-ha-cluster.yamlsimulate-control-plane-failure.shha-runbook-template.md
Expected:
- 3 control-plane nodes
- 2 worker nodes
Part 2: Inspect Control Plane and etcd Quorum
List control-plane pods:
kubectl -n kube-system get pods -o wide | grep -E 'etcd|kube-apiserver|kube-controller-manager|kube-scheduler'Inspect etcd members:
for pod in $(kubectl -n kube-system get pods -l component=etcd -o name); do
echo "=== $pod ==="
kubectl -n kube-system exec "${pod#pod/}" -- sh -c '
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list | head -5'
doneWrite down:
- number of members
- quorum requirement (
N/2 + 1)
Part 3: Simulate Control-Plane Failure
Find the node containers:
docker ps --format '{{.Names}}' | grep '^ha-'Stop one control-plane node:
docker stop ha-control-plane2Validate API and workloads still function:
kubectl get nodes
kubectl create namespace ha-lab
kubectl -n ha-lab create deployment smoke --image=nginx:1.27 --replicas=2
kubectl -n ha-lab rollout status deploy/smokeInterpretation:
- API stayed available because quorum remained
- Workloads continue when one control plane is down
Part 4: Recover Failed Control Plane
Start the failed node:
docker start ha-control-plane2
kubectl get nodes -wConfirm etcd and control-plane pods stabilize:
kubectl -n kube-system get pods -o wide | grep control-plane2Part 5: HA Design Worksheet
Answer these in your notes:
- Where is the stable API endpoint in an HA kubeadm design?
- What breaks when quorum is lost?
- How many control-plane failures can this topology tolerate?
- What is your incident runbook order for one failed control-plane node?
Suggested runbook order:
- Confirm API health
- Check etcd quorum/membership
- Restore failed node or replace member
- Re-validate scheduler/controller activity
- Verify workload health and events
Verification Checklist
You are done when:
- HA cluster created with 3 control-plane nodes
- etcd member list inspected and quorum explained
- One control-plane node failure simulated and recovered
- You produced a short HA incident runbook
Reinforcement Scenarios
37-jerry-scheduler-missingjerry-node-notready-kubelet
Cleanup
kubectl config use-context kind-lab || true
kind delete cluster --name ha
rm -f /tmp/kind-ha.yaml
