Public Post-Mortem — Kubernetes Production Incident

Incident summary

It happened that the Kubernetes Cluster (EKS) went down while trying to install the CNI plugin Calico.

Note: CNI stands for container network interface.

The event was triggered by an attempt to install the CNI plugin “Calico” on top the running kubernetes cluster.

The event was detected by observing the behavior of the ingress controller which is the router of all and all customer requests.

This critical incident affected all users except whose accessing the portal with read access and browser cache.

Impact

This incident impacts all external users but it does not impact all internal users , as some internal users communicates with the cluster thru the kubectl client.

Response & Recovery

  • Delete all CRDs (Custom resource definitions) that have been added to install Calico.

Root cause

By applying the CNI plugin, we added some CRDs which blocks the visibility among namespaces. Since Ingress controller has to forward the requests to a specific service regardless its namespace, Ingress Controller is not able to get any response from any service.

What went well?

  • The Cluster infrastructure is versioned and reproducible thanks to Git ❤️️ Terraform.

What went badly?

  • Detecting the issue after 1 hour and ±30 minutes

Lessons learned

  • By k8s design, you should have one and only one cluster. Any computing resource must join this single cluster. However, this incident lets us learn that there are some use-cases where you can run multiple clusters. In this case, a second cluster can be a staging environment where we make sure that all software versions are compatible (docker version, kubelet version, ingress controller version,…. ). Once we establish confidence, we can reproduce the change on the production cluster. This is not a big deal for us, as the cluster is provisioned by terraform. We just need to externalize the cluster name as Terraform variable, and switch between 2 values : prod, staging. That’s it!

Bonus:

Udemy course about running EKS on production : https://www.udemy.com/course/aws-eks-kubernetes

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn