Public Post-Mortem — Kubernetes Production Incident

Incident summary

It happened that the Kubernetes Cluster (EKS) went down while trying to install the CNI plugin Calico.

Impact

This incident impacts all external users but it does not impact all internal users , as some internal users communicates with the cluster thru the kubectl client.

Response & Recovery

  • Delete all CRDs (Custom resource definitions) that have been added to install Calico.
  • Delete the whole ingress Controller
  • Install the Helm Chart of the Ingress Controller again.
  • Update the DNS record to point the wildcard domain name to the new Loadbalancer (bound to the ingress controller).
  • Re-Run the last successful Terraform Plan

Root cause

By applying the CNI plugin, we added some CRDs which blocks the visibility among namespaces. Since Ingress controller has to forward the requests to a specific service regardless its namespace, Ingress Controller is not able to get any response from any service.

What went well?

  • The Cluster infrastructure is versioned and reproducible thanks to Git ❤️️ Terraform.
  • All k8s Applications, including the ingress controller, are versioned and reproducible thanks to Git ❤️ Helm.

What went badly?

  • Detecting the issue after 1 hour and ±30 minutes
  • Detecting the issue passively.

Lessons learned

  • By k8s design, you should have one and only one cluster. Any computing resource must join this single cluster. However, this incident lets us learn that there are some use-cases where you can run multiple clusters. In this case, a second cluster can be a staging environment where we make sure that all software versions are compatible (docker version, kubelet version, ingress controller version,…. ). Once we establish confidence, we can reproduce the change on the production cluster. This is not a big deal for us, as the cluster is provisioned by terraform. We just need to externalize the cluster name as Terraform variable, and switch between 2 values : prod, staging. That’s it!
  • If you codify 99% of your operations , you are not safe. Codify 100% of all your operations. This is not a waste of time , however ,this is how you establish reliability & become an SRE like a hammer.
  • Monitoring system must be reactive and better, proactive. We have Prometheus, however, we just used to “taste” Grafana dashboards. We have to configure our alerts on top of it.

Bonus:

Udemy course about running EKS on production : https://www.udemy.com/course/aws-eks-kubernetes

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn