Public Post-Mortem — Kubernetes Production Incident

Incident summary

Impact

Response & Recovery

  • Delete all CRDs (Custom resource definitions) that have been added to install Calico.
  • Delete the whole ingress Controller
  • Install the Helm Chart of the Ingress Controller again.
  • Update the DNS record to point the wildcard domain name to the new Loadbalancer (bound to the ingress controller).
  • Re-Run the last successful Terraform Plan

Root cause

What went well?

  • The Cluster infrastructure is versioned and reproducible thanks to Git ❤️️ Terraform.
  • All k8s Applications, including the ingress controller, are versioned and reproducible thanks to Git ❤️ Helm.

What went badly?

  • Detecting the issue after 1 hour and ±30 minutes
  • Detecting the issue passively.

Lessons learned

  • By k8s design, you should have one and only one cluster. Any computing resource must join this single cluster. However, this incident lets us learn that there are some use-cases where you can run multiple clusters. In this case, a second cluster can be a staging environment where we make sure that all software versions are compatible (docker version, kubelet version, ingress controller version,…. ). Once we establish confidence, we can reproduce the change on the production cluster. This is not a big deal for us, as the cluster is provisioned by terraform. We just need to externalize the cluster name as Terraform variable, and switch between 2 values : prod, staging. That’s it!
  • If you codify 99% of your operations , you are not safe. Codify 100% of all your operations. This is not a waste of time , however ,this is how you establish reliability & become an SRE like a hammer.
  • Monitoring system must be reactive and better, proactive. We have Prometheus, however, we just used to “taste” Grafana dashboards. We have to configure our alerts on top of it.

Bonus:

--

--

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store