Public Post-Mortem — Kubernetes Production Incident

Incident summary

Note: CNI stands for container network interface.

The event was triggered by an attempt to install the CNI plugin “Calico” on top the running kubernetes cluster.

The event was detected by observing the behavior of the ingress controller which is the router of all and all customer requests.

This critical incident affected all users except whose accessing the portal with read access and browser cache.

Impact

Response & Recovery

  • Delete the whole ingress Controller
  • Install the Helm Chart of the Ingress Controller again.
  • Update the DNS record to point the wildcard domain name to the new Loadbalancer (bound to the ingress controller).
  • Re-Run the last successful Terraform Plan

Root cause

What went well?

  • All k8s Applications, including the ingress controller, are versioned and reproducible thanks to Git ❤️ Helm.

What went badly?

  • Detecting the issue passively.

Lessons learned

  • If you codify 99% of your operations , you are not safe. Codify 100% of all your operations. This is not a waste of time , however ,this is how you establish reliability & become an SRE like a hammer.
  • Monitoring system must be reactive and better, proactive. We have Prometheus, however, we just used to “taste” Grafana dashboards. We have to configure our alerts on top of it.

Bonus:

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store