May 28 2020, It was a hard day. Indeed. The famous public container registry, quay.io , was down for several hours.
The same day, we decided to upgrade the EKS Kubernetes cluster from 1.14 to 1.15
While rolling out the upgrade by terminating some worker nodes and let the autoscaling group spin off new nodes, the incident occurs.
Indeed, the Main router of all user requests ( Ingress Controller) cannot run on any node with the error ImagePullBackOff
After some investigation, I realized that all pods with images from quay.io are not downloadable.
I’ve checked quay.io , and it was the disaster: quay.io is down
I kept watching updates from status.quay.io , but nothing is looking good.
Actually, this is the longest outage we encountered with the public cloud-native services :
I was lucky that i am running another cluster with the same ingress controller (router) but with the later version.
- In the intact cluster, I’ve checked where the pod of ingress-controller is running
k -n kube-system get pod -o wide | grep ingress-controller
# -o wide will show the Node name where the pod is runningingress-nginx-ingress-controller-b6544fd67-9c8nc 1/1 Running 0 21m 10.0.5.196 ip-10-0-5-230.ap-southeast-1.compute.internal <none> <none>
2. I login to that Node and I rename/tag the image of ingress-controller
ssh ip-10-0-5-230$ docker tag quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.30.0 abdennour/nginx-ingress-controller:0.30.0
3. I pushed the new image tag to my account on dockerhub (docker.io)
docker login ...
docker push abdennour/nginx-ingress-controller:0.30.0
4. I updated the Helm chart values (./cluster-plugins/charts/values/nginx-ingress.yaml) to use the new image instead of the quay.io image
# repository: quay.io/kubernetes-ingress-controller/nginx-ingress-controller
5. I upgraded the Helm Release
docker-compose run --rm helm3 -n kube-system
docker-compose run --rm helm3 \
-n kube-system upgrade ingress stable/nginx-ingress \
--version 1.34.2 \
I got some errors:
- I delete the whole helm release of ingress-controller,
- and i reinstalled the helm chart again
- A new ELB has been created, then I updated the DNS record of the wildcard domain (*.company.com) to point to the new ELB DNS.
After dealing with the ingess-controller as Cattle instead of Pet, all web applications are now reachable by users.
I am still having issues with the other quay.io images.
But i am now satisfied as the Business is alive again.