quay.io is down, my Kubernetes cluster cannot pull images

Overview

May 28 2020, It was a hard day. Indeed. The famous public container registry, quay.io , was down for several hours.

The same day, we decided to upgrade the EKS Kubernetes cluster from 1.14 to 1.15

While rolling out the upgrade by terminating some worker nodes and let the autoscaling group spin off new nodes, the incident occurs.

Indeed, the Main router of all user requests ( Ingress Controller) cannot run on any node with the error ImagePullBackOff

Root Cause

After some investigation, I realized that all pods with images from quay.io are not downloadable.

I’ve checked quay.io , and it was the disaster: quay.io is down

I kept watching updates from status.quay.io , but nothing is looking good.

Actually, this is the longest outage we encountered with the public cloud-native services :

Workaround

I was lucky that i am running another cluster with the same ingress controller (router) but with the later version.

  1. In the intact cluster, I’ve checked where the pod of ingress-controller is running
k -n kube-system get pod -o wide | grep ingress-controller
# -o wide will show the Node name where the pod is running
ingress-nginx-ingress-controller-b6544fd67-9c8nc 1/1 Running 0 21m 10.0.5.196 ip-10-0-5-230.ap-southeast-1.compute.internal <none> <none>

2. I login to that Node and I rename/tag the image of ingress-controller

ssh ip-10-0-5-230$ docker tag quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.30.0 abdennour/nginx-ingress-controller:0.30.0

3. I pushed the new image tag to my account on dockerhub (docker.io)

docker login ...
docker push abdennour/nginx-ingress-controller:0.30.0

4. I updated the Helm chart values (./cluster-plugins/charts/values/nginx-ingress.yaml) to use the new image instead of the quay.io image

rbac:
create: true
controller:
image:
# repository: quay.io/kubernetes-ingress-controller/nginx-ingress-controller
repository: abdennour/nginx-ingress-controller
tag: "0.30.0"

5. I upgraded the Helm Release

docker-compose run --rm helm3 -n kube-system 
docker-compose run --rm helm3 \
-n kube-system upgrade ingress stable/nginx-ingress \
--version 1.34.2 \
-f cluster-plugins/charts/values/nginx-ingress.yaml

Results

I got some errors:

  • I delete the whole helm release of ingress-controller,
  • and i reinstalled the helm chart again
  • A new ELB has been created, then I updated the DNS record of the wildcard domain (*.company.com) to point to the new ELB DNS.

After dealing with the ingess-controller as Cattle instead of Pet, all web applications are now reachable by users.

I am still having issues with the other quay.io images.

But i am now satisfied as the Business is alive again.

Software engineer, Cloud Architect, 5/5 AWS|GCP|PSM Certified, Owner of kubernetes.tn

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store