Managed GKE control-plane failure resulting in platform outage on 10th May, 2022

By Karolis Rusenas · May 13, 2022

On May 10th, 2022, Webhook Relay suffered a major outage due to an underlying Google Cloud GKE container platform failure which acts as our hosting provider for our main region where controllers and backend APIs reside. We have been using GKE for a long time (since 2017) and most of the time it just works.

A rough timeline of the outage (all times EST):

  • 12:00PM Healthcheck system flares up, backend APIs, frontend and tunnels in EU region are down. We begin mitigating the outage.
  • 12:10PM All information in the Kubernetes cluster seems inconsistent, looks like either the nodes are down or in an upgrade process. We have seen this before a few years ago with an automated maintenance upgrade on GKE going bad, however this is not a maintenance window.
  • 12:20PM Node pool recycling is taking a long time, the cluster is timing out, operations fail multiple times until they succeed, the root cause is still unclear.
  • 12:40PM We have found out the root cause, it’s the API server certificates. We initialize manual certificate rotation.
  • 13:20PM During the certificate rotation, nodes should be recreated, however nothing happens in our GKE cluster, we recreate node-pools manually again.
  • 13:50PM Workloads start running, however service account based authentication inside the cluster is failing. Backend services fail to connect to the database, object storage and pubsub services aren’t functional.
  • 14:30PM By now we are already duplicating the main cluster services into a newly created cluster. The new cluster is also having hick-ups, control-plane becomes unresponsive every few minutes.
  • 15:00PM It is not possible to connect to the Clickhouse database where we store the webhook logs as the port forward commands fail. Logs do not work as well through the Kubernetes API.
  • 15:30PM Main GKE cluster is still in a crippled state, certificate rotation, while succeeds, doesn’t seem to be effective. GKE console on GCP works, it is possible to see logs and at least update the deployments.
  • 16:00PM Most of the services are already running in a duplicated cluster, however since we rely on load balancers and PVCs that are still in a primary cluster, we need to detach them from there first.
  • 16:50PM We were able to delete Kubernetes service records from the primary cluster, this allowed us to attach them to the new cluster.
  • 17:10PM New backend is running, however webhook routes are going to a fresh disk.
  • 17:30PM StatefulSet detached as well, switched all writes to the main disk.
  • 18:30PM Temporary Clickhouse instance data exported and reinserted into the main instance.

The initial alert

After receiving the first healthcheck notification that the service went down, we thought that potentially it’s just that the request got interrupted on the client side or hit a server that got evicted due to a memory or CPU utilization issue, we knew that these errors are transient and the recovery should be quick and automatic. However, after checking out the main dashboard and seeing it offline, we immediately started mitigating the outage.

Troubleshooting

The main problem with this outage was the lack of information provided by the GKE console. It didn’t look completely right but it also didn’t look wrong. What did work:

  • Ability to see, create and manage node pools. Nodes were all going into the healthy state and start the workloads. However, all operations were very slow.
  • Ability to edit deployments, view logs
  • Ability to upgrade control plane version

What didn’t work:

  • kubectl was able to list and retrieve objects, however it couldn’t view logs or do port forwarding
  • kubectl couldn’t update objects

So the cluster was running, the workloads were running, however some workload info was being returned definitely as stale. Running workloads were having trouble accessing various helper services that Google Cloud provides. It took as awhile to pin point the problem with the Kubernetes API server certificates. In GKE on older clusters they used to give 5 year certificates (now the clusters come with 30 year expiration) and the certificate rotation never happened. You can also manually start the rotation with:

gcloud container clusters update <cluster name> --zone europe-west1-d --start-credential-rotation

But unfortunately this operation while succeeding - didn’t actually do anything. We found a useful command which can show the validity of your GKE cluster:

gcloud container clusters describe <cluster name> --zone europe-north1-a \
    --format "value(masterAuth.clusterCaCertificate)" \
    | base64 --decode \
    | openssl x509 -text \
    | grep Validity -A 2

Recovery

GKE has multiple availability zones that increase your application’s reliability in case one of the zones fail. However, in this case that didn’t matter :)

Manual certificate rotation didn’t help. Upgrading control-plane also didn’t help. The pods were running but all the backing services were not accessible. Backend services could not connect to Cloud SQL, PubSub or GCS. Once we noticed the authentication errors, we started moving services to the cluster.

The length of the downtime

It would have been much shorter downtime duration if we were aware that the managed GKE cluster is totalled. It always seemed that it’s about to start working. If we had the knowledge about the real cluster state, we would have made different choices as while the complete migration to a new cluster is not without its issues (moving persistent disks, detaching load balancers), it would have been a lot quicker.

Lessons learnt

The main lesson here for us was that it’s important to time-box the recovery of an existing infrastructure. While throughout the outage it looks like services are about to start working, we should have pulled the plug on the cluster much sooner and start from scratch with a new one. The sunk-cost fallacy phenomenon made us reluctant to ditch the salvage efforts which would have been the right call.

As part of the work during the outage we have improved our deployment manifests to be able to quickly switch between clusters without a complicated persistent disk data migration.