How to debug your Kubernetes cluster without going mad!

3 min readSep 25, 2021

I’ve been working with Kubernetes for several months and at times thought I was going mad myself! But I learned some debugging techniques that I think should be widely applicable; here they are.

Make sure the application in your container is working first.

If you just deploy the cluster’s containers without checking first, you’re apt to have to dig into the logs to figure out what’s happening. Try this first and it saves you a step. Like this:

docker build -t gcr.io/whatever:latest .docker run -it gcr.io/whatever:latest

You may have to do a more complicated debugging loop on GCP, as the permissions can be a challenge to get right. If you don’t have permissions to do the run step above, you can run in Google Cloud Shell, which has all the permissions you need. From there:

docker pull grc.io/whatever:latestdocker run -it gcr.io/whatever:latest

It takes at least three parts to get Kubernetes clusters going: Pods, a Service and an Ingress

I had a hard time finding a complete example of this, so mine is posted to Github.

Liveness and Readiness Probes are essential in Kubernetes

You also need to make sure you have liveness and readiness probes set up. if the liveness probe fails the container is marked dead and a new one is started; if the readiness probe fails the container is marked out of service and taken out of load balancing.

According to the Kubernetes docs, there are several types available, but http checks of a URL expecting a 200 return seem to be the most common. For me, I was working getting a Flask app deployed, so that’s what I needed. The default URL ‘/’ was in use for the main app logic, so I just had to create another route for the health check:

@app.route(‘/healthz’, methods = [‘GET’])
def healthz():
    return “OK”, 200

Make sure the firewall isn’t blocking your liveness probe!

By default, stock Kubernetes comes up with a firewall configuration that blocks everything but a few standard ports. Since our app ran on port 8080, I had to open up this firewall rule (kubernetes-minion-all is the name) to get the cluster to work.

If you do need to look at the logs, make sure you use the kubernetes logs, not the cloud system logs.

I have been working for these last several months with GCP after working for years with AWS, where the only way to get some log information was to get into cloudwatch. I initially tried to look at the GCP GKE cluster logs for what was going wrong but generally found it worthless.

Only when I went back to the kubernetes logs, first

kubectl logs -p podname [-c containername] [-n namespace]

If you run this command without another parameter it needs, you get a nonsensical answer, like the previous container you had running that you know wasn’t working, or “this pod isn’t available” …

You have to give this extra parameter to get the current pod instead of the previous one:

kubectl logs -p podname --previous=false

I’d like to meet the Kubernetes developer who made this design decision; she owes me a cup of tea!

On GKE especially, expect your cluster to take much longer than you’d think to show up HEALTHY.

I ran the same cluster on “stock kubernetes” (still targeting GCP) and then on GKE … Kubernetes without GKE came up HEALTHY in minutes, whereas GKE took … over an hour, until then showing status “Unknown”.

It took patience; hope this saves you some time and trouble!

Best of luck making your Kubernetes cluster sing and dance!