Kubernetes Horror Stories

Kubernetes is a complex container management system. So it’s no surprise that it's often the lead character in application or infrastructure horror stories.

Oct 27th, 2020 12:28pm by Serkan Özal

Featued image for: Kubernetes Horror Stories

Feature image via Pixabay.

The stories summarized in this article were found in the repository, Kubernetes Failure Stories. Each individual story is also linked below.

Kubernetes is a feature-rich, complex container management system that runs across all environments — multiple public clouds, on-premises, and hybrid. It’s no surprise, therefore, that Kubernetes is often the lead character in application or infrastructure horror stories.

In this article, we introduce five such scary-but-true stories, which are described in full detail in our white paper. Warning: Even these abridged versions are not for the faint of heart — but they may save you from experiencing your own real-life and costly horror story.

Doomsday Preppers: Running Out of Resources

Serkan Özal

Serkan is co-founder and CTO of Thundra. He has 10+ years of expertise in software development, is an AWS Certified PRO and has a patent on distributed environments. He mainly works on serverless architectures, distributed systems and monitoring tools.

Moonlight, a service that matches software developers with companies looking to hire, uses Kubernetes via Google Kubernetes Engine (GKE) to host its web-based application.

The story starts on a Friday, with connectivity problems with the Redis database that the Moonlight API uses for every authenticated request to validate sessions. It’s a critical component in their workflow. At the very same time, Google Cloud reported network service disruptions with packet losses, so the Moonlight team assumed that Google Cloud was causing the connectivity errors. But then things got worse. The following Tuesday, the dreaded doomsday scenario occurred: Moonlight’s website crashed completely.

With the help of the Google Cloud support team, they tracked application and resource usage, and the root cause was identified. GKE was scheduling high-CPU-consuming pods to the same node, which consumed 100% of the node’s CPU. The node would thus go into a kernel panic and become unresponsive. At first, only the Redis pods were failing, but as the pattern continued, all pods serving the traffic went offline.

The Moonlight team had designed a three-replica deployment of the web application in the cluster. They assumed that there would be one pod per node, and two nodes could fail before the system would go down. Unfortunately, they didn’t know that the Kubernetes scheduler can assign the pods to the same node unless inter-pod anti-affinity rules are implemented to define which pods or other CPU-intensive applications should never be together. The problem was solved by using the rules to repel CPU-intensive and critical applications from each other, resulting in a more reliable system.

Story via Moonlight.

Falling Bridges: Unresponsive Webhooks

Jetstack helps companies create and operate cloud native landscapes with Kubernetes, including provisioning multi-tenant applications. To define custom requirements, they started using the Open Policy Agent for enforcing custom policies in the admission controller.

One fine day, the team was using Terraform to upgrade the development clusters running in GKE, after having successfully performed the same procedure for the pre-production environment. However, 20 minutes into the upgrade, Terraform timed out. As if that wasn’t bad enough, the API server started to time out the incoming requests. With nodes unable to deliver their statuses to the control plane, GKE assumed they were broken and started replacing them. The pattern continued until the entire cluster collapsed.

Jetstack and the Google Cloud support team discovered that GKE halted the upgrade when it was unable to complete the upgrade of the second master. The root cause: a mismatched namespace configuration that resulted in an unresponsive OPA webhook. The lesson learned was that webhooks, which are a single point of failure in a cluster, must be monitored closely and configured with care.

Story via Jetstack.

Finding Nemo: The Missing CNI Configuration

MindTickle closes knowledge and skill gaps in customer-facing sales teams. Their horror story started while migrating their Kubernetes cluster management from kops to Amazon Elastic Kubernetes Service (EKS). Suddenly, the platform was running slowly. AWS support learned from a packet capture that all internal network requests were fine, but calls outside the Kubernetes cluster were failing.

Over the course of a week, they looked for but found no errors in the usual suspects (nodes, configuration, load balancers). AWS Kubernetes experts then told them to check the CNI flag: AWS_VPC_K8S_CNI_EXTERNALSNAT=true. It turned out that the CNI plugin’s source network address translation (SNAT) was disabled, preventing the plugin from properly translating the pod’s private IP to the primary network interface of the Amazon EC2 node that the pod was running. After enabling SNAT in their CNI configuration, they were able to smoothly complete the migration.

The moral of the story: Suspect every component in the stack, even in a managed service, and add observability tools to understand what is happening behind the scenes in your workflows.

Story via Yash Mehrotra.

The Alchemist: Convert CPU into Bitcoin

JW Player is a video platform backed by an HTML5 online video player. Early last year, the JW Player DevOps team found out that their development and staging clusters had been hijacked by cryptocurrency miners. They discovered the intrusion when they started getting high-load alerts in one of JW Player’s legitimate services. After four days of warnings, the team realized that the same gcc process was running at 100% CPU in the nodes of their clusters.

In checking out this gcc process, they saw that the parent application was a Weave Scope Docker container. Closer scrutiny revealed that this gcc application was not a GNU Compiler Collection, but a cryptocurrency miner with the filename gcc! When the team investigated further, they discovered that anyone with the load balancer URL could access the Weave Scope dashboard without authentication. They could also open a shell and execute commands — opening a classic channel for malicious activity.

Story via Brian Choy.

Vertical Limit: No IP for Pods in GKE

LoveHolidays helps people find their dream vacation, but the company soon found itself enmeshed in a nightmare in their relatively large Kubernetes cluster in GKE. A Slack message notified the team that the deployment had been stuck for over an hour, with a warning that there were no available nodes among the 256 nodes in the cluster. Further investigation revealed that the subnetwork was exhausted. This seemed very strange since they were using a /16 subnet mask for IP secondary range, yielding a maximum of 65,536 pods, but there were far fewer pods in the cluster.

In a nutshell, the root cause of the issue was how GKE handles pod IPs and nodes, which caused the Kubernetes cluster to reach the configured limits unexpectedly, without the possibility of further expansion.

Story via Dmitri Lerko.

In Conclusion

This article introduced you to five horror stories in which Kubernetes is the lead actor. Clusters got stuck, deployments failed, web applications went down, and teams spent sleepless nights trying to find and resolve the underlying cause. It’s essential to learn from these stories and to analyze your Kubernetes landscapes. If you avoid these mistakes, you’ll have more reliable, scalable and robust Kubernetes setups.

For all the gory details, check out our white paper.

Serkan Özal is co-founder and CTO of Thundra. He has 10+ years of expertise in software development, is an AWS Certified PRO and has a patent on distributed environments. He mainly works on serverless architectures, distributed systems and monitoring tools.