KubernetesKubernetesDevOpsProduction

Running Kubernetes in Production: A Field Guide

January 8, 2024

12 min read

Kubernetes is the de facto standard for container orchestration, but moving from a "hello world" deployment to running mission-critical applications in production is a significant leap. After years of managing production clusters, I've collected my share of operational scars and wisdom. Here’s a field guide to what really matters.

The Foundation: What Worked Well

1. GitOps from Day One

Treating our Kubernetes manifests like application code, stored in Git, was a game-changer. Using tools like ArgoCD to sync our cluster state with our Git repository provided:

A Single Source of Truth: Anyone can see the desired state of the cluster in Git.
Auditability and Rollbacks: Every change is a commit, making it easy to see who changed what and to revert to a previous state instantly.
Increased Velocity: Developers can make infrastructure changes through pull requests, streamlining the deployment process.

The Hard Lessons: What Didn't Work

1. Ignoring Resource Requests and Limits

In the beginning, we didn't enforce setting resource requests and limits on our deployments. This led to "noisy neighbor" problems where a single runaway application could consume all the resources on a node, causing other critical applications to be evicted.

The Fix: We implemented an admission controller (like Kyverno) to reject any deployments that didn't have resource limits defined. This forces good behavior and ensures cluster stability.

Production Readiness Checklist

Before any service goes live, it must meet these criteria:

Observability: Are dashboards, alerts, and structured logging in place?
Resource Limits: Are CPU and memory requests/limits defined?
Health Probes: Are liveness and readiness probes correctly configured?
High Availability: Does the deployment have multiple replicas and Pod Disruption Budgets?
Security: Is the container scanned for vulnerabilities? Are network policies in place to restrict traffic?
Runbook: Is there clear documentation on how to respond if an alert fires?

Running Kubernetes in production is a journey of continuous improvement.

Want to discuss this further?

I'm always happy to chat about software engineering, cloud architecture, AI/ML, and DevOps.

Get In Touch Read More Articles

Follow me for more insights on software engineering, cloud architecture, AI/ML, and DevOps

Follow on LinkedIn

The Foundation: What Worked Well

1. GitOps from Day One

Treating our Kubernetes manifests like application code, stored in Git, was a game-changer. Using tools like ArgoCD to sync our cluster state with our Git repository provided:

A Single Source of Truth: Anyone can see the desired state of the cluster in Git.

Auditability and Rollbacks: Every change is a commit, making it easy to see who changed what and to revert to a previous state instantly.

Increased Velocity: Developers can make infrastructure changes through pull requests, streamlining the deployment process.

The Hard Lessons: What Didn't Work

1. Ignoring Resource Requests and Limits

The Fix: We implemented an admission controller (like Kyverno) to reject any deployments that didn't have resource limits defined. This forces good behavior and ensures cluster stability.

Production Readiness Checklist

Before any service goes live, it must meet these criteria:

Observability: Are dashboards, alerts, and structured logging in place?

Resource Limits: Are CPU and memory requests/limits defined?

Health Probes: Are liveness and readiness probes correctly configured?

High Availability: Does the deployment have multiple replicas and Pod Disruption Budgets?

Security: Is the container scanned for vulnerabilities? Are network policies in place to restrict traffic?

Runbook: Is there clear documentation on how to respond if an alert fires?

Running Kubernetes in production is a journey of continuous improvement.