Running Kubernetes in Production: A Field Guide

Kubernetes is the de facto standard for container orchestration, but moving from a "hello world" deployment to running mission-critical applications in production is a significant leap. After years of managing production clusters, I've collected my share of operational scars and wisdom. Here’s a field guide to what really matters.
The Foundation: What Worked Well
1. GitOps from Day One
Treating our Kubernetes manifests like application code, stored in Git, was a game-changer. Using tools like ArgoCD to sync our cluster state with our Git repository provided:
- A Single Source of Truth: Anyone can see the desired state of the cluster in Git.
- Auditability and Rollbacks: Every change is a commit, making it easy to see who changed what and to revert to a previous state instantly.
- Increased Velocity: Developers can make infrastructure changes through pull requests, streamlining the deployment process.
The Hard Lessons: What Didn't Work
1. Ignoring Resource Requests and Limits
In the beginning, we didn't enforce setting resource requests and limits on our deployments. This led to "noisy neighbor" problems where a single runaway application could consume all the resources on a node, causing other critical applications to be evicted.
- The Fix: We implemented an admission controller (like Kyverno) to reject any deployments that didn't have resource limits defined. This forces good behavior and ensures cluster stability.
Production Readiness Checklist
Before any service goes live, it must meet these criteria:
- Observability: Are dashboards, alerts, and structured logging in place?
- Resource Limits: Are CPU and memory requests/limits defined?
- Health Probes: Are liveness and readiness probes correctly configured?
- High Availability: Does the deployment have multiple replicas and Pod Disruption Budgets?
- Security: Is the container scanned for vulnerabilities? Are network policies in place to restrict traffic?
- Runbook: Is there clear documentation on how to respond if an alert fires?
Running Kubernetes in production is a journey of continuous improvement.
Want to discuss this further?
I'm always happy to chat about cloud architecture and share experiences.
Follow me for more insights on cloud architecture and DevOps
Follow on LinkedIn