Part 3: Applying eBPF for Real-World Monitoring in Kubernetes

A few weeks after I got comfortable with eBPF, it started earning its place in my daily workflow.

You know those issues that only happen in production — where logs tell you nothing useful, metrics show a spike, and everyone’s staring at Grafana wondering what changed? That’s where eBPF shines.

In this post, I’ll walk through three real-world scenarios where eBPF made the difference between guessing and knowing.

Scenario 1: The Case of the Slow Microservice

We had a payment service that occasionally took 4–5 seconds to respond. CPU looked fine, memory stable, network metrics didn’t show congestion.

The team was ready to blame the database (as always). But I wanted proof.

Step 1 — Observe Network Behavior

I ran:

sudo tcptracer-bpfcc

This traces every TCP connection on the node — including who started it, where it’s going, and how long it lived.

Within seconds, I spotted patterns like this:

PID   COMM     SADDR            DADDR            LAT(ms)
1983  java     10.42.0.12:43210 10.42.1.25:5432  4023

There it was — 4 seconds of latency between the pod and the database. The pod wasn’t busy; the network path was.

We discovered a misconfigured CNI policy that was rerouting traffic through an internal proxy, adding unnecessary hops. One config change later, latency dropped back to under 100ms.

Lesson learned: eBPF doesn’t just measure; it explains where time is lost.


Scenario 2: The Noisy Neighbor

Every Kubernetes cluster eventually faces this: one pod quietly hogs resources while everyone else suffers.

I started noticing some nodes running hot — not overloaded, but just enough to trigger alerts.

Instead of guessing which service was the culprit, I went straight to:

sudo profile-bpfcc -F 99

This captured CPU stack samples every few milliseconds. When I visualized the output (converted to a flamegraph), one process stood out like a lighthouse — a pod running a background task that was stuck in a retry loop.

A single deployment’s bad code was eating ~40% of the node’s CPU.

We fixed the retry logic, and the cluster immediately calmed down.

Lesson learned: Traditional CPU metrics tell you that there’s a problem; eBPF tells you who caused it.


Scenario 3: Database Timeouts That Weren’t the Database

This one was tricky. A service that handled large file uploads kept throwing “database timeout” errors. We scaled up the DB, increased connection pools — nothing helped.

I logged into one of the pods’ nodes and used biosnoop-bpfcc, which traces block I/O operations (reads/writes to disk).

Output:

PID   COMM     DISK   I/O(ms)
3012  java     xvda1  742
3012  java     xvda1  718

That meant the process was spending 700+ ms writing to disk — before it ever hit the database. Turns out, the app was saving each upload to a local temp file before streaming it, and that EBS volume had degraded performance.

We moved the workload to a node group with SSD-backed volumes. Problem solved.

Lesson learned: Sometimes “database latency” is just slow I/O in disguise. eBPF helped us see through it.


Bonus: When You Want Kubernetes-Native eBPF

Once you’re comfortable with these tools, you’ll want something that integrates natively with Kubernetes. Two tools stand out:

  • Pixie (by New Relic) — automatic eBPF instrumentation for Kubernetes.
    You can query telemetry data with SQL-like syntax: px exec "SELECT pod, latency, status FROM http_events ORDER BY latency DESC LIMIT 5" Great for quickly spotting slow requests across pods.
  • Cilium + Hubble — uses eBPF for networking and observability.
    It gives you real-time pod-to-pod flow visibility and security insights, no sidecars needed.

These tools essentially package up the power of eBPF and make it production-ready for teams that don’t want to manually SSH into nodes.


Bringing It All Together

Here’s what my mental checklist looks like now whenever something goes wrong in Kubernetes:

Problem TypeeBPF ToolWhat It ShowsReal-World Use
Slow requeststcptracer-bpfcc / tcplife-bpfccNetwork latency, retriesDetect slow backend calls
High CPUprofile-bpfccFlamegraph of CPU hotspotsCatch runaway loops
Missing filesopensnoop-bpfccFile open attemptsDebug bad mounts or configs
Memory leaksmemleak-bpfccUnreleased allocationsFind growing pods
Disk bottlenecksbiosnoop-bpfccSlow I/O operationsSpot degraded volumes

With these in my toolkit, I rarely start a debugging session with “Let’s redeploy and hope it works.” Instead, I start with evidence.


What eBPF Changed for Me

Before eBPF, I relied on logs, metrics, and hunches. After eBPF, I rely on facts.

It’s not a silver bullet — you still need to understand your systems. But it fills the gaps between what your apps think they’re doing and what the kernel actually knows.

And that visibility changes everything.


Final Thoughts

In this 3-part series, we covered:

  1. The Why — Why Kubernetes observability is so hard and how eBPF helps.
  2. The How — Installing and using basic eBPF tools safely.
  3. The Real — Applying eBPF to real-world production issues.

The next time your cluster misbehaves, instead of jumping between dashboards, try asking the kernel itself. It’s been quietly watching everything all along.


If this series helped you, I’d love to hear your stories — what problems did eBPF help you solve? Drop a comment or connect with me; I’m building more practical guides around real debugging cases and performance engineering at scale.

Leave a Reply

Your email address will not be published. Required fields are marked *