Back to Blog

When ETCD Crashes, Check Your Disks First: A Pod CrashLoopBack Debugging Story

Setting up a cloud-edge continuum testbed for a computer vision demo taught us something fundamental about distributed systems: etcd doesn’t forgive slow storage.

The Demo Setup

We’ve been building a demonstration for MLSysOps — a framework that enables custom policies (simple or ML-based) to customize the deployment and runtime behavior of applications across the Cloud-Edge-IoT continuum. The idea is to show how telemetry-driven policies can dynamically adapt where and how an application runs, without the developer or operator having to intervene manually.

The demo stack is straightforward: Karmada as the continuum orchestrator sitting on top of individual k3s clusters. The application is a computer vision pipeline doing real-time object detection. The hardware? A NUC, a Raspberry Pi, and a Jetson AGX Orin. More info on setting up the testbed can be found in the MLSysOps github repository

The demo flow goes like this: the object detection workload deploys and runs locally on the Raspberry Pi. As the Pi starts to struggle, frame rate drops, inference latency climbs, the MLSysOps agents picks this up through telemetry and transparently switches the vAccel backend to point at the Jetson AGX Orin. Suddenly the workload is offloaded to serious GPU hardware, real-time object detection kicks-in, and the policy change is the only thing that made it happen. No redeployment, no manual intervention. That’s the story we want to tell.

But before we could tell that story, we had to actually get the cluster running. And that turned out to be more interesting than expected.

A Four-Node Cluster on Three Physical Machines

Here’s the constraint we were working with. Karmada uses etcd to persist its own state, separate from the etcd instances backing individual Kubernetes clusters. In k3s this is embedded in the k3s binary (no need to manually handle etcd instances). This means the Karmada host effectively needs to be its own node, distinct from the clusters it’s orchestrating. With only three physical machines and a desire to keep the demo self-contained, we spun up two VMs on the NUC: one to act as the Karmada host and one to serve as the control plane for the k3s cluster, with the Raspberry Pi and the Jetson as worker nodes. Logical, pragmatic, and as it turned out, the source of a very subtle problem.

The Symptom: Pods That Wouldn’t Stay Up

After getting Karmada installed, we started noticing that Karmada’s own pods were crashing every five to ten minutes. Regularly. Predictably. Maddening.

The crashes weren’t immediately informative. The pods would come back up, run for a while, and crash again. Nothing in the application layer seemed wrong. The k3s clusters themselves looked healthy. We went through the usual suspects, resource limits, networking, configuration drift between restarts, and came up empty.

The investigation got genuinely pedantic. We started pulling on every thread we could find in the logs, correlating timestamps, looking for patterns in what was dying and when.

The Root Cause: etcd and I/O Latency

Eventually the logs pointed somewhere unexpected: etcd was timing out. Not crashing due to a bug or misconfiguration in the Karmada setup itself, but because the underlying storage wasn’t responding fast enough for etcd’s expectations.

etcd is a strongly consistent, distributed key-value store, and that consistency comes at a cost: it is extraordinarily sensitive to I/O latency. etcd uses a write-ahead log and relies on fsync calls completing within tight time windows. When storage is slow, even intermittently, etcd starts missing its internal heartbeat and election deadlines. Leader elections fail. The cluster loses quorum. Pods that depend on the API server start dying.

The VMs on the NUC were sharing the host’s storage, and under the default configuration, the I/O performance wasn’t consistent enough to keep etcd happy. Bumping the etcd timeout thresholds helped a little but didn’t solve the problem, it just moved the failure threshold slightly. The root issue was the storage itself.

The fix came from a different direction: ZFS tuning on the nuc. After optimizing the ZFS storage backend, adjusting settings that affect how aggressively writes are committed and how I/O is scheduled, the latency profile improved enough that etcd stopped timing out. The pod crashes stopped. The cluster became stable. FWIW, we used the following:

zfs set sync=disabled default # Disable sync writes
zfs set compression=lz4 default # use LZ4 compression
zfs set atime=off default # No atime
zfs set recordsize=8k default  # Better for small etcd writes
  • sync=disabled: Normally ZFS waits for data to be physically written to disk before confirming a write operation to the application (synchronous writes). Disabling sync means ZFS acknowledges writes immediately without waiting, which dramatically reduces write latency. This is why etcd stopped timing out: fsync calls returned instantly instead of blocking. The tradeoff is that in a power loss scenario, the last few seconds of writes could be lost. For a demo VM, that’s an acceptable risk; for production etcd storing critical state, you’d think harder about it.

  • compression=lz4: Enables transparent compression on all data using the lz4 algorithm. lz4 is specifically chosen here because it’s extremely fast: fast enough that the CPU overhead of compressing/decompressing is usually outweighed by the reduction in I/O, since you’re reading and writing less data to disk. For a VM image with lots of repetitive data (filesystem metadata, logs, Kubernetes objects), this typically gives decent compression ratios with negligible performance cost.

  • atime=off: By default, filesystems record the last access time on every file every time it’s read. This turns every read into a write, which doubles I/O pressure for read-heavy workloads. Disabling atime means reads are just reads. Almost every performance-tuned Linux system does this.

  • recordsize=8k: ZFS uses a configurable record size (default 128K) when writing data. etcd’s underlying storage engine (bbolt) works with small, random reads and writes. Setting recordsize to 8K aligns ZFS’s I/O unit closer to what etcd actually does, reducing write amplification — instead of reading and rewriting a 128K block to change a few bytes, ZFS only touches an 8K block.

Taken together these four settings basically tell ZFS: stop being cautious, be fast. The sync=disabled setting is the one that almost certainly stopped the etcd crashes on its own, the other three are good housekeeping that reduce overall I/O pressure and keep things humming. In a production setting you’d want to be more careful, particularly with sync=disabled, but for a demo environment running on shared VM storage it’s a very pragmatic solution.

The Lesson: When etcd Crashes, Look at Your Disks

This is the pattern worth internalizing. If you’re running Karmada (or any Kubernetes-adjacent system that embeds etcd) and you’re seeing periodic pod crashes that don’t have an obvious application-level cause, the first question to ask is: how is the storage performing under etcd’s workload?

The etcd documentation actually calls this out: it recommends SSDs and warns against running etcd on storage that’s shared with other I/O-heavy workloads. In a production cluster you’d typically have dedicated storage for etcd nodes. In a demo environment running VMs on shared hardware, that’s easy to overlook.

A few diagnostics worth running if you suspect this problem: etcd exposes metrics via Prometheus. The ones to watch are etcd_disk_wal_fsync_duration_seconds and etcd_disk_backend_commit_duration_seconds. If the 99th percentile on either of these is consistently above 100ms, you have a storage problem, not a configuration problem.

You can also run a quick benchmark against the storage path etcd uses with a tool like fio to get a baseline read/write latency profile before you ever install etcd on a machine.

Back to the Demo

Once the cluster was stable, the actual demo came together quickly. The MLSysOps policy layer does what it’s supposed to do, telemetry comes in showing the Raspberry Pi falling behind on frame rate, the policy fires, the vAccel backend switches to the Jetson AGX Orin, and object detection snaps to real time. The network hop is there but the GPU makes it irrelevant. It’s a compelling demonstration of what adaptive policy-driven orchestration can do in a heterogeneous edge environment. We just had to fight through a disk I/O problem to get there.

Sometimes the most useful debugging sessions are the ones where the answer turns out to be completely orthogonal to where you were looking. etcd taught us that distributed systems have strong opinions about their infrastructure. It’s worth listening to them.

Checkout our demo: