Skip to main content

Relay Latency

Methodology

Per-packet DERP relay round-trip time μsing derp_test_client in ping/echo mode.

  • Path: client sends DERP SendPacket with embedded nanosecond timestamp, echo responder bounces it back through the relay. Two full relay traversals per sample.
  • Samples: 5,000 pings per run, first 500 discarded as warmup (4,500 measured per run)
  • Runs: 10 per load level per server
  • Background load: dedicated client VMs run derp_scale_test at target rate. Ping/echo clients do not generate bulk traffic.
  • Load levels: idle, 25/50/75/100/150% of TS ceiling
  • Total: 480 runs, 2,160,000 latency samples
Relay latency vs load level, 8 and 16 vCPU

8 vCPU — HD flat, TS degrades

LoadHD p50HD p99HD p999TS p50TS p99TS p999
Idle114 μs129 μs143 μs112 μs129 μs162 μs
25%115 μs138 μs158 μs117 μs148 μs233 μs
50%122 μs149 μs172 μs119 μs157 μs251 μs
75%124 μs152 μs171 μs119 μs163 μs252 μs
100%121 μs147 μs169 μs121 μs185 μs272 μs
150%121 μs153 μs184 μs124 μs218 μs289 μs

HD p99 is load-invariant: 129–153 μs from idle through 150%. TS p99 rises from 129 to 218 μs (+69%). At 150% load, HD is 1.42x better on p99 and 1.57x better on p999.

16 vCPU — HD dominates

LoadHD p50HD p99HD p999TS p50TS p99TS p999
Idle106 μs119 μs133 μs104 μs117 μs145 μs
50%110 μs127 μs140 μs105 μs138 μs258 μs
100%109 μs130 μs144 μs107 μs190 μs275 μs
150%105 μs127 μs141 μs109 μs214 μs286 μs

At 150% load: HD p99 = 127 μs, TS p99 = 214 μs. 1.69x better on p99, 2.03x better on p999. HD's latency actually decreases slightly at 150% — the io_uring busy-spin loop reduces syscall overhead.

2 vCPU — both marginal

LoadHD p50HD p99TS p50TS p99
Idle109 μs143 μs101 μs128 μs
100%120 μs166 μs113 μs157 μs
150%117 μs147 μs122 μs171 μs

Both at their limits. HD slightly better at 150% (147 vs 171 μs p99), TS slightly better at idle.

Known Issue: 4 vCPU Backpressure Stall

HD at 4 vCPU (2 workers) has intermittent multi-millisecond stalls at >=50% load. Three consecutive runs at 100% load hit p99 of 593 / 2,579 / 3,923 μs -- 20-30x worse than the normal ~130 μs.

Root Cause

The backpressure mechanism oscillates. When the send queue fills, recv_paused is set. Recv stops, the queue drains, recv_paused clears, a burst floods in, and the queue fills again. The cycle is ~42ms at 2 Gbps per worker -- fast enough to cause visible latency spikes because ping packets arriving during the recv-paused window queue in the kernel TCP buffer for tens of milliseconds.

At 8+ vCPU this doesn't happen: each worker handles ~25% of the traffic, per-worker send pressure is 4x lower, and the kTLS throughput per core has headroom. The send queue rarely reaches the high-water mark.

Fix (Three Parts)

  1. Wider hysteresis for low worker counts. Resume threshold drops to 1/8 of high (from 1/4) when peer count per worker is <=12. Doubles the drain time between oscillation cycles.

  2. Minimum pause duration. Once recv_paused is set, it stays set for at least 8 CQE batch iterations (~2,048 completions) before checking the low threshold. Prevents rapid toggling.

  3. Reduced busy-spin for 2-worker configs. Spin count drops from 256 to 64 iterations, giving the kernel kTLS thread more CPU time to drain send queues.

Expected Impact

MetricBeforeAfter (est)
4v p99 @ 100%825 μs (with 3ms stalls)<200 μs
4v p99 @ 150%765 μs<250 μs
4v oscillation freq~42mseliminated
8v/16v performancebaselineunchanged

The fix only activates at low peer counts. High-worker configs are unaffected.