Skip to main content

Relay Latency

Methodology

Per-packet DERP relay round-trip time using derp_test_client in ping/echo mode.

  • Path: client sends DERP SendPacket with embedded nanosecond timestamp, echo responder bounces it back through the relay. Two full relay traversals per sample.
  • Samples: 5,000 pings per run, first 500 discarded as warmup (4,500 measured per run)
  • Runs: 10 per load level per server
  • Background load: dedicated client VMs run derp_scale_test at target rate. Ping/echo clients do not generate bulk traffic.
  • Load levels: idle, 25/50/75/100/150% of TS ceiling
  • Total: 480 runs, 2,160,000 latency samples
Relay latency vs load level, 8 and 16 vCPU

8 vCPU -- HD flat, TS degrades

LoadHD p50HD p99HD p999TS p50TS p99TS p999
Idle114 us129 us143 us112 us129 us162 us
25%115 us138 us158 us117 us148 us233 us
50%122 us149 us172 us119 us157 us251 us
75%124 us152 us171 us119 us163 us252 us
100%121 us147 us169 us121 us185 us272 us
150%121 us153 us184 us124 us218 us289 us

HD p99 is load-invariant: 129--153 us from idle through 150%. TS p99 rises from 129 to 218 us (+69%). At 150% load, HD is 1.42x better on p99 and 1.57x better on p999.

16 vCPU -- HD dominates

LoadHD p50HD p99HD p999TS p50TS p99TS p999
Idle106 us119 us133 us104 us117 us145 us
50%110 us127 us140 us105 us138 us258 us
100%109 us130 us144 us107 us190 us275 us
150%105 us127 us141 us109 us214 us286 us

At 150% load: HD p99 = 127 us, TS p99 = 214 us. 1.69x better on p99, 2.03x better on p999. HD's latency actually decreases slightly at 150% -- the io_uring busy-spin loop reduces syscall overhead.

2 vCPU -- both marginal

LoadHD p50HD p99TS p50TS p99
Idle109 us143 us101 us128 us
100%120 us166 us113 us157 us
150%117 us147 us122 us171 us

Both at their limits. HD slightly better at 150% (147 vs 171 us p99), TS slightly better at idle.

Known Issue: 4 vCPU Backpressure Stall

HD at 4 vCPU (2 workers) has intermittent multi-millisecond stalls at >=50% load. Three consecutive runs at 100% load hit p99 of 593 / 2,579 / 3,923 us -- 20-30x worse than the normal ~130 us.

Root Cause

The backpressure mechanism oscillates. When the send queue fills, recv_paused is set. Recv stops, the queue drains, recv_paused clears, a burst floods in, and the queue fills again. The cycle is ~42ms at 2 Gbps per worker -- fast enough to cause visible latency spikes because ping packets arriving during the recv-paused window queue in the kernel TCP buffer for tens of milliseconds.

At 8+ vCPU this doesn't happen: each worker handles ~25% of the traffic, per-worker send pressure is 4x lower, and the kTLS throughput per core has headroom. The send queue rarely reaches the high-water mark.

Fix (Three Parts)

  1. Wider hysteresis for low worker counts. Resume threshold drops to 1/8 of high (from 1/4) when peer count per worker is <=12. Doubles the drain time between oscillation cycles.

  2. Minimum pause duration. Once recv_paused is set, it stays set for at least 8 CQE batch iterations (~2,048 completions) before checking the low threshold. Prevents rapid toggling.

  3. Reduced busy-spin for 2-worker configs. Spin count drops from 256 to 64 iterations, giving the kernel kTLS thread more CPU time to drain send queues.

Expected Impact

MetricBeforeAfter (est)
4v p99 @ 100%825 us (with 3ms stalls)<200 us
4v p99 @ 150%765 us<250 us
4v oscillation freq~42mseliminated
8v/16v performancebaselineunchanged

The fix only activates at low peer counts. High-worker configs are unaffected.