Relay Latency
Methodology
Per-packet DERP relay round-trip time μsing
derp_test_client in ping/echo mode.
- Path: client sends DERP
SendPacketwith embedded nanosecond timestamp, echo responder bounces it back through the relay. Two full relay traversals per sample. - Samples: 5,000 pings per run, first 500 discarded as warmup (4,500 measured per run)
- Runs: 10 per load level per server
- Background load: dedicated client VMs run
derp_scale_testat target rate. Ping/echo clients do not generate bulk traffic. - Load levels: idle, 25/50/75/100/150% of TS ceiling
- Total: 480 runs, 2,160,000 latency samples
8 vCPU — HD flat, TS degrades
| Load | HD p50 | HD p99 | HD p999 | TS p50 | TS p99 | TS p999 |
|---|---|---|---|---|---|---|
| Idle | 114 μs | 129 μs | 143 μs | 112 μs | 129 μs | 162 μs |
| 25% | 115 μs | 138 μs | 158 μs | 117 μs | 148 μs | 233 μs |
| 50% | 122 μs | 149 μs | 172 μs | 119 μs | 157 μs | 251 μs |
| 75% | 124 μs | 152 μs | 171 μs | 119 μs | 163 μs | 252 μs |
| 100% | 121 μs | 147 μs | 169 μs | 121 μs | 185 μs | 272 μs |
| 150% | 121 μs | 153 μs | 184 μs | 124 μs | 218 μs | 289 μs |
HD p99 is load-invariant: 129–153 μs from idle through 150%. TS p99 rises from 129 to 218 μs (+69%). At 150% load, HD is 1.42x better on p99 and 1.57x better on p999.
16 vCPU — HD dominates
| Load | HD p50 | HD p99 | HD p999 | TS p50 | TS p99 | TS p999 |
|---|---|---|---|---|---|---|
| Idle | 106 μs | 119 μs | 133 μs | 104 μs | 117 μs | 145 μs |
| 50% | 110 μs | 127 μs | 140 μs | 105 μs | 138 μs | 258 μs |
| 100% | 109 μs | 130 μs | 144 μs | 107 μs | 190 μs | 275 μs |
| 150% | 105 μs | 127 μs | 141 μs | 109 μs | 214 μs | 286 μs |
At 150% load: HD p99 = 127 μs, TS p99 = 214 μs. 1.69x better on p99, 2.03x better on p999. HD's latency actually decreases slightly at 150% — the io_uring busy-spin loop reduces syscall overhead.
2 vCPU — both marginal
| Load | HD p50 | HD p99 | TS p50 | TS p99 |
|---|---|---|---|---|
| Idle | 109 μs | 143 μs | 101 μs | 128 μs |
| 100% | 120 μs | 166 μs | 113 μs | 157 μs |
| 150% | 117 μs | 147 μs | 122 μs | 171 μs |
Both at their limits. HD slightly better at 150% (147 vs 171 μs p99), TS slightly better at idle.
Known Issue: 4 vCPU Backpressure Stall
HD at 4 vCPU (2 workers) has intermittent multi-millisecond stalls at >=50% load. Three consecutive runs at 100% load hit p99 of 593 / 2,579 / 3,923 μs -- 20-30x worse than the normal ~130 μs.
Root Cause
The backpressure mechanism oscillates. When the send queue
fills, recv_paused is set. Recv stops, the queue drains,
recv_paused clears, a burst floods in, and the queue
fills again. The cycle is ~42ms at 2 Gbps per worker --
fast enough to cause visible latency spikes because ping
packets arriving during the recv-paused window queue in the
kernel TCP buffer for tens of milliseconds.
At 8+ vCPU this doesn't happen: each worker handles ~25% of the traffic, per-worker send pressure is 4x lower, and the kTLS throughput per core has headroom. The send queue rarely reaches the high-water mark.
Fix (Three Parts)
-
Wider hysteresis for low worker counts. Resume threshold drops to 1/8 of high (from 1/4) when peer count per worker is <=12. Doubles the drain time between oscillation cycles.
-
Minimum pause duration. Once
recv_pausedis set, it stays set for at least 8 CQE batch iterations (~2,048 completions) before checking the low threshold. Prevents rapid toggling. -
Reduced busy-spin for 2-worker configs. Spin count drops from 256 to 64 iterations, giving the kernel kTLS thread more CPU time to drain send queues.
Expected Impact
| Metric | Before | After (est) |
|---|---|---|
| 4v p99 @ 100% | 825 μs (with 3ms stalls) | <200 μs |
| 4v p99 @ 150% | 765 μs | <250 μs |
| 4v oscillation freq | ~42ms | eliminated |
| 8v/16v performance | baseline | unchanged |
The fix only activates at low peer counts. High-worker configs are unaffected.