Relay Latency

Methodology

Per-packet DERP relay round-trip time μsing derp_test_client in ping/echo mode.

Path: client sends DERP SendPacket with embedded nanosecond timestamp, echo responder bounces it back through the relay. Two full relay traversals per sample.
Samples: 5,000 pings per run, first 500 discarded as warmup (4,500 measured per run)
Runs: 10 per load level per server
Background load: dedicated client VMs run derp_scale_test at target rate. Ping/echo clients do not generate bulk traffic.
Load levels: idle, 25/50/75/100/150% of TS ceiling
Total: 480 runs, 2,160,000 latency samples

Relay latency vs load level, 8 and 16 vCPU

8 vCPU — HD flat, TS degrades

Load	HD p50	HD p99	HD p999	TS p50	TS p99	TS p999
Idle	114 μs	129 μs	143 μs	112 μs	129 μs	162 μs
25%	115 μs	138 μs	158 μs	117 μs	148 μs	233 μs
50%	122 μs	149 μs	172 μs	119 μs	157 μs	251 μs
75%	124 μs	152 μs	171 μs	119 μs	163 μs	252 μs
100%	121 μs	147 μs	169 μs	121 μs	185 μs	272 μs
150%	121 μs	153 μs	184 μs	124 μs	218 μs	289 μs

HD p99 is load-invariant: 129–153 μs from idle through 150%. TS p99 rises from 129 to 218 μs (+69%). At 150% load, HD is 1.42x better on p99 and 1.57x better on p999.

16 vCPU — HD dominates

Load	HD p50	HD p99	HD p999	TS p50	TS p99	TS p999
Idle	106 μs	119 μs	133 μs	104 μs	117 μs	145 μs
50%	110 μs	127 μs	140 μs	105 μs	138 μs	258 μs
100%	109 μs	130 μs	144 μs	107 μs	190 μs	275 μs
150%	105 μs	127 μs	141 μs	109 μs	214 μs	286 μs

At 150% load: HD p99 = 127 μs, TS p99 = 214 μs. 1.69x better on p99, 2.03x better on p999. HD's latency actually decreases slightly at 150% — the io_uring busy-spin loop reduces syscall overhead.

2 vCPU — both marginal

Load	HD p50	HD p99	TS p50	TS p99
Idle	109 μs	143 μs	101 μs	128 μs
100%	120 μs	166 μs	113 μs	157 μs
150%	117 μs	147 μs	122 μs	171 μs

Both at their limits. HD slightly better at 150% (147 vs 171 μs p99), TS slightly better at idle.

Known Issue: 4 vCPU Backpressure Stall

HD at 4 vCPU (2 workers) has intermittent multi-millisecond stalls at >=50% load. Three consecutive runs at 100% load hit p99 of 593 / 2,579 / 3,923 μs -- 20-30x worse than the normal ~130 μs.

Root Cause

The backpressure mechanism oscillates. When the send queue fills, recv_paused is set. Recv stops, the queue drains, recv_paused clears, a burst floods in, and the queue fills again. The cycle is ~42ms at 2 Gbps per worker -- fast enough to cause visible latency spikes because ping packets arriving during the recv-paused window queue in the kernel TCP buffer for tens of milliseconds.

At 8+ vCPU this doesn't happen: each worker handles ~25% of the traffic, per-worker send pressure is 4x lower, and the kTLS throughput per core has headroom. The send queue rarely reaches the high-water mark.

Fix (Three Parts)

Wider hysteresis for low worker counts. Resume threshold drops to 1/8 of high (from 1/4) when peer count per worker is <=12. Doubles the drain time between oscillation cycles.
Minimum pause duration. Once recv_paused is set, it stays set for at least 8 CQE batch iterations (~2,048 completions) before checking the low threshold. Prevents rapid toggling.
Reduced busy-spin for 2-worker configs. Spin count drops from 256 to 64 iterations, giving the kernel kTLS thread more CPU time to drain send queues.

Expected Impact

Metric	Before	After (est)
4v p99 @ 100%	825 μs (with 3ms stalls)	<200 μs
4v p99 @ 150%	765 μs	<250 μs
4v oscillation freq	~42ms	eliminated
8v/16v performance	baseline	unchanged

The fix only activates at low peer counts. High-worker configs are unaffected.

Methodology​

8 vCPU — HD flat, TS degrades​

16 vCPU — HD dominates​

2 vCPU — both marginal​

Known Issue: 4 vCPU Backpressure Stall​

Root Cause​

Fix (Three Parts)​

Expected Impact​