Skip to main content

Bare-Metal Profiling

Setup

  • Relay: Xeon E5-1650 v3 @ 3.5 GHz, 6C/12T, 15 MB L3, ConnectX-4 Lx 25GbE
  • Client: i5-13600KF, 16C/24T, 30 MB L3, CX4 Lx 25GbE
  • Network: 25GbE DAC direct link
  • Kernel: 6.12.74+deb13+1-amd64
  • kTLS: software only (CX4 Lx has no TLS offload)

Key Findings

CPU Breakdown (2 workers @ 5 Gbps)

Function% CyclesCategory
aes_gcm_dec (kTLS decrypt)13.0%kernel crypto
aes_gcm_enc (kTLS encrypt)11.8%kernel crypto
rep_movs_alternative (memcpy)4.0%kernel copy
skb_release_data2.2%kernel SKB
ForwardMsg (HD user code)2.0%user relay
memset_orig1.7%kernel alloc

HD's entire forwarding path -- frame parsing, hash lookup, SPSC enqueue, frame construction -- consumes 2% of cycles. kTLS encrypt + decrypt consumes 25%.

kTLS Cost

Plain TCP vs kTLS on the same hardware isolates the cost of kernel TLS:

WorkerskTLS ceilingTCP ceilingkTLS tax
23,833 Mbps7,383 Mbps48%
46,300 Mbps8,652 Mbps27%

Plain TCP 2w (7,383 Mbps) exceeds kTLS 4w (6,300 Mbps). kTLS costs more throughput than doubling workers recovers.

Cache Cliff

LoadLLC miss rate
3 Gbps (below saturation)2.6%
5 Gbps (kTLS 2w saturation)40%

The transition is non-linear. Below ~4 Gbps, kTLS crypto state and HD's data structures coexist in the 15 MB L3. Above ~4 Gbps, the crypto working set (cipher state, IV buffers, SKBs) evicts everything else. Throughput hits a wall.

This also explains HD's higher CV under kTLS (6-10% vs TS <1%). The system oscillates around the cliff edge -- crypto pressure evicts the working set, throughput drops, fewer packets reduce pressure, data fits again, throughput recovers, and the cycle repeats.

Comparison to TS

ConfigThroughputLoss
TS TLS4,100 Mbps37%
HD kTLS 2w3,833 Mbps3%
HD kTLS 4w6,680 Mbps8%
HD TCP 2w7,383 Mbpslow
HD TCP 4w8,652 Mbps<1%

HD kTLS 2w delivers slightly less throughput than TS TLS (3.8 vs 4.1 Gbps) but with dramatically less loss (3% vs 37%). HD kTLS 4w is 1.6x TS. HD TCP 2w is 1.8x TS.