Hyper-DERP: C++/io_uring DERP relay - Same throughput as Tailscale's derper, half the cores
The Interview
I work for a company that produces IR cameras for industrial applications. I created a Raspi edge device with the accompanying software. Among many things it will be able to forward data and control streams from one industrial net into another. I had some rough ideas how I wanted the relay to work but hadn't gotten serious about it.
Then I had an interview at NetBird, a VPN startup in Berlin that had just gotten Series A funding. In preparation for the interview I looked over their code, got to their relay, and no further.
The relay was written in Go with userspace TLS. Every packet makes its way into the userspace, gets decrypted, has its header rewritten and encrypted again before being sent back out. The whole time fighting the Go runtime - goroutine scheduling, garbage collection and context switches.
So naturally I did what every reasonable person would do: rip out the data plane and replace it with C.
I started with decoupling the NetBird relay data plane. Which worked out fine, I ran benchmarks on the loopback and I soon realized that NetBird's relay isn't much of a benchmark. It's a startup prototype - not serious systems engineering. Outrunning it was hardly sport.
Then I had a look at Tailscale's derper. Built by a proper engineering team, years of production hardening, real effort and thought behind it — I had found a worthy opponent.
