Skip to content

device, device/afalg: use AF_ALG for ChaCha20-Poly1305 on linux/arm#57

Merged
bradfitz merged 1 commit intotailscalefrom
bradfitz/pi_af_alg
May 5, 2026
Merged

device, device/afalg: use AF_ALG for ChaCha20-Poly1305 on linux/arm#57
bradfitz merged 1 commit intotailscalefrom
bradfitz/pi_af_alg

Conversation

@bradfitz
Copy link
Copy Markdown
Member

@bradfitz bradfitz commented May 5, 2026

golang.org/x/crypto/chacha20poly1305 ships no assembly on 32-bit
ARM, so the data-path AEAD falls back to a slow pure-Go
implementation. Linux's kernel crypto API exposes the same
algorithm via AF_ALG, including a NEON-accelerated
rfc7539(chacha20-neon,poly1305-neon) driver, which is faster than
pure Go even after accounting for sendmsg/recv overhead.

Add a device/afalg package that implements crypto/cipher.AEAD via
an AF_ALG socket and wire it up via build-tagged newDataAEAD on
linux/arm. The kernel's win is conditional on it picking a NEON
driver: on a NEON-less ARMv6 it falls back to scalar chacha20-arm
which is roughly on par with Go pure-Go, and the per-op syscall
overhead then turns AF_ALG into a net loss. Gate selection on
HWCAP_NEON, plus a known-answer self-test (the RFC 8439 vector)
in case the kernel lacks the algorithm or produces wrong output;
fall back to chacha20poly1305 otherwise. Handshake/cookie crypto
stays on Go.

Benchmarks at 1420-byte plaintext, the typical WireGuard packet,
showing both Go and AF_ALG numbers and which path is actually
selected after the runtime check:

                                Go         AF_ALG     selected
  amd64 Xeon (AVX)              1923 MB/s   712 MB/s  Go
  arm64 Cortex-A53              158 MB/s     78 MB/s  Go
  Pi 1 ARMv6 (no NEON)          6.3 MB/s    4.7 MB/s  Go
  Pi 3 in armv7+NEON personality 50 MB/s     73 MB/s  AF_ALG (+47%)

Only linux/arm with NEON wins; everywhere else keeps the existing
pure-Go path.

Updates tailscale/tailscale#7053

@bradfitz bradfitz requested review from raggi and sailorfrag May 5, 2026 19:32
Comment thread device/afalg/afalg_linux.go Outdated
//
// On modern amd64/arm64 with optimized assembly in x/crypto, the per-
// packet syscall overhead of AF_ALG is likely to make this slower than
// the Go implementation. Benchmark before assuming a win.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add that benchmark and drop this comment/redirect it to the benchmark

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. And mentioned benchmarking on real hardware as qemu-user and qemu-system both are misleading in different ways.

golang.org/x/crypto/chacha20poly1305 ships no assembly on 32-bit
ARM, so the data-path AEAD falls back to a slow pure-Go
implementation. Linux's kernel crypto API exposes the same
algorithm via AF_ALG, including a NEON-accelerated
rfc7539(chacha20-neon,poly1305-neon) driver, which is faster than
pure Go even after accounting for sendmsg/recv overhead.

Add a device/afalg package that implements crypto/cipher.AEAD via
an AF_ALG socket and wire it up via build-tagged newDataAEAD on
linux/arm. The kernel's win is conditional on it picking a NEON
driver: on a NEON-less ARMv6 it falls back to scalar chacha20-arm
which is roughly on par with Go pure-Go, and the per-op syscall
overhead then turns AF_ALG into a net loss. Gate selection on
HWCAP_NEON, plus a known-answer self-test (the RFC 8439 vector)
in case the kernel lacks the algorithm or produces wrong output;
fall back to chacha20poly1305 otherwise. Handshake/cookie crypto
stays on Go.

Benchmarks at 1420-byte plaintext, the typical WireGuard packet,
showing both Go and AF_ALG numbers and which path is actually
selected after the runtime check:

                                Go         AF_ALG     selected
  amd64 Xeon (AVX)              1923 MB/s   712 MB/s  Go
  arm64 Cortex-A53              158 MB/s     78 MB/s  Go
  Pi 1 ARMv6 (no NEON)          6.3 MB/s    4.7 MB/s  Go
  Pi 3 in armv7+NEON personality 50 MB/s     73 MB/s  AF_ALG (+47%)

Only linux/arm with NEON wins; everywhere else keeps the existing
pure-Go path.

Updates tailscale/tailscale#7053

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
@bradfitz bradfitz force-pushed the bradfitz/pi_af_alg branch from cfddaae to 5f3609e Compare May 5, 2026 19:56
@bradfitz bradfitz merged commit 30595e7 into tailscale May 5, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants