device, device/afalg: use AF_ALG for ChaCha20-Poly1305 on linux/arm#57
Merged
device, device/afalg: use AF_ALG for ChaCha20-Poly1305 on linux/arm#57
Conversation
raggi
approved these changes
May 5, 2026
| // | ||
| // On modern amd64/arm64 with optimized assembly in x/crypto, the per- | ||
| // packet syscall overhead of AF_ALG is likely to make this slower than | ||
| // the Go implementation. Benchmark before assuming a win. |
Member
There was a problem hiding this comment.
we should add that benchmark and drop this comment/redirect it to the benchmark
Member
Author
There was a problem hiding this comment.
Done. And mentioned benchmarking on real hardware as qemu-user and qemu-system both are misleading in different ways.
golang.org/x/crypto/chacha20poly1305 ships no assembly on 32-bit
ARM, so the data-path AEAD falls back to a slow pure-Go
implementation. Linux's kernel crypto API exposes the same
algorithm via AF_ALG, including a NEON-accelerated
rfc7539(chacha20-neon,poly1305-neon) driver, which is faster than
pure Go even after accounting for sendmsg/recv overhead.
Add a device/afalg package that implements crypto/cipher.AEAD via
an AF_ALG socket and wire it up via build-tagged newDataAEAD on
linux/arm. The kernel's win is conditional on it picking a NEON
driver: on a NEON-less ARMv6 it falls back to scalar chacha20-arm
which is roughly on par with Go pure-Go, and the per-op syscall
overhead then turns AF_ALG into a net loss. Gate selection on
HWCAP_NEON, plus a known-answer self-test (the RFC 8439 vector)
in case the kernel lacks the algorithm or produces wrong output;
fall back to chacha20poly1305 otherwise. Handshake/cookie crypto
stays on Go.
Benchmarks at 1420-byte plaintext, the typical WireGuard packet,
showing both Go and AF_ALG numbers and which path is actually
selected after the runtime check:
Go AF_ALG selected
amd64 Xeon (AVX) 1923 MB/s 712 MB/s Go
arm64 Cortex-A53 158 MB/s 78 MB/s Go
Pi 1 ARMv6 (no NEON) 6.3 MB/s 4.7 MB/s Go
Pi 3 in armv7+NEON personality 50 MB/s 73 MB/s AF_ALG (+47%)
Only linux/arm with NEON wins; everywhere else keeps the existing
pure-Go path.
Updates tailscale/tailscale#7053
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
cfddaae to
5f3609e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
golang.org/x/crypto/chacha20poly1305 ships no assembly on 32-bit
ARM, so the data-path AEAD falls back to a slow pure-Go
implementation. Linux's kernel crypto API exposes the same
algorithm via AF_ALG, including a NEON-accelerated
rfc7539(chacha20-neon,poly1305-neon) driver, which is faster than
pure Go even after accounting for sendmsg/recv overhead.
Add a device/afalg package that implements crypto/cipher.AEAD via
an AF_ALG socket and wire it up via build-tagged newDataAEAD on
linux/arm. The kernel's win is conditional on it picking a NEON
driver: on a NEON-less ARMv6 it falls back to scalar chacha20-arm
which is roughly on par with Go pure-Go, and the per-op syscall
overhead then turns AF_ALG into a net loss. Gate selection on
HWCAP_NEON, plus a known-answer self-test (the RFC 8439 vector)
in case the kernel lacks the algorithm or produces wrong output;
fall back to chacha20poly1305 otherwise. Handshake/cookie crypto
stays on Go.
Benchmarks at 1420-byte plaintext, the typical WireGuard packet,
showing both Go and AF_ALG numbers and which path is actually
selected after the runtime check:
Only linux/arm with NEON wins; everywhere else keeps the existing
pure-Go path.
Updates tailscale/tailscale#7053