4 Jan 2023

Go 1.20 Cryptography

The ~~first~~ second release candidate of Go 1.20 is out!¹ This is the first release I participated in as an independent maintainer, after leaving Google to become a professional Open Source maintainer. (By the way, that’s going great, and I’m going to write more about it here soon!)

I’m pretty happy with the work that’s landing in it. There are both exciting new APIs, and invisible deep backend improvements that are going to make code more maintainable and secure in the long run. All the main work mentioned in the planning post got done, and then some (but not the “stretch goals”). The whole release is pretty exciting, too, and you should check out the release notes (although the cryptography parts might not be complete yet).

crypto/ecdh

The standard library is gaining a new package: crypto/ecdh. Here’s what I said about it in the Go 1.20 planning post.

The most visible change will be the landing the new [crypto/ecdh package][] I proposed and implemented earlier this year. The package provides a safe, []byte-based, easy to use API for Elliptic Curve Diffie-Hellman over Curve25519 and NIST curves (P-256 and company, but no P-224 if we can get away with it).

crypto/ecdh was made possible by a long-running refactor of the elliptic curve implementations in the standard library. Between Go 1.17 and Go 1.19, most critical code was moved to safer low-level APIs under crypto/internal/nistec and crypto/internal/edwards25519, large pieces were replaced with code generated from fiat-crypto’s formally verified models, making every curve constant time, most group logic was replaced with modern complete formulas, and even the assembly was massaged to implement the same functions on all architectures and fit the nistec API. Some assembly is gone, actually!

(Here are all the changes. A couple nifty uses of generics in there if you’re curious.)

The goal of the package is to replace the major use case for the now-deprecated crypto/elliptic API, which has a hardcoded dependency on the variable-time, large, and complex math/big package. crypto/elliptic is now no more than a compatibility wrapper. Any more advanced uses of crypto/elliptic can switch to filippo.io/nistec which is an exported version of crypto/internal/nistec, or filippo.io/edwards25519 which is an exported version of crypto/internal/edwards25519.

What’s left to do in Go 1.20 then?

First, actually landing the new package, which is already submitted! Then, adding and reviewing new tests (including Wycheproof integration by Roland), which actually revealed there are fewer (!!) edge cases than I had originally documented. Finally, reviewing the BoringCrypto integration by Russ.

The package landed successfully, and all the mentioned Go 1.20 work got done. The full crypto/elliptic deprecation will actually have to wait until Go 1.22, because of the deprecation policy:

If function F1 is being replaced by function F2 and the first release in which F2 is available is Go 1.N, then an official deprecation notice for F1 should not be added until Go 1.N+2. This ensures that Go developers only see F1 as deprecated when all supported Go versions include F2 and they can easily switch.

The idea is making sure projects don’t have to support both APIs at the same time to keep supporting Go 1.20 while not getting deprecation warnings. I still have an outstanding CL to deprecate the very low-level operations (point addition, and custom curves) that are not being replaced by anything in the standard library. (Instead, they should migrate to third-party modules like filippo.io/nistec or filippo.io/edwards25519.)

There was one last-minute API change: we got a request on the issue tracker for a PrivateKey interface, to allow using private keys stored on hardware or remote modules, like the popular [crypto.Signer][]. We went back and forth a bit on it and discussed it with Russ Cox and concluded that PrivateKey doesn’t need to be an interface, but rather needs to implement one, like ecdsa.PrivateKey implements crypto.Signer. This led to moving the ECDH method from [the Curve interface][] to the PrivateKey type. We don’t define the interface ourselves because we don’t consume it anywhere, but an application that wishes to accept both crypto/ecdh-implemented keys and hardware-backed ones can define something like

type ecdhPrivateKey interface {
    Curve() ecdh.Curve
    ECDH(remote *ecdh.PublicKey) ([]byte, error)
    Equal(x crypto.PrivateKey) bool
    Public() crypto.PublicKey
    PublicKey() *ecdh.PublicKey
}

thanks to the magic of Go’s implicitly implemented interfaces. This still uses the ecdh.PublicKey concrete type, but values of that type can be easily constructed for hardware keys with Curve.NewPublicKey. What it does not support is other curves, which I am ok with.

Also, this makes the Curve interface solely an abstraction to produce keys on a certain curve, while the ECDH operation itself is a method of the private key, which feels more correct and elegant. Concretely, to implement this we added a private method to Curve which the private types returned by P256(), X25519(), etc. implement, and PrivateKey.ECDH calls. The reason for this is making it clear to the linker and to vulncheck that if you only ever call X25519(), the P-256 code is not reachable, so it doesn’t have to be linked into the binary and you don’t need to be notified of any vulnerabilities. We have a test for this. (Yes I love static analysis tests.)

Finally, we implemented support for parsing and marshaling public and private keys in PKIX and PKCS #8 format, respectively. NIST keys don’t have different OIDs to distinguish ECDH keys from ECDSA keys, so they always parse into crypto/ecdsa keys, which then have a new ECDH() method to return the equivalent crypto/ecdh key.

Since it also implements X25519 alongside ECDH over NIST curves, crypto/ecdh will in due time replace golang.org/x/crypto/curve25519, too, finally bringing down the number of 25519 implementations in crypto/... and x/crypto to one!

(CL 450816, CL 450815, CL 450335, CL 425463, CL 402555, CL 398914, CL 423363, CL 451115)

bigmod replaces math/big

happy dance

math/big is not exposed to attacker-controlled inputs anymore, nor is it used in any repetitive operations that can leak information through timing side-channels. As I explained in the planning post, this was a major ongoing goal:

math/big is a general-purpose big integer library, it’s not constant time, and it’s full of complex code that while unnecessary for cryptography has repeatedly led to security vulnerabilities in crypto packages. While it was a convenient way to bootstrap the Go crypto standard library, math/big does not belong in crypto code in 2022.

Two packages needed to be ported: crypto/rsa and crypto/ecdsa.

For crypto/rsa we needed a new library to operate on large integers. By declaring key generation out of scope² we can focus on implementing a handful of operations modulo a prime number, where essentially only multiplication is performance sensitive.

We had two implementations to choose from, both based on Montgomery multiplication, one from Lúcás Meier and one from Thomas Pornin. I spent some time reading both, and benchmarked them, and Meier’s turned out simpler and a quite a bit faster, despite Pornin’s having amazing documentation. This was probably because Pornin’s was derived from BearSSL’s C implementation, and it tackled key generation, too.

I ended up rewriting a lot of the implementation. I got it to be faster by removing bounds checks inside loops³, and removed a lot of complexity by dropping a reduction optimization that involved divisions and heuristics. Those two pretty much offset each other in terms of performance. (CL 326012)

I also attempted to wrap the whole thing in a safer API where each integer has an associated modulus, but that ended up making the code more unwieldy and slower, so I dropped it. (CL 445018)

A major concern was what we should do with the big.Ints exposed as part of the rsa.PrivateKey structure. I managed to convince myself that it’s ok to leak the precise bit size of all these values, as they only leak the key size (which is not secret). We then made a short list of big.Int methods that are allowed to be used in cryptographic operations because they are simple, safe, fast, and leak only the bit size of the integer: Bits, SetBits, Bytes, Sign, and BitLen. They now all have a scary comment requiring a security team review to any changes, which are unlikely anyway, and there’s an upcoming static analysis test that ensures only those methods are reachable from relevant cryptography functions. (CL 402554)

BitLen was actually leaking more than the bit length on platforms that don’t use a hardware instruction, because bits.Len uses a lookup table. I changed that first to a simple loop, which turned out to be a performance regression, and then to setting all bits after the first set one before calling bits.Len, on Russ Cox’s suggestion. (CL 454617)

We expected the performance of this code to be worse than math/big, since the latter has optimized assembly cores. Indeed, it was quite a bit slower. I set a target of < 20% slowdown on amd64 for RSA-2048, since I know of wide-scale deployments out there that depend on its performance. We got under that threshold with three changes:

Montgomery multiplications are performed in the “Montgomery domain” which means we operate on x * 2^k mod N instead of x. One way to get there is to shift x left by k (either with the complex code I mentioned removing above, or by doubling k times). Another way is by multiplying by 2^k mod N. The latter is faster, but requires computing 2^k mod N which is slow. We switched to the latter, and store the precomputed value in a private field of PrivateKey.Precomputed. Here I wished all PrecomputedValues fields were private from the start, by the way, but we made it work. (CL 445019, CL 445020)
Big integers have variable length, so they have to be allocated at runtime on the heap, and handled by the garbage collector. However, we know the size needed to do RSA-2048 operations! I added a constructor which returns a zero-length backing slice with a fixed pre-allocated capacity, which ends up inlined and allocated on the caller’s stack. In the general case append() will reallocate the backing array, but if the number stays small enough it will use the stack-allocated space, and save a GC allocation. This dropped allocations by 97.5% and I like how it didn’t change any of the semantics of the surrounding code. (CL 445021)
Finally, I dropped in an Avo-generated assembly core for amd64 to get across the finish line. (CL 452095)

Surprisingly enough, even with the assembly core we didn’t match the performance of math/big. I would have expected to exceed it, in fact, because despite operating in constant time, we did not have to waste time for blinding, and could pre-allocate space based on the knowledge of the common RSA key sizes. My theory is that on today’s CPUs with add-with-carry instructions, the lore about 63-bit limbs making Montgomery multiplication faster by making carry handling easier is actually outdated. Even pure Go, unlike C, can do addition chains where the carry is stored in CPU flags thanks to compiler support for bits.Add. I will test this by switching to 64-bit limbs in Go 1.21, which might even make it possible to reuse the math/big assembly cores.

I am pretty happy with the resulting code, and I think it’s pretty robust. Amazinlgy, it clocks in at just 400 lines of Go, plus 100 lines of Avo generator. All bugs I introduced while developing were edge cases triggered by certain key sizes, rather than values (which is common in constant-time code), so I wrote a test that tests every operation at every key size up to 4096 bits. It takes fifteen minutes. (CL 443195, CL 450796)

I am optimistic the code will land at this point, but at the time of the freeze it was not clear whether the performance hit would be tolerable, so I have a revert CL ready to make the life of the Release team easier. The only complaint we got was about a very dramatic slowdown in -race mode, which is curious but we decided not a blocker. (CL 452255)

While at it, I deprecated and de-optimized multi-prime keys. These are keys made of more than two primes. They make private key operations a little faster, because they can operate modulo smaller primes, but they require their own Chinese Remainder Theorem path, you can compromise the security of a key by picking too many primes for the key size, and exactly two projects across GitHub were using them. (CL 445017, CL 453257, CL 459976)

Throughout the process, I used [benchdiff][] against a separate early commit that improved the benchmarks, to have a stable base reference. (CL 433476, CL 451196)

With crypto/rsa out of the way, the last missing piece was crypto/ecdsa. It turned out to be pretty much a full rewrite, but not a particularly interesting one. The new code uses generics over crypto/internal/nistec types for elliptic curve group operations, and the new RSA backend (extracted into crypto/internal/bigmod) for scalar operations. The unfortunate math/big-based APIs Sign and Verify are now wrappers around the []byte-based SignASN1 and VerifyASN1, which use [cryptobyte]. The scariest part of ECDSA, nonce generation has extensive tests. The old math/big code is technically still there, but only reachable for custom curves, that were never a good idea and are now deprecated. (CL 450055, CL 353849, CL 453256, CL 459977)

Now that big.Int is not reachable from (non-deprecated) cryptography anymore, the team is unlikely to consider math/big bugs security vulnerabilities. If you’re using it for cryptography, it’s a good time to consider a rewrite. There’s a warning in that sense in the docs now. (CL 455135)

More elliptic curves

crypto/ed25519 now implements Ed25519ctx and Ed25519ph, through [an Options struct][] that can be passed to PrivateKey.Sign or the new VerifyWithOptions. Using context strings to domain-separate signatures is very good hygiene, and Ed25519ctx is criminally underused. It doesn’t help that the RFC doesn’t have a test vector for Ed25519ph with a context string, which we should really fix in CCTV. (CL 373076, CL 404274, CL 459975)

The rewrite of the edwards25519 scalar field landed and replaced the last bits of unreadable ref10 code with fiat-crypto generated code. As mentioned in the planning post, you can read more on a previous Cryptography Dispatches issue, and there’s an overview of the overall edwards25519 rewrite in the CL that landed it after years out-of-tree. (CL 420454)

TLS and X.509

I didn’t get to do the pass of TLS work that I called a stretch goal in the planning post, in part because the significant complexity of the bigmod work, but I got to review and participate in some nice work by others, primarily Roland Shoemaker.

First, the new certificate cache shares in-memory representations of certificates amongst TLS connections. We talked about it on Cryptography Dispatches, because it’s really neat. (CL 426455, CL 426454, CL 427155)

A very long-running proposal to make TLS clients work in environments like scratch containers that don’t have a root store finally landed on a nice solution. The new SetFallbackRoots API makes it possible to supply a CertPool to use as a fallback if the system doesn’t have a viable one (or a platform verifier). I am pretty happy with the fallback semantics, as they are less likely to get used a global kludge to override the default verification process. We plan to provide a package in its own x/crypto submodule that calls it automatically, so most users will never use the API directly but will just import that package and keep it up to date. There’s already a follow-up conversation about how to express constraints like those imposed by the Mozilla root program on TrustCor⁴ in a CertPool. Root stores and how Go handles them are a deep topic that might deserve a dedicated issue soon. (CL 449235)

TLS handshakes now return a CertificateVerificationError if they fail because of, well, certificate verification. The nice thing about Go errors being values is that you can add fields to them. Here the error carries the certificates that didn’t verify, that we named loudly UnverifiedCertificates so no one ends up trusting them by mistake. 🤞 (CL 449336)

RFC 8422, Section 5.1.2, makes the Supported Point Formats optional, and a missing extension means uncompressed points are supported, which are the only allowed option in these TLS 1.3 days.⁵ Since everyone sends the extension for backwards compatibility, we didn’t notice we were actually requiring it. My bad. Fixed and backported to be a nice ecosystem player and not force everyone else to keep sending the extension to make Go happy. This would maybe have been caught by BoGo, a reminder of prioritizing integrating it. It might also be the second interoperability issue in the TLS 1.3 implementation since its inception, which is a stat I am pretty happy about. (CL 425295)

crypto/x509 got a couple follow-up changes to previous work. First, pkix.Name is a pretty unfortunate type, because it’s at best an approximation of an X.509 distinguished name. This caused us pain before. The new CreateRevocationList function was using the issuer’s Subject field, which is a pkix.Name, causing the CRL not to match up against some issuers that have weird DNs. It now uses RawSubject, which we have all over the place specifically because of this. The whole crypto/x509/pkix package is a bit unfortunate, to be honest. Second, CL 428636 made Go reject duplicate extensions in certificates and CSRs, but mistakenly also disallowed duplicate CSRs attributes. Now fixed. This probably should have been backported and wasn’t, maybe because it was fixed without opening an issue, which is how we keep track of these things, so I’m opening a backport now. (CL 418834, CL 428636, CL 460236)

Speaking of crypto/x509, we have now formalized in the package docs what was already informal policy: of the monstrous sprawl that is X.509, crypto/x509 targets a profile⁶ that’s compatible with the WebPKI. That is, if it’s not needed to correctly and securely interpret publicly trusted certificates, we are likely not to implement it. This makes some folks sad, like those cursed to deal with TPMs or S/MIME, and I feel for them, but regrettably it’s the only way to keep the implementation sane. (CL 266541)

Flashes

Some more interesting stuff that happened around Go and cryptography, mostly not by me.

Nothing uses the global math/rand.Seed anymore. This is related to [work to make it possible to improve math/rand][], which I am hoping to nudge towards using a cryptographically secure PRNG by default. I got ChaCha8 to take less than double the current insecure PRNG. Probably writing an issue about this at some point. (CL 445395)

As GODEBUG flags (those we used to allow opt-in and opt-out of various cryptography changes) are getting formalized into an official backwards-compatibility, Russ made an efficient API to make it affordable to check them in hot code paths. Case in point, GODEBUG=x509sha1=1 is now checked at verification time, rather than at startup, so applications can drop a os.Setenv in main(). Disabling verification of SHA-1 certificates is definitely the change that got the most pushback of all the ones I’ve done, mostly from Kubernetes. (CL 449504, CL 445496)

The RSA-OAEP encryption algorithm uses a hash for two things: to hash the domain-separation label⁷, and to generate an XOR mask. For some reason those two can be different hashes. No, I don’t know why. Anyway some systems actually do use different hashes for those things. No, I don’t know why. I pushed back some on supporting this in Go, but eventually caved on adding decryption-side support, so Go programs can decrypt ciphertexts generated by other systems. The nice thing about doing it decryption-only is that we could just sneak a field into OAEPOptions which is passed to the PrivateKey.Decrypt method (for crypto.Decrypter support), and didn’t have to make new DecryptOAEP/EncryptOAEP functions. So now there’s OAEPOptions.Hash and OAEPOptions.MGF1Hash. Sure, why not. (CL 418874)

Something we do a lot in cryptography is XOR a pair of byte slices (like a key stream and a plaintext). The optimized implementation is now exposed in crypto/subtle, and should be reusable across a number of packages. This might go well with the cycle of work on symmetric ciphers I have planned. (CL 421435)

Speaking of subtle, Russ renamed the package with the functions to enforce aliasing rules around crypto packages from crypto/internal/subtle to crypto/internal/alias. Not sure why I called it subtle in the first place. Probably lack of imagination. (CL 424194)

There’s an internal package called internal/testenv which is very useful to restrict tests based on the environment, for example if a builder doesn’t have network. It now has a SkipIfOptimizationOff function, which is great for the inlining tests I scattered around since I figured out how to use the inliner to save allocations. (CL 422038)

crypto/sha512 uses the hardware ARM64 SHA-512 instructions when available now, making it 3-4x faster. Given there’s no Avo for ARM64 assembly, this CL’s use of macros and comments was as close as it gets to AssemblyPolicy compliance. (CL 180257)

Speaking of crypto/sha512, all the SHA-1 and SHA-2 packages are now a little faster for small messages, because they make one Write call instead of two. The change being identical for all three packages reminds me that we might want to refactor all Merkle–Damgård hashes to use shared code. (CL 436917)

Since Go 1.19, Go+BoringCrypto is behind a (permanent) GOEXPERIMENT flag rather than a fork, for ease of maintenance reasons. (It was tedious to resolve conflicts every time we touched anything.) In Go 1.20, the module was updated, and it now supports arm64 and 4096-bit RSA keys. RSA PSS signature salt handling was fixed, too, and the clever crypto/internal/boring/bcache package, which works with the GC to keep around the C counter-parts of some Go key values, is now type-safe thanks to generics. (CL 423357, CL 423362, CL 426659, CL 447655, CL 451656)

The crypto packages use a lot of new stuff added elsewhere, like sync/atomic types (CL 428477, CL 426088, CL 426087, CL 422296), fmt.Appendf (CL 435696), bytes.Clone (CL 435279), unsafe.{Slice,SliceData,StringData} (CL 435285, CL 428154), and encoding/binary append functions (CL 431035).

Finally, go vet just got better at catching the common mistake of reusing the loop variable which has caused a Let’s Encrypt incident, in preparation for the exceptional language change that should solve this problem once and for all. I mention it amongst crypto stuff because it caught a mistake in one of my new RSA tests. Yay for static analysis! (CL 450738)

The picture

Happy new year! Rome goes hard with fireworks.

Fireworks exploding low in the sky, against a smoky background, with a tree barely visible in the mist.

It was an exciting cycle. If you’re interested in this work, look out for a Go 1.21 planning email in the next month or so, and if your company would like to support me, while getting unlimited access to advice and the reciprocal value of talking to maintainers, reply to this email and let’s talk about a retainer contract. You can also follow me on Mastodon, or reply with any feedback. _o/

The second release candidate came out sixty seconds before I clicked send :) ↩
Even if key generation leaks a bit or two through side-channels, the attacker only has one shot to observe it, so they don’t add up, and the attacker can’t force the program into a math/big edge case because key generation has no inputs aside from the bit size. We’re ignoring doing weird things like trying to deterministically generate keys from a key stream here, which we also intentionally defeat by making the key generation process non-deterministic. That was done mostly to avoid Hyrum’s Law preventing future changes to the key generation algorithm, but is a good example of tightly owning the scope of a function turning out useful down the line. ↩
This stumped me for a while, until I realized I needed to assign the slices to a local variable. If they are a struct field, the compiler has to assume that any write might end up modifying them (because of aliasing) and has to drop the facts it learned about the relationship between indexes and the length of the slice. ↩
Had some fun participating in the distrust conversation. ↩
Point formats in TLS are a lightning course in real-world cryptography. Compressed points are almost universally better because they save space and more importantly decompressing them tends to yield a valid point or an error, while uncompressed points could be mistakenly used without checking if they are on the curve, leading to invalid curve attacks. However, point compression was patented until late, so we all settled on uncompressed points for NIST points in TLS. Except that was actually made configurable, because “agility”, which was actually worse than either option, and we regretted it, and then made all options except uncompressed points illegal. Even more upstream, NIST curves were defined in terms of abstract field elements, instead of specifying everything all the way to the byte representation. So here we are still dealing with the fallout of unnecessary complexity in TLS, caused by patents, and fundamentally by under-specification. ↩
A term of art for a subset of a specification. ↩
The label hash length is also used to set the randomly-generated seed length, which has nothing to do with the label, I think?? ↩