4 Aug 2022

A Wide Reduction Trick

In line with the original spirit of Cryptography Dispatches, this is a quick¹ issue to talk about a neat bit of cryptography engineering I encountered.

The structure of an ECC implementation

Elliptic curve cryptography implementations all roughly share the following structure: there’s a base field implementation, the group logic, a scalar field implementation, and a higher level protocol (key exchange, signatures, …) over the group.

The base field is the set of numbers modulo a large prime number (such as 2^255-19, from which Curve25519 takes its name). The implementation provides arithmetic operations in the field, such as modular addition, subtraction, multiplication, and inversion, as well as encoding and decoding of field elements to/from bytes.

The group is the set of points on the elliptic curve. The implementation represents a point in some coordinate system where each coordinate is a base field element, and then applies a point addition formula which is defined in terms of base field operations.

The scalar field is, like the base field, the set of numbers modulo a large prime. However, this large prime is not selected by the designers, but is instead the order of the group. That is, the number of times you need to add a point to itself before you get back to that point.² In practice, the scalar field implementation is often very different from the base field implementation, for a few reasons:

the performance of scalar field operations is less important to the high-level protocol performance than that of base field operations;
the prime modulus can’t be selected to have a nice optimization-friendly shape, with cute reduction identities and such;
some high-level protocols only need a very restricted set of scalar field operations, or none at all (such as ECDH);
some other high-level protocols need scalar field operations that aren’t relevant to the base field, such as a “wide” modular reduction from a value numerically much higher than the field order to a field element.

In this issue we’re going to talk about those wide modular reductions.

fiat-crypto

Field implementations are where some of the most subtle and terrifying bugs hide. Most bugs in group logic, protocol implementations, and the way these all interact can be reached with careful, extensive testing.

Carry bugs in field implementations can easily occur randomly for one in 2^64 inputs, only on some specific architecture, with no way to share test cases between implementations.³ Mentally reviewing the logic is awful, and a tiny mistake is enough to exfiltrate keys.

Enter fiat-crypto. This un-Googleable project uses formal methods to prove the logic of field arithmetic operations correct, and generates code in a variety of languages to implement those operations. The code is not the most readable but the API is usable and the most common bugs are ruled out by the formal proof.

Between Go 1.17 and Go 1.19 I progressively replaced with it all NIST curves base field implementations in the Go standard library (except the most aggressively optimized assembly ones).

What about the scalar fields though?

Aside: the Christmas tree

Let’s take a detour in cryptography engineering history.

Most cryptography implementations aren’t written from scratch, and instead have somewhat of a lineage. The current Go edwards25519 implementation traces back to a port from C to Go by Adam Langley of the “ref10” public domain implementation by Daniel Bernstein, so called because it was distributed as the crypto_sign/ed25519/ref10 subfolder of a benchmarking tarball called SUPERCOP.⁴ It was then extracted for use in github.com/gtank/ristretto255 by George Tankersley, Henry de Valence, and me, merged with an old 2017 assembly optimization by George Tankersley, and finally exposed as filippo.io/edwards25519. One year ago, I re-upstreamed it, replacing the code in the standard library it started from.

Over the years, all the base field and group logic from ref10 was replaced, but the scalar field implementation was still the same. In fact, most of the edwards25519 implementations I am aware of use that scalar implementation, consisting of two large functions: sc_muladd which computes a * b + c and sc_reduce which reduces a 512-bit value. This giant uncommented blob of hardcoded integer constants is so infamous that it’s affectionately referred to as “the Christmas tree” amongst practitioners, due to the shape of the large addition round of multiplication terms.

(Imagine a pen-and-paper columnar multiplication between two large numbers with the same digit count. The rightmost digit of the result is the sum of one product, the second rightmost of two, and so on until the middle one, and then back down until the leftmost is the sum of one product again, ignoring carries. That’s what gives the Christmas tree its shape. There is also an explanation in our edwards25519 base field implementation.)

Anyway, George Tankersley’s branch for the PR that replaced scMulAdd with fiat-crypto was called gtank/war-on-christmas. That is all.

Wide reduction

fiat-crypto provides only one decoding function, which requires the input to already be lower than the field order. This is mostly not a problem with base fields, where non-canonical “overflowing” values are rejected.⁵ However, we often need to operate on and reduce scalar values higher than the scalar field order:

the simplest constant-time way to select a random scalar, e.g. for EdDSA or hashing to the curve, is to take a random 512-bit value and reduce it modulo the scalar field order, making the bias negligibly small;
Ed25519 “clamping”—which is actually useless in a signature scheme but for some reason got copied from RFC 7748 to RFC 8032—sets the second most significant bit in a 32-byte string representing the scalar, making it by definition at least 2^254, while the scalar field order is a little over 2^252;
ECDSA takes a base field element and interprets it as a scalar field element because it’s a horrible, horrible protocol, and the base field is a little larger than the scalar field.

What do we do then if we have a value that might be as high as 2^512-1 and a field implementation that can only take inputs lower than the order (2^252 and change)?

My answer was “we find the lovely, helpful fiat-crypto authors at a cryptography workshop in Amsterdam and ask them to add a wide reduction function to fiat-crypto” but it turns out that implementing that in the Montgomery backend (what you need for weirdly shaped primes like this) is not easy, especially if you need to reduce from above the square of the order, which we do.

Thankfully Frank Denis had a much, much better answer. The kind so simple (for a person who spends most of their time working on this stuff) that it clicks immediately and then you want to write a newsletter issue about.

I represent the value as a+b*2^192+c*2^384 https://t.co/O0dwPKbzAL
— Frank ⚡ (@jedisct1) May 13, 2022

The trick

Here’s the idea: take a large number, such as 712376532, and imagine we need to operate over it with a calculator that can only count up to 1041, and then wraps around to 0. The good news is that we want all our results modulo 1042.

First, we notice we can write 712376532 mod 1042 as

712376532 = 712 * 1000000 + 376 * 1000 + 532 mod 1042

“But wait”, you’ll say, “1000000 is more than 1042!” Yes, but since we are operating modulo 1042 we can precompute its reduction, like this

712376532 = 712 * 722 + 376 * 1000 + 532 mod 1042

These are now all numbers that we can put into our calculator!

We do the same thing to use fiat-crypto for the edwards25519 scalar field. We cut the 64 bytes wide value up into three pieces, two of 21 bytes and one of 22 bytes; we decode each of them with fiat-crypto, since they are all guaranteed to be less than 2^176; we multiply the two most significant ones by 2^336 (for which we precomputed the reduction) and 2^168; and finally we add it all together.

x = c * 2^336 + b * 2^168 + a  mod l

The result is the correct value of the 512-bit wide input modulo the field order, but we got there through two additions and two multiplications of values below the field order, which is what fiat-crypto provides us.

With this, the last piece of the ref10 scalar implementation is gone, resulting in an overall change, ignoring autogenerated code, of 268 insertions(+), 893 deletions(-).

Any day we get to delete 900 lines of unreadable crypto code is a good day.

The picture

An especially good day if you get to do it on the side of a beautiful lake.

A picture of the side of a lake at night. In the foreground, a cement bench seen from behind with a yellow backpack on it, next to a tree that frames the picture at the top. Beyond it, a small pier with wooden poles stretches into the pitch balck water. In the distance the lights of a town on the other side of the lake.

Look, I swear it started quick. ↩
I’m ignoring cofactors in this explanation. You don’t need them today. The technically correct definition of the order of the scalar field is the order of the prime-order subgroup of the elliptic curve, so the number of times you need to add a point on the prime-order subgroup like the canonical generator to itself before you get back to the point. ↩
However, you can make fuzzer mutators that leverage knowledge about the internals of the field implementation to much more efficiently generate inputs that are likely to tickle edge cases without knowing in advance what bugs you’re looking for. There is such a fuzzer in filippo.io/edwards25519 and there was one in the stdlib P224 before I torched that whole implementation. I plan to write a retrospective analysis some day showing how many bugs it could have found in various implementations. ↩
SUPERCOP implementations are distributed without READMEs or version control. We all use Frank Denis’s unofficial repository to track them and this whole arrangement has been problematic in the past. Let’s just say there are many reasons I am happy to replace the last vestiges of SUPERCOP code in crypto/ed25519. ↩
This is actually also a problem for the Curve25519 base field, since RFC 7748 in its effort to make every 32-byte string a valid public key accepts the non-canonical values 2^255-19 through 2^255-1 as valid representations of 0 to 18. RFC 8032 rejects them, but in practice Ed25519 implementations allow them. A future Dispatches issue will talk about how we’re testing these edge cases in Go, and how perfect backwards compatibility is critical to avoid forks in consensus applications. Such a small range can be handled with a conditional subtraction, which however needs to be implemented manually: subtract 2^255-19 and choose whether to keep the result or the input based on the final borrow value. This is not a problem in Go because we have a modern, well-tested, custom implementation of the Curve25519 base field. ↩