31 Dec 2024

Benchmarking RSA Key Generation

RSA key generation is both conceptually simple, and one of the worst implementation tasks of the field of cryptography engineering. Even benchmarking it is tricky, and involves some math: here’s how we generated a stable but representative “average case” instead of using the ordinary statistical approach.

RSA key generation

Say you want to generate a 2048-bit RSA key. The idea is that you generate random 1024-bit numbers until you find two that are prime, you call them p and q, and compute N = p × q and d = 65537⁻¹ mod φ(N)¹ (and then some more stuff to make operations faster, but you could stop there). The computation of d is where the RSA magic lies, but today we are focusing on the first part.

There is almost nothing special to selecting prime candidates. You draw an appropriately sized random number from a CSPRNG, and to avoid wasting time, you set the least significant bit and the two most significant bits: large even numbers are not prime, and setting the top two guarantees N won’t come out too small.

Checking if a number x is prime is generally done with the Miller-Rabin test², a probabilistic algorithm where you pick a “base” and use it to run some computations on x. It will either conclusively prove x is composite (i.e. not prime), or fail to do so. Figuring out how many Miller-Rabin tests you need to run is surprisingly difficult: initially you will learn the probability of a test failing for a composite is 1/4, which suggests you need 40 rounds to reach 2⁻⁸⁰; then you learn that’s only the upper bound for worst-case values of x,³ while random values have a much much lower chance of failure; eventually you also realize that it doesn’t matter that much because you only run all the iterations on the prime, while most composites are rejected in the first iteration. Anyway, BoringSSL has a table and we’ll want 5 Miller-Rabin tests with random bases for a 1024-bit prime.

Miller-Rabin is kinda slow though, and most numbers have small divisors, so it’s usually more efficient to quickly reject those by doing “trial divisions” or a GCD with the first handful of primes. The first few dozens are usually a major win, but using more and more primes has diminishing returns.

There are a million and one things that can go wrong, but interestingly enough you have to go out of your way to get them wrong: if generating large candidates fully at random, all those cases have cryptographically negligible chance.

To recap, to generate an RSA key you generate two primes. To generate a prime you pick random numbers, try to rule them out with trial divisions, and then do a few Miller-Rabin tests on them.

Benchmarking it

Now, how are we supposed to benchmark that? Luck will drastically affect runtime: you’re essentially benchmarking a lottery. While debugging a performance regression Russ Cox ran hundreds of measurements and still got some noisy and in some places suspect results. It’s also not fast enough that you can run millions of measurements and let things average out.

One might be tempted to normalize the measurements by dividing the runtime by the number of candidates tested, but this unevenly dilutes all the final computations, and is still perturbed by how many candidates are caught by trial division and how many proceed to the Miller-Rabin tests. Similarly, benchmarking Miller-Rabin in isolation ignores the final computations, and doesn’t measure the impact of trial divisions.

What we can do is use math to figure out how an average representative sequence of candidates looks like and benchmark that. Since the key generation process is repeatable,⁴ we can pre-generate a golden sequence of candidates, and even share it across implementations to benchmark apples to apples.

Generating an average sequence

First, we need to figure out how many composites we should expect on average before each prime. The prime-counting function approximation tells us there are Li(x) primes less than x, which works out⁵ to one prime every 354 odd integers of 1024 bits.

Then, we normalize the small divisors of the composites. A random number has a 1/p chance of being divisible by p, and based on that we can calculate how many composites divisible by the first n primes we’d expect to encounter before a prime. For example, we’d expect 33% of numbers to be divisible by 3, 46% to be divisible by 3 or 5, 69% of numbers to be divisible by one of the first 10 primes, 80% to be divisible by one of the first 50 primes, and so on. Flipping that around, we make 118 of our 353 composites divisible by 3, 47 divisible by 5 but not by 3, 27 divisible by 7 but not by 3 or 5, and so on. This will make the number of successful trial divisions representative, and will even let us do comparative benchmarks between different trial division thresholds without regenerating the inputs.

Beyond setting the top and bottom bits like keygen will, we also unset the second-least significant bit and set the third-least significant bit of each candidate to normalize the number of iterations of the inner loop of Miller-Rabin, which depends on the trailing zeroes of x-1.

We don’t need to worry about composites failing Miller-Rabin tests: if 5 tests are enough to get to 2⁻¹¹² then one test fails with at most 2⁻²² chance, which is not cryptographically negligible but will not show up in benchmarks. Similarly, we don’t need to worry about e not being invertible modulo φ(N): we use 65537 as e, which is prime, so only 1/65537 numbers aren’t coprime with it.

Script, vectors, and results

The result is remarkably stable and should be representative both in terms of absolute runtime and in terms of CPU time spent in different functions, allowing meaningful profiling. Generating 20 random average traces and benchmarking them yields variance of less than 1%.

You can see it in use in this Go standard library code review. The script to generate the traces, as well as ten ready to use traces are available in CCTV and you’re welcome to use them to benchmark your implementations!

If you got this far, you might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @filippo@abyssdomain.expert.

The picture

One day a friend was driving me to the SFO airport from Redwood Park and we were late. Like, flight begins boarding in a few minutes late. But then we came up to this view, and had to stop to take it in. The people in the other car had set out a little camping chair to watch the sun set over the clouds below. I have an incredible video of driving down into the clouds. Made the flight!

A fiery orange sunset, pictured from the side of the road. The valley below is filled with a sea of white clouds.

My maintenance work is funded by the awesome Geomys clients: Interchain, Smallstep, Ava Labs, Teleport, SandboxAQ, Charm, Tailscale, and Sentry. Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement.)

Here are a few words from some of them!

Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews.

Ava Labs — We at Ava Labs, maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

SandboxAQ — SandboxAQ’s AQtive Guard is a unified cryptographic management software platform that helps protect sensitive data and ensures compliance with authorities and customers. It provides a full range of capabilities to achieve cryptographic agility, acting as an essential cryptography inventory and data aggregation platform that applies current and future standardization organizations mandates. AQtive Guard automatically analyzes and reports on your cryptographic security posture and policy management, enabling your team to deploy and enforce new protocols, including quantum-resistant cryptography, without re-writing code or modifying your IT infrastructure.

Charm — If you’re a terminal lover, join the club. Charm builds tools and libraries for the command line. Everything from styling terminal apps with Lip Gloss to making your shell scripts interactive with Gum. Charm builds libraries in Go to enhance CLI applications while building with these libraries to deliver CLI and TUI-based apps.

Or, if you want to make your life harder and your code more complex for no practical benefit, you can use λ(N) instead of φ(N), but that’s for a different rant. ↩
There is also the Lucas test, and doing both a round of Miller-Rabin with base 2 and a Lucas test is called a Baillie–PSW. There are no known composites that pass the Baillie–PSW test, which sounds great, but the Lucas test is a major pain to implement. ↩
In an adversarial setting, you also need to worry about the attacker forcing or adapting to your selection of bases. The amazingly-named Prime and Prejudice: Primality Testing Under Adversarial Conditions by Albrecht et al. pulls a number of fun tricks, but the main one boils down to the observation that if you hardcode the bases or generate them from x, they are not random. ↩
It’s not strictly speaking deterministic, because the tests are randomized, but the chance of coming to a different conclusion is cryptographically negligible, and even the chance of major deviations in runtime is very small, as we will see. ↩
I did a quick Monte Carlo simulation to check this was correct, and it was really fun to see the value swing and converge to the expected value. Math! ↩