9 Oct 2024

Accumulated Test Vectors

I like tests. I especially like reusable test vector libraries. Sometimes test vectors are lovingly handcrafted to target obscure edge-cases. Those vectors belong in Wycheproof or with the upstream specification. Sometimes though vectors are produced by sheer brute force. Enumerate every possible input and check the output. Try a million random inputs and see what happens. Combine all possible input sizes for every parameter. Make one very, very large input.

These vectors can be tremendously effective and require no prior insight into the bugs you’re hunting. If you run 300 000 random tests, you have a 99% chance of hitting any 2⁻¹⁶ edge case.¹

For example, you can get pretty good coverage of the internals of the new post-quantum algorithm ML-KEM by generating random key pairs, passing the public key to Encapsulate and the resulting ciphertext to Decapsulate, and then making sure keys, ciphertext, and shared secret are as expected.² The reference implementation offers a program to produce a corpus of 10 000 of these known-answer tests.

The catch is that—unless the result is self-evidently correct³—you need to actually check in inputs and expected outputs somewhere. Those ML-KEM vectors run into the tens of megabytes, even compressed. Checking in the reference implementation is also undesirable, with its nontrivial size, incompatible build system, and different supported environments.

Here’s a trick I call accumulated test vectors: define a test that draws random inputs from a deterministic source,⁴ like an extendable-output function such as SHAKE128; accumulate the output(s) into a hash function; and check in the expected final hash. It’s a simple idea that was probably redeveloped many times, but that I have not seen documented before.

Here’s all you need to run 10 000 random ML-KEM tests equivalent to the ones from the reference implementation:

Instantiate SHAKE128 with an empty input, call it r. Instantiate another SHAKE128, call it a. Draw a ML-KEM-768 seed from r, pass it to KeyGen_internal. Write ek to a. Draw a message from r, pass it to Encaps_internal. Write ct and k to a. Run Decaps and check that k matches. Draw an invalid ciphertext from r, pass it to Decaps. Write the rejection k to a. Repeat 10 000 times total. Draw 16 bytes from a, they should match 8a518cc63da366322a8e7a818c7a0d63.

Quite a long way from tens of megabytes of JSON!

The approach also scales well with available CPU time. In Go we run 100 iterations for -short mode (in presubmit checks), 10 000 by default (taking a couple seconds in CI), and 1 000 000 on demand (for developing). That last one would have required more than a gigabyte of vectors, instead we just check in three hashes in total.

Like JSON vectors, accumulated vectors are reusable across implementations, as long as the generator and accumulator are simple enough and broadly available. The ML-KEM vectors are available on CCTV.

Beyond purely random inputs, accumulated tests can be defined to systematically explore large ranges of input parameters and sizes. For example we have an accumulated cSHAKE128 test that iterates through the combinations of N sizes 0 to 200, S sizes 0 to 200, and input sizes 100, 168, 200 (to cover values below, matching, and above the “rate” or block size). This exercises multiple joints at once: length prefix encoding, input chunking, and padding. It was written to be a generic test that would cover a specific bug but it immediately caught an unrelated issue during a reactor.

Finally, you can do something similar to “compress” individual large vectors. For example, we now test cSHAKE personalization strings bigger than 512MiB (whose size in bits overflows a uint32) by drawing them from SHAKE128 and checking the expected output. Using a deterministic source lets us compare the result with other implementations easily, although it took a couple tries because the first one we tried also had a bug.

The main disadvantage of accumulated vectors is that if the test fails it offers no insight for debugging. I think this is well justified by the advantages, and mostly mitigated in common scenarios: when first developing an implementation intermediate values are more useful than test vectors, and while when making changes you generally have an idea of what you touched and potentially broke. A lot of cryptography engineering involves blindly looking for a bug relying only on a binary result code anyway! It’s part of the fun.

If you got this far, you might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @filippo@abyssdomain.expert.

The picture

One of my favorite places in Rome: Tiber Island, in the middle of the river. The building occupying half of it is an active hospital with an emergency room.⁵ The coast is a great pacing track for long phone calls.

I’m somewhat surprised I haven’t posted this Rome picture yet. I must have taken a hundred variations of it by now: day, night, sunset, sunny, storm, flood, migratory birds. I love the water, the symmetry, and the variance over time.

A nighttime view of Tiber Island in the middle of the Tiber River, featuring a lit stone bridge connecting the island, historic buildings with warm lighting framed by trees, and reflections of lights on the calm black water to the two sides.

My maintenance work is funded by the awesome Geomys clients: Interchain, Smallstep, Ava Labs, Teleport, SandboxAQ, Charm, and Tailscale. Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement.)

Here are a few words from some of them!

Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews.

Ava Labs — We at Ava Labs, maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

SandboxAQ — SandboxAQ’s AQtive Guard is a unified cryptographic management software platform that helps protect sensitive data and ensures compliance with authorities and customers. It provides a full range of capabilities to achieve cryptographic agility, acting as an essential cryptography inventory and data aggregation platform that applies current and future standardization organizations mandates. AQtive Guard automatically analyzes and reports on your cryptographic security posture and policy management, enabling your team to deploy and enforce new protocols, including quantum-resistant cryptography, without re-writing code or modifying your IT infrastructure.

Charm — If you’re a terminal lover, join the club. Charm builds tools and libraries for the command line. Everything from styling terminal apps with Lip Gloss to making your shell scripts interactive with Gum. Charm builds libraries in Go to enhance CLI applications while building with these libraries to deliver CLI and TUI-based apps.

I strongly believe that in a well-designed cryptographic component, edge cases should be testable or completely unreachable (i.e. < 2⁻¹²⁰ chance). Anything with a random chance > 2⁻⁴⁰ we can brute force our way to, but ideally they’d all have random chance > 2⁻¹⁶ so that they can be tested organically without targeting. ↩
This only covers successful computations. You also need negative test vectors for e.g. rejected encapsulation keys. Those lend themselves better to hand-crafting. ↩
If you can tell from the result whether it’s correct or not without comparing it with a known answer, you can more simply apply fuzzing with invariants, a.k.a. property testing. ↩
Deterministically defining how to draw inputs is easiest when the algorithm provides testable derandomized interfaces, without using randomness from the sky. ↩
I was the support EMT for an ambulance drop-off there last year, and it was kinda weird to see this place I always walk past in such a different light. ↩