2 Nov 2022

Why Did the OpenSSL Punycode Vulnerability Happen

Some room-temperature takes on yesterday’s not-quite-RCE vulnerabilities in OpenSSL 3.0, and on what there is to learn about safe cryptography engineering.

A recap

Yesterday OpenSSL published version 3.0.7, which was pre-announced to contain a fix for a CRITICAL vulnerability, the first one since 2016 and since Heartbleed before that. The vulnerability was downgraded to HIGH. You can read why on the OpenSSL blog.

The vulnerability is a stack write overflow in certificate verification. You can choose to overflow an unlimited number of . bytes, or four arbitrary bytes. The four-byte overflow is a little subtler, but the . is fairly straightforward.

There is a function, ossl_punycode_decode that decodes Punycode. Punycode is a way to encode Unicode as ASCII, used to represent Unicode strings in the ASCII-only world of DNS. ossl_punycode_decode takes an output buffer, and if the buffer runs out it keeps parsing and verifying the Punycode but discards the rest of the output.

I did not have the heart to figure out why it works like this. Maybe there’s a good reason to do progressive parsing. Maybe it’s an artifact of how C makes you do memory management or of OpenSSL’s C style. Anyway.

You can see where this is going. Sometimes the “should we discard the output or add it to the buffer” check is faulty, and the function keeps appending to the buffer past its end.

This is the relevant snippet of code. The reading key is that result is both the eventual return value, and the thing that controls whether output gets added to outbuf. Once outbuf runs out result gets set to 0, but line 303 doesn’t check result before adding a ..

The crasher PoC is simply a domain with a lot of Punycode labels, like xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a.xn--a. All those .s end up on the stack even if the outbuf was smaller.

ossl_punycode_decode is finally used (only) in the process of enforcing an X.509 chain feature called Name Constraints on email addresses, which is how this is reachable from certificate verification.

Colm has a nice writeup and a PoC repo. I live-streamed some analysis and played with Colm’s PoC to confirm our understanding of the vulnerability. You can watch the recording.

On severity

This was initially announced as a CRITICAL, and then downgraded to HIGH. Honestly, I think might even be a MEDIUM, because of where you need to put these strings to reach the parser.

The vulnerability is in certificate verification, so most TLS servers are not affected right off the bat. Only servers that accept client certificates do certificate verification.
Name Constraints are only checked after a chain has been built to a trust anchor. The code is not part of certificate parsing itself. This is very good. The malicious certificates need to be part of a chain ultimately signed by a trusted root, and most environments trust their roots.

- I’m told Basic Constraints are checked before this, too, so you can’t pull off a trick where you get a certificate from Let’s Encrypt for filippo.io and then pretend it’s an intermediate root and use it to sign a malicious certificate.
IMHO most importantly, the malicious string must be in the intermediate root, not in the leaf. It’s conceivable that there might be systems out there that issue untrusted parties leaf certificates for arbitrary email addresses, but I can’t see an attacker getting their hands on an intermediate root with a custom Name Constraint that chains to a trusted root.

So yeah, not the end of the world.

Lots of people criticized the OpenSSL project for causing alarm, but as someone who ran a vulnerability reporting program for a large project (Go) I think they did ok. Vulnerability triage is a high-pressure activity you do on a clock, and you have to make judgement calls. Upgrading the severity at the last moment is way worse than downgrading it, so they IMHO erred on the right side. Maybe it would have been nice to get a heads up of the downgrade, but you’re sometimes doing analysis until the very last moment.

It’s also the first CRITICAL in more than five years, so it’s not like they are constantly crying wolf.

One thing I’ll say about the process is that it would have been nice to know the CVE number in advance so we could coordinate effectively instead of using made up names, which is the point of CVEs.

Why did it make it to a release

There are process lessons to learn, though.

Hanno, as always bridging the gap between available tools and concrete results, pointed out that this vulnerability could be found in a second by LibFuzzer with the simplest possible fuzzing function.

I’m transcribing it here just to make a point about how simple it is. ossl_punycode_decode is not a carefully chosen internal function, it has a man page.

#include "crypto/punycode.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    unsigned int out[32];
    unsigned int outlen = 32;
    ossl_punycode_decode((const char *)data, size, out, &outlen);
    return 0;
}

I’m a big fan of asking “how could we have found this bug in advance without knowing anything about it?” Sometimes the answer is complicated, like “we would have had to write a fuzzer for all arithmetic operations that mixes values that we know in advance are likely to hit edge cases even if we don’t know which ones and then compare against a known-good implementation” (which is what we did in the Go standard library anyway!).

This time it looks like the answer was “we should have used a fuzzer at all”. That’s disappointing. This was newly added code. This was a parser written in C. This was a perfect match for fuzzing. It didn’t require any novel tool or technique.

We could talk about expanding tooling for fuzzing coverage reports, and I think we can do a lot in general to make fuzzer behavior more understandable. However here the issue would have been prevented by just having a policy of “when fuzzing makes perfect sense and is very easy, use it”.

I’m not sure what the next step is or who should take it, but there seems to be a large margin for cheap process improvements that would yield a large security benefit in critical software infrastructure.

Why was this code even necessary

There’s an even wider scope to consider, though. Why is that code there? Could it have been even more secure by not existing?

As curl author Daniel Stenberg said, “I’ve never even considered to decode punycode. Why does something like OpenSSL need to decode this?”

The answer is: an explicit IETF design choice, that made punycode decoding part of X.509 verification, without even a line of acknowledgement in the Security Considerations.

As we said, Punycode is about encoding Unicode in domain names. You might be tempted to say “ah, why would we want Unicode in delicate cryptography stuff like this”. I invite you to resist that urge. Internationalization is not the issue, internationalization is the job. We make things to serve users, and users (mostly outside the US and UK) communicate using more than ASCII, and their domain names and email addresses should be written in their language and work as well as English ones. The general effort to expand i18n is good.

How did that lead to “OpenSSL has a Punycode parser” though? Well, folks (rightfully) wanted Unicode email addresses in X.509 certificates, but punycode is only defined for domain names. What if there’s Unicode in the local part (the stuff before the @) of email addresses?

There were a number of ways to solve this!

One was to just define how to encode the local part in ASCII, staying true to the idea of punycode as the way to abstract encoding issues. Here’s a draft that does just that by prepending a : and then encoding with Base64. Roughly one page long. No Name Constraints interactions, barely any new code path. We could have had nice things!

But nope. Here’s RFC 8398. It adds a new Subject Alternative Name type that’s a UTF-8 email address, SmtpUTF8Mailbox.¹ But what if the issuer (i.e. the intermediate root) has a Name Constraint that says it can only issue for a certain domain? That Name Constraint will be encoded as Punycode… how does it apply to a SmtpUTF8Mailbox in a leaf? Well, you decode the Punycode before doing the comparison, of course. The sections on matching and Name Constraints are longer than the whole alternative draft, and there’s a whole other RFC of amendments to path verification.²

What’s really short instead is the Security Considerations section with no mention of the fact that this adds a whole new parser to the critical path of certificate verification, so I’m not sure the risk and cost of software complexity was taken into account, making this as much an IETF specification failure as a C language failure.

In conclusion, this is why we resist adding even trivial features to crypto/x509, Go’s implementation of a X.509 parser and verifier. I seem to remember a feature request for exposing otherName fields of certificates, which includes SmtpUTF8Mailbox. It felt pretty innocuous: just another field we could add to the struct, five lines of code maybe. However, the moment we expose it we need to apply Name Constraints to it! Not doing so would be a security vulnerability. Without knowing it, we could have opted into all this complexity and ended up importing a Punycode parser in crypto/x509. Still, it would not have been written in C, I guess.

The picture

Florence is always pretty.

It’s even more complex, actually: if there’s Unicode in the local part, you use SmtpUTF8Mailbox, if there’s Unicode only in the domain you’re still expected to use punycode. You could end up with most certificates in the organization using punycode and only a few using SmtpUTF8Mailbox, which would surely work out great. RFC 8399, Section 2.5 summarizes the new logic X.509 verifiers need to apply. ↩
There are some good choices in there, like deprecating NCs on the local parts of email addresses that no one was using, instead of defining UTF-8 constraints. But that also opened the door to keeping encoding issues entirely out of Name Constraints while still supporting international domains through Punycode, so it really feels like a missed opportunity. Punycode was supposed to be the abstraction layer, to keep encoding out of the rest of the stack. ↩