Insights

What the May 8 Let's Encrypt Incident Means for Your Renewal Margin

Two and a half hours, a missing EKU, and the math on the new shortlived profile

May 15, 2026 · [cyphrs] Team · 10 min read

Two and a half hours that nobody quite talked about

On May 8, Let's Encrypt halted all issuance for roughly two and a half hours. The community-facing announcement was, in characteristic LE style, polite and terse: "Gulp, we have been made aware of a potential incident and are shutting down all issuance."

Issuance was restored at 21:03 UTC. The incident post has since been updated to "Resolved." The thread on r/letsencrypt is 43 upvotes deep and the HN thread (item 48067790, if you want to find it) has the technical postmortem. The internet kept working. Most people didn't notice.

I don't think most people should have noticed, in fact. Two and a half hours is a short outage, especially compared to the kind of public-CA wobbles we've had in years past. Let's Encrypt's response was clean: they identified the issue, rolled back, and were issuing again before the average North American operator finished lunch. By the operational standard most of us use, this was a non-event.

It's worth dwelling on anyway, because the math under the surface has quietly changed.

What actually broke

The root cause, now public, is the kind of thing that looks small when you write it down and big when you stare at it for a while. LE's Generation X root was cross-signing the new Generation Y root, so existing relying parties (which only trust X) would still validate Y-issued chains during the transition. Cross-signing is the standard way you migrate root trust. Nothing about that is unusual.

The cross-sign was missing the serverAuth Extended Key Usage. Mozilla bug 2038351 has the trail. Without serverAuth on the cross-sign, validators that walk the chain back to X and check EKU constraints (which is, more or less, "the relying parties") would refuse to consider the new Y intermediate valid for TLS server authentication. The two new ACME profiles that issue from Y, called "tlsserver" and "shortlived," stopped working. LE noticed quickly, rolled back issuance to the Gen X intermediate at 21:03 UTC, and the outage ended.

Postmortem, in one sentence

A new intermediate was added to the chain. The intermediate's cross-sign from the old root lacked an EKU constraint the relying parties enforce. Two and a half hours later, the operator rolled back to the old intermediate and issuance resumed.

This is, to be clear, the kind of mistake any PKI operator could make. The EKU semantics around cross-signed intermediates are genuinely subtle, and the validators don't all agree about how strict to be. The relevant CABF guidance has changed twice in the last three years. I'm not piling on LE here; they noticed faster than most CAs would have and the rollback path worked. That's the system functioning.

The interesting question is what the same outage would have meant if the new shortlived profile had been live for a year and most LE customers were on it.

The shortlived profile changes the math

LE's "shortlived" ACME profile issues certificates with a roughly 6-day validity. The intended renewal cadence is about every 2.5 days, which means at any given moment a typical certificate is somewhere in the middle of its lifetime, with three to four days of remaining validity. The profile is designed for the long-haul trajectory of public-web TLS, where 47-day certificates land in 2029 and the industry has been pushed toward issuance frequencies that look more like DNS than the annual-renewal world we grew up in.

On a 90-day certificate, two and a half hours of unavailability at the public CA is operationally invisible. You probably have a renewal scheduled for some point in the next two weeks. If May 8's outage delays it, you reschedule the cron job and forget it happened. The margin for upstream failure is enormous.

On a 6-day certificate, the same two and a half hours is a meaningfully bigger fraction of your safety budget. Not catastrophic, but no longer negligible. And it's the wrong kind of unavailability to model with classical SLOs, because the failure mode isn't a request that errors out and gets retried; it's a renewal that doesn't happen, with downstream consequences that won't show up until the cert expires.

The actual math is uncomfortable to write down, mostly because we haven't been asked to think this way before.

Doing the math

Let's assume you're on the shortlived profile. Your typical certificate is somewhere in the middle of its 6-day life, with about 3.5 days of remaining validity on average. Your renewal job runs every 2.5 days. When a renewal attempt fails, you generally have one or two more attempts before the certificate expires entirely.

Here is what a few hypothetical LE outages cost you, expressed as a fraction of remaining margin:

LE outage length	90-day profile	6-day shortlived profile
2.5 hours (May 8 actual)	~0.1% of margin, invisible	~4% of remaining margin
12 hours	~0.5% of margin, invisible	~20% of remaining margin
24 hours	~1% of margin, noticeable	~40% of remaining margin
48 hours	~2% of margin, plan B activates	renewals miss the next intended cycle; expiry risk

The numbers aren't precise, because the actual impact depends on where each individual cert sits in its lifetime when the outage starts. A cert renewed two hours before the incident is fine. A cert that was supposed to renew an hour into the incident is now waiting for the next scheduled attempt, and if that attempt also falls inside the outage, you're cutting into the no-margin band.

The thing the table shows, even with rough numbers, is that the move from 90-day to 6-day certificates isn't a 15x change in renewal volume. It's also roughly a 40x change in the operational cost of any given public-CA outage. And that ratio gets worse, not better, as the certificate lifetime keeps shrinking through the CABF SC-081 step-down (200 days now, 100 days in March 2027, 47 days in March 2029).

"Just have a backup CA" doesn't quite work

The first thing every senior infra engineer says when they read the table above is "fine, run a second public CA as a backup." I've said it too. The reason it doesn't quite solve the problem is worth working through.

Public CAs are a small, increasingly interdependent set of operators. They share underlying trust infrastructure (root program policies, log requirements, lint suites). When something CABF-policy-related goes sideways, the failure mode is correlated across CAs in a way that breaks the "use a backup" assumption. The Mozilla bug that bit LE on May 8 wasn't an LE-only issue; it was a validation question about cross-signed EKUs that could land in any CA's chain rebuilds. A second public CA that's also doing a chain transition that quarter is not actually a redundant supplier.

There's a more pedestrian failure mode too, which is that operationally, very few teams actually run dual issuance. The CSR generation, the ACME client, the cert-mgr Kubernetes annotations, the deployment automation, all of it is usually wired to one issuer URL. Standing up a parallel pipeline against a second CA is the kind of "we'll do it later" project that lives on a roadmap for two years and never ships, because in steady state it adds operational cost and saves zero developer hours.

So in practice, the fallback most teams have for a multi-hour LE outage is "hope it ends before the next renewal cycle." On a 90-day cert that's a fine plan. On a 6-day cert that's a bet against the calendar.

The asymmetry with internal trust

Here's the part that's not a sales pitch and that holds up regardless of which vendor's CA you might end up using: an outage at a public CA is fundamentally different in kind from an outage at a private CA you operate.

When LE has an incident, you are a downstream consumer of someone else's PKI making decisions about someone else's chain. You can't roll it back. You can't change the EKU on the cross-sign. You can't decide that, just for this week, your existing certs are valid for 30 more days while you sort it out. You wait.

When your own private CA has an incident, all of those levers are available. You can extend existing cert lifetimes by re-issuing from a parallel intermediate. You can pause renewals temporarily without anyone's relying-party validator caring. You can do controlled rollouts of chain changes against a staging environment that has the same trust root as production. The blast radius is shaped by your own operational discipline, not by a policy change you read about on a mailing list.

Public-CA outages aren't necessarily more frequent than private-CA outages. They are, by construction, less under your control. The shorter the certificate lifetime, the more that lack of control costs you.

That asymmetry has been true for as long as public CAs have existed. The reason it's worth re-stating now is that the move toward shortlived public certificates dramatically amplifies it. We're scaling the cost of public-CA dependency without scaling the corresponding ability to mitigate.

What this argument is not

Let's be clear about what this argument is not. It's not a complaint about Let's Encrypt, which has been the most important piece of public-web infrastructure shipped in the last decade and which, by every operational metric I care about, handled this incident competently. And it's not a complaint about shortlived certificates per se: for browser-facing TLS, where compromise windows matter and the validator pool is the entire internet, shorter lifetimes are a genuinely good idea. I'm not telling anyone to stop using public CAs for the public web.

It's an argument that the shortlived profile changes the right answer to one specific question: which workloads belong on a public CA at all.

For your public website, the marketing site, the API gateway that browsers terminate against, public CAs remain the right answer. The validator pool is the entire internet, and only a public CA can issue chains that pool will trust. Shortlived, automated, public, fine.

For internal service-to-service mTLS, client authentication, device identity, AI agent identity, workload identity inside your perimeter, the picture is different. The relying parties are inside your perimeter too. They don't need a public CA's trust; they need your trust. And the operational cost of treating those workloads as public-CA dependents is exactly the cost the May 8 incident illustrated, on the lifetime curve the public-CA roadmap is committed to.

What to actually do this week

If you're operating any production workload on public certificates, the May 8 incident is a useful prompt for an honest audit. Three concrete questions.

Question 1

For every public certificate you currently renew, can you name a browser or external relying party that needs it?

If the answer is "the load balancer terminates it and the backend talks plain HTTP behind it," the cert is genuinely public-facing. If the answer is "two internal services trust it because we already had a public CA wired up," the cert was on the public CA for convenience, not necessity.

Question 2

If LE (or your public CA of choice) were unavailable for 24 hours starting at 2am tomorrow, which of your renewals would fail and what would the customer impact be?

Run the calendar. Most teams have never done this exercise on the 6-day shortlived profile because they're still on the 90-day default. The answer is informative either way.

Question 3

What's the smallest unit of your infrastructure that could be on a private CA tomorrow without breaking anything browser-facing?

Often the answer is "everything between the load balancer and the backend." That's a few hundred certificates, none of which were ever going to be seen by a browser, all of which inherit public-CA dependency for no architectural reason.

None of these questions require buying anything from anyone, including us. They're audit questions. They surface the certificates whose public-CA dependency is accidental rather than necessary. Once you know which certificates fall in that category, the decision about what to do with them gets a lot simpler.

And for everything that stays on public PKI

The audit above moves workloads off public PKI when they shouldn't have been on it in the first place. For the certificates that legitimately stay public – your edge, your customer-facing endpoints, the things that genuinely need a browser-trusted chain – the May 8 incident points at a second, quieter problem: the renewal authority itself is now load-bearing infrastructure, and almost nobody treats it that way.

Every public renewal exercises four upstream dependencies: the ACME account that signs the order, the DNS records and delegation chain that prove ownership, the persistent _acme-challenge configuration that survives between renewals, and the deployment path that installs the certificate before the old one expires. When cert lifetimes were 398 days and DV reuse windows were generous, each of those was a one-time setup. At 47-day cadence with compressed DV reuse, every renewal touches every link in the chain.

The natural audit follow-on:

→Which ACME accounts hold authority over which domains? Is any single account a fan-in point across the estate?
→For each domain that uses DNS-01 validation, where does the _acme-challenge record actually resolve – on your authoritative servers, a delegated zone, a managed DNS provider's API?
→If the team that controls the delegated _acme-challenge zone changes ownership or sunsets the API, do your renewals silently start failing on the next cycle?
→When a renewed cert lands, do you verify the new one is actually the certificate being served – or only that issuance succeeded?

These aren't theoretical. The May 8 incident was an ACME-account-side wobble that resolved in two and a half hours. The next incident – there will be one – may be on a different surface: a delegated DNS zone whose ownership changed, a managed-DNS provider API quota that hits during a renewal storm, an _acme-challenge record that got mass-deleted by a cleanup script. Each is a separate failure mode of the same load-bearing system.

This is also where Cyphrs's ACME ARI product crosses from "automate renewals" into "govern the renewal authority itself" – ACME account custody, DNS-PERSIST mapping, delegated DNS authority, and active post-deployment verification, all treated as first-class governable surfaces rather than ambient infrastructure. The audit questions above are useful regardless of whether you decide to use us for any of it.

The throughline

May 8 was a clean incident with a competent response. The reason it's worth writing about a week later isn't that anything went badly. It's that the underlying operational model (every workload renews from the public CA, public CAs occasionally wobble, the wobble is short relative to the cert lifetime so nobody cares) is being quietly retired by the move to shortlived profiles.

When the cert lifetime is 6 days, a wobble of "only" 2.5 hours is 4% of your margin. At 47 days, the math is gentler but the trajectory is the same. We're heading toward a world where public-CA dependency for internal workloads is more expensive than it has any reason to be, and where the right architectural answer (internal trust for internal workloads) was already correct on every other axis (cost, control, identity model, EKU support) before the renewal-margin math even entered the picture.

The May 8 incident just made it concrete. That's all it really did. But sometimes concrete is what you need.

Cyphrs builds a private CA for the workloads that never needed to be on a public one. If you're working through the audit questions above, we'd be happy to look at the result with you.

Get in touch