The Certificate Automation Trap
Why every ACME approach fails in a constrained enterprise, and what the community keeps missing
April 7, 2026 · [cyphrs] Team · 11 min read
Someone finally wrote it all down
There's a post on r/sysadmin from this past week that I've read three times now. The title is "How to approach SSL certificate automation in this environment?" and the body is about 1,200 words of someone methodically explaining why nothing works.
Here's their setup: 700 servers. Roughly 50/50 Windows and Linux. No outbound internet access. 150 domains. Some requiring EV certificates. And they need automation, because the 200-day certificate lifetime that went into effect on March 15 makes manual renewal at this scale somewhere between difficult and impossible.
They didn't just complain about it. They listed every approach they'd tried, explained exactly where each one broke, and concluded: "all solutions seem hacky at best."
51 upvotes. 53 comments. The most engaged certificate automation thread on r/sysadmin in the past two weeks. And I think the reason it resonated is that this person wasn't asking a vague question. They'd done the work. They'd tried the tools. And they still couldn't make it work.
Every path has a wall
What makes this post useful (beyond the catharsis) is that it lays out the failure modes clearly. Each ACME validation method ran into a specific, concrete constraint. Not a theoretical objection. A real wall.
Certbot with HTTP-01 challenge validation. Load balancers strip the challenge tokens before they reach the backend. The validation never completes. Not a configuration error on their end; the load balancer architecture simply isn't compatible with the way HTTP-01 hands off proof of domain ownership.
Their DNS provider doesn't support per-zone delegation. So granting ACME permissions to create DNS TXT records means granting write access to every zone. That's a security risk nobody in a 700-server shop is going to sign off on. Correctly, in my opinion.
Centralized certificate issuance with push distribution. Works in theory. But pushing private keys and certificates to 700 servers means a deployment pipeline with access to every service's TLS identity. The attack surface is enormous, and now you have a single point of compromise that can impersonate any service in your environment.
Servers pull certificates via SFTP from a central store. Operationally, this means managing SFTP credentials for 700 endpoints, handling failures and retries, monitoring for stale certificates on servers that missed a pull cycle. At scale, it's another ops burden layered on top of the original problem.
These aren't exotic edge cases. Load balancers that terminate TLS or rewrite requests are standard enterprise infrastructure. DNS providers with coarse-grained permissions are common (especially when the DNS provider is the same one the org has used for a decade). And the security team is always going to push back on centralized key distribution, because they should.
53 comments, same category of answer
The thread got 53 replies. Most of them were helpful, specific, and offered from genuine experience. But almost every suggestion fell into the same bucket: try a different ACME client. Use lego instead of certbot. Try acme.sh with the DNS-01 hook for your specific provider. Look into Caddy's built-in ACME handling.
Nobody stepped back and asked: do all 700 of these servers actually need certificates from a public certificate authority?
"all solutions seem hacky at best"
r/sysadmin, April 2026
That quote is doing a lot of work. Because the poster is right, and the reason everything feels hacky is that they're fighting a protocol (ACME, backed by public CA validation) that was designed for a completely different environment. ACME assumes outbound internet. It assumes the server can respond to a challenge from a public endpoint. It assumes DNS writes are available. None of those assumptions hold in this person's network.
And that's not a bug in ACME. ACME works great for what it was built to do: let a web server prove to a public CA that it controls a domain, so browsers can trust it. The problem is scope creep. We started using ACME (and public certificates generally) for services that browsers will never see, because public CA certificates were free and convenient and the tooling existed.
The question nobody asked
Of those 700 servers, how many face the public internet?
The poster mentions no outbound internet. That's a strong indicator that most of these servers are internal. They probably serve internal APIs, databases, middleware, monitoring dashboards, admin panels, message queues. Services that talk to each other, not to external browsers. The fact that they need TLS (they absolutely do) doesn't mean they need public TLS. Those are two different requirements that get conflated constantly.
Customer-facing websites
Public APIs used by third parties
Email servers (SMTP, IMAP) serving external users
Anything browsers or external clients need to trust
Internal APIs and microservices
Database connections
Admin dashboards behind a VPN
Message queues, caches, monitoring
Service-to-service mTLS
My guess is that in a 700-server environment with no outbound internet, the split is probably 80/20 or 90/10 in favor of internal. Maybe more. Which means the poster is fighting public CA validation requirements for hundreds of services that will never be validated by a browser.
The constraints dissolve when you change the architecture
An internal CA doesn't need outbound internet. It doesn't need HTTP-01 challenges or DNS TXT records. It doesn't need to prove anything to Google or Let's Encrypt. It issues certificates to services you own, on networks you control, for trust relationships you define.
Look at what happens to each of the poster's constraints:
| Constraint | Public CA | Internal CA |
|---|---|---|
| No outbound internet | Blocks all ACME validation | Irrelevant. CA is internal. |
| Load balancers strip HTTP-01 | Breaks HTTP challenge flow | No HTTP challenge needed |
| DNS lacks per-zone perms | All-or-nothing DNS write risk | No DNS validation required |
| Mixed OS (Win + Linux) | Different ACME clients per OS | Agent-based, OS-agnostic distribution |
| EV requirements | Still needs public CA for these | Public CA for the (few) that need it |
Four of the five constraints disappear entirely. The EV requirement stays, but EV certs are only relevant for the handful of domains that face external users. Those get a public cert through whatever manual or semi-automated process works for that small number. The other 600+ servers? They get certificates from an authority that lives on the same network they do.
This keeps happening
The 700-server thread isn't an anomaly. The same week, a DevOps engineer on r/devops asked how people track certificate expirations at scale (48 upvotes, 68 comments). Their description: spreadsheets that rot, shared calendars nobody owns, reminder emails that get ignored. Another poster on r/sysadmin is managing 120 SaaS SSO certificates and doing 3 to 4 manual renewals per month. And there's the person on r/devops who tried to point a public DNS record at a private IP just so they could use Let's Encrypt for an internal RabbitMQ instance. 45 comments correcting the approach, but the underlying desire is exactly right: trusted TLS for internal services without the operational pain of public CA validation.
Each of these threads is a different person discovering the same gap. They need TLS internally. The tools they know about (Let's Encrypt, ACME, public CAs) don't fit their internal environment. And there's no obvious, accessible alternative that just works.
Meanwhile, CyberArk published a blog this week calling the current state of certificate management "the endless game of Whack-A-Cert." Their stat: 67% of organizations experience certificate-related outages monthly. Under the current 398-day regime. We're now at 200 days, headed to 100, then 47. The math only goes one direction.
Automation isn't the answer. It's half the answer.
I want to be careful here, because I'm not arguing against automation. You absolutely need automated certificate lifecycle management. At 47-day lifetimes, manual renewal is mathematical suicide. But automation only works when the underlying architecture makes sense. Automating a bad design just means you fail faster and more efficiently.
The poster's problem isn't that they chose the wrong ACME client. The problem is that ACME (and public CA validation more broadly) assumes an environment that doesn't match their reality. No amount of client-side tooling fixes the fact that their load balancers eat HTTP challenges, their DNS provider won't grant scoped permissions, and their servers can't reach the internet.
What fixes it is recognizing that the certificate estate isn't one problem. It's two.
External trust is about proving identity to the outside world. Browsers, partners, third-party APIs. This requires a public CA, always will, and you automate it with ACME where you can.
Internal trust is about services proving identity to each other, inside your network. This requires a CA you control. The certificates come from you, the trust chain is yours, and the operational constraints of public validation don't apply.
Once you make that split, the 700-server problem shrinks dramatically. Maybe 50 servers need public certificates. You solve those with whatever combination of ACME and vendor tooling works for your load balancer setup. The other 650 get certificates from an internal authority that already lives on their network.
The compression timeline makes this urgent
This isn't just an architectural preference. It's becoming a timeline problem. TechRadar ran a piece last week headlined "Why October 1, 2026, could be the day SSL/TLS certificates break the Internet." The reasoning: every 200-day certificate issued since March 15 expires in October. Organizations that haven't automated will hit a wall of simultaneous renewals.
Now. Maximum certificate lifetime drops to 200 days. All public CAs must comply. The first cohort of 200-day certs are already ticking.
First wave hits. The 200-day certs issued in March start expiring. Organizations without automation face simultaneous renewals across their entire public certificate estate.
100-day maximum. Renewals go from roughly twice a year to nearly quarterly. The operational cadence doubles.
47-day maximum. You're renewing every public certificate roughly every month. At 700 servers, that's a full-time job or a fully automated pipeline. There's no middle ground.
Every service that doesn't need a public cert but currently has one is a service you'll be renewing on that compressed schedule for no reason. That's not a security improvement. That's operational waste.
The gap in the conversation
I keep coming back to those 53 comments. The community showed up. People shared their configurations, their ACME client preferences, their workarounds for specific DNS providers and load balancer configurations. That's how sysadmin communities work, and it's valuable.
But the conversation stayed inside a frame that assumes every certificate must come from a public authority. And that frame is the problem. Not because public CAs are bad (they're essential for what they do), but because they're being used for things they weren't designed for, in environments that actively resist their validation model.
The reason that frame persists is probably historical. For a long time, running your own CA meant standing up Microsoft ADCS, dealing with Active Directory integration, managing CRLs, and generally wading through a level of PKI complexity that most sysadmins correctly wanted nothing to do with. "Just use Let's Encrypt for everything" was a sane response to that reality. But the tooling has changed. The r/selfhosted community is already exploring lightweight private CAs. The complexity barrier is lower than it used to be.
And the cost of not making the split is about to go up. A lot.
Certificate automation solves the speed problem. But if you're automating certificates for 600 internal services through a public CA, you've automated the wrong thing. The architecture question comes first.