.uTechUnfiltered  .dev
Cloud Infrastructure#aws#ssl#saas#caddy#cloudfront#lets-encrypt

How I Solved AWS ALB's 25 SSL Certificate Limit for SaaS Custom Domains

Apr 22, 202618 min readUpdated today
Share:

Everything works perfectly — until your 26th customer.

That's when I discovered a hard limit in AWS: an Application Load Balancer only supports 25 SSL certificates per listener by default. For a SaaS product offering white-labeled custom domains, that's not a limitation — it's a wall.

This is the story of how I hit that wall, walked through a few dead ends, and eventually shipped a Caddy-based dynamic SSL layer that's been handling thousands of domains in production without a single incident. And then — because the landscape never stops moving — how AWS released their own native answer a year later.


The Initial Setup (Which Worked Fine)

The product is a SaaS checkout platform. Every tenant gets a sandbox subdomain the moment they sign up:

text
storename_a1b2.sandbox.acme-saas.io

The SSL story here is trivial. One wildcard ACM certificate for *.sandbox.acme-saas.io, attached to the ALB's 443 listener. Zero operational overhead. Infinite tenants. Everything just works.

This was the architecture for the first few months. No complaints.


Then Customers Wanted Their Own Domains

The moment you offer white-labeling, the request shows up: “Can we use our own domain instead?”

Now tenants want to point their domain at the platform:

text
shop.client-brand.com     →  our infra
store.another-tenant.xyz  →  our infra

For customers already using Cloudflare with orange-cloud proxy enabled, this is easy. Cloudflare terminates SSL at their edge and forwards the request to our ALB over a generic wildcard. We didn't have to issue anything.

But Cloudflare isn't universal. Plenty of customers don't use it. Plenty don't want to. So for the rest, we have to issue the SSL ourselves.

The first version was the obvious one:

  1. Customer adds a CNAME from their domain to our ALB.
  2. We request an ACM certificate for shop.client-brand.com.
  3. Customer adds the DNS validation record.
  4. We attach the issued cert to the ALB's 443 listener.

This works. For a while.


The Wall: ALB's 25-Certificate Limit

Somewhere around the 25th customer, ACM issued the certificate fine — but the ALB refused to attach it.

AWS Application Load Balancer allows a maximum of 25 certificates per listener by default.

You can request a quota increase, but it's capped around 100. For a platform meant to grow into hundreds or thousands of tenants, this number is a ceiling, not a floor. Provisioning a new ALB every 25 customers isn't architecture — it's paperwork.

The solution needed to scale to thousands of domains without any per-customer infrastructure work.


Dead End #1: CloudFront with a Single Distribution

The first instinct: move SSL termination from ALB to CloudFront. CloudFront supports SNI. Problem solved.

Except it isn't. Here's what you run into:

  • A CloudFront distribution allows one SSL certificate attached at a time.
  • That single certificate must cover every alternate domain name on the distribution.
  • ACM certs have a default limit of 10 SAN entries per certificate (increasable, but capped).
  • A wildcard like *.client-brand.com doesn't help when your next customer's domain is shop.other-brand.xyz. Different TLDs. Different root domains. No wildcard works across them.

Single-distribution strategy: dead on arrival for arbitrary multi-tenant custom domains.


Dead End #2: One CloudFront Distribution Per Tenant + Lambda Automation

If one distribution won't work, spin up one per tenant and automate it.

The Lambda pipeline looks clean on paper:

One-distribution-per-tenant with a Lambda onboarding pipeline. Works on paper — breaks down operationally once you hit hundreds of distributions to manage.

Technically this works. But operationally it's painful:

  • CloudFront has a soft limit on distributions per account.
  • Every new distribution is 15–20 minutes to deploy.
  • At 500 tenants, you're managing 500 distributions — configuration drift, invalidation cost, auditability all get worse with every one you add.
  • Tearing down a tenant means tearing down a distribution. Not fun.

It's a solution. It's not an elegant one. And at the time, I wanted something I wouldn't have to babysit.


The Breakthrough: Caddy + Let's Encrypt on ECS

I started looking at NGINX + Let's Encrypt + some orchestration layer. That's when I ran into Caddy.

Caddy has one feature that changes everything for multi-tenant SaaS: on-demand TLS. It issues a Let's Encrypt certificate the first time a domain hits it, caches it, and serves every subsequent request from cache. No pre-provisioning. No cert limits. No API dance.

If you're fuzzy on where NLBs and ALBs actually live in your AWS network, the VPCs, subnets, and routing breakdown is worth reading first — the rest of this architecture assumes you know how traffic moves through a VPC.

The architecture I landed on:

Traffic flow: NLB → Caddy (ECS Fargate) → internal ALB → app services. Caddy checks S3 for a cached cert, calls the verify API to confirm the domain is authorized, then issues or reuses a Let's Encrypt certificate.

Traffic flow:

  1. Customer browser hits shop.client-brand.com over HTTPS.
  2. DNS points to the NLB (layer 4, raw TCP passthrough).
  3. NLB forwards to Caddy running on ECS Fargate.
  4. Caddy checks S3 for an existing certificate. If found, it uses it. If not, it asks our API whether this domain is authorized, then issues a fresh Let's Encrypt certificate and stores it in S3.
  5. SSL terminates at Caddy. Request is forwarded to the internal ALB over HTTPS with the original Host header preserved.
  6. ALB routes to the right app service based on Host.

SSL termination happens at the Caddy layer. The internal ALB only ever needs one certificate — for the fixed internal domain Caddy talks to it on.


The Actual Configuration

Here's the Dockerfile. The key part is building Caddy with the S3 storage plugin so multiple Caddy tasks share the same cert store.

Dockerfile
FROM caddy:builder AS builder

RUN xcaddy build \
    --with github.com/ss098/certmagic-s3

FROM caddy:alpine

COPY --from=builder /usr/bin/caddy /usr/bin/caddy
COPY Caddyfile /etc/caddy/Caddyfile

ENV S3_REGION=us-east-1
ENV S3_BUCKET=acme-saas-certs
ENV S3_PREFIX=production/certs/

EXPOSE 80 443 2019

CMD ["caddy", "run", "--config", "/etc/caddy/Caddyfile", "--adapter", "caddyfile"]

And the Caddyfile. This is where the real magic lives:

Caddyfile
{
    email ops@acme-saas.io

    storage s3 {
        host "s3.us-east-1.amazonaws.com"
        bucket "acme-saas-certs"
        use_iam_provider true
        prefix "production/certs"
        insecure false
    }

    on_demand_tls {
        # Caddy calls this before issuing a cert for any new domain.
        # With query param ?domain=shop.client-brand.com
        # If it returns 200, Caddy issues. Otherwise it refuses.
        ask https://api.acme-saas.io/domains/verify
    }
}

:443 {
    tls {
        on_demand
    }

    reverse_proxy http://internal-alb.us-east-1.elb.amazonaws.com {
        # Keep the original Host header so the app knows which tenant is loading.
        header_up Host {host}
    }
}

:80 {
    redir https://{host}{uri}
}

:2019 {
    metrics /metrics
}

Two things worth calling out:

  • ask endpoint. Without this, Caddy would happily try to issue a certificate for any domain that resolves to your NLB. That's a Let's Encrypt rate-limit disaster and an open relay for abuse. The ask endpoint is the gatekeeper.
  • S3 storage. Run two or more Caddy tasks for HA, and they need to share a cert store. Without the S3 plugin, each task issues its own cert for the same domain and you burn through Let's Encrypt rate limits fast.

Domain Verification: Stopping Abuse

The ask endpoint is the most security-critical piece of this whole setup. If it returns 200 for arbitrary domains, attackers can point random hostnames at your NLB and rack up Let's Encrypt requests until you're rate-limited globally.

The flow:

  1. Customer adds their domain in our dashboard.
  2. We generate a unique verification token and return two DNS records they need to add: a TXT record (proves ownership) and a CNAME (points traffic to us).
  3. When the customer clicks “Verify,” we do a live DNS lookup from our backend to confirm the TXT record exists and matches what we issued.
  4. Only after TXT verification passes does the domain become “active” in our database.
  5. When Caddy's ask endpoint hits our API for a certificate, it checks this “active” flag.

The UUID generation and verification logic, in Python:

domain_verification.py
import uuid
import dns.resolver

SAAS_ROOT = "acme-saas.io"
TXT_RECORD_NAME = "acme_verify"


def generate_verification_record(customer_domain: str) -> dict:
    """
    Called when a customer adds their domain in the dashboard.
    Returns the exact DNS records they need to add.
    """
    token = uuid.uuid4().hex
    txt_value = f"{token}.verify.{SAAS_ROOT}"

    return {
        "txt_record": {
            "type": "TXT",
            "name": f"{TXT_RECORD_NAME}.{customer_domain}",
            "value": txt_value,
        },
        "cname_record": {
            "type": "CNAME",
            "name": customer_domain,
            "value": f"nlb.{SAAS_ROOT}",
        },
        "token": token,
    }


def verify_txt_record(customer_domain: str, expected_token: str) -> bool:
    """
    Live DNS lookup against authoritative nameservers.
    Returns True only if the customer has actually added the record.
    """
    expected_value = f"{expected_token}.verify.{SAAS_ROOT}"
    record_name = f"{TXT_RECORD_NAME}.{customer_domain}"

    try:
        answers = dns.resolver.resolve(record_name, "TXT", lifetime=5)
    except (dns.resolver.NXDOMAIN, dns.resolver.NoAnswer, dns.resolver.Timeout):
        return False

    for rdata in answers:
        for txt_bytes in rdata.strings:
            if txt_bytes.decode("utf-8") == expected_value:
                return True

    return False

And the endpoint Caddy hits before issuing a certificate:

caddy_ask_api.py
from fastapi import FastAPI, HTTPException, Query

app = FastAPI()


@app.get("/domains/verify")
def caddy_ask_endpoint(domain: str = Query(...)):
    """
    Caddy calls this before issuing any cert.
    Return 200 only if the domain is verified and active.
    """
    record = db.fetch_one(
        "SELECT status FROM custom_domains WHERE domain = %s",
        [domain],
    )

    if not record or record["status"] != "active":
        raise HTTPException(status_code=403, detail="domain not authorized")

    return {"ok": True}

Three DNS calls, one database check, no shared state beyond what's already in your app's primary DB. That's the entire verification layer.


ECS Sizing for Caddy

Running Caddy in production taught me a few things about sizing. Here's what's been running fine for months, serving real traffic:

SettingValueReason
CPU0.5 vCPUEnough for SSL handshakes under normal load
Memory1 GBPrevents OOM during cert generation bursts
ComputeFargate ARM64 (Graviton)Best price-performance
Min tasks2HA across AZs, zero-downtime deploys
Max tasks4–8Headroom for handshake spikes
Autoscaling target70% CPUScales before handshake queues build

Two tasks comfortably handle a large volume of tenants. The bottleneck isn't Caddy — it's usually your app.


Trade-offs to Know About

Nothing's free.

  • First-hit latency. The very first request to a new domain takes 3–5 seconds while Caddy negotiates with Let's Encrypt. Every request after that is sub-millisecond until renewal. For most use cases this is fine. If you need zero-cold-start for new tenants, pre-issue certs from a background job instead of waiting for the first visitor.
  • Verify-API dependency. If your ask endpoint is down, Caddy can't issue new certificates. Already-issued certs in S3 keep working fine — this only affects new domains.
  • Let's Encrypt rate limits. 50 new certificates per registered domain per week, 5 duplicate certificates per week. Don't manually delete S3 cert objects unless you're sure the domain is gone — re-issuing can eat into your weekly quota fast. (If you want a deeper read on why external rate limits quietly break systems at scale, I've written about how to design a rate limiter that actually works at scale — same mental model applies to consuming one.)
  • Header trust. Your app must trust X-Forwarded-For and X-Forwarded-Host from Caddy to identify the tenant and the real client IP.

None of these have bitten us in production. But they're real and you should know about them going in.


Then AWS Shipped the Native Answer

In April 2025, AWS released CloudFront SaaS Manager. It's the first-party solution to exactly this problem.

The model is a new distribution type called a multi-tenant distribution. One distribution, many tenants, each tenant gets their own ACM certificate attached at the tenant level.

CloudFront SaaS Manager: multiple tenant domains share one distribution with per-tenant ACM certs. The origin ALB only needs a single internal certificate, regardless of tenant count.

Key numbers:

  • Up to 2,000 domains per multi-tenant distribution.
  • Each tenant gets a per-domain ACM certificate, auto-validated and auto-renewed.
  • One fixed internal origin (your ALB with a single internal cert) serves every tenant.
  • ACM, SNI, renewal, edge caching — all managed by AWS.

The trick that makes this work is the same insight as the Caddy setup: you don't need per-customer SSL between the edge and your ALB. CloudFront terminates SSL at the edge with the tenant's cert, then talks to your ALB over a fixed internal domain using one internal cert. Your app reads the Host header to route.

I haven't migrated yet — what's in production is still the Caddy stack, and it's still working beautifully. But for any new platform starting today, SaaS Manager is the obvious starting point.

If you want the full tenant-onboarding code — ACM certificate request, DNS validation handoff, manual verify button and automated polling, CloudFront tenant creation — I've put both a Node.js and a Python reference on a gist: CloudFront SaaS Manager onboarding flow (Node.js + Python).


Cost Comparison

Three scenarios: UC1 — 5 domains, ~50 GB/month. UC2 — 50 domains, ~500 GB/month. UC3 — 500 domains, ~10 TB/month.

Option A — CloudFront SaaS Manager + ALB

ComponentUC1 (5 domains)UC2 (50 domains)UC3 (500 domains)
CloudFront data transfer$0 (free tier)$0 (free tier)~$765 (9 TB @ $0.085/GB)
CloudFront requests$0 (free tier)$0 (free tier)~$40 (40M @ $1/M)
SaaS Manager tenants$0 (first 10 free)~$20~$50
ACM SSL certs$0$0$0
ALB~$16~$20~$45
App compute (EC2/ECS)~$8~$60~$200
Route 53 hosted zone$0.50$0.50$0.50
Total / month~$24.50~$100.50~$1,100

Option B — NLB + Caddy (ECS) + ALB

ComponentUC1 (5 domains)UC2 (50 domains)UC3 (500 domains)
NLB~$16~$20~$55
Caddy on ECS Fargate~$9 (0.25 vCPU)~$18 (0.5 vCPU)~$75 (2 tasks HA)
S3 (cert storage)~$1~$2~$5
Internal ALB~$16~$20~$45
App compute (EC2/ECS)~$8~$60~$200
Data transfer out (no CDN)~$5~$45~$900 (10 TB @ $0.09/GB)
SSL certs (Let's Encrypt)$0$0$0
Total / month~$55~$165~$1,280

Side-by-side

Use caseOption AOption BA saves
UC1 — 5 domains, 50 GB~$24.50~$55~$30/mo
UC2 — 50 domains, 500 GB~$100.50~$165~$64/mo
UC3 — 500 domains, 10 TB~$1,100~$1,280~$180/mo

Two things stand out. At low traffic, the gap is almost entirely the extra load balancer (NLB + ALB vs one CloudFront + ALB). At high traffic, CloudFront's edge caching starts saving serious money — at a 30% cache hit rate, you avoid ~3 TB of origin egress, which alone is worth ~$270/month and widens the gap to roughly $450.

Caddy's one real advantage here isn't cost — it's onboarding speed. New domain becomes live in under 60 seconds, no client-side DNS validation step. With ACM, the customer has to add a validation CNAME and wait. For non-technical customers, that friction is real.


What I'd Do Today

If I were starting from scratch in 2026:

  • < 25 domains, no growth pressure: ALB + ACM directly. Don't overbuild.
  • 25–500 domains, steady growth: CloudFront SaaS Manager. Let AWS handle it.
  • 500+ domains or very fast onboarding required: Either works. SaaS Manager is lower ops. Caddy is faster for first-hit and gives you more control.

The answer depends on how much you trust AWS to own the problem versus how much control you want. Both are valid.


Closing Thoughts

What looked like a simple infrastructure decision turned into a lesson in scale.

At one point, the ALB-plus-ACM architecture was correct. It worked. It was clean. It solved the problem in front of me. But it didn't solve the problem that was coming.

That's the uncomfortable reality of building systems — the solution that works today can quietly become tomorrow's bottleneck. Hitting the 25-certificate limit wasn't a failure. It was a signal. A signal that assumptions need revisiting, that “working” isn't the same as “scalable,” and that sometimes you have to step outside the managed-service happy path to move forward.

Building the Caddy layer wasn't a hack. It was the right call at that time, given the constraints and the tools that existed. And now with CloudFront SaaS Manager, there's a native answer that simplifies a lot of this complexity.

Does that make the earlier solution wrong? No. It just proves something worth remembering:

Good architecture isn't about picking the perfect solution. It's about making the best decision with the information you have — and being ready to evolve when the landscape changes.


Your Turn

There's no comment section on this site, and that's intentional.

If this was useful, if you're running a similar setup and want to compare notes, or if you think this approach is wrong and have a better one — tell me. I'm on X and LinkedIn and I read every message.

The best feedback I've gotten on past articles has come from engineers pointing out where I got it wrong. That's the loop worth having.


Frequently Asked Questions

AWS Application Load Balancer supports 25 SSL certificates per HTTPS listener by default. The hard cap after a quota increase is roughly 100. For any SaaS platform offering custom domains, this becomes a blocker well before you'd expect.

Share:
Raunak Gupta

Written by

Raunak Gupta

Director of Products at CodeClouds. Started as a developer in 2014, never stopped writing code. 12 years of building and shipping — web platforms, checkout systems, HR tools, AWS/GCP infrastructure — and still living with every decision I made. I write about the calls that worked, the ones that didn't, and what I'd do differently. No tutorials. No theory. Just what actually happened.

Previous

How I Use AI as a Manager to Save 10+ Hours Every Week

Next

How to Design a Rate Limiter That Actually Works at Scale

Related Articles