.uTechUnfiltered  .dev
Scalable System Design#aws#kafka#sns#distributed-systems#architecture

SNS vs Kafka: When You Actually Need Pub/Sub vs an Event Log

May 27, 202617 min readUpdated May 27, 2026
Share:

Most "SNS vs Kafka" comparisons miss the only thing that matters: SNS forgets, Kafka remembers.

Teams pick the wrong one for predictable reasons. A team spins up a Kafka cluster to send "new user signed up" emails. Another team ships SNS for a system that needs an audit trail — then loses three days of data when a webhook consumer 500s. Both mistakes come from treating SNS and Kafka as interchangeable pub/sub tools.

They aren't. SNS is a delivery system. Kafka is a storage system. Once that distinction clicks, almost every other trade-off — ordering, replay, backpressure, operational cost — follows from it. This article covers the decisions you'll actually face in production, and when each one breaks.

This is the companion to SQS vs Kafka: When to Use What in Real Systems. Read that for the queue-vs-log angle. This one covers the third corner — push-based pub/sub vs an event log.

TL;DR

SNS is a delivery system: it pushes a message to every subscriber, then forgets it existed. Kafka is a storage system: it appends events to a durable log that consumers pull from at their own offset, with replay. Default to SNS for fanout, notifications, and AWS-native decoupling at thousands of messages/sec. Reach for Kafka only when replay, stream processing, or sustained high throughput is a real requirement — not a hypothetical one. You don't have to choose: bridge SNS into Kafka when a specific durability need shows up.


The Real Problem: Why Teams Get This Wrong

Three predictable failure modes show up over and over.

Overengineering with Kafka.A team running 200 events/second deploys a three-broker MSK cluster because someone said "we might need replay someday." Six months in, they've spent more engineering time on partition rebalancing than on the feature replay was supposed to enable. They could have shipped SNS + SQS in an afternoon and never thought about it again.

Underestimating SNS.A team picks SNS because it's simple. Then the requirements shift — they need to replay events after a bad deploy, or bootstrap a new service with last week's data. SNS can't do either. Now they're writing recovery scripts against database snapshots and S3 logs at 2 AM.

Choosing based on resume aesthetics.Kafka has mindshare. It's what big tech uses. So teams adopt it for a workload a single SNS topic and a Lambda would have handled. The cluster sits at 3% utilization, but the operational cost is fixed.

The question isn't "which is better." It's: do you need messages delivered, or events stored?


The Mental Model: Delivery vs. Storage

Fix this before looking at anything else.

SNS is a delivery system. A producer publishes a message. SNS pushes it to every subscriber as fast as it can — SQS queues, Lambda, HTTPS endpoints, email, SMS. Then it forgets the message ever existed. If a subscriber is down past the retry window, the message is gone.

Kafka is a storage system.A producer appends a record to a topic. Kafka writes it to disk, replicates it, and waits. Consumers pull at their own pace, tracking their own position (offset). Records sit there for hours, days, or forever — until the retention policy deletes them. Delivery is the consumer's problem, not the broker's.

This is the entire architectural split:

Two brokers, two contracts — SNS delivers and forgets, Kafka stores and lets consumers pull at their own pace.

Everything else — partitions, consumer groups, DLQs, FIFO — is implementation detail on top of this split.


SNS in Production: The Distributed Megaphone

SNS is a smart router with dumb consumers. The broker is proactive. When you Publish, SNS immediately tries to push the payload to every subscriber.

This makes SNS unbeatable at one thing: fanout to heterogeneous consumers. An OrderPlaced event needs to trigger a shipping label, a confirmation email, a fraud check, and a warehouse update. Four subscribers, four protocols, zero replication code on your side. AWS handles the broadcast at scale.

python
import boto3

sns = boto3.client("sns")

sns.publish(
    TopicArn="arn:aws:sns:us-east-1:123456789012:order-events",
    Message='{"order_id": "ord_42", "status": "placed", "amount": 9900}',
    MessageAttributes={
        "event_type": {"DataType": "String", "StringValue": "order.placed"},
    },
)

The operational tax is close to zero. No brokers to patch, no partitions to design, no offsets to monitor. You pay per million publishes and per delivery.

The trade-off is that the broker owns retries, not the consumer.If your HTTPS subscriber is down, SNS retries on a schedule, then dumps the message to a DLQ if you configured one — and silently drops it if you didn't. You're fighting backpressure at the point of delivery, not at the point of consumption. There is no "rewind." There is no "let me read yesterday's events." Whatever SNS delivered, your downstream system either caught or didn't.

One of the cleanest SNS deployments I've worked on was account lifecycle fanout. UserCreated and SubscriptionUpgraded events needed to hit four completely different systems — onboarding emails, billing workflows, audit logging, and a couple of Lambda enrichment jobs. SNS fit because the consumers were independent, stateless, and only cared about the event in the moment. Nobody needed historical replay, and nobody wanted to run a Kafka cluster just to distribute notifications.

It worked until a downstream reporting Lambda silently failed for several hours and someone realised the recovery story was "hope the producer emits it again." There wasn't one. That's the moment you stop thinking about fanout and start thinking about durable event streams.


Kafka in Production: The Distributed Ledger

Kafka inverts the model. The broker is passive — it writes bytes to a log and waits. The consumer is smart. It tracks its own position (offset) in each partition and decides when and how fast to read.

This solves backpressure naturally. A slow consumer doesn't take the broker down or trigger a retry storm. It just falls behind its offset. Lag is observable, recoverable, and self-healing once the consumer catches up.

python
from kafka import KafkaConsumer, TopicPartition

consumer = KafkaConsumer(
    bootstrap_servers="kafka-broker:9092",
    group_id="fraud-checker",
    enable_auto_commit=False,
)

tp = TopicPartition("order-events", 0)
consumer.assign([tp])

# Replay: rewind to a known-good offset from before the bad deploy
consumer.seek(tp, 1_482_330)

for msg in consumer:
    process(msg.value)
    consumer.commit()

That seek()call is the killer feature. A bug ships at 2 PM. You catch it at 4 PM. You fix the code, reset the consumer group to the 1:55 PM offset, and reprocess the last two hours. With SNS, those events are gone — you're reconstructing them from logs and database state.

But Kafka's parallelism is bound to partitions. A topic with 10 partitions tops out at 10 active consumers in a group. Add an 11th and it sits idle. Unlike SQS, you can't "just add workers" to drain a backlog — you have to plan partition counts upfront, and repartitioning a hot topic in production is genuinely painful.

One Kafka deployment that earned every bit of its operational tax: an event pipeline for order and inventory mutations, with downstream consumers powering fulfillment, analytics, fraud detection, and financial reconciliation. We sized the main topic at 12 partitions based on conservative projections of 15–20K events/sec peak. A year later traffic had tripled, consumer lag during flash sales was crossing 40 million messages, and adding more consumers stopped helping — we'd already maxed out parallelism at the partition count.

Repartitioning while preserving per-key ordering turned into a multi-week migration with dual writes, staged consumer cutovers, and a lot of operational anxiety. Plan partition counts pessimistically — 2–5× your current peak, not your current peak.

The same Kafka that punished us on partitions also saved us on replay. A schema change shipped a broken inventory consumer that corrupted stock counts for nearly three hours. We fixed the code, rewound the consumer group offset, and rebuilt state directly from the log. On SNS, that's a database reconstruction job. On Kafka, it's a seek().


The Differences That Actually Matter

Forget feature matrices. Five behaviors decide the call.

BehaviorSNSKafka
RetentionTransient — deleted after delivery (or DLQ)Durable — configurable, hours to forever
Consumer stateOwned by brokerOwned by consumer (offsets)
Delivery modelPushPull
OrderingFIFO topics: per group, ~300 msg/s capPer partition, scales linearly
ReplayNot possibleNative (seek to any offset)
Fanout costEffectively free (one publish, N pushes)One consumer group per subscriber, plus broker load
Operational taxNear zeroHigh — even on MSK

A few of these need translation.

Consumer state ownership is the hidden lever. With SNS, the broker knows what was delivered. With Kafka, the consumerknows what was read. That's why Kafka can replay — the data is still there, only the pointer moves. It's also why Kafka consumers can crash, restart, and resume exactly where they left off without the broker doing anything special.

Ordering on SNS FIFO is a trap if you don't read the throughput caps. 300 messages/sec per message group, 3,000 with batching. Past that you're sharding message groups, which means the global ordering you thought you had isn't global at all. Kafka gives you ordering per partition and scales by adding partitions.

Fanout cost flips at scale.SNS fanout is free in engineering time but costs per delivery. Kafka fanout is "free" per message but each new consumer group adds broker load and offset bookkeeping. At 10 subscribers and low volume, SNS wins. At 3 subscribers and 50K events/sec, Kafka wins.


The Exactly-Once Myth

Stop trying to solve this at the broker layer. In any real distributed system you're getting at-least-once delivery:

  • SNS retries on transient failures → duplicates possible.
  • Kafka producer retries, consumer rebalances → duplicates possible.
  • Kafka's "exactly-once semantics" only holds inside a single Kafka transaction. The moment you write to a database or call an external API, it's at-least-once again.

The fix isn't broker config. It's idempotent consumers — design every handler so that processing the same event twice produces the same result. Use an event ID, a dedupe table, an upsert, or a conditional update. Once you accept at-least-once, both SNS and Kafka become easier to reason about.


Anti-Patterns Worth Naming

Kafka-as-a-database. Teams use compacted topics and KTables to store current state, then write services that query Kafka instead of Postgres. It works until someone needs a join, a secondary index, or a point-in-time read. Use Kafka to move the events. Put the state in a database.

SNS-as-an-audit-log. SNS will deliver your event. If the consumer 500s past the retry window, SNS will drop it (or DLQ it, if you remembered to configure one). Anything that needs compliance-grade durability — financial transactions, audit trails, CDC — belongs in Kafka or a log-backed system. Not SNS.

Infinite fanout on Kafka.Replicating SNS's "every subscriber gets a copy" pattern in Kafka means one consumer group per downstream service. That works up to a point, then broker load and rebalance times become a real problem. If your dominant pattern is heterogeneous fanout, SNS is the right primitive, not Kafka.

I've also seen the inverse — a team using SNS as a de facto audit system because "it's already in AWS and works." The original setup was correctly scoped to lightweight notifications: emails, Slack alerts, internal automations. Then business-critical consumers got bolted onto the same fanout pipeline — billing reconciliation, compliance logging — and nobody revisited the guarantees. The system mostly worked, which is the dangerous part.

One downstream HTTPS consumer started failing intermittently during traffic spikes, and several hours of financial events disappeared after retry exhaustion. The data loss wasn't even the worst part. The worst part was realising there was no clean replay path because SNS had never been designed to retain history. Migrating to Kafka afterward wasn't hard. Rebuilding trust in the event pipeline took far longer than standing up the brokers.


When SNS Is Enough

Default to SNS when:

  • You need decoupling, not durability — Service A telling Service B something happened.
  • You're already in AWS and want IAM, Lambda, and SQS integration for free.
  • Heterogeneous fanout is the dominant pattern (email + SMS + webhook + queue).
  • Throughput is in the thousands/sec, not millions/sec.
  • Notifications are the product (push, SMS, email).

When Kafka Becomes Necessary

You actually need Kafka when:

  • You need replay — historical reprocessing is in your operational toolkit, not a nice-to-have.
  • You need stateful stream processing — windowed aggregations, joins, sessionization.
  • You're piping change data capture from a database into downstream systems.
  • Throughput is sustained millions/sec on a single logical stream.
  • Multiple independent consumers need to read the same events at their own pace, indefinitely.

If none of those apply, you don't need Kafka. You need SNS + SQS, and you'll save a quarter of an SRE's calendar.


Migration: SNS+SQS → Kafka (Without the Rewrite)

The honest path for most teams isn't "pick one." It's "start with SNS, add Kafka when a specific use case demands it."

The bridge pattern: pipe SNS into Kafka. Subscribe a Lambda (or an MSK Connect source) to your SNS topic, write each event into a Kafka topic. Your existing fanout consumers keep working unchanged. New consumers — analytics, ML pipelines, the audit log — read from Kafka and get replayability for free.

Bridge SNS into Kafka — existing push consumers stay untouched, durable replay shows up on the other side.

You don't have to choose one for the whole company. Most production systems I've seen end up with both, and the boundary lines up with the natural split: notifications go through SNS, durable event streams go through Kafka.

The mistake I've seen most often when teams set up this bridge is treating it as a temporary scaffold. It isn't. The SNS topic stays the contract for "an event happened," the Bridge Lambda stays the boring forwarder, and new subscribers don't get a vote on which side they read from. Push consumers stay on SNS. Durable consumers go through Kafka. The moment teams let new subscribers pick "whichever's easier," the boundary blurs and you end up with the same OrderPlaced event arriving via two different paths with two different retry semantics. Pick the seam once and defend it.


What Breaks in Production

The failure modes worth knowing about:

  • SNS silent drops. No DLQ configured + a subscriber that 500s past the retry window = lost events with no alert. Check every SNS subscription for a DLQ. Today.
  • Kafka consumer lag during rebalances. When a consumer joins or leaves a group, Kafka pauses the entire group while it reassigns partitions. On a busy topic, a flaky pod can trigger rebalances every few minutes and starve the group.
  • The partition wall.You provisioned 6 partitions because traffic was low. Traffic isn't low anymore. Repartitioning means losing per-key ordering for the duration of the migration. Plan partition count assuming 2–5× your current peak.
  • SNS message size limits.256 KB. If your payload is bigger, you're putting it in S3 and sending a pointer — congratulations, you've reinvented event sourcing with extra steps.

Here's a Kafka-specific failure mode that doesn't show up in any feature matrix. We were running MSK on kafka.t3.small with two brokers and replication_factor=2in Terraform. We hadn't explicitly set min.insync.replicas — Kafka had historically picked a sane default from the broker count, so it never came up.

In October 2025, a patch release on Kafka 3.7 changed the behavior to require min.insync.replicas to be set explicitly. Terraform started failing. Nothing about our code had changed; the broker contract had. A few hours of digging through release notes later, the fix was a one-liner: set min.insync.replicas=1, because we only had two brokers. With three brokers, the previous default would still have been valid. With SNS, none of this would have been a problem because there's nothing to configure.

That's the Kafka tax in miniature. Running it means tracking upstream behavior changes across broker versions, partition rebalances, consumer group internals — none of which exist on a managed pub/sub service. The replay is worth it when you actually need it. The overhead is the price you keep paying whether you needed it that quarter or not.


When NOT to Use Either

Not every async problem needs a broker.

  • Synchronous calls with idempotency beat async fanout when the consumer count is one and you need confirmation before responding.
  • A boring SQS queueis the right answer for "background job, one worker, one consumer." No SNS, no Kafka.
  • A database outbox + pollingis sometimes the simplest reliable pattern for "I need to publish an event after this transaction commits." Don't drag a broker into a problem a SELECT ... FOR UPDATE SKIP LOCKED would have solved.

Summary

  • SNS is a delivery system (push, forget). Kafka is a storage system (write, retain, pull).
  • Default to SNS for fanout, notifications, and AWS-native decoupling. Reach for Kafka only when replay, stream processing, or sustained high throughput is a real requirement — not a hypothetical one.
  • Solve duplicates with idempotent consumers, not exactly-once broker config.
  • Configure DLQs on every SNS subscription. Plan partition counts assuming 2–5× your current peak on Kafka.
  • You don't have to pick one — bridge SNS → Kafka when a specific durability or replay need shows up. Don't migrate everything.

Most teams reaching for Kafka don't have a streaming problem — they have a reliability and ownership problem. They need better retry handling, DLQs, idempotent consumers, and clearer service boundaries. Not a distributed event log with partition planning and rebalance storms attached. Kafka becomes invaluable once replayability and long-lived event history are real operational requirements. Introducing it before that point is one of the fastest ways to turn a simple async architecture into a platform engineering project.

Frequently Asked Questions

SNS is a managed pub/sub service that pushes messages to subscribers and then deletes them. Kafka is a distributed event log that stores messages on disk and lets consumers pull them at their own pace, with full replay. SNS optimizes for delivery and fanout simplicity; Kafka optimizes for durability, replayability, and high-throughput stream processing.

Share:
Raunak Gupta

Written by

Raunak Gupta

Director of Products at CodeClouds. Started as a developer in 2014, never stopped writing code. 12 years of building and shipping — web platforms, checkout systems, HR tools, AWS/GCP infrastructure — and still living with every decision I made. I write about the calls that worked, the ones that didn't, and what I'd do differently. No tutorials. No theory. Just what actually happened.

Previous

How We Reduced an AWS Bill by 40% Without Rewriting the Application

Next

How to Design a Rate Limiter That Actually Works at Scale

Related Articles