How I Stopped Worrying and Learned to Love Cell-Based Architecture
A frank account of moving from one big shared system to many small isolated cells. The migration sucked. The blast-radius math afterward was a religious experience.
For three years we ran a single big production stack. Big shared Postgres. Big shared Kafka. Big shared cache layer. Every team's code shipped to the same fleet. Every customer's data lived in the same database. Every feature flag was global.
It was beautiful, in the way a high-wire act is beautiful — easy to admire, hard to live with. Every outage was a global outage. Every deploy felt like surgery. Every "can we run a heavy migration" required a CAB meeting and three hours of follow-up.
Then we moved to cell-based architecture, and one Tuesday afternoon I noticed I hadn't been paged in 11 weeks, and I stopped to write down what changed.
The argument, in one sentence
A cell is a self-contained vertical slice of your stack — its own database, its own cache, its own service tier — that handles a subset of customers. When a cell breaks, only its customers feel it. When a cell deploys, only its customers see the new code. When a cell scales, only its hardware grows.
The pitch is blast-radius math: with N cells, the worst-case outage hits 1/N customers. With N=1, your worst case is everyone. With N=20, your worst case is 5%. That's not a small difference. That's a different relationship to risk.
What we did NOT do
We did not microservice everything. Cells are not microservices. A cell is allowed to be a monolith. The slicing happens at the customer-isolation boundary, not the code boundary.
This is the part I think most teams misread. "Cell-based" doesn't mean "break the app into 50 services." It means "run 20 copies of the same app, each handling a slice of customers, none of them sharing state." Architecturally, your code can be a single monorepo deployable. The discipline is in the runtime topology.
Picking the cell key
The most consequential decision in this whole project was picking the partitioning key. We considered:
- Customer ID. Simple. The natural unit of isolation.
- Region. Good for latency, terrible for compliance — customers don't always sit in one region.
- Tenant tier (free/pro/enterprise). Tempting from a SLO perspective. Bad from a migration perspective.
We picked customer ID. Each customer is hashed to a cell at signup. Cells are static — a customer never moves cells without an explicit migration job. This made every other decision easier. It also made one decision harder: what about customers who outgrow their cell?
The answer there is "cell migration as a planned job." We wrote a runbook. It runs roughly four times a year. Each migration takes ~6 hours of planning and ~30 minutes of cutover. It's not free; it's not common.
The migration: not painless
Let me not romanticise this. The cell migration was 14 months of work. We had to:
- Stand up cell-zero (a new cell, identical infrastructure to existing prod, zero customers).
- Build a router service that maps incoming requests → cell.
- Make every cross-customer feature (admin dashboards, internal reporting, fraud detection) explicit about which cells it queries.
- Move customers in waves: 1% → 10% → 50% → 100%, with rollback hooks at each step.
- Decommission the legacy big-shared stack — which sounds easy and is actually where 4 of the 14 months went.
The hardest part wasn't the technical migration. It was the internal tooling. Every dashboard, every report, every customer-support tool we owned had implicit assumptions about a single global database. We had to teach them about cells. Some tools we kept; many we rebuilt. The customer-success team has feelings about this.
What changed, concretely
- Outages got smaller. Our biggest incident in the last 12 months hit 4.7% of customers. Pre-cell, the same root cause would have been 100%.
- Deploys got bolder. We deploy to one cell at a time. If a deploy goes bad, we catch it on cell 1 and stop. Pre-cell, every deploy was a coin flip on the entire customer base.
- Heavy migrations got safer. A schema migration on cell 7 affects 5% of customers. We schedule them in business hours now, because the worst case is recoverable.
- Capacity planning got specific. We know which cell hosts which large customer. We can scale that one cell. No more "throw 20% more capacity at the entire fleet because one customer onboarded."
What got worse
I'd be lying if I said it was all upside. The honest list:
- Ops cost roughly 1.7x. 20 cells means 20 databases, 20 caches, 20 deploy pipelines. We pay for that. The math still works out because we have fewer outages, but it's not a free win.
- Cross-customer queries are harder. Reporting on "all customers above $X MRR" used to be a SELECT. Now it's a fan-out. We built an aggregator. It's fine. It's not delightful.
- Engineering velocity dipped for two quarters. Every new feature now has to ask "is this cell-local or cross-cell?" Most are cell-local. The ones that aren't are 3x the work they were before.
Would I do it again
Without hesitation, yes — but I'd do two things differently.
First: I'd start with two cells, not one, on day zero. Even if both run on the same hardware. Once you have two, the discipline of "don't write code that assumes one" is enforced. Adding a 20th cell is operational work; converting from one to two is architectural work.
Second: I'd invest in observability tooling earlier. "Which cell is this request from" should be a first-class label on every metric and every log line. We bolted it on later, and we paid for it in late-night debugging.
The honest summary
Cell-based architecture is more work, more cost, and more discipline. It is also the single most effective reliability lever I've ever pulled. The "cost of cells" is paid in dollars and engineering time, both of which are finite but renewable. The "cost of a global outage" is paid in customer trust, which is neither.
If your business is past the point where one outage is genuinely existential — and you'd be surprised how early that point comes — start drafting a cell strategy. You don't need to ship it next quarter. You need to know how it would work, and you need the team to believe it's possible. Both of those compound.