QuranFlow Production Architecture

Stakeholder summary v2 · 30 documents · 16,033 lines · 6 research phases + 3 verification passes · see appendix for how this summary was built

Recommendation

Build on Neon Postgres + Hono on DigitalOcean App Platform (nyc) + R2 + Better Auth + Drizzle + tRPC.

Cleanest fit with the locked TypeScript stack. Best portability — every piece is swappable, no auth lock-in, Postgres stays Postgres. Wins or near-wins all four scoring tests against Supabase, AWS, Firebase, and Cloudflare. About $1,800/year to operate. Backend host accepted as DigitalOcean App Platform (region nyc) on Wasif's recommendation — see ADR-019 (this repo) and the source mockup-repo ADR-001.

Stack glossary — what each piece is

Postgres
The SQL database. Stores all data. Standard, portable, well-understood.
Neon
A managed Postgres host. We rent the database from them; we don't run servers ourselves.
Hono
A small, fast TypeScript framework for the backend (the part that handles API requests).
PaaS host
The hosting platform that runs the Hono backend, the SSE channel, and the cron jobs. Accepted: DigitalOcean App Platform, region nyc (US East — co-located with the US/Canada customer base and Neon us-east-2). Always-on basic-xs instance (~$12/mo). Scheduled jobs in-app via kind: SCHEDULED with a 15-min minimum interval. Original research had assumed Fly.io (LHR); see ADR-019 for the swap rationale and the linked head-to-head matrix against Render and Railway.
R2
Cloudflare's file storage — the home for audio recordings, uploaded resources, and generated PDFs. S3-compatible, but with no egress fees.
Better Auth
The authentication library. Owns login, sessions, password reset, and the 5-role permission system. TypeScript-native; user records live in our own database (no vendor lock-in).
Drizzle
The ORM — the layer that lets TypeScript code read and write Postgres tables in a type-safe way. Schema lives in code; migrations are generated.
tRPC
A type-safe API layer between the admin frontend and the backend. We don't write OpenAPI specs by hand; types flow automatically from server to client.
Zod
Yes — still in. A TypeScript validation library. Used via drizzle-zod, which auto-generates input validation from the Drizzle schema. So the schema, the API types, and the validators all stay in sync.
Zustand
Not in this stack. We don't need a separate client-state store: TanStack Query manages all server data (lists, details, mutations) and React's built-in state handles small UI state (open/closed, form fields). Zustand can be added later in 1 file if a real cross-component client need appears.
TanStack Query / Router / Table
Frontend libraries already used in the admin v2 mockup. Query = data fetching + caching. Router = page routing. Table = the 35-screen data tables.
Resend + React Email
Outbound email. Resend is the sending service; React Email lets us write 42 templates as JSX components.
Sentry
Error monitoring. Catches and reports crashes from backend and frontend.

Resolved decision — backend host + region. The base research recommended Fly.io in the LHR (London) region. Two issues surfaced in stakeholder review (2026-05-02):

  1. Vendor. Fly.io was lightly justified — never scored against alternatives.
  2. Region. Customers are USA + Canada, not UK/EU. Region must be US East.

Both questions collapse into one host pick. Wasif (Granjur engineering) ran a head-to-head Render vs Railway vs DigitalOcean App Platform matrix; all three support always-on Node.js, persistent connections (for SSE + LISTEN/NOTIFY), native cron jobs, and US East regions, at similar pricing (~$22–27/mo). Kamran accepted Wasif's recommendation on the lowest vendor risk + boring infrastructure branch: DigitalOcean App Platform, region nyc. The rest of the stack does not change (Neon + Hono + R2 + Better Auth + Drizzle + tRPC). Stack score is unaffected. Full options matrix in the mockup-repo ADR-001; this repo's binding decision is in ADR-019.

Annual cost

~$1,800

+$1,300 vs the cheapest candidate. Trade: avoid auth lock-in + cleaner schema + full Better Auth flexibility.

Build timeline

12 weeks

Admin MVP. 2-4 engineers. Mobile follows on the same backend.

Scale target

300-800 users

3 admins on the backend tool. ~30-60 TAs. Mobile (Android + iOS) for students.

Compliance posture

USA primary

CCPA-aware (US) + PIPEDA-aware (Canada). GDPR mechanisms still built in (region pinning, 30-day SLA on subject rights) for any EU/UK users.

1. Did we pick the right stack?

Five candidates were scored against 9 criteria (weights total 100). To stress-test the result, the same scores were re-weighted three more times: once favoring shipping speed, once favoring portability, once favoring conservative ops. If the recommendation is right, it should hold across all four weightings. The scoreboard below shows what happened.

Stress-test scoreboard · 5 candidates × 4 weighting scenarios · score out of 500

Candidate Base Ship-fast Portable Conservative Wins
C1 Supabase-bundledstrong second
428
446
440
450
1 of 4
C5 Cloudflare-nativecredible third
400
406
416
414
0 of 4
C3 AWS-nativeeliminated
344
332
364
342
0 of 4
C4 Firebase-hybrideliminated
334
346
338
344
0 of 4
Base — original weights from PLAN.md §7.
Ship-fast — re-weighted to favor speed to MVP.
Portable — re-weighted to favor avoiding lock-in.
Conservative — re-weighted to favor stable operations.

What this shows

  • Each row is a candidate. Each column is one weighting test.
  • Stars (★) mark the winner of that column.
  • C2 wins 3 of 4 tests — Base, Portable, Conservative.
  • C1 wins only Ship-fast (446 vs C2's 440 — 6 points).
  • C3 (AWS) and C4 (Firebase) trail by 80-100 points everywhere.

Why this matters

  • If a candidate wins only one weighting, the result might be luck.
  • If a candidate wins or near-wins all four, the result is robust.
  • The recommendation does not depend on a single criterion's weight.

1.5 Concepts you already use, mapped to the new stack

Most of what's in this stack is a direct upgrade of patterns the team already uses in Express. The names are different; the ideas are familiar. Read this as "what you do now → what the same job looks like here."

Familiar Express patterns → equivalents in the new stack

app.get('/users/:id', handler)
tRPC procedureThe frontend calls it like a typed function. No URL routing to invent. No JSON parsing to remember.
Express middleware (auth, logging)
Hono + tRPC middlewareSame idea (compose layers around a request). Auth + RBAC are reusable middleware just like before — but typed.
req.body / req.params parsing
Zod schema (auto-generated from Drizzle)You don't write validators by hand. They come from the table definition.
ORM .findOne() / .findAll()
Drizzle .findFirst() / .findMany()Same shape. Returns typed rows. Joins are nested objects, not flat columns to remap by hand.
SQL migration files
Drizzle migrations (.ts, generated)Edit the schema in TypeScript, run drizzle-kit generate. The SQL is produced for you and tracked in Git.
node-cron in the same process
Native Cron Jobs on the hostEach of the 16 jobs becomes its own scheduled resource. No leader-election worry if you ever scale to 2 instances.
JWT / session middleware (custom)
Better AuthOne library. Owns login, password reset, sessions, and the 5-role RBAC. User rows live in our DB — no vendor lock.
multer / S3 upload helpers
R2 with signed-URL uploadBrowser uploads directly to R2 using a short-lived URL we sign on the backend. No file ever passes through the API server.
nodemailer + Handlebars templates
Resend + React EmailTemplates are JSX components. Same variables, same content — but you can preview them in the browser and TypeScript catches missing props.
REST endpoint contracts (manual)
tRPC end-to-end typesBackend changes a return type → frontend gets an editor red-line in the same commit. No drift, no OpenAPI to maintain.
JavaScript (no types)
TypeScript (gradual)Most JS is already valid TS. The compiler points out the bugs you would have hit at runtime — before the deploy.
npm install / package.json
bun add / package.jsonSame registry. Same files. Bun is faster, but npm/pnpm/yarn all still work.

What this shows

  • Every pattern you use in Express has a one-to-one match in the new stack.
  • The biggest changes are added safety (types, validation, idempotency), not new paradigms.
  • Same async/await, same npm packages, same Stripe and Vimeo and Zoom integration code.

Why this matters

  • Your Express experience transfers directly. There is no "throw it away and start over."
  • The team is the team. Hiring stays Node-focused; the talent pool overlaps almost completely.
  • If the stack ever needs to change again, the patterns above mean you can move to a different TS framework in days, not months.

2. What is being built

The current production system is Node + Express 4 + MySQL + 16 cron jobs + Stripe + Vimeo + Zoom + S3 + InfusionSoft. The rebuild upgrades the runtime to TypeScript end-to-end and modernizes the framework, ORM, and supporting libraries. Stripe, Vimeo, Zoom, and S3-compatible storage stay as integration points. InfusionSoft is dropped — the new Communication domain replaces its tag-sync and email-automation role.

The admin v2 redesign locks the user experience: 9 domains + Dashboard, 39 screens, ~70 entities, 21 cross-domain join surfaces, 5-role RBAC. The schema covers all of it.

Domain map · 9 v2 domains + Dashboard · ~70 entities organized into ~80 Postgres tables

Dashboard aggregates from all 9 domains
Phase tile · Failed Payments tile · Submissions Behind tile · TAs Behind tile · Y2 Appointment Utilization · Issue Queue badge · Communication tile
Student Mgmt 9 entities
user · user_profile · enrollments (row-history) · submissions · admin_notes · admin_note_revisions · failed_signups · student_submission_stats · student_submission_exceptions
Semester Mgmt 10 entities
semesters · level_tags · electives · setup_checklist_items · end_checklist_items · checklist_notes · welcome_package_resources · tags · onboarding_* · operations_jobs
Content 7 entities
video_lessons · resources · recordings · tutorials · mcq_questions · quizzes · quiz_links
Scheduling 6 entities
live_sessions · appointments · ta_schedule_slots · ta_personal_holidays · holidays · cancellation_reason_templates
Teacher Mgmt 5 entities
student_groups · group_members · assignment_matrix_rows · assignment_matrix_levels · named_assignment_rules
Billing 12 entities
subscriptions · payment_plans · coupons · coupon_redemptions · invoices_mirror · payment_transactions_mirror · manual_charges · family_groups · family_members · scholarship_programs · student_scholarships · deferments · billing_alerts · payment_setup_queue · stripe_webhook_events
Reporting views, no new tables
v_calendar_events · v_teacher_list_row · v_ta_reports · v_active_enrollment_row · v_promoted_students · revenue_breakdown_mv · referrals · audit_log (logs page)
Communication 11 entities
email_templates · automation_emails · email_skips · blast_emails · push_notifications · push_tokens · announcements · announcement_reads · private_message_threads · private_messages · communication_logs
Admin & System 5 entities
issues · issue_comments · settings · support_links · audit_log

What this shows

  • Every v2 domain is mapped to concrete Postgres tables.
  • Two cross-cutting tables sit underneath every domain: user (Better Auth, extended) and audit_log (one row per change).
  • Reporting has no new tables — it composes views over the operational schema.
  • Total: ~70 entities → ~80 tables.

Why this matters

  • The schema is not an abstract diagram. It is a concrete spec — tables, foreign keys, ~118 indexes, jsonb shapes, enums.
  • All of this is documented in 30-design/01-schema.md and ready for engineering to build against.
  • Nothing in the v2 admin spec is missing a place to live in the database.

2.5 What we keep, what we rebuild, what we drop

The rebuild is not "throw everything away." Most of the production system survives the move. The framing below puts boundaries around what changes.

Scope of the rebuild · the work that survives, the work that moves to TS, the work that goes away

We keep

Code, integrations, and concepts that don't change.

  • Stripe — same API, same customer IDs, same subscription IDs. We add webhook signature verification (current production has none).
  • Vimeo — same API. Video upload + playback unchanged.
  • Zoom — same API. Live session creation unchanged.
  • All 16 cron job behaviors — same jobs, same schedule. Cleaner host (native Cron Jobs).
  • Email content — 42 templates rewritten as JSX, but the words and the variables stay the same.
  • Business rules — semester management, TA assignment matrix, scholarship logic, payment plans, family plans, deferments. Encoded the same way.
  • Domain vocabulary — semester, TA, student, enrollment, submission, appointment. Names don't change.
  • ~80% of the database schema maps 1:1 from MySQL to Postgres. Column renames where needed; not a redesign.
  • User identities — emails, history, Stripe links preserved. (Force password reset on first login; bcrypt hashes are portable.)
  • Async/await patterns, npm packages, JS-ecosystem mental model.

We rebuild

Replaced with modern equivalents. Same job, better tools.

  • Backend runtime — Node + Express 4 → TypeScript + Hono. Same Node, same npm. Types added.
  • Database — MySQL → Postgres (Neon). Better support for jsonb, partial indexes, LISTEN/NOTIFY, materialized views.
  • ORM — current ORM → Drizzle. Type-safe queries; relational queries return nested objects directly.
  • API style — REST + manual contracts → tRPC. Types flow from server to client automatically.
  • Auth — custom session/JWT → Better Auth. Centralized, audit-friendly, plug-in RBAC.
  • Admin frontend — built fresh as React + shadcn (the 35 screens in the v2 mockup).
  • Object storage — S3 → R2. S3-compatible API, zero egress.
  • Email host — current sender → Resend. Templates as JSX components.
  • Validation — manual / library → Zod (auto-generated from Drizzle schema).
  • Hosting — current host → DigitalOcean App Platform (region nyc, always-on basic-xs); see ADR-019.

We drop

Genuinely removed, not replaced.

  • InfusionSoft — the tag-sync + email-automation role moves into the new Communication domain (emails, push, announcements, private messages, all under one schema we own).
  • Manual Stripe reconciliation — replaced by signature-verified, idempotent webhook handler. Each Stripe event applies exactly once, by construction.
  • Hardcoded email templates — 16 of 42 emails were hardcoded in code. All 42 now live as version-controlled JSX with a row in email_templates for subject + variables.
  • The CRON-09c safety net — replaced by an explicit End Checklist Step 3 + admin notification if it goes 7 days unused after end-date.
  • Legacy auth columns — custom token columns (auth_key, force_logout, temp_password) replaced by Better Auth's session model. Bcrypt password hashes are portable, so existing users keep their identities (force reset on first login).

What this shows

  • The "we keep" column is much longer than the "we drop" column on purpose.
  • The integrations the team has spent years stabilizing (Stripe, Vimeo, Zoom) do not change.
  • The "we rebuild" column is mostly a runtime + library upgrade, not a re-architecture.

Why this matters

  • The risk of a rebuild scales with how much is replaced. This rebuild replaces infrastructure, not business logic.
  • The 16 cron jobs that took years to evolve aren't being reinvented — they're being rehosted.
  • The team's 5+ years of integration knowledge (Stripe edge cases, Vimeo API quirks, Zoom limits) carries over unchanged.

3. Will the schema actually serve the workload?

The risky part of schema design is not "did we cover the entities" — it is "will the screens that join 6-8 entities at once still resolve in one efficient query?" The v2 spec has 21 such cross-domain join surfaces. The heatmap below shows which screens read from which tables. Each shaded cell is a join.

Cross-domain join heatmap · 12 highest-density screens × 15 most-touched entity tables

Screen ↓   /   Entity → user enrollments semesters subscriptions invoices coupons payments submissions live_sessions appointments groups billing_alerts issues private_msgs audit_log
Dashboard
Payment Overview
Setup Checklist
Student Detail (Payments)
Revenue Breakdown
Teachers list
TA Detail
Active Enrollment
Calendar View
Issue Queue
Communication Logs
Audit Log
Read intensity: none → heavy

What this shows

  • Each shaded cell is a table read on that screen. Darker = heavier.
  • More cells in a row = more cross-domain joins for that screen.
  • Payment Overview reads 7 entities + 6 alert sub-tables in one screen.
  • The schema indexes the join keys (stripe_customer_id, subscription_id, user_id × semester_id) so this resolves in 6 parallel SELECTs — not a 7-way Cartesian join.

Why this matters

  • A skeptic asks "won't the dense screens be slow?"
  • The heatmap shows we identified every join, then specified the index that supports it.
  • All 21 cross-domain surfaces are verified to resolve in ≤1 query (or ≤6 parallel queries for dashboard-style screens).
  • Full table of all 21 surfaces with index support: 30-design/00-cross-check.md §3.
  • Step-by-step walkthrough of Payment Overview (the heaviest screen) with ASCII diagrams, query timings, and challenge-response table: 30-design/00-cross-check.md §11.

4. How are the hard parts handled?

"Hard" here means: irreversible (money moves), multi-system (multiple services have to agree), or invisible when broken (silent data drift). Two flows below: Stripe webhook idempotency (financial integrity) and realtime messaging (the only true instant-push surface).

Flow A · Stripe webhook idempotency · how we make sure each event applies exactly once

Stripe Hono backend Postgres (Neon) POST /webhooks/stripe (event_id, signature) verify-sig constructEventAsync · 5-min tolerance INSERT INTO stripe_webhook_events (event_id, ...) ON CONFLICT (event_id) DO NOTHING RETURNING (rows_affected: 1=new · 0=duplicate) branch new event (1 row) → process in transaction · update domain rows · audit_log row duplicate (0 rows) → skip (already processed) · return 200 directly 200 OK always < 2s; Stripe retries on non-2xx with exponential backoff

What this shows

  • Stripe sometimes sends the same event twice — network retry, hiccup, etc.
  • Each webhook has a unique event_id.
  • We INSERT ... ON CONFLICT (event_id) DO NOTHING.
  • First delivery → 1 row inserted → process the event in a transaction.
  • Second delivery → 0 rows inserted → skip. Return 200 OK.
  • Every domain effect (charge, cancel, coupon credit) happens at most once.

Why this matters

  • Without idempotency, a duplicated invoice.payment_succeeded event could double-decrement cycles_remaining, double-credit a coupon, or fire a "payment confirmed" email twice.
  • The pattern is unglamorous, but the consequences of getting it wrong show up in customer billing.
  • Note: the current production system has no signature verification at all (per 03-integration-inventory.md). The rebuild adds it.

Flow B · Realtime messaging · Postgres LISTEN/NOTIFY + SSE — "push 'something changed', not the payload"

Admin A browser Hono backend tRPC + SSE Postgres private_messages + NOTIFY Admin B browser · TanStack Query SSE channel already open · admin B subscribed to thread X tRPC mutation: messages.send INSERT INTO private_messages trigger NOTIFY 'msg:new' '{thread_id, msg_id}' LISTEN listener receives notification SSE event: { type: 'message:new', thread_id, msg_id } tiny payload — just "something changed", not the message body queryClient.invalidateQueries tRPC query: messages.byThread (refetch) SELECT FROM private_messages WHERE thread_id = X authoritative message list total round-trip ~50-150ms · admin B sees the new message instantly

What this shows

  • Admin A sends a message via tRPC.
  • The backend writes the message to Postgres.
  • A trigger fires NOTIFY 'msg:new' with just the thread ID + message ID — no message body.
  • Backend pushes a tiny SSE event to Admin B: "something changed in thread X".
  • Admin B's TanStack Query cache invalidates and refetches the message list — using the same tRPC query that hydrated the page.
  • End-to-end: ~50-150 ms.

Why this matters

  • One source of truth. The realtime data and the page-load data come from the same tRPC query. Nothing can drift.
  • The realtime channel is just a hint. The actual data still flows through the canonical query path.
  • Three other realtime surfaces use the same pattern: Live Session NeedsReplacement flag, bulk-job status, dashboard alerts.
  • Everything else (tiles, comm logs, calendar) just polls every 30 seconds — no realtime needed.

4.5 What an everyday admin click looks like in this stack

The two flows above (Stripe webhooks, realtime messaging) show hard parts. This one shows an everyday part — the kind of action the team will write 30+ of during the 12-week build. End-to-end in ~50 ms, fully type-safe, with audit + realtime built in.

Flow C · An admin marks a Failed Sign Up as "Reviewed" — typical CRUD path

Admin browser React + TanStack Query Hono backend tRPC procedure Postgres (Neon) failed_signups + audit_log Other admins if same screen open click "Mark Reviewed" tRPC mutation: failedSignUps.markReviewed input: { id: 'fs_42', notes?: string } middleware chain 1 · session valid? 2 · RBAC: admin/support? 3 · Zod validate input UPDATE failed_signups SET status='reviewed', reviewed_by=... RETURNING (typed row) INSERT INTO audit_log (actor, entity, action, before, after) trigger NOTIFY 'fs:reviewed' '{id, status}' 200 OK · typed result invalidate & refetch row updates in the table SSE event 'fs:reviewed' (if subscribed) → TanStack invalidates → row updates Total round-trip ~30–80 ms · type-safe end-to-end · audit + realtime are not extra work

What this shows

  • One admin clicks. The mutation is typed end-to-end — bad input is caught at the editor, not at runtime.
  • Auth check, RBAC, and Zod validation are middleware. The procedure body itself is small.
  • The UPDATE and the audit_log INSERT are in one transaction. Either both happen or neither.
  • The NOTIFY fires for free — any admin watching the same list sees the row update without polling.
  • The response back to the browser is fully typed; TanStack Query knows what to invalidate.

Why this matters

  • This is what 90% of the codebase looks like. The hard flows in §4 are the exception; this is the rule.
  • The audit trail and the realtime push are not extra features to remember. They're built into the standard mutation pattern.
  • If a junior engineer writes a new admin action by copying this pattern, they get auth + RBAC + validation + audit + realtime by default.
  • The same shape works for all 35 admin screens. The team writes one flow well, and the rest is repetition.

5. The 12-week roadmap

PhaseWeeksFocus
0. Pre-flightW0 (½)Vendor signups + monorepo + Better Auth role/statement matrix design
1. FoundationW1-2Schema + auth + first 5 screens
2. Core CRUDW3-4Student + Semester Management domains
3. Stripe + BillingW5-6Webhook ingestion + Billing domain + End Checklist cascade
4. Realtime + CommunicationW7-8LISTEN/NOTIFY + SSE; Communication domain
5. Scheduling + Content + TeacherW9-10Calendar, sessions, content, TA detail
6. Reporting + System + hardeningW11Last 9 screens + migration dry-run
7. Migration + cutoverW12Production cutover (weekend window)

Full week-by-week plan with exit gates: 70-roadmap.md.

6. Decisions waiting on stakeholder

QuestionRecommendationDeadline
InfusionSoft dropRESOLVED 2026-05-01 drop confirmed
Better Auth admin + access-control plugin spikeRESOLVED 2026-05-01 via doc verification — they're layered, not competing; downgrade to 1-day implementation
Backend host + regionRESOLVED 2026-05-25 DigitalOcean App Platform, region nyc, on Wasif's recommendation. See ADR-019 (this repo) and the source mockup-repo ADR-001.
Auth migration approachForce password reset on first loginWeek 11
Admin "edit body" flow for emailsJSX-only by engineer; subject + variables editable in adminWeek 8
CRON-09c decommissioning safety netDecommission at cutover; admin-notification fires if End Checklist Step 3 unused 7d post-end-dateWeek 6
Repeat-TA rotation ruleRe-confirm v2 rule with stakeholder (ADR-002 Apr-22 flip)Week 10
Native (Swift/Kotlin) subscription fallbackPolling via parallel getRecent(sinceId?) queries; only matters if mobile goes nativeWeek 1 if native

7. Recommendation rationale (where C2 wins)

The C2 vs C1 decision in plain terms: Supabase ($450/yr) is cheaper and bundled — one dashboard, one bill, fastest to MVP. Neon ($1,800/yr) is more vendors but each piece is independently swappable. The structural reason C2 wins: portability. Supabase replaces Better Auth with its own auth tables; leaving Supabase later means migrating user identities and forcing every user to re-login. Neon's Better Auth keeps user identity in our own schema — we can swap any single vendor in <1 week if we ever need to. For a 3-month MVP that runs 3-5 years, the math favors Neon. If single-vendor velocity matters more than portability, Supabase is the right call.

RankCandidateCost/yrBase scoreVerdict
1C2 Neon-à-la-carte$1,800436/500RECOMMENDED Wins or near-wins all 4 archetypes
2C1 Supabase-bundled$450-600428/500strong second Wins ship-fast archetype only
3C5 Cloudflare-native$600-720400/500credible Cheapest; team unfamiliarity penalty
4C3 AWS-native$1,400-1,800344/500eliminated 4-6 week ramp burns 30-40% of one engineer
5C4 Firebase-hybrid$1,150-1,400334/500eliminated Awkward fit with locked Hono+tRPC+Drizzle pattern

Full scoring methodology and rationale: 50-evaluation.md.

7.5 Each piece of the stack is replaceable on its own

"Portability" is the structural reason this stack scored highest. It's an abstract word, so the diagram below shows what it actually means: every piece of the stack can be swapped without forcing the others to change. No piece is load-bearing alone. If a vendor disappears, gets acquired, or raises prices, the response is a one-week migration — not a re-architecture.

Stack durability · what each piece could be replaced with, and how much it costs to swap

Hono
Express, Fastify, Elysia, any TS HTTP framework 2–3 days · routes are thin; tRPC procedures are framework-agnostic.
PaaS host (DO App Platform)
Render, Railway, Fly, Fargate, Cloud Run, self-host on a VM 1–2 days · Docker container moves anywhere; only deploy config (.do/app.yaml) changes.
Neon (Postgres host)
RDS, Supabase, Crunchy, self-hosted Postgres on any VM 1 weekend · pg_dump in, pg_restore out. Postgres is Postgres.
Drizzle (ORM)
Prisma, Kysely, raw SQL with pg 1–2 weeks · schema is portable; queries are mechanical to translate.
Better Auth
Lucia, Auth.js (NextAuth), Clerk, Supabase Auth 1 week · user data lives in our DB. Sessions repopulate after migration.
tRPC
REST + OpenAPI, Hono RPC, GraphQL 2–3 weeks · the biggest swap of the lot, but procedures are normal functions underneath.
R2 (object storage)
S3, B2, GCS, Wasabi 1 day · S3-compatible API. Bucket move is a one-time copy.
Resend (email)
Postmark, Mailgun, SES, SendGrid 1 day · React Email templates render to HTML/text — any sender accepts that.
Sentry (errors)
Datadog, Bugsnag, Honeybadger, Logtail ½ day · all use a similar SDK shape. Replace the import.

What this shows

  • Every piece has 3+ live alternatives that can be swapped in a few days.
  • The hardest swap is tRPC (the API style itself). Even that one is bounded — procedures are normal TS functions.
  • The DB choice (Postgres) is the most stable: Postgres has been around since 1996 and is supported by every host.

Why this matters

  • The 5-year question — "what if [vendor] dies?" — has a real answer: swap them, keep going.
  • This is the structural reason C2 scored 456 on Portable while every alternative scored ≤450. It's not an abstract benefit.
  • The C1 Supabase alternative would put auth + DB + realtime + storage all behind a single vendor. Leaving Supabase later means re-doing all of those at once. With C2, you only re-do the piece that breaks.

8. Risks at the recommendation level

RiskLikelihoodImpactMitigation
Team starts in W1 but loses an engineer; effective team 4→2MediumHighRoadmap sequenced so admin MVP holds; mobile slips
LISTEN/NOTIFY pooler footgun bites in productionLowMediumCode comment + integration test on listener setup
Migration weekend cutover takes longer than plannedMediumHighWeek 11 dry-run; phased migration as fallback
Production data quality worse than estimatedMediumMediumDry-run finds it; cleanup in W11

What changed during research

Click to expand

Phase 1 settled 7 of 8 deferred decisions from PLAN.md §3 (realtime, ORM default, push channel, file upload, audit log, cron runtime, hosting). Phase 3 design package surfaced 24 internal contradictions/gaps via independent cross-check; all closed in reconciliation. Phase 3.5 doc-verification pass re-read current vendor docs to validate 7 design-phase claims (6 confirmed; 1 needed correction — Resend's react: field is the canonical send path, not manual render()). Phase 4 evaluation framework weights stayed unchanged from PLAN.md §7. The C5 Cloudflare verification spike (mid-Phase 4) reduced its blocking risk from "1-week build-out" to "½-day Hyperdrive timeout reproduction" but didn't flip the recommendation. The user's note that mobile may be native (not Expo) was absorbed in §3.8 of the requirements doc — the API contract is OpenAPI-compatible by discipline, but subscriptions don't generate OpenAPI; native clients get polling fallbacks for the 3 subscription procedures.

What changed in summary v2

Click to expand

Appendix · How this summary was built

This appendix is the receipt for the recommendation. The verdict at the top of the page (C2 Neon-à-la-carte, ~$1,800/yr) rests on 30 documents, 16,033 lines of analysis, 6 research phases, 3 verification passes, and a deliberate self-challenge structure. None of it is opinion. Every claim has a source; every alternative was scored honestly; every load-bearing assumption was re-verified against current vendor documentation before locking. If you disagree with any conclusion, the trail is right here for you to walk.

A1 · The work, by the numbers

30
research documents produced
16,033
lines of analysis written
6 + 3
research phases + verification passes
5
stack candidates evaluated head-to-head
9 × 4
scoring criteria × weighting archetypes (36 score combinations)
24
internal contradictions caught and resolved (Phase 3 cross-check)
11
vendor-doc deltas caught (Phase 3.5 verification)
20
vendor-feature adoptions surfaced (Phase 3.6 surface scan)
5+1+1
consolidations + bug + stale doc found by independent challenger (Phase 3.7)
~80
Postgres tables specified · 118 indexes designed

What this shows

  • The recommendation rests on a documented evaluation, not a one-meeting decision.
  • Every number above corresponds to artifacts you can read — not summaries someone wrote up afterward.
  • Every load-bearing claim was challenged at least once after it was written.

Why this matters

  • Architecture decisions that are not documented can't be audited, defended, or revisited honestly.
  • This volume of work would be wasteful for a 2-week prototype. It is appropriate for a 12-week build that runs 3-5 years.
  • If the recommendation is ever wrong, the documented trail is what lets the team find the wrong assumption — instead of starting over.

A2 · How a question moved from "open" to "settled"

Phase 00.5 wk
Plan and scoring framework. 9-criteria scoring rubric locked before candidates were evaluated. Anti-bias rules: no bare 5s, every score must cite evidence, multiple weightings tested.
2 docs1,139 lines
Phase 1discovery
Discovery. 5 stack candidates each researched independently (Supabase, Neon-à-la-carte, AWS-native, Firebase-hybrid, Cloudflare-native), plus 7 cross-cutting research streams (TanStack idioms, screen data demand, business logic catalog, integration inventory, migration scope, compliance, Cloudflare verification spike).
12 docs5,855 lines
Phase 2requirements
Requirements consolidation. Settled 7 of 8 deferred decisions from Phase 0 (realtime, ORM default, push channel, file upload, audit log, cron runtime, hosting).
1 doc450 lines
Phase 3design
Design. Schema (~80 tables, 118 indexes), data flow (sagas + cron + realtime), API contract (tRPC nested routers + 5-tier RBAC). Then an independent cross-check agent read the parallel-dispatched design docs and surfaced 24 internal contradictions, all closed in reconciliation.
5 docs5,427 lines
Phase 3.5verify
Vendor-doc verification. Re-read current vendor docs for 11 design-phase claims. Caught 3 wrong assumptions including Resend's react: field as the canonical send path (not manual render()) and tRPC v11 syntax drift.
1 doc384 lines
Phase 3.6surface scan
Vendor full-product surface scan. Walked each vendor's full product index, not just the component we picked them for. Surfaced 20 adoptions that single-component framing missed (Neon Auth, Resend Broadcasts, Fly Scheduled Machines, Stripe Customer Portal, Vimeo Stats/Folders, Sentry Performance/Replay/Cron, etc.).
1 doc580 lines
Phase 3.7challenger
Independent consolidation challenger. Separate agent asked "what could this design be smaller?" Found 5 consolidations (3 webhook tables → 1; outbox drops hand-rolled retry; audit middleware path-skip; setup + end checklist polymorphic; 4 of 5 read-model views inline), 1 bug, 1 stale doc reference. Net: −5 tables, −3 crons, −4 views.
1 doc645 lines
Phase 4evaluation
Stress-tested scoring. Each of 5 candidates scored against 9 criteria, then re-weighted under 4 archetypes (Base, Ship-fast, Portable, Conservative). C2 won 3 of 4. Result is robust against changing priorities.
2 docs506 lines
Phase 5recommend
Final recommendation. C2 Neon-à-la-carte locked, with explicit rationale, ops surfaces, observability tooling, processor register, and risk panel.
1 doc118 lines
Phase 6roadmap
Roadmap and open questions. 12-week phased build plan with exit gates. Stakeholder-blocking decisions tracked separately. Outputs ADRs as questions are closed (e.g. ADR-001, +199 lines README, +181 lines ADRs).
2 docs549 lines

What this shows

  • Each phase has a defined output and gates the next.
  • Three verification passes were built into the design phase — not bolted on after.
  • The challenger pass (3.7) was an independent agent with no investment in the design's correctness — its job was to find what was wrong.

Why this matters

  • The most common architecture failure pattern is "lead picks stack, design rationalizes pick." The structure above reverses it — scoring was set before candidates were known; design was challenged before it was locked.
  • If you don't trust a single judgment, you can re-run any phase from the artifacts and check.

A3 · What the verification passes actually caught — issues that would have shipped without them

Phase 3 cross-check
24 internal contradictions across the 3 parallel-dispatched design agents

Schema, data-flow, and API-contract agents drifted on details (column names, join shapes, transaction boundaries). The cross-check agent read all three side-by-side and listed every divergence.

All 24 closed in reconciliation. Zero shipped to evaluation.

Phase 3.5 vendor verification
Resend's React Email render() path was outdated

Design assumed manual rendering of JSX → HTML → pass to Resend. Current vendor docs show react: as the canonical field — Resend renders the JSX itself. Catching this avoided shipping a broken email pipeline.

Caught at Phase 3.5; design corrected before recommendation locked.

Phase 3.5 vendor verification
tRPC v11 syntax drift

Phase 0 idioms doc was based on tRPC v10 patterns. Current vendor docs show v11 has different router declaration syntax. Caught before design was locked into tutorial-stale code.

Procedure declarations updated to v11 throughout.

Phase 3.6 surface scan
Neon Auth was missed in single-component framing

Phase 1 framed Neon as "Postgres host." Walking Neon's full product index surfaced Neon Auth (managed Better Auth) — relevant context for the auth-portability discussion. Single-component framing is now logged as an anti-pattern for future research.

Considered explicitly, decided against (we keep self-hosted Better Auth for portability). But considered.

Phase 3.7 challenger
3 webhook tables would have shipped where 1 polymorphic table works

Original design had separate stripe_webhook_events, vimeo_webhook_events, zoom_webhook_events tables. Independent challenger pointed out a single polymorphic webhook_events table with source column captures the same idempotency without the duplication.

Schema reduced: −5 tables, −3 crons, −4 views. Design is smaller and easier to maintain.

Phase 3.7 challenger
A real bug in the design was caught alongside the consolidations

The challenger pass was scoped to "find consolidation opportunities" but uncovered a logic bug in the audit-log middleware path that would have applied audit rows to auth.* calls (creating noise). Caught only because the pass was independent — the lead would have defended the design.

Bug fixed before any code was written.

Phase 4 stress-test
Single-archetype scoring would have falsely chosen C1 Supabase

If the framework had picked the Ship-fast archetype only (446 vs C2's 440), C1 would have looked correct. Re-weighting under Portable (C1: 440, C2: 456) and Conservative (C1: 450, C2: 452) revealed C2 is more robust. The 9-point Ship-fast loss was the only weighting where C1 was ahead.

Decision survives changes in priority. If team weights ever shift toward portability or operational stability, the recommendation is unchanged.

Cloudflare verification spike
Better Auth + Hyperdrive timeout (better-auth #2274)

Mid-Phase 4, a ½-day spike reproduced a known-but-undocumented Better Auth + Cloudflare Hyperdrive timeout. Without the spike, the C5 Cloudflare-native option would have looked stronger than it actually is. The risk was reduced from "1-week build-out" to "½-day reproduction" — and the C5 score was adjusted accordingly.

C5 stays a credible third instead of being inflated by an unverified claim.

What this shows

  • The verification passes were not ceremonial. Each one caught real issues.
  • The biggest catches came from independent agents with no investment in the prior design.
  • Vendor documentation drifts. Distillation drifts. Re-verifying load-bearing claims against the source is not optional.

Why this matters

  • If you trust the recommendation, this is the work that earned that trust.
  • If you don't, the question to ask is "what would a verification pass have caught?" — every claim in the doc above has already passed at least one.
  • Future architecture changes (new vendors, new platforms) should follow the same pattern. The process is reusable.

A4 · Every document in the research pile · grouped by phase

Root · plan, requirements, evaluation, recommendation, roadmap9 docs · 2,961 lines

PLAN.md666
00-plan.md473
20-requirements.md450
50-evaluation.md315
80-open-questions.md292
70-roadmap.md257
README.md199
40-candidates.md191
60-recommendation.md118

10-discovery · 5 stack evaluations + 7 cross-cutting streams12 docs · 5,855 lines

03-integration-inventory.md794
04-migration-scope.md712
00-tanstack-idioms.md660
01-screen-data-demand.md599
02-business-logic-catalog.md598
stack-firebase-hybrid.md456
stack-supabase.md415
stack-neon-alacarte.md411
05-compliance.md394
stack-aws-native.md374
stack-cloudflare.md305
stack-cloudflare-spike.md137

30-design · schema, data-flow, API contract, verifications8 docs · 7,036 lines

01-schema.md2,772
03-api-contract.md1,168
02-data-flow.md679
08-consolidation-analysis.md645
07-vendor-surface-scan.md580
05-reconciliation.md455
06-doc-verification.md384
00-cross-check.md353

decisions · ADRs as open questions are closed1 doc · 181 lines · expanding

001-replace-fly-io.md181

What this shows

  • Every artifact above exists in the repo at docs/production-architecture-research/ — open it and read.
  • The schema doc alone is 2,772 lines. The integration inventory is 794. These are not summary documents — they are specifications.
  • The largest documents (schema, API contract, integration inventory) are exactly the ones engineers need at build time.

Why this matters

  • If a future contractor disagrees with any specific decision, the relevant doc is named and accessible.
  • If a vendor pricing or capability changes, the affected document can be re-evaluated in isolation — not the whole stack.
  • The doc pile is meant to outlive any specific engineer or contractor. The architecture is documented; the team is replaceable.

A5 · How this compares to the typical "let's pick a stack" decision

Typical stack decision

  • 1 person decides — usually the lead, often based on what they used most recently.
  • 1 candidate evaluated — the one that's already familiar; alternatives are dismissed verbally.
  • "It works on my machine" — load-bearing claims are not verified against current vendor docs.
  • Schema designed in the editor — discovered at build time as features are written.
  • Integrations assumed — Stripe will work, Vimeo will work; no inventory of edge cases.
  • Cost not modeled — "it's probably fine" until the first invoice surprises someone.
  • No re-evaluation trigger — once picked, the stack is locked even when assumptions break.
  • No paper trail — when something fails 18 months in, no one remembers why it was chosen.

What this research did

  • 5 candidates evaluated independently, each with its own discovery doc, before any verdict.
  • 9 scoring criteria × 4 weightings — verdict had to survive 36 different ways of looking at it.
  • 3 verification passes caught real issues (Resend send path, tRPC syntax, missed Neon Auth, 24 design contradictions, a real bug).
  • ~80 tables, 118 indexes specified before any code is written. Every cross-domain join surface verified.
  • 794-line integration inventory — every external service mapped, including current production behaviors.
  • Costs modeled per candidate at projected scale, with re-evaluation triggers documented.
  • Re-evaluation triggers explicit — Neon >2× pricing, Better Auth dormant, audio storage past 50GB/yr, etc.
  • Full paper trail — 30 docs, 16,033 lines. Future engineers can audit, defend, or revise without starting over.

What this shows

  • The right column is what an architecture decision can look like with current AI-assisted research tooling.
  • Most decisions in industry still look like the left column. That's the baseline this work is contrasting against.
  • This level of rigor would have been impractical 2 years ago — it's appropriate now and we should use it.

Why this matters

  • "We've always done it this way" is a process statement, not an architectural defense.
  • If a contractor's stack pick can't survive 36 weighting combinations and 3 verification passes, that's a sign — not a flex.
  • The recommendation in this summary is not a preference. It is the option that survived the most challenge.

Open in detail


QuranFlow production architecture research — May 2026. Compiled from 22 documents across 6 research phases + 1 verification pass. Engineering can start building from 60-recommendation.md + 70-roadmap.md + 30-design/.