QuranFlow Production Architecture

Stakeholder summary v2 · 30 documents · 16,033 lines · 6 research phases + 3 verification passes · see appendix for how this summary was built

Recommendation

Build on Neon Postgres + Hono on DigitalOcean App Platform (nyc) + R2 + Better Auth + Drizzle + tRPC.

Cleanest fit with the locked TypeScript stack. Best portability — every piece is swappable, no auth lock-in, Postgres stays Postgres. Wins or near-wins all four scoring tests against Supabase, AWS, Firebase, and Cloudflare. About $1,800/year to operate. Backend host accepted as DigitalOcean App Platform (region nyc) on Wasif's recommendation — see ADR-019 (this repo) and the source mockup-repo ADR-001.

Stack glossary — what each piece is

Postgres: The SQL database. Stores all data. Standard, portable, well-understood.
Neon: A managed Postgres host. We rent the database from them; we don't run servers ourselves.
Hono: A small, fast TypeScript framework for the backend (the part that handles API requests).
PaaS host: The hosting platform that runs the Hono backend, the SSE channel, and the cron jobs. Accepted: DigitalOcean App Platform, region nyc (US East — co-located with the US/Canada customer base and Neon us-east-2). Always-on basic-xs instance (~$12/mo). Scheduled jobs in-app via kind: SCHEDULED with a 15-min minimum interval. Original research had assumed Fly.io (LHR); see ADR-019 for the swap rationale and the linked head-to-head matrix against Render and Railway.
R2: Cloudflare's file storage — the home for audio recordings, uploaded resources, and generated PDFs. S3-compatible, but with no egress fees.
Better Auth: The authentication library. Owns login, sessions, password reset, and the 5-role permission system. TypeScript-native; user records live in our own database (no vendor lock-in).
Drizzle: The ORM — the layer that lets TypeScript code read and write Postgres tables in a type-safe way. Schema lives in code; migrations are generated.
tRPC: A type-safe API layer between the admin frontend and the backend. We don't write OpenAPI specs by hand; types flow automatically from server to client.
Zod: Yes — still in. A TypeScript validation library. Used via drizzle-zod, which auto-generates input validation from the Drizzle schema. So the schema, the API types, and the validators all stay in sync.
Zustand: Not in this stack. We don't need a separate client-state store: TanStack Query manages all server data (lists, details, mutations) and React's built-in state handles small UI state (open/closed, form fields). Zustand can be added later in 1 file if a real cross-component client need appears.
TanStack Query / Router / Table: Frontend libraries already used in the admin v2 mockup. Query = data fetching + caching. Router = page routing. Table = the 35-screen data tables.
Resend + React Email: Outbound email. Resend is the sending service; React Email lets us write 42 templates as JSX components.
Sentry: Error monitoring. Catches and reports crashes from backend and frontend.

Backend host: DigitalOcean App Platform, region nyc. Co-located with Neon (us-east-2) for sub-10ms DB latency. Rationale + alternatives matrix in ADR-019.

Annual cost

~$1,800

+$1,300 vs the cheapest candidate. Trade: avoid auth lock-in + cleaner schema + full Better Auth flexibility.

Build timeline

12 weeks

Admin MVP. 2-4 engineers. Mobile follows on the same backend.

Scale target

300-800 users

3 admins on the backend tool. ~30-60 TAs. Mobile (Android + iOS) for students.

Compliance posture

USA primary

CCPA-aware (US) + PIPEDA-aware (Canada). GDPR mechanisms still built in (region pinning, 30-day SLA on subject rights) for any EU/UK users.

1. Did we pick the right stack?

Five candidates were scored against 9 criteria (weights total 100). To stress-test the result, the same scores were re-weighted three more times: once favoring shipping speed, once favoring portability, once favoring conservative ops. If the recommendation is right, it should hold across all four weightings. The scoreboard below shows what happened.

Stress-test scoreboard · 5 candidates × 4 weighting scenarios · score out of 500

Candidate	Base	Ship-fast	Portable	Conservative	Wins
C2 Neon-à-la-carterecommended	★ 436	440	★ 456	★ 452	3 of 4
C1 Supabase-bundledstrong second	428	★ 446	440	450	1 of 4
C5 Cloudflare-nativecredible third	400	406	416	414	0 of 4
C3 AWS-nativeeliminated	344	332	364	342	0 of 4
C4 Firebase-hybrideliminated	334	346	338	344	0 of 4

Base — original weights from PLAN.md §7.

Ship-fast — re-weighted to favor speed to MVP.

Portable — re-weighted to favor avoiding lock-in.

Conservative — re-weighted to favor stable operations.

What this shows

Each row is a candidate. Each column is one weighting test.
Stars (★) mark the winner of that column.
C2 wins 3 of 4 tests — Base, Portable, Conservative.
C1 wins only Ship-fast (446 vs C2's 440 — 6 points).
C3 (AWS) and C4 (Firebase) trail by 80-100 points everywhere.

Why this matters

If a candidate wins only one weighting, the result might be luck.
If a candidate wins or near-wins all four, the result is robust.
The recommendation does not depend on a single criterion's weight.

1.5 Concepts you already use, mapped to the new stack

Most of what's in this stack is a direct upgrade of patterns the team already uses in Express. The names are different; the ideas are familiar. Read this as "what you do now → what the same job looks like here."

Familiar Express patterns → equivalents in the new stack

app.get('/users/:id', handler)

→

tRPC procedureThe frontend calls it like a typed function. No URL routing to invent. No JSON parsing to remember.

Express middleware (auth, logging)

→

Hono + tRPC middlewareSame idea (compose layers around a request). Auth + RBAC are reusable middleware just like before — but typed.

req.body / req.params parsing

→

Zod schema (auto-generated from Drizzle)You don't write validators by hand. They come from the table definition.

ORM .findOne() / .findAll()

→

Drizzle .findFirst() / .findMany()Same shape. Returns typed rows. Joins are nested objects, not flat columns to remap by hand.

SQL migration files

→

Drizzle migrations (.ts, generated)Edit the schema in TypeScript, run drizzle-kit generate. The SQL is produced for you and tracked in Git.

node-cron in the same process

→

Native Cron Jobs on the hostEach of the 16 jobs becomes its own scheduled resource. No leader-election worry if you ever scale to 2 instances.

JWT / session middleware (custom)

→

Better AuthOne library. Owns login, password reset, sessions, and the 5-role RBAC. User rows live in our DB — no vendor lock.

multer / S3 upload helpers

→

R2 with signed-URL uploadBrowser uploads directly to R2 using a short-lived URL we sign on the backend. No file ever passes through the API server.

nodemailer + Handlebars templates

→

Resend + React EmailTemplates are JSX components. Same variables, same content — but you can preview them in the browser and TypeScript catches missing props.

REST endpoint contracts (manual)

→

tRPC end-to-end typesBackend changes a return type → frontend gets an editor red-line in the same commit. No drift, no OpenAPI to maintain.

JavaScript (no types)

→

TypeScript (gradual)Most JS is already valid TS. The compiler points out the bugs you would have hit at runtime — before the deploy.

npm install / package.json

→

bun add / package.jsonSame registry. Same files. Bun is faster, but npm/pnpm/yarn all still work.

What this shows

Every pattern you use in Express has a one-to-one match in the new stack.
The biggest changes are added safety (types, validation, idempotency), not new paradigms.
Same async/await, same npm packages, same Stripe and Vimeo and Zoom integration code.

Why this matters

Your Express experience transfers directly. There is no "throw it away and start over."
The team is the team. Hiring stays Node-focused; the talent pool overlaps almost completely.
If the stack ever needs to change again, the patterns above mean you can move to a different TS framework in days, not months.

2. What is being built

The current production system is Node + Express 4 + MySQL + 16 cron jobs + Stripe + Vimeo + Zoom + S3 + InfusionSoft. The rebuild upgrades the runtime to TypeScript end-to-end and modernizes the framework, ORM, and supporting libraries. Stripe, Vimeo, Zoom, and S3-compatible storage stay as integration points. InfusionSoft is dropped — the new Communication domain replaces its tag-sync and email-automation role.

The admin v2 redesign locks the user experience: 9 domains + Dashboard, 39 screens, ~70 entities, 21 cross-domain join surfaces, 5-role RBAC. The schema covers all of it.

Domain map · 9 v2 domains + Dashboard · ~70 entities organized into ~80 Postgres tables

Dashboard aggregates from all 9 domains

Phase tile · Failed Payments tile · Submissions Behind tile · TAs Behind tile · Y2 Appointment Utilization · Issue Queue badge · Communication tile

Student Mgmt 9 entities

user · user_profile · enrollments (row-history) · submissions · admin_notes · admin_note_revisions · failed_signups · student_submission_stats · student_submission_exceptions

Semester Mgmt 10 entities

semesters · level_tags · electives · setup_checklist_items · end_checklist_items · checklist_notes · welcome_package_resources · tags · onboarding_* · operations_jobs

Content 7 entities

video_lessons · resources · recordings · tutorials · mcq_questions · quizzes · quiz_links

Scheduling 6 entities

live_sessions · appointments · ta_schedule_slots · ta_personal_holidays · holidays · cancellation_reason_templates

Teacher Mgmt 5 entities

student_groups · group_members · assignment_matrix_rows · assignment_matrix_levels · named_assignment_rules

Billing 12 entities

subscriptions · payment_plans · coupons · coupon_redemptions · invoices_mirror · payment_transactions_mirror · manual_charges · family_groups · family_members · scholarship_programs · student_scholarships · deferments · billing_alerts · payment_setup_queue · stripe_webhook_events

Reporting views, no new tables

v_calendar_events · v_teacher_list_row · v_ta_reports · v_active_enrollment_row · v_promoted_students · revenue_breakdown_mv · referrals · audit_log (logs page)

Communication 11 entities

email_templates · automation_emails · email_skips · blast_emails · push_notifications · push_tokens · announcements · announcement_reads · private_message_threads · private_messages · communication_logs

Admin & System 5 entities

issues · issue_comments · settings · support_links · audit_log

What this shows

Every v2 domain is mapped to concrete Postgres tables.
Two cross-cutting tables sit underneath every domain: user (Better Auth, extended) and audit_log (one row per change).
Reporting has no new tables — it composes views over the operational schema.
Total: ~70 entities → ~80 tables.

Why this matters

The schema is not an abstract diagram. It is a concrete spec — tables, foreign keys, ~118 indexes, jsonb shapes, enums.
All of this is documented in 30-design/01-schema.md and ready for engineering to build against.
Nothing in the v2 admin spec is missing a place to live in the database.

2.5 What we keep, what we rebuild, what we drop

The rebuild is not "throw everything away." Most of the production system survives the move. The framing below puts boundaries around what changes.

Scope of the rebuild · the work that survives, the work that moves to TS, the work that goes away

We keep

Code, integrations, and concepts that don't change.

Stripe — same API, same customer IDs, same subscription IDs. We add webhook signature verification (current production has none).
Vimeo — same API. Video upload + playback unchanged.
Zoom — same API. Live session creation unchanged.
All 16 cron job behaviors — same jobs, same schedule. Cleaner host (native Cron Jobs).
Email content — 42 templates rewritten as JSX, but the words and the variables stay the same.
Business rules — semester management, TA assignment matrix, scholarship logic, payment plans, family plans, deferments. Encoded the same way.
Domain vocabulary — semester, TA, student, enrollment, submission, appointment. Names don't change.
~80% of the database schema maps 1:1 from MySQL to Postgres. Column renames where needed; not a redesign.
User identities — emails, history, Stripe links preserved. (Force password reset on first login; bcrypt hashes are portable.)
Async/await patterns, npm packages, JS-ecosystem mental model.

We rebuild

Replaced with modern equivalents. Same job, better tools.

Backend runtime — Node + Express 4 → TypeScript + Hono. Same Node, same npm. Types added.
Database — MySQL → Postgres (Neon). Better support for jsonb, partial indexes, LISTEN/NOTIFY, materialized views.
ORM — current ORM → Drizzle. Type-safe queries; relational queries return nested objects directly.
API style — REST + manual contracts → tRPC. Types flow from server to client automatically.
Auth — custom session/JWT → Better Auth. Centralized, audit-friendly, plug-in RBAC.
Admin frontend — built fresh as React + shadcn (the 35 screens in the v2 mockup).
Object storage — S3 → R2. S3-compatible API, zero egress.
Email host — current sender → Resend. Templates as JSX components.
Validation — manual / library → Zod (auto-generated from Drizzle schema).
Hosting — current host → DigitalOcean App Platform (region nyc, always-on basic-xs); see ADR-019.

We drop

Genuinely removed, not replaced.

InfusionSoft — the tag-sync + email-automation role moves into the new Communication domain (emails, push, announcements, private messages, all under one schema we own).
Manual Stripe reconciliation — replaced by signature-verified, idempotent webhook handler. Each Stripe event applies exactly once, by construction.
Hardcoded email templates — 16 of 42 emails were hardcoded in code. All 42 now live as version-controlled JSX with a row in email_templates for subject + variables.
The CRON-09c safety net — replaced by an explicit End Checklist Step 3 + admin notification if it goes 7 days unused after end-date.
Legacy auth columns — custom token columns (auth_key, force_logout, temp_password) replaced by Better Auth's session model. Bcrypt password hashes are portable, so existing users keep their identities (force reset on first login).

What this shows

The "we keep" column is much longer than the "we drop" column on purpose.
The integrations the team has spent years stabilizing (Stripe, Vimeo, Zoom) do not change.
The "we rebuild" column is mostly a runtime + library upgrade, not a re-architecture.

Why this matters

The risk of a rebuild scales with how much is replaced. This rebuild replaces infrastructure, not business logic.
The 16 cron jobs that took years to evolve aren't being reinvented — they're being rehosted.
The team's 5+ years of integration knowledge (Stripe edge cases, Vimeo API quirks, Zoom limits) carries over unchanged.

3. Will the schema actually serve the workload?

The risky part of schema design is not "did we cover the entities" — it is "will the screens that join 6-8 entities at once still resolve in one efficient query?" The v2 spec has 21 such cross-domain join surfaces. The heatmap below shows which screens read from which tables. Each shaded cell is a join.

Cross-domain join heatmap · 12 highest-density screens × 15 most-touched entity tables

Screen ↓ / Entity →	user	enrollments	semesters	subscriptions	invoices	coupons	payments	submissions	live_sessions	appointments	groups	billing_alerts	issues	private_msgs	audit_log
Dashboard
Payment Overview
Setup Checklist
Student Detail (Payments)
Revenue Breakdown
Teachers list
TA Detail
Active Enrollment
Calendar View
Issue Queue
Communication Logs
Audit Log

Read intensity: none → heavy

What this shows

Each shaded cell is a table read on that screen. Darker = heavier.
More cells in a row = more cross-domain joins for that screen.
Payment Overview reads 7 entities + 6 alert sub-tables in one screen.
The schema indexes the join keys (stripe_customer_id, subscription_id, user_id × semester_id) so this resolves in 6 parallel SELECTs — not a 7-way Cartesian join.

Why this matters

A skeptic asks "won't the dense screens be slow?"
The heatmap shows we identified every join, then specified the index that supports it.
All 21 cross-domain surfaces are verified to resolve in ≤1 query (or ≤6 parallel queries for dashboard-style screens).
Full table of all 21 surfaces with index support: 30-design/00-cross-check.md §3.
Step-by-step walkthrough of Payment Overview (the heaviest screen) with ASCII diagrams, query timings, and challenge-response table: 30-design/00-cross-check.md §11.

4. How are the hard parts handled?

"Hard" here means: irreversible (money moves), multi-system (multiple services have to agree), or invisible when broken (silent data drift). Two flows below: Stripe webhook idempotency (financial integrity) and realtime messaging (the only true instant-push surface).

Flow A · Stripe webhook idempotency · how we make sure each event applies exactly once

What this shows

Stripe sometimes sends the same event twice — network retry, hiccup, etc.
Each webhook has a unique event_id.
We INSERT ... ON CONFLICT (event_id) DO NOTHING.
First delivery → 1 row inserted → process the event in a transaction.
Second delivery → 0 rows inserted → skip. Return 200 OK.
Every domain effect (charge, cancel, coupon credit) happens at most once.

Why this matters

Without idempotency, a duplicated invoice.payment_succeeded event could double-decrement cycles_remaining, double-credit a coupon, or fire a "payment confirmed" email twice.
The pattern is unglamorous, but the consequences of getting it wrong show up in customer billing.
Note: the current production system has no signature verification at all (per 03-integration-inventory.md). The rebuild adds it.

Flow B · Realtime messaging · Postgres LISTEN/NOTIFY + SSE — "push 'something changed', not the payload"

What this shows

Admin A sends a message via tRPC.
The backend writes the message to Postgres.
A trigger fires NOTIFY 'msg:new' with just the thread ID + message ID — no message body.
Backend pushes a tiny SSE event to Admin B: "something changed in thread X".
Admin B's TanStack Query cache invalidates and refetches the message list — using the same tRPC query that hydrated the page.
End-to-end: ~50-150 ms.

Why this matters

One source of truth. The realtime data and the page-load data come from the same tRPC query. Nothing can drift.
The realtime channel is just a hint. The actual data still flows through the canonical query path.
Three other realtime surfaces use the same pattern: Live Session NeedsReplacement flag, bulk-job status, dashboard alerts.
Everything else (tiles, comm logs, calendar) just polls every 30 seconds — no realtime needed.

4.5 What an everyday admin click looks like in this stack

The two flows above (Stripe webhooks, realtime messaging) show hard parts. This one shows an everyday part — the kind of action the team will write 30+ of during the 12-week build. End-to-end in ~50 ms, fully type-safe, with audit + realtime built in.

Flow C · An admin marks a Failed Sign Up as "Reviewed" — typical CRUD path

What this shows

One admin clicks. The mutation is typed end-to-end — bad input is caught at the editor, not at runtime.
Auth check, RBAC, and Zod validation are middleware. The procedure body itself is small.
The UPDATE and the audit_log INSERT are in one transaction. Either both happen or neither.
The NOTIFY fires for free — any admin watching the same list sees the row update without polling.
The response back to the browser is fully typed; TanStack Query knows what to invalidate.

Why this matters

This is what 90% of the codebase looks like. The hard flows in §4 are the exception; this is the rule.
The audit trail and the realtime push are not extra features to remember. They're built into the standard mutation pattern.
If a junior engineer writes a new admin action by copying this pattern, they get auth + RBAC + validation + audit + realtime by default.
The same shape works for all 35 admin screens. The team writes one flow well, and the rest is repetition.

5. The 12-week roadmap

Phase	Weeks	Focus
0. Pre-flight	W0 (½)	Vendor signups + monorepo + Better Auth role/statement matrix design
1. Foundation	W1-2	Schema + auth + first 5 screens
2. Core CRUD	W3-4	Student + Semester Management domains
3. Stripe + Billing	W5-6	Webhook ingestion + Billing domain + End Checklist cascade
4. Realtime + Communication	W7-8	LISTEN/NOTIFY + SSE; Communication domain
5. Scheduling + Content + Teacher	W9-10	Calendar, sessions, content, TA detail
6. Reporting + System + hardening	W11	Last 9 screens + migration dry-run
7. Migration + cutover	W12	Production cutover (weekend window)

Full week-by-week plan with exit gates: 70-roadmap.md.

6. Decisions waiting on stakeholder

Question	Recommendation	Deadline
InfusionSoft drop	RESOLVED 2026-05-01 drop confirmed	—
Better Auth admin + access-control plugin spike	RESOLVED 2026-05-01 via doc verification — they're layered, not competing; downgrade to 1-day implementation	—
Backend host + region	RESOLVED 2026-05-25 DigitalOcean App Platform, region `nyc`, on Wasif's recommendation. See ADR-019 (this repo) and the source mockup-repo ADR-001.	—
Auth migration approach	Force password reset on first login	Week 11
Admin "edit body" flow for emails	JSX-only by engineer; subject + variables editable in admin	Week 8
CRON-09c decommissioning safety net	Decommission at cutover; admin-notification fires if End Checklist Step 3 unused 7d post-end-date	Week 6
Repeat-TA rotation rule	Re-confirm v2 rule with stakeholder (ADR-002 Apr-22 flip)	Week 10
Native (Swift/Kotlin) subscription fallback	Polling via parallel `getRecent(sinceId?)` queries; only matters if mobile goes native	Week 1 if native

7. Recommendation rationale (where C2 wins)

The C2 vs C1 decision in plain terms: Supabase ($450/yr) is cheaper and bundled — one dashboard, one bill, fastest to MVP. Neon ($1,800/yr) is more vendors but each piece is independently swappable. The structural reason C2 wins: portability. Supabase replaces Better Auth with its own auth tables; leaving Supabase later means migrating user identities and forcing every user to re-login. Neon's Better Auth keeps user identity in our own schema — we can swap any single vendor in <1 week if we ever need to. For a 3-month MVP that runs 3-5 years, the math favors Neon. If single-vendor velocity matters more than portability, Supabase is the right call.

Rank	Candidate	Cost/yr	Base score	Verdict
1	C2 Neon-à-la-carte	$1,800	436/500	RECOMMENDED Wins or near-wins all 4 archetypes
2	C1 Supabase-bundled	$450-600	428/500	strong second Wins ship-fast archetype only
3	C5 Cloudflare-native	$600-720	400/500	credible Cheapest; team unfamiliarity penalty
4	C3 AWS-native	$1,400-1,800	344/500	eliminated 4-6 week ramp burns 30-40% of one engineer
5	C4 Firebase-hybrid	$1,150-1,400	334/500	eliminated Awkward fit with locked Hono+tRPC+Drizzle pattern

Full scoring methodology and rationale: 50-evaluation.md.

7.5 Each piece of the stack is replaceable on its own

"Portability" is the structural reason this stack scored highest. It's an abstract word, so the diagram below shows what it actually means: every piece of the stack can be swapped without forcing the others to change. No piece is load-bearing alone. If a vendor disappears, gets acquired, or raises prices, the response is a one-week migration — not a re-architecture.

Stack durability · what each piece could be replaced with, and how much it costs to swap

Hono

⇄

Express, Fastify, Elysia, any TS HTTP framework 2–3 days · routes are thin; tRPC procedures are framework-agnostic.

PaaS host (DO App Platform)

⇄

Render, Railway, Fly, Fargate, Cloud Run, self-host on a VM 1–2 days · Docker container moves anywhere; only deploy config (.do/app.yaml) changes.

Neon (Postgres host)

⇄

RDS, Supabase, Crunchy, self-hosted Postgres on any VM 1 weekend · pg_dump in, pg_restore out. Postgres is Postgres.

Drizzle (ORM)

⇄

Prisma, Kysely, raw SQL with pg 1–2 weeks · schema is portable; queries are mechanical to translate.

Better Auth

⇄

Lucia, Auth.js (NextAuth), Clerk, Supabase Auth 1 week · user data lives in our DB. Sessions repopulate after migration.

tRPC

⇄

REST + OpenAPI, Hono RPC, GraphQL 2–3 weeks · the biggest swap of the lot, but procedures are normal functions underneath.

R2 (object storage)

⇄

S3, B2, GCS, Wasabi 1 day · S3-compatible API. Bucket move is a one-time copy.

Resend (email)

⇄

Postmark, Mailgun, SES, SendGrid 1 day · React Email templates render to HTML/text — any sender accepts that.

Sentry (errors)

⇄

Datadog, Bugsnag, Honeybadger, Logtail ½ day · all use a similar SDK shape. Replace the import.

What this shows

Every piece has 3+ live alternatives that can be swapped in a few days.
The hardest swap is tRPC (the API style itself). Even that one is bounded — procedures are normal TS functions.
The DB choice (Postgres) is the most stable: Postgres has been around since 1996 and is supported by every host.

Why this matters

The 5-year question — "what if [vendor] dies?" — has a real answer: swap them, keep going.
This is the structural reason C2 scored 456 on Portable while every alternative scored ≤450. It's not an abstract benefit.
The C1 Supabase alternative would put auth + DB + realtime + storage all behind a single vendor. Leaving Supabase later means re-doing all of those at once. With C2, you only re-do the piece that breaks.

8. Risks at the recommendation level

Risk	Likelihood	Impact	Mitigation
Team starts in W1 but loses an engineer; effective team 4→2	Medium	High	Roadmap sequenced so admin MVP holds; mobile slips
LISTEN/NOTIFY pooler footgun bites in production	Low	Medium	Code comment + integration test on listener setup
Migration weekend cutover takes longer than planned	Medium	High	Week 11 dry-run; phased migration as fallback
Production data quality worse than estimated	Medium	Medium	Dry-run finds it; cleanup in W11

What changed during research

Click to expand

Phase 1 settled 7 of 8 deferred decisions from PLAN.md §3 (realtime, ORM default, push channel, file upload, audit log, cron runtime, hosting). Phase 3 design package surfaced 24 internal contradictions/gaps via independent cross-check; all closed in reconciliation. Phase 3.5 doc-verification pass re-read current vendor docs to validate 7 design-phase claims (6 confirmed; 1 needed correction — Resend's react: field is the canonical send path, not manual render()). Phase 4 evaluation framework weights stayed unchanged from PLAN.md §7. The C5 Cloudflare verification spike (mid-Phase 4) reduced its blocking risk from "1-week build-out" to "½-day Hyperdrive timeout reproduction" but didn't flip the recommendation. The user's note that mobile may be native (not Expo) was absorbed in §3.8 of the requirements doc — the API contract is OpenAPI-compatible by discipline, but subscriptions don't generate OpenAPI; native clients get polling fallbacks for the 3 subscription procedures.

What changed in summary v2

Click to expand

Appendix added — credibility receipts for the recommendation (2026-05-02). Five panels at the end of the doc: by-the-numbers stat tiles, vertical pipeline of all 6 phases + 3 verification passes, 8 specific issues the verification passes caught (which would have shipped without them), full document map of all 30 docs grouped by phase with line counts, and a 2-column "vs the typical stack decision" comparison. Written to anchor the recommendation against contractor skepticism with evidence rather than assertion.
Four new sections added for the engineering team (2026-05-02) — written for Express devs reading from the legacy production system:
- §1.5 Concept map — every Express pattern (route handler, middleware, ORM, validation, cron, JWT, multer, nodemailer, REST contracts) mapped one-to-one to the new stack. Shows the team that their experience transfers.
- §2.5 Keep / Rebuild / Drop — three-column scope panel making clear what survives the rebuild (Stripe, Vimeo, Zoom, business rules, vocabulary, ~80% of schema), what gets replaced (runtime, DB, ORM, API style), and what genuinely goes away (InfusionSoft, manual reconciliation).
- §4.5 Trace one click — a third sequence diagram showing a typical CRUD action ("admin marks Failed Sign Up as Reviewed") so the team sees the everyday flow, not just the hard parts.
- §7.5 Stack durability — visual swap-list showing every stack piece with replacement candidates and migration cost. Makes the "portability" argument concrete.
Production stack reference corrected — Yii2/PHP → Node + Express 4. Concept map and rebuild scope reflect the actual legacy system the team works in.
Fly.io removed; DigitalOcean App Platform accepted (Fly removed 2026-05-02; DO accepted 2026-05-25). The Fly pick was lightly justified — never scored against alternatives. After a head-to-head against Render and Railway, Kamran accepted Wasif's recommendation: DigitalOcean App Platform, region nyc (lowest vendor risk + boring infrastructure). Full options matrix in the mockup-repo ADR-001; the binding decision in this repo is ADR-019. Stack score and architecture are unaffected; only the host vendor and region change.
Compliance posture flipped from UK/EU primary → USA primary (CCPA + PIPEDA-aware, GDPR mechanisms retained).
Region question merged with host question — picking the new host also picks the US-East region. The two open decisions collapse into one.
Stack glossary added — every term (Postgres, Neon, Hono, PaaS host, R2, Better Auth, Drizzle, tRPC, Zod, Zustand, TanStack, Resend, Sentry) defined inline in plain language.
Section 1 visualization replaced — bar chart → stress-test scoreboard table with star markers showing which candidate wins each weighting test.
All "What this shows / Why it matters" boxes rewritten as side-by-side bullet lists with bigger, lighter typography (replaces dense italic paragraphs).
Zod confirmed in stack, Zustand explicitly noted as not needed (TanStack Query handles server state).

Appendix · How this summary was built

This appendix is the receipt for the recommendation. The verdict at the top of the page (C2 Neon-à-la-carte, ~$1,800/yr) rests on 30 documents, 16,033 lines of analysis, 6 research phases, 3 verification passes, and a deliberate self-challenge structure. None of it is opinion. Every claim has a source; every alternative was scored honestly; every load-bearing assumption was re-verified against current vendor documentation before locking. If you disagree with any conclusion, the trail is right here for you to walk.

A1 · The work, by the numbers

research documents produced

16,033

lines of analysis written

6 + 3

research phases + verification passes

stack candidates evaluated head-to-head

9 × 4

scoring criteria × weighting archetypes (36 score combinations)

internal contradictions caught and resolved (Phase 3 cross-check)

vendor-doc deltas caught (Phase 3.5 verification)

vendor-feature adoptions surfaced (Phase 3.6 surface scan)

5+1+1

consolidations + bug + stale doc found by independent challenger (Phase 3.7)

~80

Postgres tables specified · 118 indexes designed

What this shows

The recommendation rests on a documented evaluation, not a one-meeting decision.
Every number above corresponds to artifacts you can read — not summaries someone wrote up afterward.
Every load-bearing claim was challenged at least once after it was written.

Why this matters

Architecture decisions that are not documented can't be audited, defended, or revisited honestly.
This volume of work would be wasteful for a 2-week prototype. It is appropriate for a 12-week build that runs 3-5 years.
If the recommendation is ever wrong, the documented trail is what lets the team find the wrong assumption — instead of starting over.

A2 · How a question moved from "open" to "settled"

Phase 00.5 wk

Plan and scoring framework. 9-criteria scoring rubric locked before candidates were evaluated. Anti-bias rules: no bare 5s, every score must cite evidence, multiple weightings tested.

2 docs1,139 lines

▼

Phase 1discovery

Discovery. 5 stack candidates each researched independently (Supabase, Neon-à-la-carte, AWS-native, Firebase-hybrid, Cloudflare-native), plus 7 cross-cutting research streams (TanStack idioms, screen data demand, business logic catalog, integration inventory, migration scope, compliance, Cloudflare verification spike).

12 docs5,855 lines

▼

Phase 2requirements

Requirements consolidation. Settled 7 of 8 deferred decisions from Phase 0 (realtime, ORM default, push channel, file upload, audit log, cron runtime, hosting).

1 doc450 lines

▼

Phase 3design

Design. Schema (~80 tables, 118 indexes), data flow (sagas + cron + realtime), API contract (tRPC nested routers + 5-tier RBAC). Then an independent cross-check agent read the parallel-dispatched design docs and surfaced 24 internal contradictions, all closed in reconciliation.

5 docs5,427 lines

Phase 3.5verify

Vendor-doc verification. Re-read current vendor docs for 11 design-phase claims. Caught 3 wrong assumptions including Resend's react: field as the canonical send path (not manual render()) and tRPC v11 syntax drift.

1 doc384 lines

Phase 3.6surface scan

Vendor full-product surface scan. Walked each vendor's full product index, not just the component we picked them for. Surfaced 20 adoptions that single-component framing missed (Neon Auth, Resend Broadcasts, Fly Scheduled Machines, Stripe Customer Portal, Vimeo Stats/Folders, Sentry Performance/Replay/Cron, etc.).

1 doc580 lines

Phase 3.7challenger

Independent consolidation challenger. Separate agent asked "what could this design be smaller?" Found 5 consolidations (3 webhook tables → 1; outbox drops hand-rolled retry; audit middleware path-skip; setup + end checklist polymorphic; 4 of 5 read-model views inline), 1 bug, 1 stale doc reference. Net: −5 tables, −3 crons, −4 views.

1 doc645 lines

▼

Phase 4evaluation

Stress-tested scoring. Each of 5 candidates scored against 9 criteria, then re-weighted under 4 archetypes (Base, Ship-fast, Portable, Conservative). C2 won 3 of 4. Result is robust against changing priorities.

2 docs506 lines

▼

Phase 5recommend

Final recommendation. C2 Neon-à-la-carte locked, with explicit rationale, ops surfaces, observability tooling, processor register, and risk panel.

1 doc118 lines

▼

Phase 6roadmap

Roadmap and open questions. 12-week phased build plan with exit gates. Stakeholder-blocking decisions tracked separately. Outputs ADRs as questions are closed (e.g. ADR-001, +199 lines README, +181 lines ADRs).

2 docs549 lines

What this shows

Each phase has a defined output and gates the next.
Three verification passes were built into the design phase — not bolted on after.
The challenger pass (3.7) was an independent agent with no investment in the design's correctness — its job was to find what was wrong.

Why this matters

The most common architecture failure pattern is "lead picks stack, design rationalizes pick." The structure above reverses it — scoring was set before candidates were known; design was challenged before it was locked.
If you don't trust a single judgment, you can re-run any phase from the artifacts and check.

A3 · What the verification passes actually caught — issues that would have shipped without them

Phase 3 cross-check

24 internal contradictions across the 3 parallel-dispatched design agents

Schema, data-flow, and API-contract agents drifted on details (column names, join shapes, transaction boundaries). The cross-check agent read all three side-by-side and listed every divergence.

All 24 closed in reconciliation. Zero shipped to evaluation.

Phase 3.5 vendor verification

Resend's React Email render() path was outdated

Design assumed manual rendering of JSX → HTML → pass to Resend. Current vendor docs show react: as the canonical field — Resend renders the JSX itself. Catching this avoided shipping a broken email pipeline.

Caught at Phase 3.5; design corrected before recommendation locked.

Phase 3.5 vendor verification

tRPC v11 syntax drift

Phase 0 idioms doc was based on tRPC v10 patterns. Current vendor docs show v11 has different router declaration syntax. Caught before design was locked into tutorial-stale code.

Procedure declarations updated to v11 throughout.

Phase 3.6 surface scan

Neon Auth was missed in single-component framing

Phase 1 framed Neon as "Postgres host." Walking Neon's full product index surfaced Neon Auth (managed Better Auth) — relevant context for the auth-portability discussion. Single-component framing is now logged as an anti-pattern for future research.

Considered explicitly, decided against (we keep self-hosted Better Auth for portability). But considered.

Phase 3.7 challenger

3 webhook tables would have shipped where 1 polymorphic table works

Original design had separate stripe_webhook_events, vimeo_webhook_events, zoom_webhook_events tables. Independent challenger pointed out a single polymorphic webhook_events table with source column captures the same idempotency without the duplication.

Schema reduced: −5 tables, −3 crons, −4 views. Design is smaller and easier to maintain.

Phase 3.7 challenger

A real bug in the design was caught alongside the consolidations

The challenger pass was scoped to "find consolidation opportunities" but uncovered a logic bug in the audit-log middleware path that would have applied audit rows to auth.* calls (creating noise). Caught only because the pass was independent — the lead would have defended the design.

Bug fixed before any code was written.

Phase 4 stress-test

Single-archetype scoring would have falsely chosen C1 Supabase

If the framework had picked the Ship-fast archetype only (446 vs C2's 440), C1 would have looked correct. Re-weighting under Portable (C1: 440, C2: 456) and Conservative (C1: 450, C2: 452) revealed C2 is more robust. The 9-point Ship-fast loss was the only weighting where C1 was ahead.

Decision survives changes in priority. If team weights ever shift toward portability or operational stability, the recommendation is unchanged.

Cloudflare verification spike

Better Auth + Hyperdrive timeout (better-auth #2274)

Mid-Phase 4, a ½-day spike reproduced a known-but-undocumented Better Auth + Cloudflare Hyperdrive timeout. Without the spike, the C5 Cloudflare-native option would have looked stronger than it actually is. The risk was reduced from "1-week build-out" to "½-day reproduction" — and the C5 score was adjusted accordingly.

C5 stays a credible third instead of being inflated by an unverified claim.

What this shows

The verification passes were not ceremonial. Each one caught real issues.
The biggest catches came from independent agents with no investment in the prior design.
Vendor documentation drifts. Distillation drifts. Re-verifying load-bearing claims against the source is not optional.

Why this matters

If you trust the recommendation, this is the work that earned that trust.
If you don't, the question to ask is "what would a verification pass have caught?" — every claim in the doc above has already passed at least one.
Future architecture changes (new vendors, new platforms) should follow the same pattern. The process is reusable.

A4 · Every document in the research pile · grouped by phase

Root · plan, requirements, evaluation, recommendation, roadmap9 docs · 2,961 lines

PLAN.md666

00-plan.md473

20-requirements.md450

50-evaluation.md315

80-open-questions.md292

70-roadmap.md257

README.md199

40-candidates.md191

60-recommendation.md118

10-discovery · 5 stack evaluations + 7 cross-cutting streams12 docs · 5,855 lines

03-integration-inventory.md794

04-migration-scope.md712

00-tanstack-idioms.md660

01-screen-data-demand.md599

02-business-logic-catalog.md598

stack-firebase-hybrid.md456

stack-supabase.md415

stack-neon-alacarte.md411

05-compliance.md394

stack-aws-native.md374

stack-cloudflare.md305

stack-cloudflare-spike.md137

30-design · schema, data-flow, API contract, verifications8 docs · 7,036 lines

01-schema.md2,772

03-api-contract.md1,168

02-data-flow.md679

08-consolidation-analysis.md645

07-vendor-surface-scan.md580

05-reconciliation.md455

06-doc-verification.md384

00-cross-check.md353

decisions · ADRs as open questions are closed1 doc · 181 lines · expanding

001-replace-fly-io.md181

What this shows

Every artifact above exists in the repo at docs/production-architecture-research/ — open it and read.
The schema doc alone is 2,772 lines. The integration inventory is 794. These are not summary documents — they are specifications.
The largest documents (schema, API contract, integration inventory) are exactly the ones engineers need at build time.

Why this matters

If a future contractor disagrees with any specific decision, the relevant doc is named and accessible.
If a vendor pricing or capability changes, the affected document can be re-evaluated in isolation — not the whole stack.
The doc pile is meant to outlive any specific engineer or contractor. The architecture is documented; the team is replaceable.

A5 · How this compares to the typical "let's pick a stack" decision

Typical stack decision

1 person decides — usually the lead, often based on what they used most recently.
1 candidate evaluated — the one that's already familiar; alternatives are dismissed verbally.
"It works on my machine" — load-bearing claims are not verified against current vendor docs.
Schema designed in the editor — discovered at build time as features are written.
Integrations assumed — Stripe will work, Vimeo will work; no inventory of edge cases.
Cost not modeled — "it's probably fine" until the first invoice surprises someone.
No re-evaluation trigger — once picked, the stack is locked even when assumptions break.
No paper trail — when something fails 18 months in, no one remembers why it was chosen.

What this research did

5 candidates evaluated independently, each with its own discovery doc, before any verdict.
9 scoring criteria × 4 weightings — verdict had to survive 36 different ways of looking at it.
3 verification passes caught real issues (Resend send path, tRPC syntax, missed Neon Auth, 24 design contradictions, a real bug).
~80 tables, 118 indexes specified before any code is written. Every cross-domain join surface verified.
794-line integration inventory — every external service mapped, including current production behaviors.
Costs modeled per candidate at projected scale, with re-evaluation triggers documented.
Re-evaluation triggers explicit — Neon >2× pricing, Better Auth dormant, audio storage past 50GB/yr, etc.
Full paper trail — 30 docs, 16,033 lines. Future engineers can audit, defend, or revise without starting over.

What this shows

The right column is what an architecture decision can look like with current AI-assisted research tooling.
Most decisions in industry still look like the left column. That's the baseline this work is contrasting against.
This level of rigor would have been impractical 2 years ago — it's appropriate now and we should use it.

Why this matters

"We've always done it this way" is a process statement, not an architectural defense.
If a contractor's stack pick can't survive 36 weighting combinations and 3 verification passes, that's a sign — not a flex.
The recommendation in this summary is not a preference. It is the option that survived the most challenge.

Open in detail

Why this stack: 60-recommendation.md
How to build it: 70-roadmap.md
What needs answers: 80-open-questions.md
The complete picture: README.md

QuranFlow production architecture research — May 2026. Compiled from 22 documents across 6 research phases + 1 verification pass. Engineering can start building from 60-recommendation.md + 70-roadmap.md + 30-design/.