ParaEval
Architecture
ParaEval runs as two services deployed via Docker Compose: a Next.js frontend and a FastAPI Python backend. The domain logic is a pure function implemented identically in TypeScript and Python. A shared contract layer (Zod ↔ Pydantic) ensures language boundary never becomes a data contract boundary.
Phase 1 is complete: curated demo cases, a live decision engine, a live optional extraction endpoint, regression coverage, Docker Compose deployment, and CI/CD building both images. Phase 2 adds persistent case storage, richer policy logic, and a fuller multi-agent enrichment pipeline.
Full Stack
| Layer | Technology | Notes |
|---|---|---|
| Frontend | Next.js 15 App Router | Server-rendered public pages plus a verdict-first comparison workbench. Next route handlers proxy the public benchmark/session APIs so the browser never talks to the backend service directly. |
| Public API | FastAPI public benchmark/session routes | Read-only benchmark packet routes plus reproducible session create/evaluate/export/import flows. Public routes are separate from maintainer mutation routes. |
| Maintainer API | FastAPI maintainer routes with bearer-token auth | Refresh/build jobs, packet publication, and rollback live on a separate surface guarded by PARAEVAL_BACKEND_SECRET so benchmark mutation cannot leak into public read flows. |
| Packet registry | Append-only file store with pointer swaps | Benchmark packets, sessions, and jobs are stored as append-only versioned records. Small current-pointer files are swapped atomically so publication and rollback do not rewrite packet history in place. |
| Evaluation authority | FastAPI + Pydantic v2 + Uvicorn | The backend owns benchmark packet selection and evaluation runs. The Next app may render or cache results, but it does not silently substitute a second production decision engine. |
| Contract layer | Zod 4 + Pydantic v2 | TypeScript and Python both serialize the comparison-lab entities in camelCase so session/export/import payloads, public packets, and runs stay structurally aligned across the boundary. |
| Auth | Next auth + backend bearer token | The site can still gate maintainers at the Next layer, but the backend independently enforces bearer-token auth for mutation routes. Public benchmark reads remain unauthenticated. |
| Local fixtures | Explicit dev/test mode only | Synthetic fixture packets still exist for tests and local development, but production reads are no longer meant to fall back silently to static demo data. |
Decision Algorithm (Phase 1 — Pinned)
The algorithm is deliberately simple and deliberately pinned. Every evidence item is scored on a 0–1 scale. The policy trigger is evaluated on the average. No weighting, no mandatory source requirements, no peril-specific rules — that is Phase 2. Pinning the algorithm means regression tests are stable: a historical case that returned "met" will always return "met" until a version bump explicitly changes the model.
- Reproducible: same input always returns same output
- Auditable: the logic fits in 10 lines, readable by non-engineers
- Basis-risk aware: detects and surfaces yes/no conflict explicitly
- Regression-safe: golden cases catch any unintended change
- Mandatory index condition: at least one gauge or API source required
- Peril-specific evidence weights (satellite carries more weight for flood)
- Missing-evidence penalty when expected sources are absent
- Source quality tiers: authoritative vs. indicative
Python Backend
The backend is a FastAPI service deployed as a separate Docker container. It holds a Python port of the decision engine and serves the same demo data as the TypeScript layer. Cross-runtime parity tests run in CI to verify both engines produce identical outputs on identical inputs. The extraction pipeline (POST /extract) already supports LLM-backed evidence classification when a DeepSeek-compatible API key is configured; it remains deliberately stateless until persistence lands in Phase 2.
| Method | Path | Notes |
|---|---|---|
| GET | /health | Healthcheck endpoint. Returns {"status":"ok"}. Used by Docker Compose depends_on condition. |
| GET | /public/benchmark/config | Returns current benchmark packets, policy templates, and model dossiers for the workbench. |
| POST | /public/sessions | Creates a reproducible session pinned to an exact benchmark snapshot version. |
| POST | /public/sessions/{id}/evaluate | Appends a new run under the selected policy/model and returns delta-ready session state. |
| GET | /public/sessions/{id}/export | Exports the current session payload without changing its pinned snapshot version. |
| POST | /maintainer/jobs/refresh | Creates a maintained packet-build job against a fixed source adapter. Mutation route; bearer-token protected. |
| POST | /maintainer/packets/{id}/publish | Atomically promotes a specific packet version by swapping the current pointer file. |
Contract Layer: Zod ↔ Pydantic
The contract layer is the guarantee that the TypeScript frontend and the Python backend speak the same language. Zod schemas define the canonical shape in TypeScript. Pydantic models mirror them exactly. Both sides use camelCase for JSON serialization (TypeScript naturally; Python via alias_generator=to_camel). Any deviation is caught by cross-runtime parity tests that assert identical decision outputs for identical inputs in both engines.
Deployment: Docker Compose + CI/CD
Both containers are built in CI and pushed to GitHub Container Registry. The production server pulls images via docker compose pull and restarts — no source code on the server. The backend is not exposed to the host in production; it is reachable only on the internal Docker network viahttp://backend:8000. The Next.js container proxies to it from route handlers; the browser never hits the Python service directly.
Repo Shape
Database Schema (Phase 2+)
Table names and column shapes are defined in the schema design. Not active in Phase 1. SQLite via node:sqlite for development; the schema is Postgres-compatible for production scale.
Phase 2: Extraction Pipeline
The extraction pipeline is the missing piece. Phase 1 uses static demo data to prove the evaluation and decision surfaces work. Phase 2 replaces static data with a live extraction pipeline that queries real sources: gauge APIs (USGS, JMA, Thai Meteorological Department), satellite-derived indices (MODIS flood extent, NDVI anomaly, GPM precipitation), and document ingestion for field reports and loss adjuster notes.
Planned Python stack:
- LangGraph — orchestration graph for multi-step extraction runs with retries and partial results
- vLLM — local inference for document extraction (loss adjuster reports, policy schedule parsing)
- CLIMADA — catastrophe model context: expected loss at location for peril, used to weight evidence and calibrate basis-risk notes
- MLflow — extraction run tracking: which sources were queried, what was returned, how long each step took
- pytest parity tests — cross-runtime verification that the Python decision engine produces identical outputs to the TypeScript engine for all golden cases
The extraction pipeline adds operational complexity — long-running jobs, partial failures, external API rate limits, model inference latency. That complexity belongs in a separate service. The split from the Next.js monolith happens when extraction demand is real, not as a premature architectural decision.