Chinese Restaurant Process (CRP)
A Bayesian non-parametric approach to discovering latent catastrophe regimes for parametric trigger calibration. Unlike traditional models that assume a single distribution, CRP automatically identifies distinct risk regimes (e.g., hurricanes vs. floods vs. convective storms) without prespecifying how many exist.
Dataset Management
The CRP pipeline lets you toggle between curated real datasets (IBTrACS, ERA5, EM-DAT) and the synthetic data generator. Use the dataset cards and metadata inspector below to explore coverage, availability, and why each source matters for catastrophe regimes before training.
Dataset Explorer
Double-click any dataset card to open the backend-cached metadata and exploratory analysis. Use the checkboxes to compose a blend of track, rainfall, and loss data before starting training.
Synthetic Data Companion
Synthetic data generation creates controlled regimes for debugging. The same generator powers the dataset that feeds the Train CRP Model card below, so you can preview and label synth samples before training or use them as a fallback when real archives are incomplete.
IBTrACS
Double-click for metadataNot downloaded
Track & intensity metadata for each storm.
ERA5
Double-click for metadataRequires CDS API key
Reanalysis rainfall and pressure to capture flooding history.
EM-DAT
Double-click for metadataManual download required
Loss records and damage tallies for validation.
Synthetic
Double-click for EDAAlways Available
Generated corpus with known ground-truth regimes for testing.
Gumbel Copula →Generate Synthetic Catastrophe Data
Create synthetic typhoon/weather data with known ground truth regimes for testing and validation. Features include realistic insurance-relevant parameters.
Interactive CRP Training & Inference
Train a CRP model on synthetic insurance data and perform inference to discover latent catastrophe regimes. This demonstrates the model's ability to automatically discover the appropriate number of regimes from data.
Train CRP Model
Train on synthetic data with known ground-truth regimes for testing.
Note: EM-DAT provides economic loss labels but lacks the 6 peril features needed for training. It is automatically included when selected in Dataset Management above.
Data Source
- Iterations: 600 (longer chains stabilize regime count)
- Alpha: 2.0 (balanced new regime growth)
- Gamma: 1.8 (keeps base measure anchored)
- Burn-in: 150 (discard early samples)
- Thinning: 6 (reduces autocorrelation)
These defaults are drawn from the best-performing CRP/HDP calibration runs—feel free to tweak further for faster iteration.
CRP/HDP Parameters
Alpha controls cluster creation tendency. Higher α = more regimes. Gamma is the HDP base measure concentration.
CRP Pipeline Showcase
Run the full parametric insurance pipeline on synthetic typhoon data: CRP regime discovery, trigger calibration against baselines, and alpha-drift climate signal analysis.
Why This Matters for Parametric Insurance
The Problem with Single-Distribution Models
Traditional catastrophe models (GEV, Weibull, Gaussian copulas) assume all events come from one statistical distribution. But Hurricane Katrina (2005) and a localized flash flood are fundamentally different phenomena with different feature distributions. Fitting one model to both produces poor tail estimates and unreliable parametric triggers.
Regime-Specific Triggers
CRP discovers latent regimes automatically. Each regime gets its own trigger threshold θ* optimized for that catastrophe type. A hurricane regime might use max wind speed with threshold 45 m/s, while a flood regime uses precipitation with threshold 120mm/24h. This improves payout accuracy and reduces basis risk.
Catastrophe Features Modeled
- • max_wind_ms — peak wind speed (hurricanes)
- • precip_24h_mm — 24h rainfall (floods)
- • surge_m — storm surge (coastal events)
- • min_pressure_hpa — intensity proxy
- • translation_kmh — storm movement speed
- • track_deviation_km — path uncertainty
Insurance Output Statistics
- • E[K] — expected number of catastrophe types
- • E[α] — concentration (regime separation)
- • Loss rate per regime (claims frequency)
- • Feature means per regime (trigger calibration)
- • Event-to-regime assignments (classification)
Key Insurance Advantage
Unlike k-means or GMM where you must choose K regimes upfront, CRP discovers the appropriate number from historical loss data. The Bayesian evidence (marginal likelihood) tells you whether you have 2, 3, or 5 distinct catastrophe types in your portfolio — no guessing required.
Mathematical Foundation
The CRP: Insurance Analogy
Imagine a catastrophe response center with infinite specialist teams. Each new event (claim) either joins an existing team handling similar events (probability proportional to team size), or forms a new specialized team (probability controlled by concentration parameter α). The normalised probabilities follow the Pólya-urn scheme:
The expected number of occupied regimes after n events is — growing logarithmically, reflecting the intuition that rare new peril types become increasingly unlikely as the catalogue grows. A higher α encourages more diverse regimes.
Hierarchical Dirichlet Process — Full Generative Story
The HDP adds a global base measure G₀ so that all regimes can share statistical strength — critical when rare coastal surge events have only a handful of historical observations. The two-level Chinese Restaurant Franchiserepresentation:
- Events sit at tables (local regimes) via CRP(α).
- Tables link to global dishes (shared parameter types) via CRP(γ).
The hyperconcentration γ controls how many distinct parameter types can appear globally. A small γ forces rare-peril regimes to borrow heavily from common-peril regimes; a large γ lets each regime be fully independent. Both α and γ are inferred viaconjugate Gamma hyperpriors during sampling.
Collapsed Gibbs Sampling & the Predictive Likelihood
Integrating out regime parameters gives the collapsed update. The key computational piece is p(xᵢ | x̄ₖ⁻ⁱ) — the posterior predictive under regime k's model after excluding event i:
Conjugate case — Normal-Inverse-Gamma prior (multivariate features):
The degrees of freedom ν, location μ, and scale Σ update analytically from the sufficient statistics of the nₖ⁻ⁱ excluded events — making this an O(d²)rank-1 update per Gibbs step.
Non-conjugate case — GEV prior for wind extremes (Metropolis-within-Gibbs):
Here M Monte Carlo draws are taken from the regime posterior via a Metropolis step embedded inside the outer Gibbs loop. The GEV shape ξ controlstail heaviness; a vague prior on ξ prevents pathological heavy tails when a regime has fewer than ~20 events.
Posterior on Number of Regimes & Label Switching
After burn-in, the empirical distribution of unique cluster labels across thinnedMCMC samples yields P(K | x). The posterior mode is the reported regime count; the spread quantifies genuine uncertainty. Formally:
Label switching: Because the likelihood is invariant to permuting regime labels, naive averaging of posterior samples is meaningless. We resolve this by using invariant summaries (regime feature means, sizes) rather than label-specific quantities, or by post-processing with an optimal-transport alignment across samples.
Parametric Trigger Calibration
Per-Regime Threshold Optimization
Once regimes are discovered, we calibrate parametric triggers separately for each. The optimal threshold θ* balances sensitivity (capturing true catastrophes) against false positives (paying for non-events). We use train/test cross-validation to prevent overfitting.
Posterior Predictive Payout Objective
Replacing the point-estimate p(x) with the full posterior predictive averaged over MCMC regime assignments naturally propagates parameter uncertainty into the optimal threshold:
Where C is the coverage limit, L_max is the maximum payout, and the sum averages over S thinned MCMC samples of regime parameters θₖ — producing credible intervals on θ* that directly quantify uncertainty in the trigger.
Basis Risk Objective
For explicit basis risk control, minimize a weighted false-positive / false-negative trade-off under the regime-specific predictive, with λ reflecting insurer risk appetite:
λ → 0 favours recall (capturing every true event), λ → 1 favours precision (no spurious payouts). The optimal θ* is found via grid search or Brent's method over the posterior predictive CDF.
Validation Against Baselines
The CRP-trigger model is validated against standard insurance baselines: GEV (Generalized Extreme Value), Weibull, Gaussian copula, and historical quantile methods. Performance metrics include:
- • Basis Risk — (FP + FN) / N — lower is better
- • Precision — TP / (TP + FP) — trigger accuracy
- • Recall — TP / (TP + FN) — loss capture rate
- • Boundary F1 — harmonic mean of precision and recall
- • CRPS — Continuous Ranked Probability Score on regime forecasts — rewards calibrated uncertainty
- • OLPD — Out-of-sample log predictive density — penalises overfit triggers
Advanced Topics
Non-Stationarity: Dynamic Alpha-Drift
Climate change makes α itself non-stationary: as high-precipitation regimes become more frequent, the effective concentration should drift upward. A minimal exponential-time model:
A more principled approach is the Recurrent CRP (RCRP) or Dependent Dirichlet Process (DDP), where regime atom locations (GEV parameters) depend smoothly on climate covariates such as year, ENSO index, or sea-surface temperature anomaly:
This enables forward projection of regime probabilities under climate scenarios — directly useful for pricing parametric instruments beyond a 10-year horizon. Here GP denotes a Gaussian Process with covariance kernel κ.
Convergence Diagnostics
With 600 iterations, burn-in of 150, and thinning of 6, the effective sample size (ESS) should be checked before trusting any posterior summary:
- Gelman-Rubin R̂ — run ≥ 2 chains; R̂ < 1.05 indicates mixing.
- ESS on α and K — ESS < 100 signals autocorrelation; increase thinning.
- Trace plots — visual check for stationarity of α̂(t) and K(t) post burn-in.
Scalability & Approximations
For catalogues exceeding ~10,000 events, collapsed Gibbs becomes the bottleneck. Recommended alternatives:
- Slice sampling (stick-breaking) — O(K) per step vs O(K²), no tuning required.
- Variational Bayes for HDP — closed-form ELBO updates; 10-100× faster, loses some tail accuracy.
- Mini-batch collapsed Gibbs — subsample events per sweep; suitable for streaming loss data.
Prior recommendations: α ~ Gamma(1, 1) (weakly informative), γ ~ Gamma(1, 1),GEV shape ξ ~ Gamma(2, 0.5) (prevents ξ > 1 pathologies),GEV scale σ ~ LogNormal(0, 1). Run a sensitivity analysis by doubling/halving concentration priors and checking if K posterior shifts.