CRP Model — Catastrophe Regime Discovery

Dataset Management

The CRP pipeline lets you toggle between curated real datasets (IBTrACS, ERA5, EM-DAT) and the synthetic data generator. Use the dataset cards and metadata inspector below to explore coverage, availability, and why each source matters for catastrophe regimes before training.

Dataset Explorer

Double-click any dataset card to open the backend-cached metadata and exploratory analysis. Use the checkboxes to compose a blend of track, rainfall, and loss data before starting training.

Synthetic Data Companion

Synthetic data generation creates controlled regimes for debugging. The same generator powers the dataset that feeds the Train CRP Model card below, so you can preview and label synth samples before training or use them as a fallback when real archives are incomplete.

IBTrACS

Double-click for metadata

Not downloaded

Track & intensity metadata for each storm.

ERA5

Double-click for metadata

Requires CDS API key

Reanalysis rainfall and pressure to capture flooding history.

EM-DAT

Double-click for metadata

Manual download required

Loss records and damage tallies for validation.

Synthetic

Double-click for EDA

Always Available

Generated corpus with known ground-truth regimes for testing.

Gumbel Copula →

Generate Synthetic Catastrophe Data

Create synthetic typhoon/weather data with known ground truth regimes for testing and validation. Features include realistic insurance-relevant parameters.

Events

Features

True Regimes

Interactive CRP Training & Inference

Train a CRP model on synthetic insurance data and perform inference to discover latent catastrophe regimes. This demonstrates the model's ability to automatically discover the appropriate number of regimes from data.

Train CRP Model

Select Dataset for Training

Train on synthetic data with known ground-truth regimes for testing.

Note: EM-DAT provides economic loss labels but lacks the 6 peril features needed for training. It is automatically included when selected in Dataset Management above.

Data Source

Real DataUsing IBTrACS

Iterations

Recommended defaults (from prior calibration)

Iterations: 600 (longer chains stabilize regime count)
Alpha: 2.0 (balanced new regime growth)
Gamma: 1.8 (keeps base measure anchored)
Burn-in: 150 (discard early samples)
Thinning: 6 (reduces autocorrelation)

These defaults are drawn from the best-performing CRP/HDP calibration runs—feel free to tweak further for faster iteration.

CRP/HDP Parameters

Alpha (α)

Gamma (γ)

Burn-in

Thinning

Alpha controls cluster creation tendency. Higher α = more regimes. Gamma is the HDP base measure concentration.

CRP Pipeline Showcase

Run the full parametric insurance pipeline on synthetic typhoon data: CRP regime discovery, trigger calibration against baselines, and alpha-drift climate signal analysis.

Why This Matters for Parametric Insurance

The Problem with Single-Distribution Models

Traditional catastrophe models (GEV, Weibull, Gaussian copulas) assume all events come from one statistical distribution. But Hurricane Katrina (2005) and a localized flash flood are fundamentally different phenomena with different feature distributions. Fitting one model to both produces poor tail estimates and unreliable parametric triggers.

Regime-Specific Triggers

CRP discovers latent regimes automatically. Each regime gets its own trigger threshold θ* optimized for that catastrophe type. A hurricane regime might use max wind speed with threshold 45 m/s, while a flood regime uses precipitation with threshold 120mm/24h. This improves payout accuracy and reduces basis risk.

Catastrophe Features Modeled

• max_wind_ms — peak wind speed (hurricanes)
• precip_24h_mm — 24h rainfall (floods)
• surge_m — storm surge (coastal events)
• min_pressure_hpa — intensity proxy
• translation_kmh — storm movement speed
• track_deviation_km — path uncertainty

Insurance Output Statistics

• E[K] — expected number of catastrophe types
• E[α] — concentration (regime separation)
• Loss rate per regime (claims frequency)
• Feature means per regime (trigger calibration)
• Event-to-regime assignments (classification)

Key Insurance Advantage

Unlike k-means or GMM where you must choose K regimes upfront, CRP discovers the appropriate number from historical loss data. The Bayesian evidence (marginal likelihood) tells you whether you have 2, 3, or 5 distinct catastrophe types in your portfolio — no guessing required.

Mathematical Foundation

The CRP: Insurance Analogy

Imagine a catastrophe response center with infinite specialist teams. Each new event (claim) either joins an existing team handling similar events (probability proportional to team size), or forms a new specialized team (probability controlled by concentration parameter α). The normalised probabilities follow the Pólya-urn scheme:

P(z_n = k \mid \mathbf{z}_{<n}) = \frac{n_k}{n - 1 + \alpha} \quad \text{(existing regime)}

P(z_n = K+1 \mid \mathbf{z}_{<n}) = \frac{\alpha}{n - 1 + \alpha} \quad \text{(new regime)}

The expected number of occupied regimes after n events is $\mathbb{E}[K_n] = \alpha \ln\!\left(1 + \tfrac{n}{\alpha}\right)$ — growing logarithmically, reflecting the intuition that rare new peril types become increasingly unlikely as the catalogue grows. A higher α encourages more diverse regimes.

Hierarchical Dirichlet Process — Full Generative Story

The HDP adds a global base measure G₀ so that all regimes can share statistical strength — critical when rare coastal surge events have only a handful of historical observations. The two-level Chinese Restaurant Franchiserepresentation:

Events sit at tables (local regimes) via CRP(α).
Tables link to global dishes (shared parameter types) via CRP(γ).

G_0 \sim \text{DP}(\gamma,\, H) \quad \text{(global: } \gamma \text{ controls dish diversity, } H \text{ is hyperprior on GEV/Gamma params)}

G_j \mid G_0 \sim \text{DP}(\alpha,\, G_0) \quad \text{(local: each regime draws its params from } G_0\text{)}

\theta_i \mid G_{z_i} \sim G_{z_i},\quad x_i \mid \theta_i \sim F(\theta_i)

The hyperconcentration γ controls how many distinct parameter types can appear globally. A small γ forces rare-peril regimes to borrow heavily from common-peril regimes; a large γ lets each regime be fully independent. Both α and γ are inferred viaconjugate Gamma hyperpriors during sampling.

Collapsed Gibbs Sampling & the Predictive Likelihood

Integrating out regime parameters gives the collapsed update. The key computational piece is p(xᵢ | x̄ₖ⁻ⁱ) — the posterior predictive under regime k's model after excluding event i:

P(z_i = k \mid \mathbf{z}_{-i}, \mathbf{x}) \propto n_k^{-i} \cdot p(x_i \mid \mathbf{x}_{k}^{-i})

Conjugate case — Normal-Inverse-Gamma prior (multivariate features):

p(x_i \mid \mathbf{x}_{k}^{-i}) = t_{\,\nu_k^{-i}}\!\left(x_i \;\big|\; \mu_k^{-i},\, \Sigma_k^{-i}\right)

The degrees of freedom ν, location μ, and scale Σ update analytically from the sufficient statistics of the nₖ⁻ⁱ excluded events — making this an O(d²)rank-1 update per Gibbs step.

Non-conjugate case — GEV prior for wind extremes (Metropolis-within-Gibbs):

p(x_i \mid \mathbf{x}_{k}^{-i}) \approx \frac{1}{M}\sum_{m=1}^{M} f_{\text{GEV}}\!\left(x_i \mid \theta_k^{(m)}\right), \quad \theta_k^{(m)} \sim p(\theta_k \mid \mathbf{x}_k^{-i})

Here M Monte Carlo draws are taken from the regime posterior via a Metropolis step embedded inside the outer Gibbs loop. The GEV shape ξ controlstail heaviness; a vague prior on ξ prevents pathological heavy tails when a regime has fewer than ~20 events.

Posterior on Number of Regimes & Label Switching

After burn-in, the empirical distribution of unique cluster labels across thinnedMCMC samples yields P(K | x). The posterior mode is the reported regime count; the spread quantifies genuine uncertainty. Formally:

\hat{P}(K = k \mid \mathbf{x}) = \frac{1}{S}\sum_{s=1}^{S} \mathbf{1}\!\left[K^{(s)} = k\right]

Label switching: Because the likelihood is invariant to permuting regime labels, naive averaging of posterior samples is meaningless. We resolve this by using invariant summaries (regime feature means, sizes) rather than label-specific quantities, or by post-processing with an optimal-transport alignment across samples.

Parametric Trigger Calibration

Per-Regime Threshold Optimization

Once regimes are discovered, we calibrate parametric triggers separately for each. The optimal threshold θ* balances sensitivity (capturing true catastrophes) against false positives (paying for non-events). We use train/test cross-validation to prevent overfitting.

Posterior Predictive Payout Objective

Replacing the point-estimate p(x) with the full posterior predictive averaged over MCMC regime assignments naturally propagates parameter uncertainty into the optimal threshold:

L(\theta) = \mathbb{E}[\text{Payout}] = \int_{\theta}^{\infty} \min\!\left(\frac{x - \theta}{\theta} \cdot C,\, L_{\max}\right) \underbrace{\frac{1}{S}\sum_{s=1}^{S} p(x \mid \theta_k^{(s)})}_{\text{posterior predictive}} dx

Where C is the coverage limit, L_max is the maximum payout, and the sum averages over S thinned MCMC samples of regime parameters θₖ — producing credible intervals on θ* that directly quantify uncertainty in the trigger.

Basis Risk Objective

For explicit basis risk control, minimize a weighted false-positive / false-negative trade-off under the regime-specific predictive, with λ reflecting insurer risk appetite:

\text{BasisRisk}(\theta) = \lambda \cdot \mathbb{E}[\text{FP}] + (1 - \lambda) \cdot \mathbb{E}[\text{FN}]

λ → 0 favours recall (capturing every true event), λ → 1 favours precision (no spurious payouts). The optimal θ* is found via grid search or Brent's method over the posterior predictive CDF.

Validation Against Baselines

The CRP-trigger model is validated against standard insurance baselines: GEV (Generalized Extreme Value), Weibull, Gaussian copula, and historical quantile methods. Performance metrics include:

• Basis Risk — (FP + FN) / N — lower is better
• Precision — TP / (TP + FP) — trigger accuracy
• Recall — TP / (TP + FN) — loss capture rate
• Boundary F1 — harmonic mean of precision and recall
• CRPS — Continuous Ranked Probability Score on regime forecasts — rewards calibrated uncertainty
• OLPD — Out-of-sample log predictive density — penalises overfit triggers

Advanced Topics

Non-Stationarity: Dynamic Alpha-Drift

Climate change makes α itself non-stationary: as high-precipitation regimes become more frequent, the effective concentration should drift upward. A minimal exponential-time model:

\alpha_t = \alpha_0 \cdot e^{\,\beta\, t}, \quad \beta \sim \mathcal{N}(0, \sigma_\beta^2)

A more principled approach is the Recurrent CRP (RCRP) or Dependent Dirichlet Process (DDP), where regime atom locations (GEV parameters) depend smoothly on climate covariates such as year, ENSO index, or sea-surface temperature anomaly:

\theta_k(t) = f_k(\text{cov}_t) + \epsilon_k, \quad f_k \sim \mathcal{GP}(0, \kappa)

This enables forward projection of regime probabilities under climate scenarios — directly useful for pricing parametric instruments beyond a 10-year horizon. Here GP denotes a Gaussian Process with covariance kernel κ.

Convergence Diagnostics

With 600 iterations, burn-in of 150, and thinning of 6, the effective sample size (ESS) should be checked before trusting any posterior summary:

Gelman-Rubin R̂ — run ≥ 2 chains; R̂ < 1.05 indicates mixing.
ESS on α and K — ESS < 100 signals autocorrelation; increase thinning.
Trace plots — visual check for stationarity of α̂(t) and K(t) post burn-in.

# Collapsed Gibbs pseudocode

for s in range(n_iter):

for i in range(n_events):

# remove event i from its current regime

# compute p(z_i = k) ∝ n_k * t_ν(x_i | μ_k, Σ_k) [conjugate]

# sample new z_i, update sufficient statistics (rank-1)

update α via Escobar-West auxiliary variable sampler

Scalability & Approximations

For catalogues exceeding ~10,000 events, collapsed Gibbs becomes the bottleneck. Recommended alternatives:

Slice sampling (stick-breaking) — O(K) per step vs O(K²), no tuning required.
Variational Bayes for HDP — closed-form ELBO updates; 10-100× faster, loses some tail accuracy.
Mini-batch collapsed Gibbs — subsample events per sweep; suitable for streaming loss data.

Prior recommendations: α ~ Gamma(1, 1) (weakly informative), γ ~ Gamma(1, 1),GEV shape ξ ~ Gamma(2, 0.5) (prevents ξ > 1 pathologies),GEV scale σ ~ LogNormal(0, 1). Run a sensitivity analysis by doubling/halving concentration priors and checking if K posterior shifts.

Chinese Restaurant Process (CRP)

Dataset Management

IBTrACS

ERA5

EM-DAT

Synthetic

Generate Synthetic Catastrophe Data

Interactive CRP Training & Inference

Train CRP Model

CRP/HDP Parameters

CRP Pipeline Showcase

Why This Matters for Parametric Insurance

The Problem with Single-Distribution Models

Regime-Specific Triggers

Catastrophe Features Modeled

Insurance Output Statistics

Key Insurance Advantage

Mathematical Foundation

The CRP: Insurance Analogy

Hierarchical Dirichlet Process — Full Generative Story

Collapsed Gibbs Sampling & the Predictive Likelihood

Posterior on Number of Regimes & Label Switching

Parametric Trigger Calibration

Per-Regime Threshold Optimization

Posterior Predictive Payout Objective

Basis Risk Objective

Validation Against Baselines

Advanced Topics

Non-Stationarity: Dynamic Alpha-Drift

Convergence Diagnostics

Scalability & Approximations