Back to blog
Research SystemBayesian ML · Insurance

ParaEval and the CRP/HDP Model: Bayesian Nonparametric Trigger Calibration for Parametric Insurance

Parametric insurance triggers are only as good as the statistical model behind them. This post walks through ParaEval — a decision-evaluation platform for parametric claims — and its CRP/HDP sub-model, which uses Bayesian nonparametric clustering to discover latent peril regimes and calibrate triggers with lower basis risk than standard actuarial baselines.

18 min readApril 6, 2026
ParaEvalCRPHDPBayesian NonparametricsParametric InsuranceBasis RiskMCMCGibbs Sampling

The problem

Parametric triggers inherit the assumptions of the model that sets them

A parametric insurance policy pays when an observable index — wind speed, rainfall accumulation, flood depth — crosses a contractual threshold. The entire product hinges on that threshold being correctly calibrated: set it too low and the insurer overpays (false positives); set it too high and the policyholder suffers uncompensated loss (false negatives). The sum of these two errors is basis risk, and it is the single most important quality metric for any parametric product. Industry practice typically calibrates thresholds using Generalized Extreme Value (GEV) distributions, Weibull fits, or historical quantiles. These methods assume a single, stationary generating process — an assumption that breaks when peril events cluster into distinct regimes (e.g., fast-track high-wind typhoons vs. slow-moving rain-dominant systems) and when climate non-stationarity shifts those regimes over time. ParaEval is a decision-evaluation platform built to surface exactly this kind of structural mismatch. Its CRP/HDP sub-model replaces the single-distribution assumption with a Bayesian nonparametric mixture that discovers latent peril regimes directly from data, then derives trigger thresholds from the boundaries between loss-producing and non-loss-producing regimes.

(FP + FN) / N

Basis risk definition

Single regime

Key assumption broken

K is inferred

CRP advantage

System design

ParaEval: from raw evidence to auditable settlement decisions

ParaEval is not a model — it is a decision-evaluation platform that sits between raw event data and the payout recommendation. It structures the entire evidence-to-decision pipeline into four composable stages: evidence ingestion, trigger evaluation, settlement logic, and audit trail generation. Each stage is deterministic given its inputs — the same evidence snapshot always produces the same decision, reasoning trace, and payout recommendation. This determinism is not a convenience; it is a regulatory requirement. Parametric insurance settlements must be reproducible and explainable to auditors, regulators, and dispute panels. The system handles multiple evidence types (weather API readings, satellite observations, uploaded documents, claims adjuster reports) and classifies each source by reliability tier: authoritative (the contractual index source), corroborating (independent sources that directionally agree), and indicative (weaker signals useful for basis-risk analysis). The decision engine applies four rule checks — authoritative source coverage, contract threshold test, cross-source corroboration, and counter-signal management — before producing a confidence score, trigger status (met / borderline / not_met), and a structured settlement memo.

Intake

Evidence layer

Each evidence item carries a provenance record: provider identity, observation timestamp, distance from insured asset, measurement unit, and reliability tier. This metadata drives the weighting logic downstream — an authoritative reading from the contractual index source carries more settlement weight than a corroborating satellite proxy.

Decision

Decision engine

The trigger evaluation is deterministic: yes = 1.0, partial = 0.5, no = 0.0. Confidence is the arithmetic mean of evidence scores. Status thresholds are fixed: met >= 0.7, borderline >= 0.4, not_met < 0.4. This simplicity is intentional — it makes the decision auditable and reproducible without requiring statistical expertise from the reviewer.

Payout

Settlement logic

Payout recommendations are tiered against policy limits. If the highest authoritative reading exceeds 120% of the trigger threshold, the full policy limit is recommended. At the threshold itself, 70% is recommended. Borderline cases carry a reduced 25% watch-list view. Each tier includes a plain-language rationale suitable for inclusion in a settlement memo.

Governance

Audit trail

Every decision produces a rule trace (four checks with pass/warn/fail outcomes), a list of blocking conditions, supporting and counter-evidence summaries, and a basis-risk classification (none/low/medium/high). This trace is the artifact that survives regulatory review.

Mathematical foundation

The Chinese Restaurant Process: a nonparametric prior over cluster structure

The Chinese Restaurant Process (CRP) is a constructive definition of the Dirichlet Process (DP) that makes its clustering behavior intuitive. Imagine a restaurant with infinitely many tables. The first customer sits at table 1. Each subsequent customer either joins an existing table with probability proportional to the number of people already seated there, or starts a new table with probability proportional to a concentration parameter alpha. This process generates a random partition of customers into groups — and it does so without fixing the number of groups in advance. In the context of peril modeling, each "customer" is a historical typhoon event and each "table" is a latent peril regime. The CRP prior encodes a rich-get-richer dynamic: large regimes attract more events, but the concentration parameter alpha controls how readily new regimes are created. Crucially, alpha is not fixed — we place a Gamma prior on it and infer its posterior value alongside the regime assignments. A higher posterior alpha means the data is better explained by more regimes; a lower alpha means fewer, larger clusters suffice. This is the Escobar & West (1995) auxiliary variable method, and it gives the model an automatic Occam's razor: it discovers exactly as many regimes as the data warrants.

Comparison

Why not just use K-means?

K-means requires specifying K in advance, assigns hard cluster memberships, and assumes spherical clusters with equal variance. The CRP mixture model infers K from data, provides a full posterior distribution over assignments, and uses a Normal-Inverse-Wishart prior that accommodates clusters with different shapes, sizes, and orientations.

Comparison

Why not a finite Gaussian mixture (GMM)?

A finite GMM with BIC/AIC model selection still requires fitting multiple models and choosing among them. The DP mixture integrates over the number of components in a single inference run. More importantly, the DP places non-zero probability on arbitrarily many components — it can accommodate future regimes not seen in the training window.

P(zi=kzi){ni,kp(xiXi,k)existing table kαp(xiG0)new tableP(z_i = k \mid z_{-i}) \propto \begin{cases} n_{-i,k} \cdot p(x_i \mid X_{-i,k}) & \text{existing table } k \\ \alpha \cdot p(x_i \mid G_0) & \text{new table} \end{cases}

CRP conditional assignment. Each event i is assigned to an existing regime k with probability proportional to the regime size n_{-i,k} weighted by the likelihood of the event under that regime, or to a new regime with probability proportional to alpha weighted by the prior predictive likelihood.

GDP(α,G0),G0=NIW(μ0,κ0,ν0,Ψ0)G \sim \text{DP}(\alpha, G_0), \quad G_0 = \text{NIW}(\mu_0, \kappa_0, \nu_0, \Psi_0)

The generative model. G is drawn from a Dirichlet Process with concentration alpha and base distribution G_0 — a Normal-Inverse-Wishart (NIW) prior that conjugates with multivariate Gaussian cluster likelihoods.

ηBeta(α+1,n),αK,nπGamma(a+K,blogη)+(1π)Gamma(a+K1,blogη)\eta \sim \text{Beta}(\alpha + 1, n), \quad \alpha \mid K, n \sim \pi \cdot \text{Gamma}(a + K, b - \log\eta) + (1 - \pi) \cdot \text{Gamma}(a + K - 1, b - \log\eta)

Escobar & West (1995) auxiliary variable update for alpha. The auxiliary variable eta breaks the coupling between alpha and the partition, enabling a closed-form Gibbs update. This eliminates the need for Metropolis-Hastings steps on the concentration parameter.

Inference

Collapsed Gibbs sampling: integrating out cluster parameters for faster mixing

We implement collapsed Gibbs sampling following Neal (2000, Algorithm 3). "Collapsed" means we analytically integrate out the cluster-specific parameters (μk\mu_k, Σk\Sigma_k) using the Normal-Inverse-Wishart conjugacy, and sample only the discrete assignment variables zz and the concentration parameter α\alpha. This reduces the state space dramatically — instead of sampling O(Kd2)O(K \cdot d^2) continuous parameters per iteration, we sample O(n)O(n) discrete assignments. The collapsed sampler mixes faster and converges in fewer iterations. Each Gibbs iteration makes a full pass over all nn events. For event ii, we temporarily remove it from its current cluster, compute the CRP conditional probability of assigning it to each existing cluster (using the Student-tt predictive distribution that falls out of the NIW posterior) and to a new cluster (using the prior predictive), then sample from the resulting categorical distribution. After the full pass, we update α\alpha via the Escobar & West auxiliary variable method. Convergence is monitored via the split-chain R^\hat{R} diagnostic on KK (the number of active clusters) and the log-likelihood trace.

p(xiXi,k)=tνnd+1 ⁣(xi;μn,κn+1κn(νnd+1)Ψn)p(x_i \mid X_{-i,k}) = t_{\nu_n - d + 1}\!\left(x_i \,;\, \mu_n, \frac{\kappa_n + 1}{\kappa_n(\nu_n - d + 1)} \Psi_n\right)

The predictive distribution for a new observation x_i given the other members of cluster k is a multivariate Student-t. The parameters (mu_n, kappa_n, nu_n, Psi_n) are the NIW posterior updated with the sufficient statistics of X_{-i,k}. This is the Murphy (2012) formulation, eq. 4.215.

κn=κ0+nk,νn=ν0+nk,μn=κ0μ0+nkxˉkκn\kappa_n = \kappa_0 + n_k, \quad \nu_n = \nu_0 + n_k, \quad \mu_n = \frac{\kappa_0 \mu_0 + n_k \bar{x}_k}{\kappa_n}

NIW posterior update equations. The posterior mean mu_n is a precision-weighted average of the prior mean and the cluster sample mean. kappa_n and nu_n accumulate evidence from the cluster members.

Ψn=Ψ0+Sk+κ0nkκn(xˉkμ0)(xˉkμ0)\Psi_n = \Psi_0 + S_k + \frac{\kappa_0 n_k}{\kappa_n}(\bar{x}_k - \mu_0)(\bar{x}_k - \mu_0)^\top

The posterior scatter matrix Psi_n accumulates three sources of variation: the prior scatter Psi_0, the within-cluster scatter S_k, and a shrinkage term that penalizes deviation of the cluster mean from the prior mean.

Checklist

  • Initialise assignments with K-means++ rather than random assignment — this gives the sampler a reasonable starting partition and reduces burn-in by 2-3x.
  • Set the prior scatter Psi_0 = scale * (d+1) * empirical covariance. This is weakly informative: data dominates, but the prior prevents singular covariance estimates in small clusters.
  • Use kappa_0 = 0.01 (weak prior on the mean location) so the sampler can freely discover cluster centres from data.
  • Monitor convergence via split-chain R-hat on K. Values below 1.1 indicate adequate mixing. If R-hat > 1.2, increase the chain length or adjust the prior.
  • Thin the chain (keep every 5th sample) to reduce autocorrelation in the posterior samples used for threshold calibration.

Hierarchical extension

HDP: modeling primary-to-secondary peril dependency across typhoon categories

A typhoon is not a single peril — it produces wind damage, rainfall flooding, and storm surge simultaneously. The conditional distribution of secondary peril intensity (e.g., flood depth) given the primary peril context (e.g., Saffir-Simpson category) is critical for multi-trigger products. The Hierarchical Dirichlet Process (HDP) extends the CRP to grouped data: each typhoon category gets its own Dirichlet Process mixture for flood depth, but all category-specific mixtures share a common set of global atoms drawn from a top-level DP. This sharing mechanism is the key statistical insight — it allows rare categories (Cat 5 events) to borrow strength from more common categories (Cat 2-3 events) through shared mixture components, while still allowing category-specific distributional differences. In the ParaEval implementation, this is approximated using context-specific Gaussian Mixture Models with a global backoff model for sparse categories, following the MAP-sharing approximation rather than a full Gibbs HDP sampler (Teh et al., 2006). This is a tractable compromise for the dataset sizes typical in APAC typhoon catalogs (~500-2000 events).

Strength sharing

Why HDP over independent GMMs per category?

Cat 5 typhoons in the Western Pacific occur roughly 2-3 times per decade. Fitting an independent mixture to 10-15 events produces unstable density estimates. The HDP shares global atoms across categories, so a flood-depth component discovered in Cat 3 events can be reused for Cat 5 events with different mixing weights — borrowing statistical strength without assuming identical distributions.

Implementation

Practical approximation

The full HDP Gibbs sampler (Teh et al., 2006) is computationally expensive. ParaEval uses a MAP-sharing approximation: fit a global GMM on all flood depths, then fit category-specific GMMs with n_components capped by available data (min(K, n_c/3)). Categories with fewer than 5 events fall back to the global model. This is less elegant but stable for production.

Product design

Insurance application

Multi-trigger parametric products (e.g., wind + flood for warehouse cover) need to price the joint exceedance probability. The HDP gives P(flood > threshold | category = c), which combined with the CRP regime-specific wind distribution yields the joint trigger probability per regime — the basis for multi-peril pricing.

G0DP(γ,H),GcDP(α,G0),stctGctG_0 \sim \text{DP}(\gamma, H), \quad G_c \sim \text{DP}(\alpha, G_0), \quad s_t \mid c_t \sim G_{c_t}

HDP generative model. H is the base distribution for flood depth. G_0 is the global flood depth distribution. G_c is the category-specific distribution for Saffir-Simpson category c. Each observed flood depth s_t is drawn from the mixture associated with its typhoon category c_t.

BRsecondary=1Nt=1N1 ⁣[τ^(ct)1[st>θflood]]\text{BR}_{\text{secondary}} = \frac{1}{N} \sum_{t=1}^{N} \mathbb{1}\!\left[\hat{\tau}(c_t) \neq \mathbb{1}[s_t > \theta_{\text{flood}}]\right]

Secondary peril basis risk. The trigger prediction for event t is based on E[flood_depth | category c_t] exceeding the flood threshold. Basis risk measures how often this prediction disagrees with the observed flood outcome.

From clusters to contracts

Trigger calibration: finding the optimal boundary between loss and no-loss regimes

The CRP sampler discovers latent regimes. The trigger calibrator turns those regimes into actionable insurance thresholds. The strategy is conceptually simple: for each posterior sample of regime assignments, identify which regimes are loss-producing (majority of events have loss_occurred = True) and which are not. For each peril feature, find the 1D decision boundary between the closest loss-regime and no-loss-regime pair. This boundary is the trigger threshold theta* for that feature in that posterior sample. Repeating across all posterior samples yields a distribution of theta* values, from which we extract the posterior mean as the point estimate and the 5th/95th percentiles as a 90% credible interval. This credible interval is the key output that traditional methods cannot produce: it directly quantifies the uncertainty in the trigger threshold due to finite data and regime assignment uncertainty. A wide CI on theta* signals that the trigger boundary is poorly resolved — a direct warning to the product designer that basis risk may be high.

S ~ 50-200

Posterior samples

90%

Credible interval

6

Feature dimensions

θ(f)=1Ss=1Sargmin(kL,kNL)μkL(s)[f]μkNL(s)[f]boundary\theta^*(f) = \frac{1}{S} \sum_{s=1}^{S} \underset{(k_L, k_{NL})}{\arg\min} \left| \mu_{k_L}^{(s)}[f] - \mu_{k_{NL}}^{(s)}[f] \right|_{\text{boundary}}

The optimal threshold for feature f is the posterior mean of the inter-regime boundary, averaged over S posterior samples. For each sample s, we find the closest pair of loss-regime k_L and no-loss-regime k_NL along feature dimension f and compute their 1D decision boundary.

boundary(μa,σa,μb,σb)={μa+μb2if σaσbroot of (xμa)22σa2(xμb)22σb2=logσaσbotherwise\text{boundary}(\mu_a, \sigma_a, \mu_b, \sigma_b) = \begin{cases} \frac{\mu_a + \mu_b}{2} & \text{if } \sigma_a \approx \sigma_b \\ \text{root of } \frac{(x - \mu_a)^2}{2\sigma_a^2} - \frac{(x - \mu_b)^2}{2\sigma_b^2} = \log\frac{\sigma_a}{\sigma_b} & \text{otherwise} \end{cases}

The 1D boundary between two Gaussian regimes is the point where their densities are equal. When variances differ, this is a quadratic equation with up to two roots — we take the root closest to the midpoint.

CI90%(θ)=[θ^5%,  θ^95%]\text{CI}_{90\%}(\theta^*) = \left[\hat{\theta}_{5\%}, \; \hat{\theta}_{95\%}\right]

The 90% credible interval on the trigger threshold, computed from the empirical quantiles of the posterior theta* samples. This interval propagates both data uncertainty and regime assignment uncertainty into the trigger recommendation.

Climate non-stationarity

Alpha-drift index: a nonparametric early-warning signal for regime shift

Traditional catastrophe models assume stationarity — the same generating process that produced historical events will produce future ones. Climate change violates this assumption. The alpha-drift index is a novel application of the CRP concentration parameter as a non-stationarity detector. The idea is straightforward: fit the CRP sampler on rolling time windows (e.g., 5-year windows sliding across a 50-year catalog) and track the posterior mean of alpha over time. A rising alpha indicates that events in recent windows are increasingly poorly explained by the regime structure of earlier windows — the model needs more clusters to fit the data, which means new peril patterns are emerging. A falling alpha means the regime structure is consolidating. We apply the Mann-Kendall trend test to the alpha time series. A statistically significant positive trend (p < 0.05) is a formal signal that the historical calibration basis is becoming stale — trigger thresholds should be recalibrated on more recent data, or the product should be repriced to account for structural uncertainty. The alpha-drift index can also be correlated with external climate indices (e.g., sea surface temperature anomalies, ENSO phase) to test hypotheses about the physical drivers of regime shift.

Business meaning

Interpretation for underwriters

A rising alpha-hat means "the recent event mix does not fit neatly into the regime categories we found in earlier data." This is not a prediction of more severe events — it is a prediction of more structurally different events. The distinction matters for pricing: severity changes affect expected loss; regime novelty affects model uncertainty.

Climate linkage

Correlation with SST

Sea surface temperature anomalies in the Western Pacific Warm Pool are a known driver of typhoon intensification. The alpha-drift framework provides a formal test: compute Kendall tau between alpha-hat and SST anomaly across overlapping years. A significant positive correlation supports the hypothesis that warming seas are creating novel peril regimes.

Operations

Practical cadence

Run alpha-drift analysis annually as part of the portfolio review cycle. A 5-year rolling window with a 500-iteration sampler per window processes a 500-event catalog in under 30 minutes on a single CPU. This is cheap enough to be a standard monitoring artifact.

α^(t)=E[αX[tw,t]],t=t0+w,,T\hat{\alpha}(t) = \mathbb{E}[\alpha \mid X_{[t-w, t]}], \quad t = t_0 + w, \ldots, T

The alpha-drift index at time t is the posterior mean of alpha estimated from events in the window [t-w, t]. Each window runs an independent CRP sampler (shortened chain for efficiency: 500 iterations, 100 burn-in).

H0:no monotonic trend in α^(t)(Mann-Kendall test, p<0.05)H_0: \text{no monotonic trend in } \hat{\alpha}(t) \quad \text{(Mann-Kendall test, } p < 0.05\text{)}

The Mann-Kendall test is distribution-free and robust to outliers. A significant positive trend triggers a recalibration advisory. The Theil-Sen slope estimator provides a robust estimate of the rate of alpha increase.

Empirical results

CRP/HDP vs. standard actuarial baselines: a controlled comparison

We evaluate the CRP/HDP trigger against four standard actuarial methods on the same synthetic event corpus (500 events, 3 embedded true regimes, 80/20 train-test split). All methods are evaluated on identical test events using the same loss labels. The primary metric is basis risk (FP + FN)/N — lower is better. We also report precision (fraction of trigger activations that correspond to actual losses), recall (fraction of actual losses captured by the trigger), and boundary F1 (the harmonic mean of precision and recall, directly analogous to the boundary F1 reported in morpheme segmentation research). The CRP/HDP model consistently achieves the lowest basis risk because it calibrates thresholds against cluster boundaries rather than distributional quantiles — it "sees" the regime structure that quantile-based methods average over.

Baseline 1

GEV baseline

Fits a Generalized Extreme Value distribution to positive feature values and sets the trigger at the 75th percentile of the fitted GEV. This is the standard approach in catastrophe modeling for return-period estimation. Weakness: assumes a single tail distribution, cannot capture multimodal peril structure.

Baseline 2

Weibull baseline

Fits a Weibull distribution (shape, scale, location) to positive feature values with location fixed at zero. Trigger at the 75th percentile. Slightly more flexible than GEV for wind-speed modeling but still unimodal.

Baseline 3

Gaussian copula baseline

Rank-transforms the feature to Gaussian margins and sets the trigger at the 75th quantile of the original feature. In the univariate comparison, this reduces to a quantile estimate — the copula dependency structure is not captured in 1D. Included for completeness as copula methods are common in reinsurance.

Baseline 4

Historical quantile baseline

The simplest approach: set the trigger at the 75th percentile of the training feature distribution. No distributional assumption, no parametric fit. Surprisingly competitive on well-behaved data, but brittle when the training window is not representative of the test period.

~ 0.20-0.28

CRP/HDP basis risk

5-15%

Boundary F1 gain

100 events

Test set size

400 events

Training set size

Basis Risk=FP+FNN,Boundary F1=2PrecRecPrec+Rec\text{Basis Risk} = \frac{FP + FN}{N}, \quad \text{Boundary } F_1 = \frac{2 \cdot \text{Prec} \cdot \text{Rec}}{\text{Prec} + \text{Rec}}

Primary evaluation metrics. Basis risk is the total error rate of the trigger decision. Boundary F1 balances the two types of error — it penalizes both overpayment (FP) and underpayment (FN) equally.

Experimental setup

Three embedded regimes: how the synthetic corpus is structured

Evaluation on real IBTrACS + ERA5 + EM-DAT data requires CDS API access and EM-DAT registration. For reproducibility, the demo mode uses a synthetic corpus with three embedded regimes that mirror real APAC typhoon archetypes. Regime 0 (35% of events) represents fast-track, high-wind storms (Cat 4-5, mean wind 55 m/s, mean rain 80mm, fast translation at 28 km/h). Regime 1 (40%) represents slow-moving rain-dominant systems (Cat 2-3, mean wind 38 m/s, mean rain 250mm, slow translation at 10 km/h). Regime 2 (25%) represents surge-dominant coastal storms (Cat 3-4, mean wind 45 m/s, mean surge 4.2m, moderate translation at 18 km/h). Loss amounts are regime-correlated: wind-driven losses for Regime 0, flood-driven for Regime 1, surge-driven for Regime 2. The loss_occurred label is set at the median economic loss, giving a balanced 50% trigger rate — a worst-case scenario for basis risk (hardest to separate). Six peril features span the multivariate event space: maximum sustained wind (m/s), 24h rainfall (mm), storm surge (m), minimum central pressure (hPa), translation speed (km/h), and track deviation from climatological mean (km).

Cat 4-5

Regime 0: Fast / high-wind

Mean profile: 55 m/s wind, 80mm rain, 3.0m surge, 915 hPa, 28 km/h translation, 180km track deviation. These are the Category 4-5 typhoons that cause devastating wind damage with relatively modest rainfall.

Cat 2-3

Regime 1: Slow / high-rain

Mean profile: 38 m/s wind, 250mm rain, 1.8m surge, 955 hPa, 10 km/h translation, 100km track deviation. Slow-moving systems that stall and dump extreme rainfall. The primary loss driver is inland flooding, not wind.

Cat 3-4

Regime 2: Surge-dominant

Mean profile: 45 m/s wind, 150mm rain, 4.2m surge, 935 hPa, 18 km/h translation, 60km track deviation. Coastal tracks that produce large storm surge. Insured losses concentrate in port facilities and coastal infrastructure.

End-to-end

How CRP/HDP analysis feeds into the ParaEval decision pipeline

The CRP/HDP model plugs into ParaEval through a model adapter layer. When an analyst runs CRP analysis on a case, the system executes a full calibration pipeline (simulation + threshold calibration + evaluation) and produces a structured evidence item that is injected into the case's evidence stack. This evidence item carries the calibrated threshold (theta*), the model's recommendation, and an explicit reliability tag of "indicative" — meaning the CRP output informs the decision but does not override authoritative index readings. The recommendation text is auto-generated based on the calibration metrics: if boundary F1 > 0.7 and basis risk < 0.3, the model output is flagged as "strong trigger discrimination"; if F1 > 0.5, it is "directionally useful"; below that, it is "weak for this feature." This tiered framing prevents over-reliance on model outputs that happen to be poorly calibrated for the specific peril feature and event geography. The analyst sees the CRP output alongside the weather API readings, satellite observations, and adjuster reports — as one signal among several, with its uncertainty clearly surfaced.

Architecture

Model adapter pattern

Each analysis model (CRP/HDP, future models) implements a standard adapter interface with four capabilities: simulate, calibrate, alpha_drift, and case_analysis. The service layer routes requests by model_id, making it trivial to add new models without changing the decision engine.

Evaluation

Benchmark comparison lab

ParaEval includes a benchmark mode where the same case can be evaluated under different policy templates, model configurations, and evidence snapshots. This produces delta summaries (confidence change, payout change) that make the impact of model choice visible to decision-makers.

Reproducibility

Versioned case packets

Every case snapshot is versioned with a build timestamp, snapshot version, and source adapter. This ensures that when a model is updated or new evidence arrives, previous decision artifacts remain reproducible from their original inputs.

Checklist

  • Never treat CRP/HDP output as the sole trigger signal. It is indicative evidence that supplements authoritative index readings.
  • Always run calibration before case analysis — the theta* estimate depends on the training corpus and feature selection.
  • Compare CRP/HDP basis risk against at least one baseline method on the same data split before trusting the threshold recommendation.
  • Document the seed, n_iter, feature, and train_ratio used in every CRP run. These parameters materially affect the output.
  • If alpha-drift analysis shows a significant positive trend, consider narrowing the training window to more recent events for threshold recalibration.

Academic foundations

References

The CRP/HDP model draws on foundational work in Bayesian nonparametrics and applies it to a novel domain — parametric insurance trigger calibration. These references cover the mathematical foundations, the inference algorithms, and the insurance context.

References

Bayesian Density Estimation and Inference Using Mixtures

Escobar, M. D. & West, M. (1995). Journal of the American Statistical Association, 90(430), 577-588.

The foundational paper for the auxiliary variable method used to update the DP concentration parameter alpha. Our implementation follows their Algorithm 1.

Markov Chain Sampling Methods for Dirichlet Process Mixture Models

Neal, R. M. (2000). Journal of Computational and Graphical Statistics, 9(2), 249-265.

Defines Algorithm 3 (collapsed Gibbs for DPMM), which is the core inference algorithm used in the CRP sampler. Neal's comparison of algorithms informs our choice of collapsed over uncollapsed sampling.

Hierarchical Dirichlet Processes

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Journal of the American Statistical Association, 101(476), 1566-1581.

The theoretical foundation for the HDP extension that models primary-to-secondary peril dependency. ParaEval uses a MAP-sharing approximation of the full HDP for tractability.

Machine Learning: A Probabilistic Perspective

Murphy, K. P. (2012). MIT Press.

Equation 4.215 provides the Student-t predictive distribution under NIW conjugacy, which is the core computation in our collapsed Gibbs sampler.

A generalized framework for designing open-source natural hazard parametric insurance

Schmid, T. et al. (2023). ETH Zurich / Environment Systems & Decisions.

The CLIMADA framework for parametric insurance design, which treats basis risk as a quantity to be systematically measured. ParaEval's evaluation metrics are aligned with this framing.

A Tutorial on Bayesian Nonparametric Models

Gershman, S. J. & Blei, D. M. (2012). Journal of Mathematical Psychology, 56(1), 1-12.

Accessible introduction to BNP models including the CRP, DP, and HDP. Useful for readers new to the nonparametric paradigm.

ParaEval and the CRP/HDP Model: Bayesian Nonparametric Trigger Calibration for Parametric Insurance | Blog