# PCG Whitening: Convergence and Fallback Control This page focuses on one objective: ensure shared-event whitening solves on the PCG path with stable convergence and minimal diagonal fallback. ## Preconditions - `model.likelihoods.sample.type = "correlated_gaussian"` - `model.likelihoods.sample.shared_event_re.enabled = true` - `model.likelihoods.sample.shared_event_re.model.group_by = "station_phase"` - `model.likelihoods.sample.shared_event_re.solver.kind = "pcg"` ## Critical knobs ## Correlation model - `shared_event_re.model.tau_s`: `[P, S]` RE scales. - Must be strictly positive to avoid `tau_zero` fallback. - If too small, model behaves near diagonal and fallback risk increases. ## Group caps (most important for fallback) - `shared_event_re.limits.max_nodes` - `shared_event_re.limits.max_rows` If groups exceed either cap, those groups can fall back to diagonal. ## PCG convergence controls - `shared_event_re.solver.max_iters` - `shared_event_re.solver.min_iters` - `shared_event_re.solver.tol` - `shared_event_re.numerics.jitter0` - `shared_event_re.numerics.jitter_max` Use these to stabilize or tighten solves after cap issues are resolved. ## Safety behavior - `shared_event_re.fallback.to_diag` - `shared_event_re.fallback.abort_on_pcg_fallback` Recommended workflow: - tuning phase: `to_diag=true`, `abort_on_pcg_fallback=false` - strict validation phase: `abort_on_pcg_fallback=true` to force immediate failure and inspect cause ## Convergence-first baseline config ```json "shared_event_re": { "enabled": true, "model": { "group_by": "station_phase", "tau_s": [0.03, 0.04], "cluster": { "mode": "none", "k": 1 } }, "limits": { "max_nodes": 50000, "max_rows": 2000000 }, "fallback": { "to_diag": true, "abort_on_pcg_fallback": false }, "numerics": { "jitter0": 1e-8, "jitter_max": 1e-3 }, "solver": { "kind": "pcg", "max_iters": 80, "min_iters": 2, "tol": 1e-3, "batched": true } } ``` ## What to monitor Use periodic whitening stats logs (`shared_event_re.logging.stats_log_every_epochs > 0`) and check: - `shared_event_re/groups_pcg_mean` - `shared_event_re/groups_fallback_diag_mean` - `shared_event_re/max_rows_max` - `shared_event_re/max_nodes_max` - Console fallback reason counters: `rows_cap`, `nodes_cap`, `tau_zero` Healthy target: - `groups_pcg_mean > 0` - `groups_fallback_diag_mean` close to zero - fallback reasons remain zero in steady state ## Fallback elimination playbook If fallback is nonzero: 1. Check fallback reason summary first. 2. If `rows_cap > 0`, raise `limits.max_rows`. 3. If `nodes_cap > 0`, raise `limits.max_nodes`. 4. If `tau_zero > 0`, fix `tau_s` values and config plumbing. 5. If fallback persists after caps/tau are fixed: - raise `solver.max_iters`, - increase `jitter0` gradually, - tighten or relax `tol` depending on instability vs. stagnation. If all groups still fall back: - temporarily set `abort_on_pcg_fallback=true`, - rerun and inspect the first failing context, - verify group sizes are within caps and `tau_s` is valid. ## Notes on optional solver knobs `node_bin_edges`, warm starts, cache sizing, sparse-bin merge, and precompute controls are operational knobs. They can improve robustness in difficult workloads, but fallback elimination should first be solved with: - valid `tau_s`, - sufficient `max_nodes`/`max_rows`, - stable `max_iters`/`tol`/`jitter0`.