PCG Whitening: Convergence and Fallback Control
This page focuses on one objective: ensure shared-event whitening solves on the PCG path with stable convergence and minimal diagonal fallback.
Preconditions
model.likelihoods.sample.type = "correlated_gaussian"model.likelihoods.sample.shared_event_re.enabled = truemodel.likelihoods.sample.shared_event_re.model.group_by = "station_phase"model.likelihoods.sample.shared_event_re.solver.kind = "pcg"
Critical knobs
Correlation model
shared_event_re.model.tau_s:[P, S]RE scales.Must be strictly positive to avoid
tau_zerofallback.If too small, model behaves near diagonal and fallback risk increases.
Group caps (most important for fallback)
shared_event_re.limits.max_nodesshared_event_re.limits.max_rows
If groups exceed either cap, those groups can fall back to diagonal.
PCG convergence controls
shared_event_re.solver.max_itersshared_event_re.solver.min_itersshared_event_re.solver.tolshared_event_re.numerics.jitter0shared_event_re.numerics.jitter_max
Use these to stabilize or tighten solves after cap issues are resolved.
Safety behavior
shared_event_re.fallback.to_diagshared_event_re.fallback.abort_on_pcg_fallback
Recommended workflow:
tuning phase:
to_diag=true,abort_on_pcg_fallback=falsestrict validation phase:
abort_on_pcg_fallback=trueto force immediate failure and inspect cause
Convergence-first baseline config
"shared_event_re": {
"enabled": true,
"model": {
"group_by": "station_phase",
"tau_s": [0.03, 0.04],
"cluster": { "mode": "none", "k": 1 }
},
"limits": {
"max_nodes": 50000,
"max_rows": 2000000
},
"fallback": {
"to_diag": true,
"abort_on_pcg_fallback": false
},
"numerics": {
"jitter0": 1e-8,
"jitter_max": 1e-3
},
"solver": {
"kind": "pcg",
"max_iters": 80,
"min_iters": 2,
"tol": 1e-3,
"batched": true
}
}
What to monitor
Use periodic whitening stats logs (shared_event_re.logging.stats_log_every_epochs > 0) and check:
shared_event_re/groups_pcg_meanshared_event_re/groups_fallback_diag_meanshared_event_re/max_rows_maxshared_event_re/max_nodes_maxConsole fallback reason counters:
rows_cap,nodes_cap,tau_zero
Healthy target:
groups_pcg_mean > 0groups_fallback_diag_meanclose to zerofallback reasons remain zero in steady state
Fallback elimination playbook
If fallback is nonzero:
Check fallback reason summary first.
If
rows_cap > 0, raiselimits.max_rows.If
nodes_cap > 0, raiselimits.max_nodes.If
tau_zero > 0, fixtau_svalues and config plumbing.If fallback persists after caps/tau are fixed:
raise
solver.max_iters,increase
jitter0gradually,tighten or relax
toldepending on instability vs. stagnation.
If all groups still fall back:
temporarily set
abort_on_pcg_fallback=true,rerun and inspect the first failing context,
verify group sizes are within caps and
tau_sis valid.
Notes on optional solver knobs
node_bin_edges, warm starts, cache sizing, sparse-bin merge, and precompute controls are operational knobs. They can improve robustness in difficult workloads, but fallback elimination should first be solved with:
valid
tau_s,sufficient
max_nodes/max_rows,stable
max_iters/tol/jitter0.