PCG Whitening: Convergence and Fallback Control

This page focuses on one objective: ensure shared-event whitening solves on the PCG path with stable convergence and minimal diagonal fallback.

Preconditions

model.likelihoods.sample.type = "correlated_gaussian"
model.likelihoods.sample.shared_event_re.enabled = true
model.likelihoods.sample.shared_event_re.model.group_by = "station_phase"
model.likelihoods.sample.shared_event_re.solver.kind = "pcg"

Critical knobs

Correlation model

shared_event_re.model.tau_s: [P, S] RE scales.
- Must be strictly positive to avoid tau_zero fallback.
- If too small, model behaves near diagonal and fallback risk increases.

Group caps (most important for fallback)

shared_event_re.limits.max_nodes
shared_event_re.limits.max_rows

If groups exceed either cap, those groups can fall back to diagonal.

PCG convergence controls

shared_event_re.solver.max_iters
shared_event_re.solver.min_iters
shared_event_re.solver.tol
shared_event_re.numerics.jitter0
shared_event_re.numerics.jitter_max

Use these to stabilize or tighten solves after cap issues are resolved.

Safety behavior

shared_event_re.fallback.to_diag
shared_event_re.fallback.abort_on_pcg_fallback

Recommended workflow:

tuning phase: to_diag=true, abort_on_pcg_fallback=false
strict validation phase: abort_on_pcg_fallback=true to force immediate failure and inspect cause

Convergence-first baseline config

"shared_event_re": {
  "enabled": true,
  "model": {
    "group_by": "station_phase",
    "tau_s": [0.03, 0.04],
    "cluster": { "mode": "none", "k": 1 }
  },
  "limits": {
    "max_nodes": 50000,
    "max_rows": 2000000
  },
  "fallback": {
    "to_diag": true,
    "abort_on_pcg_fallback": false
  },
  "numerics": {
    "jitter0": 1e-8,
    "jitter_max": 1e-3
  },
  "solver": {
    "kind": "pcg",
    "max_iters": 80,
    "min_iters": 2,
    "tol": 1e-3,
    "batched": true
  }
}

What to monitor

Use periodic whitening stats logs (shared_event_re.logging.stats_log_every_epochs > 0) and check:

shared_event_re/groups_pcg_mean
shared_event_re/groups_fallback_diag_mean
shared_event_re/max_rows_max
shared_event_re/max_nodes_max
Console fallback reason counters: rows_cap, nodes_cap, tau_zero

Healthy target:

groups_pcg_mean > 0
groups_fallback_diag_mean close to zero
fallback reasons remain zero in steady state

Fallback elimination playbook

If fallback is nonzero:

Check fallback reason summary first.
If rows_cap > 0, raise limits.max_rows.
If nodes_cap > 0, raise limits.max_nodes.
If tau_zero > 0, fix tau_s values and config plumbing.
If fallback persists after caps/tau are fixed:
- raise solver.max_iters,
- increase jitter0 gradually,
- tighten or relax tol depending on instability vs. stagnation.

If all groups still fall back:

temporarily set abort_on_pcg_fallback=true,
rerun and inspect the first failing context,
verify group sizes are within caps and tau_s is valid.

Notes on optional solver knobs

node_bin_edges, warm starts, cache sizing, sparse-bin merge, and precompute controls are operational knobs. They can improve robustness in difficult workloads, but fallback elimination should first be solved with:

valid tau_s,
sufficient max_nodes/max_rows,
stable max_iters/tol/jitter0.