Diagnostics Catalog

This page maps diagnostics settings to runtime behavior and key metrics.

Configuration location

Primary controls:

  • observability.wandb.*

  • observability.diagnostics.*

Important: W&B metric group switches are only effective when W&B runtime is enabled.

Core diagnostics switches

  • observability.diagnostics.pair_count_stats_enable

    • Enables pair-count stats diagnostics.

  • observability.diagnostics.sgld_log_gnoise

    • Enables additional gradient-noise logging paths.

  • observability.diagnostics.sgld_log_temperature

    • Enables additional temperature logging paths.

  • observability.diagnostics.display_precond_every

    • Controls cadence for some preconditioner console diagnostics.

  • observability.diagnostics.profile_shared_event_re

    • Enables shared-event profiling summaries in epoch logging.

W&B group gating

Configured under:

  • observability.diagnostics.wandb.enabled

  • observability.diagnostics.wandb.groups

Common group names seen in configs:

  • core

  • noise

  • sampler

  • ess_online

  • precond

  • fixed_eval

  • resid_rms

  • priors

  • latent_field

  • corr_error

At current runtime, explicit group-gating is most actively used for:

  • core

  • noise

  • sampler

  • ess_online

Other names may be consumed by logger-side code paths depending on run mode and config.

Sampler diagnostics metrics

When the optimizer exposes diagnostics methods, SPIDER logs:

  • grad_noise_to_langevin_med

  • grad_noise_to_langevin_gm

  • grad_noise_to_langevin_med_dt

  • grad_noise_to_langevin_gm_dt

  • per-group variants (for example grad_noise_to_langevin_med_hypocenter)

  • t_eff_var_over_target

  • t_eff_var_over_target_gm

  • grad_noise_var_med

  • langevin_noise_var_med

Interpretation quick guide:

  • grad_noise_to_langevin_* compares minibatch gradient-noise variance to injected Langevin-noise variance.

  • t_eff_var_over_target near 1 suggests temperature calibration is closer to target.

See also Sampler Health Checks.

Shared-event whitening diagnostics metrics

Common metrics emitted during runs include:

  • shared_event_re/groups_pcg_mean

  • shared_event_re/groups_fallback_diag_mean

  • shared_event_re/max_rows_max

  • shared_event_re/max_nodes_max

  • shared_event_re/time_ms_sum

  • shared_event_re/whitening_solve_ms_sum

  • shared_event_re/whitening_cache_hits

  • shared_event_re/whitening_cache_misses

  • shared_event_re/whitening_pcg_pack_ms_sum

  • shared_event_re/whitening_pcg_kernel_ms_sum

  • shared_event_re/whitening_pcg_unpack_ms_sum

  • shared_event_re/whitening_pcg_leftover_ms_sum

For convergence/no-fallback triage, prioritize:

  • groups_pcg_mean

  • groups_fallback_diag_mean

  • max_rows_max, max_nodes_max

  • fallback reason counters in logs (rows_cap, nodes_cap, tau_zero)

See PCG Whitening: Convergence and Fallback Control.

Online ESS diagnostics (ess_online)

Configured under observability.diagnostics.ess_online:

  • enabled

  • every_n_samples

  • n_events

  • seed

  • window

  • max_lag

  • dims

Common derived metrics:

  • ess_online/ess_median

  • ess_online/ess_per_s_median

  • ess_online/samples_per_s

  • ess_online/elapsed_s

Practical logging profile

For production sampling with useful observability and moderate overhead:

  • keep core, sampler, and noise enabled,

  • enable ess_online only when actively tuning chains,

  • keep profile_shared_event_re=false unless investigating whitening behavior,

  • increase display_precond_every if logs are too verbose.