Sampler Options and Mathematical Updates

This page documents sampler backends available in SPIDER and the update equations used in practice.

Available sampler backends

Configured at:

inference.sampler.backend

Supported values:

psgld
sghmc

Backend selection is handled in spider.optim.backends.create_sampler_backend.

Common runtime conventions

For both backends in SPIDER:

Drift uses minibatch mean gradient scaled by total observations:
- \(g_{\text{drift}} = N \,\bar g\)
User-configured sampler learning rate is internally scaled per observation at backend creation:
- \(\lambda_{\text{effective}} = \lambda_{\text{config}} / N\)
Noise is off in Phase 2, ramped in Phase 3, and active in Phase 4.

Common controls:

temperature
beta, eps
freeze_preconditioner_sampling
grad_clip_norm

Preconditioner options

Configured at:

inference.sampler.preconditioning.enabled
inference.sampler.preconditioning.type

Supported preconditioner types:

rmsprop (diagonal)
lrd (low-rank plus diagonal)

When preconditioning is enabled, both drift and injected noise are scaled by the same metric.

pSGLD backend

Class:

spider.optim.sgld.pSGLD

With diagonal preconditioner \(G\), the implemented step is:

\[ \theta_{t+1} = \theta_t - \Big(\lambda\, G_t\, g_{\text{drift}} + \lambda\,\Gamma_t\Big) + \sqrt{2\,\lambda\,T}\,\sqrt{G_t}\,\xi_t \]

where:

\(\xi_t \sim \mathcal N(0, I)\)
\(\Gamma_t\) is an optional diagonal approximation to the pSGLD correction term (enabled by include_gamma)
\(G_t = (\epsilon + \sqrt{v_t})^{-1}\) for RMSprop mode

RMSprop second-moment update:

\[ v_t = \beta v_{t-1} + (1-\beta)\,\bar g_t^{\,2} \]

SGHMC backend

Class:

spider.optim.sghmc.SGHMC

With momentum \(p\), friction \(\alpha\), and diagonal \(G\):

\[ p_{t+1} = (1-\alpha)\,p_t - \lambda\,G_t\,g_{\text{drift}} + \sqrt{2\alpha\,\lambda}\,\,\sqrt{T}\,\sqrt{G_t}\,\xi_t \]

\[ \theta_{t+1} = \theta_t + p_{t+1} \]

RMSprop metric in SGHMC uses bias-corrected second moment:

\[ v_t = \beta v_{t-1} + (1-\beta)\,\bar g_t^{\,2},\qquad \hat v_t = \frac{v_t}{1-\beta^t},\qquad G_t = (\epsilon + \sqrt{\hat v_t})^{-1} \]

LRD preconditioner math

Implemented in both samplers via _build_lrd_metric(...).

Metric form:

\[ P_t = \operatorname{diag}(d_t) + U_t\,\operatorname{diag}(\lambda_t)\,U_t^\top \]

Drift preconditioning:

\[ P_t g = d_t \odot g + U_t\Big(\lambda_t \odot (U_t^\top g)\Big) \]

Noise is drawn with covariance proportional to \(P_t\):

diagonal part via \(\sqrt{d_t}\odot z_1\)
low-rank part via \(U_t(\sqrt{\lambda_t}\odot z_2)\)

with independent standard-normal \(z_1, z_2\).

LRD subspace update modes

Configured under:

inference.sampler.preconditioning.lrd.mode

Modes:

svd: maintain a gradient buffer and periodically update \(U,\lambda\) from batched SVD.
oja: online Oja-style subspace updates with learning rate eta.

Useful LRD knobs:

rank
mode (svd or oja)
update_every (svd mode)
buffer_size (svd mode)
eta/oja_eta (oja mode)
diag_floor

What is currently exposed in config

Production path (via backend factory) currently exposes:

psgld
sghmc
rmsprop or lrd preconditioning

There is an additional optimizer class in code (AdaptiveDriftSGLDAdam), but it is not currently selected by inference.sampler.backend.

Practical tuning interpretation

Increase eps to reduce extreme preconditioner amplification.
Increase beta for smoother/slower preconditioner adaptation.
In SGHMC, lower sghmc_alpha to reduce damping; raise it to damp oscillations.
Use freeze_preconditioner_sampling=true for time-homogeneous Phase 4 kernels after adaptation is mature.
Use grad_clip_norm as a safety guardrail when exploring higher learning rates.