Sampler Options and Mathematical Updates

This page documents sampler backends available in SPIDER and the update equations used in practice.

Available sampler backends

Configured at:

  • inference.sampler.backend

Supported values:

  • psgld

  • sghmc

Backend selection is handled in spider.optim.backends.create_sampler_backend.

Common runtime conventions

For both backends in SPIDER:

  • Drift uses minibatch mean gradient scaled by total observations:

    • \(g_{\text{drift}} = N \,\bar g\)

  • User-configured sampler learning rate is internally scaled per observation at backend creation:

    • \(\lambda_{\text{effective}} = \lambda_{\text{config}} / N\)

  • Noise is off in Phase 2, ramped in Phase 3, and active in Phase 4.

Common controls:

  • temperature

  • beta, eps

  • freeze_preconditioner_sampling

  • grad_clip_norm

Preconditioner options

Configured at:

  • inference.sampler.preconditioning.enabled

  • inference.sampler.preconditioning.type

Supported preconditioner types:

  • rmsprop (diagonal)

  • lrd (low-rank plus diagonal)

When preconditioning is enabled, both drift and injected noise are scaled by the same metric.

pSGLD backend

Class:

  • spider.optim.sgld.pSGLD

With diagonal preconditioner \(G\), the implemented step is:

\[ \theta_{t+1} = \theta_t - \Big(\lambda\, G_t\, g_{\text{drift}} + \lambda\,\Gamma_t\Big) + \sqrt{2\,\lambda\,T}\,\sqrt{G_t}\,\xi_t \]

where:

  • \(\xi_t \sim \mathcal N(0, I)\)

  • \(\Gamma_t\) is an optional diagonal approximation to the pSGLD correction term (enabled by include_gamma)

  • \(G_t = (\epsilon + \sqrt{v_t})^{-1}\) for RMSprop mode

RMSprop second-moment update:

\[ v_t = \beta v_{t-1} + (1-\beta)\,\bar g_t^{\,2} \]

SGHMC backend

Class:

  • spider.optim.sghmc.SGHMC

With momentum \(p\), friction \(\alpha\), and diagonal \(G\):

\[ p_{t+1} = (1-\alpha)\,p_t - \lambda\,G_t\,g_{\text{drift}} + \sqrt{2\alpha\,\lambda}\,\,\sqrt{T}\,\sqrt{G_t}\,\xi_t \]
\[ \theta_{t+1} = \theta_t + p_{t+1} \]

RMSprop metric in SGHMC uses bias-corrected second moment:

\[ v_t = \beta v_{t-1} + (1-\beta)\,\bar g_t^{\,2},\qquad \hat v_t = \frac{v_t}{1-\beta^t},\qquad G_t = (\epsilon + \sqrt{\hat v_t})^{-1} \]

LRD preconditioner math

Implemented in both samplers via _build_lrd_metric(...).

Metric form:

\[ P_t = \operatorname{diag}(d_t) + U_t\,\operatorname{diag}(\lambda_t)\,U_t^\top \]

Drift preconditioning:

\[ P_t g = d_t \odot g + U_t\Big(\lambda_t \odot (U_t^\top g)\Big) \]

Noise is drawn with covariance proportional to \(P_t\):

  • diagonal part via \(\sqrt{d_t}\odot z_1\)

  • low-rank part via \(U_t(\sqrt{\lambda_t}\odot z_2)\)

with independent standard-normal \(z_1, z_2\).

LRD subspace update modes

Configured under:

  • inference.sampler.preconditioning.lrd.mode

Modes:

  • svd: maintain a gradient buffer and periodically update \(U,\lambda\) from batched SVD.

  • oja: online Oja-style subspace updates with learning rate eta.

Useful LRD knobs:

  • rank

  • mode (svd or oja)

  • update_every (svd mode)

  • buffer_size (svd mode)

  • eta/oja_eta (oja mode)

  • diag_floor

What is currently exposed in config

Production path (via backend factory) currently exposes:

  • psgld

  • sghmc

  • rmsprop or lrd preconditioning

There is an additional optimizer class in code (AdaptiveDriftSGLDAdam), but it is not currently selected by inference.sampler.backend.

Practical tuning interpretation

  • Increase eps to reduce extreme preconditioner amplification.

  • Increase beta for smoother/slower preconditioner adaptation.

  • In SGHMC, lower sghmc_alpha to reduce damping; raise it to damp oscillations.

  • Use freeze_preconditioner_sampling=true for time-homogeneous Phase 4 kernels after adaptation is mature.

  • Use grad_clip_norm as a safety guardrail when exploring higher learning rates.

See also: