Strict Online Variational Bayesian Filtering: What Survived The Stress Tests

A synthesis of strict online nonlinear filtering experiments: K2 FIVO bridge survived, while trajectories, couplings, flows, and predictive pressure exposed calibration failures.

May 4, 2026 mlbot

Series: VBF Experiments, May 2026

Contents

The last few experiment branches all asked the same question from different angles:

Can a learned Bayesian filter stay strictly online and still carry a useful nonlinear filtering density under stress?

The answer is now sharp enough to write down. We can train robust strict online filters, but the winning mechanism is not simply more posterior capacity, more trajectory pressure, learned couplings, or scalar normalizing flows. The current robust anchor is the K2 FIVO bridge row. The next scientific step is narrower: calibrate the local Bayes projection and the predictive normalizer together.

The source reports are in the public dwrtz/ml-examples repo. This post is the narrative version for readers who know Kalman filters but have not followed this particular branch of work.

Why Strict Online Filtering Is Harder Than Sequence Inference#

A Kalman filter works because the filtering family is closed. Linear-Gaussian dynamics and observations imply:

\[ p(z_t \mid y_{1:t}) = \mathcal{N}(m_t, P_t). \]

After each observation, the filter only needs a mean and covariance. The posterior is exact and online.

This project keeps the online part and removes the closure. The learned filter must obey:

\[ q^F_t = \operatorname{update}_\theta(q^F_{t-1}, x_t, y_t). \]

That is the strict contract. The filter can carry a Gaussian mixture or another tractable family, but the headline row cannot peek at future observations, keep a hidden smoother state, or train against a grid posterior as the main scientific claim.

The benchmark is scalar, but not easy. The latent dynamics are local and Gaussian, while observations include sine, cubic, tanh, Student-t, heteroskedastic, weak, intermittent, zero, and random-normal amplitude variants. The sine case creates aliasing: states separated by roughly \(2\pi\) can explain similar observations. A single Gaussian can become confidently wrong; a mixture can carry aliases, but only if the update rule allocates mass correctly.

The Guardrails#

The main metrics were:

Metric	What it checks
state NLL	Density assigned to the true latent state by the filtering belief.
predictive-y NLL	Pre-update observation predictive likelihood.
cov90	Empirical coverage of nominal 90 percent state intervals.
variance ratio	Learned filtering variance relative to the reference.
ESS and log-weight variance	Whether a particle objective has usable sequential weights.
coupling marginal error	Whether a Markov coupling preserves its requested marginals.

The important guardrail was:

Do not promote a row that wins cubic or predictive-y by collapsing
random-normal filtering density.

Random-normal is the stressor that catches overconfident filters. A row can look good on cubic because it commits hard to one local explanation. If that same row loses random-normal state density and underestimates variance, it has not solved filtering. It has moved along a tradeoff.

What Survived#

The K2/K3 mixture frontier survived as the basic learned-posterior story. K2 mixture IWAE is the clean baseline. K3 mixture IWAE improves state density, so posterior shape still matters. But K growth alone did not solve robustness, and an explicitly exchangeable component cell was a negative result.

The FIVO bridge survived as the robust sequential objective. FIVO is the Filtering Variational Objective, a particle objective that trains the proposal and filtering update through sequential importance weights. In this codebase, the bridge variant uses a learned proposal bridge rather than relying on the plain filtering proposal. That matters because plain FIVO improved predictive-y while collapsing state density. The bridge row kept effective sample size high and gave the best stressor robustness story.

The current decision table is:

Role	Row	Reason
Robust strict-online anchor	`direct_mixture_k2_fivo_bridge_n32`	Best current random-normal and stressor robustness story.
State-density comparator	`direct_mixture_k3_joint_iwae_h4_k32`	K3 improves state density but is not robust enough to replace the anchor.
Cubic/local-projection comparator	`direct_mixture_k3_hybrid_iwae_projection_h4_k16_w03`	Useful cubic and local projection comparator.
Predictive-y / coverage comparator	`direct_mixture_k2_power_ep_alpha_0p5`	Useful predictive calibration comparator, not the state-density default.
Couplings	diagnostic only	Markov coupling helps over product-marginal paths but does not clear robustness guardrails.
Scalar flow	diagnostic only	Extra capacity helps some metrics but creates variance and predictive mismatch.

What Failed#

The failures are more useful than they first look because they agree with each other.

Direct predictive-y pressure did not fix the frontier. Pre-update predictive scoring tied the K2 baseline on the family grid but did not improve predictive-y, and slightly regressed stressor state NLL. Detached and late predictive-y variants mostly moved the model along the same tradeoff.

Trajectory objectives were stable but wrong for the current goal. A backward trajectory ELBO samples full latent paths while preserving a causal online filter at test time. That produced better observation prediction in several rows, but the filtering density collapsed. Backward marginal consistency was not the missing piece; larger consistency weights worsened random-normal state NLL and variance collapse.

Couplings were meaningful but not promotable. Product-marginal trajectory objectives were very bad because they break useful temporal dependence between neighboring latent states. Sinkhorn-coupled trajectories improved on that negative control, which says coupling structure is not empty. But learned and derived Sinkhorn rows still failed sine and random-normal robustness.

Coupled FIVO was the final test of that idea. After a log-space sampling and scoring fix, K2/K3 coupled FIVO rows became numerically trainable. They still failed: no coupled row improved cubic state density relative to the K2 FIVO bridge, all worsened random-normal state NLL, variance ratios were weak, and coupling marginal residuals stayed above the \(10^{-3}\) trust threshold even with 1000 Sinkhorn iterations. That branch is now closed as a promotion path.

Scalar flows were also not a free win. A strict online scalar monotone-spline flow with FIVO bridge improved average state NLL slightly and improved coverage, but it over-inflated variance and badly damaged cubic predictive-y. Flow+IWAE was unstable. A richer marginal family is useful only if the update objective keeps variance and predictive normalizers calibrated.

The Local Bayes Projection Clue#

The recurring object underneath these results is the one-step tilted distribution:

\[ \tilde p_\alpha(z_t) \propto \left[ \int p(z_t \mid z_{t-1})q^F_{t-1}(z_{t-1})\,dz_{t-1} \right] p(y_t \mid z_t, x_t)^\alpha. \]

For \(\alpha = 1\), this is the ordinary local Bayesian update before projection. For \(\alpha < 1\), it resembles a Power-EP style damped update. The filter then needs a projection:

\[ q^F_t = \Pi_{\mathcal{Q}}[\tilde p_\alpha(z_t)]. \]

In a Kalman filter, this projection is exact because the family is closed. In our nonlinear mixture filters, the projection is the thing being learned or approximated. The experiments suggest that the projection operator, not raw capacity, is the bottleneck.

Power-EP and local ADF-style rows were not final answers, but they repeatedly exposed the right tradeoff: state density, coverage, and predictive-y move together only when the local projection is calibrated. FIVO bridge is robust because it respects the sequential filtering structure, but it still leaves cubic and predictive-normalizer gaps. K3 and flows show that posterior shape helps, but unsafe projection can spend that extra shape in the wrong place.

Next Branch#

The next branch should be:

Calibrated Local Bayes Projection for Strict Online Filtering

The first test should not be a large learned controller. It should map a reference-free deterministic projection frontier using the existing quadrature tools. The question is whether a local projection rule can improve cubic and predictive-y behavior while preserving K2 FIVO bridge random-normal robustness.

Only after that deterministic frontier looks real should we amortize it into a learned strict online filter. The teacher should be the reference-free projection operator, not a grid posterior. That keeps the scientific claim focused on learning a Bayes update rule rather than distilling an offline reference.

The concise lesson from this branch is:

More posterior/path capacity is not enough.
Strict online filtering needs calibrated local projection.

That is a narrower target than “try a bigger model”, and it is the target the evidence now supports.

Artifacts#

Primary research context:

docs/RESEARCH.md

Supporting reports: