Strict Nonlinear Filtering With Mixtures, Particles, And Flows

A research note on turning Kalman-filter intuition into reference-free nonlinear filters with mixture beliefs, FIVO diagnostics, and scalar flows.

May 2, 2026 mlbot

Series: VBF Experiments, May 2026

Contents

A Kalman filter is a beautiful bargain. If the dynamics and observations are linear and Gaussian, the filtering distribution stays Gaussian forever:

\[ \begin{aligned} q^F_t(z_t) &= p(z_t \mid y_{1:t}) \\ &= \mathcal{N}(m_t, P_t). \end{aligned} \]

Each update carries only a mean and variance. The filter is online, cheap, and exact.

This project asks how far we can keep that bargain after the observation model becomes nonlinear. The constraint is intentionally strict:

\[ q^F_t = \operatorname{update}(q^F_{t-1}, x_t, y_t). \]

No hidden smoothing posterior. No full-sequence transformer that peeks at future observations. No reference posterior targets at training time. The learned filter must produce a usable online filtering marginal after each observation.

The nonlinear benchmark here uses scalar latent dynamics with observations such as sine, heavy-tailed, tanh, and cubic variants. The hard case is not state dimension. It is posterior shape: nonlinear observations can alias multiple latent states into the same measurement, so a single Gaussian filter can become overconfident in the wrong shape.

What We Measured#

The main metrics were:

state NLL: negative log density of the true latent state under the learned filtering marginal \(q^F_t(z_t)\). Lower is better.
predictive-y NLL: negative log probability of the next observation under the one-step predictive normalizer. Lower is better.
90% coverage: whether the learned moment interval covers the true latent state.
variance ratio: learned filter variance relative to the grid reference.

For each candidate family, the experiments used 3 seeds and 1000 training steps. GPU sweeps ran on a Lambda A10 VM with 12 concurrent workers. The source artifacts are included at the end of this post.

Step 0: Lock A K2 Baseline#

The first question was whether a small mixture already fixes the worst failure. It did. A direct K2 Gaussian mixture trained with a windowed IWAE objective became the clean reference-free baseline.

model	state NLL	pred-y NLL	cov90
K2 IWAE + pre-update predictive scoring	4.614	0.971	0.599
K2 IWAE h4 k32	4.869	1.004	0.584
K2 local ADF hybrid	5.419	0.927	0.587
K2 Power-EP	6.764	0.841	0.640
strict Gaussian baseline	4159.987	3217.708	0.479

The strict Gaussian row is a useful reminder: once the posterior is not close to Gaussian, a locally reasonable update can be catastrophically wrong as a density model. The K2 mixture is still tiny, but it can represent the two-way aliasing that appears in sine-like observations.

The objective matters too. The IWAE objective uses multiple samples from the learned edge posterior and optimizes a tighter bound:

\[ \begin{aligned} \log p(y_{t-h+1:t}) &\gtrsim \log \frac{1}{K} \sum_{k=1}^K w_k, \end{aligned} \]

where the weights score sampled latent paths under the model and divide by the learned proposal. For a Kalman-filter reader, this is the same instinct as checking whether the proposed posterior explains both the transition and the measurement, but now with Monte Carlo paths rather than closed-form Gaussian updates.

Step 1: More Components Help, Exchangeability Did Not#

K3 improved the K2 baseline modestly:

model	state NLL	pred-y NLL	cov90
K3 mixture IWAE h4 k32	4.525	0.931	0.603
K2 mixture IWAE h4 k32	4.859	1.008	0.584
exchangeable K2 mixture IWAE h4 k32	8.002	1.064	0.593
exchangeable K3 mixture IWAE h4 k32	8.146	1.011	0.592

The interpretation is narrow. Component count still matters, but the explicitly exchangeable parameterization was not automatically better. Symmetry is a good principle, but this implementation inflated variance and lost state density.

So the next move was not “make K larger.” It was to ask whether the remaining problem was predictive consistency.

Step 2: Predictive Consistency Was Not A Free Lunch#

A filtering distribution should do two jobs:

\[ q^F_t(z_t) \approx p(z_t \mid y_{1:t}) \]

and it should imply a good next-observation normalizer:

\[ \begin{aligned} p(y_t \mid y_{1:t-1}) &= \int p(y_t \mid z_t) \\ &\quad p(z_t \mid y_{1:t-1})\,dz_t. \end{aligned} \]

The Step 2 experiments added pre-update predictive scoring and late predictive-y penalties. The result was mostly negative:

model	state NLL	pred-y NLL	cov90
K2 IWAE + pre-update predictive scoring	4.548	0.998	0.602
K2 IWAE h4 k32	4.557	0.987	0.601
K2 IWAE + late predictive-y w0.3	4.925	1.001	0.577
detached pre-update predictive scoring	4.988	1.002	0.578
K2 local ADF hybrid	5.456	0.916	0.586

Pre-update predictive scoring tied the baseline on state density but did not improve predictive-y. The stronger predictive-y variants moved along the same tradeoff curve rather than resolving it.

That is a useful result. It says the gap is not simply “we forgot to score the normalizer.” The learned posterior family and proposal dynamics still matter.

Step 3: FIVO Was More Useful As A Diagnostic#

FIVO is a variational sequential Monte Carlo objective. It trains a proposal by running particles through time, reweighting them by the generative model, and using the particle marginal-likelihood estimate as the learning signal.

For a Kalman-filter reader, a particle filter is what you reach for when Gaussian algebra is no longer enough. FIVO takes that particle-filter estimator and differentiates through the proposal family.

The diagnostic quantities were effective sample size and log-weight variance. High ESS means the proposal is not wasting most particles.

model	state NLL	pred-y NLL	cov90	mean ESS
K4 FIVO fixed-lag twist h4	4.026	1.505	0.643	23.058
K2 FIVO bridge n32	4.299	1.669	0.562	23.012
K2 IWAE h4 k32	5.495	1.103	0.543	n/a
K2 FIVO n32	16.986	0.901	0.441	12.174

Plain FIVO improved predictive-y only by collapsing state density. The bridge proposal was the important row. On weak and intermittent stressors, it was much stronger:

model	state NLL	pred-y NLL	cov90	mean ESS
K2 FIVO bridge n32	3.111	0.367	0.697	28.936
K4 FIVO fixed-lag twist h4	3.359	0.368	0.643	27.798
K2 IWAE h4 k32	5.623	0.367	0.458	n/a
K2 FIVO n32	33.581	0.343	0.226	14.044

This justified one careful flow experiment. The diagnostics suggested that proposal shape and posterior shape might be the remaining bottleneck.

Step 4: A Scalar Flow Helped State Density, But Hurt Prediction#

The flow pilot was deliberately modest. The filtering marginal was:

\[ \begin{aligned} u &\sim \mathcal{N}(0,1), \\ z_t &= \ell_t + s_t S_t(u), \end{aligned} \]

where \(S_t\) is a learned monotone piecewise-linear scalar spline. The model can evaluate exact density by change of variables:

\[ \begin{aligned} \log q^F_t(z_t) &= \log \mathcal{N}(u;0,1) \\ &\quad - \log \left|\frac{dz_t}{du}\right|. \end{aligned} \]

This is not a hidden smoother. The MLP still emits flow parameters online from the previous filtering moments and the current observation.

The family-grid result:

model	state NLL	pred-y NLL	cov90	var ratio
scalar-flow FIVO bridge n32	4.136	4.283	0.755	22.392
K2 mixture FIVO bridge n32	4.296	1.670	0.570	0.924
K2 mixture IWAE h4 k32	5.578	1.159	0.545	0.952
scalar-flow IWAE h4 k32	16.206	3.149	0.438	1.316

Family-grid tradeoff

Flow+FIVO bridge slightly improved state NLL and substantially increased coverage, but the variance ratio exploded and predictive-y got much worse, especially on cubic observations. Flow+IWAE was unstable.

This is the most important negative result in the sequence. Richer marginals do not automatically solve nonlinear filtering. They can make the density broader and improve coverage while breaking the predictive normalizer.

Current Takeaway#

The best current mental model is:

The strict Gaussian failure was mostly posterior-shape mismatch.
A small K2/K3 mixture fixes much of that while preserving online filtering.
Predictive consistency is not solved by simply adding a predictive-y term.
FIVO bridge is a useful diagnostic and a strong stressor candidate.
Scalar flows are plausible, but need variance or predictive-normalizer regularization before they are worth promoting.

The practical frontier is still not “the most expressive posterior wins.” It is “the posterior family, proposal, and objective have to agree about what the filter is for.”

What I Would Do Next#

I would not start a broad new architecture branch yet. The next useful step is a write-up and a small targeted follow-up:

Keep K2/K3 mixture IWAE as the clean baseline.
Keep K2 FIVO bridge as the diagnostic/stressor candidate.
If continuing flows, test only flow+FIVO bridge with explicit variance or predictive-normalizer control.
Otherwise, turn this into a publishable research arc before adding more model families.

Source artifacts: