Amortizing Quadrature Filters Without Losing Calibration

A strict mixture filter learned from deterministic Power-EP teachers improved some nonlinear alias cases, but exposed a hard alias-mass calibration tradeoff.

Series: VBF Experiments, April 2026

Contents

The previous post in this sequence, Reference-Free Quadrature Filters For The Sine Benchmark, found a useful but awkward result. Deterministic quadrature ADF and Power-EP updates were strong nonlinear filters even without grid-reference targets. In particular, prior-weighted alias-indexed Power-EP could recover state density in the random-normal stressor where simpler K4 spread filters failed.

The awkward part was calibration. The best alias-heavy deterministic filter often carried too much mass across aliases. That helped state NLL in some regimes and hurt predictive calibration in others.

This session asked the next question:

Can a strict learned mixture filter amortize the deterministic quadrature / alias Power-EP update, including its predictive normalizer, without inheriting bad alias-mass calibration?

The answer from this pass is: partially. The new teacher-student objective works and gives targeted weak/intermittent improvements, but the aggregate row still loses to the conservative K4 component baseline. That negative result is useful because it points at the actual bottleneck: not another generic neural head, but how alias mass is represented and calibrated.

Student tradeoff

The Filtering Setup#

The nonlinear benchmark is still a scalar state-space model:

\[ z_t = z_{t-1} + w_t,\quad w_t \sim \mathcal{N}(0,Q) \]\[ y_t = x_t \sin(z_t) + v_t,\quad v_t \sim \mathcal{N}(0,R) \]

A Kalman filter works because the posterior stays Gaussian under linear-Gaussian dynamics. Here the sine likelihood breaks that closure. If \(x_t \sin(z_t)\) is observed, then states separated by about \(2\pi\) can explain similar observations. The posterior can become multi-modal.

The strict learned filter is allowed to carry a Gaussian mixture:

\[ q^F_t(z_t) = \sum_{k=1}^K \pi_{t,k}\, \mathcal{N}(z_t; \mu_{t,k}, \sigma^2_{t,k}) \]

but it must remain an online filter:

\[ q^F_t = \operatorname{update}_\theta(q^F_{t-1}, x_t, y_t) \]

There is no hidden recurrent state in the headline rows. The belief itself has to carry the uncertainty.

ADF, Power-EP, And The Local Tilt#

Assumed-density filtering is the old idea hiding underneath the experiment. Start with a carried filtering belief \(q^F_{t-1}\), push it through the transition, multiply by the new likelihood, then project back into a tractable family.

The transition-predictive belief is:

\[ q^-_t(z_t) = \int p(z_t \mid z_{t-1}) q^F_{t-1}(z_{t-1})\,dz_{t-1} \]

The exact local Bayesian update would be:

\[ p(z_t \mid y_{\le t}) \propto p(y_t \mid z_t, x_t) q^-_t(z_t) \]

ADF replaces that exact posterior with a projection:

\[ q^F_t = \Pi_{\mathcal{Q}}\left[ p(y_t \mid z_t, x_t)q^-_t(z_t) \right] \]

Power-EP changes the local geometry by raising the likelihood to a power:

\[ \tilde p_\alpha(z_t) \propto p(y_t \mid z_t, x_t)^\alpha q^-_t(z_t) \]

In this codebase, the deterministic teacher computes that local tilted distribution with Gauss-Hermite quadrature and projects it back to a Gaussian mixture. That teacher is reference-free in the important sense: it uses the known model and observations, not latent states or grid posterior moments.

Why The Predictive Normalizer Matters#

The pre-update predictive likelihood is:

\[ Z_t = p(y_t \mid y_{For a mixture belief, the implementation computes this as a mixture of quadrature estimates:

\[ \log Z_t = \log \sum_k \pi^-_{t,k} \int p(y_t \mid z_t,x_t) \mathcal{N}(z_t;\mu^-_{t,k},\sigma^{2,-}_{t,k})\,dz_t \]

This quantity matters because it scores the belief before the observation is used to update it. A filter can look good after assimilation while still carrying the wrong pre-update mass. The notes for this branch specifically recommended targeting this predictive normalizer directly, rather than adding another scalar predictive-y auxiliary.

What Changed In This Session#

The quadrature distillation trainer now supports three pieces:

PiecePurpose
alias Power-EP teacher optionsexpose mode-preserving alias projection, prior-alias weighting, shrink, entropy, and top-k knobs
predictive-normalizer targetmatch the teacher \(\log Z_t\) with a squared error term
short rollout distillationinitialize short windows from teacher beliefs, then train the student through self-fed rollouts

The main config was experiments/nonlinear/14_quadrature_alias_power_ep_rollout_normalizer_k5.yaml in the public dwrtz/ml-examples repository.

It used:

SettingValue
componentsK5
teacherprior-weighted alias-indexed Power-EP
likelihood power0.5
alias spacing\(2\pi\)
rollout horizon4
predictive-normalizer weight1.0
seed321

The follow-up experiments then tested whether the bad calibration came from too much alias mass:

  • full K5 alias rollout-normalizer;
  • K5 with alias mean shrink 0.85, 0.90, 0.95;
  • K5 with a generic component-stability regularizer;
  • K5 shrink 0.90 plus stability.

The stability regularizer was deliberately not sine-specific. It penalizes component churn by keeping component \(k\) near its own transition-predicted component \(k\), and by discouraging abrupt weight changes:

\[ \begin{aligned} \mathcal{L}_{\text{stable}} &= \sum_{t,k} \bar \pi_{t,k} \frac{(\mu_{t,k} - \mu^-_{t,k})^2}{\sigma^{2,-}_{t,k}} \\ &\quad + \lambda_v \sum_{t,k} \bar \pi_{t,k} \left(\log \sigma^2_{t,k} - \log \sigma^{2,-}_{t,k}\right)^2 \\ &\quad + \lambda_\pi \sum_{t,k} \pi^-_{t,k}\log\frac{\pi^-_{t,k}}{\pi_{t,k}} \end{aligned} \]

That tests a more general idea than “sine aliases are \(2\pi\)-spaced”: mixture components should have persistent identities during filtering.

Results#

The aggregate comparison used weak, intermittent, and random-normal stressors. Those are the important cases because they expose low-observation information, missing observations, and irregular observation amplitudes.

Aggregate tradeoff

variantmean state NLLmean pred-y NLLmean cov90min cov90mean var ratio
K4 component baseline2.7875050.5227980.9063860.8861491.057096
K5 shrink 0.90 + stability 0.032.7939520.5155670.8646920.7651370.884604
K5 shrink 0.902.7940420.5155950.8646100.7648930.884503
K5 full alias mass2.8034400.5243540.9608020.9510911.707011
K5 stability 0.102.7987320.5238120.9618600.9511721.729154

The K4 component baseline is still the best aggregate row. It is boring in the right way: decent state NLL, coverage near the target, and variance ratio near one.

The K5 shrink row is more interesting but less robust. It improves state NLL on weak and intermittent:

patternK4 state NLLK5 shrink 0.90 state NLLK4 var ratioK5 shrink 0.90 var ratio
weak2.7951332.7684771.1531901.054112
intermittent2.7868852.7585641.1215371.016610
random-normal2.7804982.8550840.8965600.582786

That last row is the problem. Shrinking alias means fixes overdispersion in weak and intermittent cases, but makes random-normal underdispersed. The student is no longer carrying enough uncertainty when the observation amplitude is irregular.

The generic component-stability regularizer did not materially change that tradeoff. With shrink 0.90, stability moved metrics only in the fourth decimal place. Without shrink, stability left the K5 row overdispersed.

Interpretation#

The experiment rejects a simple story.

It is not enough to say “use the alias teacher.” Full K5 alias distillation carries too much mass:

mean variance ratio: 1.707
min coverage:        0.951

It is also not enough to say “shrink the aliases.” K5 shrink 0.90 gets the weak/intermittent rows right, but random-normal collapses too much mass:

random-normal variance ratio: 0.583
random-normal coverage:       0.765

And it is not enough to add a generic persistence penalty. Component stability is a reasonable regularizer, but the observed failure is not mostly component churn. It is a teacher/student calibration mismatch: the teacher’s alias mass is useful in one regime and too broad or too narrow in another.

That is a good research result. It says the next useful work is diagnostic:

  • effective number of mixture components over time;
  • weight entropy and alias-mass concentration;
  • teacher-student \(\log Z_t\) error by pattern and time;
  • variance ratio by time, not just globally;
  • whether predictive-normalizer matching conflicts with state-density distillation.

Those diagnostics should be run before another objective sweep. The scalar weights are no longer the most informative knob.

Reproducibility Notes#

The runs in this post used seed 321. The raw run directories are local ignored outputs from ml-examples:

outputs/nonlinear_quadrature_alias_power_ep_rollout_normalizer_k5_suite_2026_04_30
outputs/nonlinear_alias_shrink_followup_2026_04_30
outputs/nonlinear_component_stability_followup_2026_04_30

The most relevant source files in dwrtz/ml-examples are:

The copied review artifacts for this post are:

The short version is:

Amortizing deterministic quadrature updates is viable, but alias mass is the calibration bottleneck. K4 remains the safer aggregate learned filter; K5 alias distillation is the right diagnostic branch, not yet the promotion row.