theme | layout | class | title | mdc | coverDate |
---|---|---|---|---|---|
academic |
cover |
text-center |
BONG |
true |
2025-03-12 |
Kevin Murphy (Google DeepMind)
Special thanks for Gerardo Duran-Martin (Oxford) for many of the figures.
Observe sequence of past states
$$ {\cal D}{1:t-1} = {(s_1, a_1, r_1), \ldots, (s{t-1}, a_{t-1}, r_{t-1})}. $$
Given new state
$$ \hat{a}{t} = \pi{t}(s_{t}, {\cal D}_{1:t-1}). $$
Receive reward $$ r_t = R(s_t, \hat{a}_{t}) $$
Repeat
Goal: (efficiently) update policy
Observe sequence of features (inputs)
$$ {\cal D}{1:t-1} = {(x_1, y_1), \ldots, (x{t-1}, y_{t-1})}. $$
Given new input
$$ \hat{y}{t} = \pi{t}(x_{t}, {\cal D}_{1:t-1}). $$
Incur loss $$ \ell_t = \ell(y_t, \hat{y}_{t}) $$
Goal: (efficiently) update predictor
- Background
- Sequential decision making
- Online Bayesian inference
- Online Variational Bayes
- Algorithms 1: low-rank filtering using variational Bayes
- Application 1: Bandits
- Application 2: Bayesian optimization
- Algorithms 2: robust filtering using generalized Bayes
At each step
Suppose we use 0-1 loss, where we assume cost of false positives = cost of false negatives = 1 $$ \begin{array}{c|cc} & \hat{y}=0 & \hat{y}=1 \ \hline y=0 & 0 & \ell_{FP} \ y=1 & \ell_{FN} & 0 \end{array} $$
In this case, the optimal estimator uses posterior mode (MAP estimate) $$ \begin{aligned} \hat{y}t &= \arg \min_a E[\ell{01}(y_t, a) | x_t, D_{1:t-1}] \ &= \arg \min_{a} \sum_{y_t} p(y_t|x_t, D_{1:t-1}) {\cal I}(y_t \neq a) \ &= \arg \min_a 1-p(y_t=a|x_t,D_{1:t-1}) \ &= \arg \max_{y_t} p(y_t|x_t,D_{1:t-1}) \end{aligned} $$
For
State Space Model (SSM)
{style="max-width: 50%" .horizontal-center}
Input: prior belief state
Goal 1: Predict next output
Goal 2: Recursively update belief state given output
One step ahead predictive distribution (for unknown
New posterior over latent (after seeing $y_t$ ):
$$
\begin{aligned}
\overbrace{p(\theta_t|D_{1:t})}^\text{posterior}
&= b_t =
p(\theta_t|x_t,y_t,D_{1:t-1})
= \text{UpdateBel}(b_{t|t-1}, x_t, y_t) \
&=
\frac{
\overbrace{p(y_t|\theta_t,x_t)}^\text{likelihood}
\overbrace{p(\theta_t|D_{1:t-1})}^\text{prior}}
{
\underbrace{p(y_t|x_t,D_{1:t-1})}_{\text{marg. lik.}}
}
\frac{ p(y_t|\theta_t,x_t) p(\theta_t|D_{1:t-1})} {\int p(y_t|\theta_t,x_t) p(\theta_t|D_{1:t-1}) d\theta_t} \end{aligned} $$
{style="max-width: 50%" .horizontal-center}
Measurements
-
${\bf Q}_t$ the state covariance, -
${\bf R}_t$ the measurement covariance, -
$f_t: \R^p \to \R^p$ the state-transition function, and -
$h_t: \mathbb{R}^p \to \mathbb{R}^d$ the measurement function.
Linear Gaussian likelihood (with measurement noise cov.
Nonlinear Gaussian likelihood $$ p(y_t|\theta_t, x_t) = N(y_t|h(\theta_t, x_t), R_t) $$
Binary logistic Regression $$ p(y_t|\theta_t, x_t) = {\rm Bern}(y_t|\sigma(x_t^\intercal \theta_t)) $$
Multinomial logistic Regression $$ p(y_t|\theta_t, x_t) = {\rm Cat}(y_t|{\cal S}(\theta_t x_t)) $$
MLP classifier $$ p(y_t|\theta_t, x_t) = {\rm Cat}(y_t|{\cal S} (\theta_t^{(1)} \text{relu}(\theta_t^{(1)} x_t))) = {\rm Cat}(y_t|h(\theta_t,x_t)) $$
Linear Gaussian dynamics (with system / process noise cov.
Special case of LG: Ornstein-Uhlenbeck process $$ F_t = \gamma_t I, b_t = (1-\gamma_t) \mu_0, Q_t =(1-\gamma_t^2) \Sigma_0 $$
Special case of OU (
Shrink and Perturb (Ash and Adams, 2020):
$$
p(\theta_t | \theta_{t-1})
= N(\theta_t|\lambda \theta_{t-1}, \sigma I)
$$
where
Gaussian ansatz for the previous posterior $$ p(\theta_{t-1}|D_{1:t-1}) = N(\theta_{t-1}|\mu_{t-1},\Sigma_{t-1}) $$
Previous posterior becomes new prior $$ \begin{aligned} p(\theta_t|D_{1:t-1}) &= \int p(\theta_t|\theta_{t-1}) p(\theta_{t-1}|D_{1:t-1}) d\theta_{t-1} \ &= \int N(\theta_t | F_t \theta_{t-1} + b_t , Q_t) N(\theta_{t-1}|\mu_{t-1},\Sigma_{t-1}) d\theta_{t-1} \ &= N(\theta_t|\mu_{t|t-1}, \Sigma_{t|t-1}) \ \mu_{t|t-1} &= F_t \mu_{t-1} + b_t \ \Sigma_{t|t-1} &= F_t \Sigma_{t-1} F_t^\intercal + Q_t \end{aligned} $$
Special case for constant parameter (
Prior predictive $$ \begin{aligned} p(y_t|x_t,D_{1:t-1}) &= \int p(y_t|x_t, \theta_t) p(\theta_t|D_{1:t-1}) d\theta_{t} \end{aligned} $$ For nonlinear likelihoods, often approximated by Monte Carlo: $$ \begin{aligned} p(y_t|x_t,D_{1:t-1}) &\approx \int p(y_t|x_t,\theta_t) \left[ \frac{1}{K} \sum_{k=1}^K \delta(\theta_t-\theta_t^k) \right] d\theta_t \ &\approx \frac{1}{K} \sum_{k=1}^K p(y_t|x_t, \theta_t^k), ;; \theta_t^k \sim p(\theta_t|D_{1:t-1}) \end{aligned} $$ For linear likelihoods, can derive predictive in closed form: $$ \begin{aligned} p(y_t|x_t,D_{1:t-1}) &=N(y_t|H_t \mu_{t|t-1}, H \Sigma_{t|t-1} H^\intercal + R_t) \end{aligned} $$
New posterior (after seeing
Focus of this talk: how to compute this posterior efficiently
If we have LG dynamics and LG observations, get closed form solution!
Predict step $$ \begin{aligned} \bm\Sigma_{t|t-1} &= {\bf F}t \bm\Sigma{t-1} {\bf F}_t^\intercal + {\bf Q}t\ \bm\mu{t|t-1} &= {\bf F}t\bm\mu{t-1} \ \hat{\bm y}_t &= {\bf H}t\bm\mu{t|t-1}\ \end{aligned} $$
Update step $$ \begin{aligned}
{\bf S}_t &= {\bf H}t\bm\Sigma{t|t-1}{\bf H}_t^\intercal + {\bf R}_t \
{\bf K}t &= \bm\Sigma{t|t-1}{\bf H}_t^\intercal{\bf S}_t^{-1}\
\bm\mu_t &= \bm\mu_{t|t-1} + {\bf K}_t({\bm y}_t - \hat{\bm y}_t)\
\bm\Sigma_t &= \bm\Sigma_{t|t-1} - {\bf K}_t{\bf S}_t{\bf K}_t^\intercal \end{aligned} $$
::right::
Left: Plot
{style="max-width: 70%" .horizontal-center}
For non-linear state and/or measurement functions. Replace functions with first-order approximations centred around previous mean. $$ \begin{aligned} p(\bm\theta_t \vert \bm\theta_{t-1}) &= {\cal N}(\bm\theta_t \vert \bar f_t(\bm\theta_{t-1}), {\bf Q}_t)\ p(\bm y_t \vert \bm\theta_t) &= {\cal N}(\bm y_t \vert \bar h_t(\bm\theta_t), {\bf R}_t) \end{aligned} $$
with
- $\bar f_t(\bm\theta_{t-1}) = f(\bm\mu_{t-1}) + {\bf F}t(\bm\theta{t-1} - \bm\mu_{t-1})$,
- $\bar h_t(\bm\theta_t) = h_t(\bm\mu_{t|t-1}) + {\bf H}t(\bm\theta_t - \bm\mu{t|t-1})$,
- ${\bf F}t = \text{Jac}(f_t)(\bm\mu{t-1})$, ${\bf H}t = \text{Jac}(h_t)(\bm\mu{t|t-1})$
Exact posterior $$ \begin{aligned} p(\theta|D) = \frac{1}{p(D)} p_0(\theta) p(D|\theta)) \end{aligned} $$
Approximate posterior: $$ \begin{aligned} q_{\psi^}(\theta) &\approx p(\theta|D) \ \psi^ &= \arg \min_{\psi} KL(q_{\psi}(\theta) | p(\theta|D)) \ &= \arg\min_{\psi} E_{q_{\psi}(\theta)}\left[ \log q_{\psi}(\theta) - \log ( \frac{p(\data|\theta) p_0(\theta)}{\cancel{p(\data)}} ) \right] \ &= \arg\min_{\psi} E_{q_{\psi}(\theta)}\left[ -\log p(\data|\theta)
- \log p_0(\theta) + \log q_{\psi}(\theta) \right] \ &= \arg\min_{\psi} \underbrace{E_{q_{\psi}(\theta)}\left[ -\log p(\data|\theta) \right]}_{\text{ENLL}}
- \underbrace{KL(q_{\psi}(\theta) | p_0(\theta))}_{\text{regularizer}} \end{aligned} $$
$$ \begin{aligned} q_{\psi^}(\theta) &\approx p(\theta|D) \ \psi^ &= \arg\max_{\psi} \underbrace{E_{q_{\psi}(\theta)}\left[ \log p(\data|\theta) \right]
- KL(q_{\psi}(\theta) | p_0(\theta))}_{\text{ELBO} \leq \log p(D)} \end{aligned} $$
{style="max-width: 50%" .horizontal-center}
Online version $$ \begin{aligned} \psi_t &= \arg \min_{\psi} L_t^\text{NELBO}(\psi) \ L_t^\text{NELBO}(\psi) &= \underbrace{E_{\theta \sim q_{\psi}} [-\log p(y_t|h_t(\theta_t))]}\text{incremental E[NLL]} + \underbrace{KL(q\psi | q_{\psi_{t|t-1}})}_\text{incremental regularizer} \end{aligned} $$
-
"Efficient Online Bayesian Inference for Neural Bandits".
Gerardo Duran-Martin, Aleyna Kara, Kevin Murphy.
AISTATS 2022. -
"Low-rank EKF for online learning of neural networks from streaming data'' (LOFI).
Peter Chang, Gerardo Duran-Martin, Alex Shestopaloff, Matt Jones, Kevin Murphy.
COLLAS 2023. -
"Bayesian online natural gradient" (BONG).
Matt Jones, Peter Chang, Kevin Murphy.
NeurIPS 2024. -
"Low-rank Kalman filtering" (LRKF).
Gerardo Duran-Martin, Leandro Sánchez-Betancourt, Kevin Murphy.
(WIP).
BLR uses multiple iterations
of natural gradient ascent (NGD) on the ELBO:
$$
\begin{aligned}
\psi_{i} &=
\psi_{i-1} + \alpha F_{\psi_0}^{-1}
\nabla_{\psi_{i-1}} L^\text{ELBO}(\psi_{i-1}) \
&= \psi_{i-1} + \alpha
\nabla_{\rho_{i-1}} L^\text{ELBO}(\psi_{i-1}) \
L^\text{ELBO}(\psi_{i}) &=
E_{q_{\psi_{i}}}[
\log p(\data \vert \theta)]
-KL(q_{\psi_{i}} | p_{0})
\end{aligned}
$$
where the Fisher information matrix is give by
$$
F_{\psi}^{-1} \nabla_{\psi} L(\psi) = \nabla_{\rho} L(\rho)
$$
where
Online BLR: $$ \begin{aligned} \psi_{t,i} &= \psi_{t,i-1} + \alpha F_{\psi_{t|t-1}}^{-1} \nabla_{\psi_{t,i-1}} L_t^\text{ELBO}(\psi_{t,i-1}) \ &= \psi_{t,i-1} + \alpha \nabla_{\rho_{t,i-1}} L_t^\text{ELBO}(\psi_{t,i-1}) \ L_t^\text{ELBO}(\psi_{t,i}) &= E_{q_{\psi_{t,i}}}[ \log p(y_{t} \vert h_{t}(\theta_{t}))] -KL(q_{\psi_{t,i}} | q_{\psi_{t \vert t-1}}) \end{aligned} $$
In BONG,
we initialize with $\psi_{t,0}=\psi_{t|t-1}$ ,
and just do one step of gradient ascent (so $i=0$ ).
Thus the KL term vanishes,
but we still have implicit regularization due to initialization
at prior:
$$
\begin{aligned}
L_t^\text{ELBO}(\psi_{t,0}) &=
E_{q_{\psi_{t,0}}(\theta_t)}[
\log p(y_{t} \vert h_{t}(\theta_{t}))]
-\cancel{KL(q_{\psi_{t,0}} | q_{\psi_{t \vert t-1}})}
L_t^\text{ELL}(\psi_{t,0}) \end{aligned} $$ Hence BONG update becomes $$ \begin{aligned} \psi_{t} &= \psi_{t|t-1} + \alpha F_{\psi_{t|t-1}}^{-1} \nabla_{\psi_{t|t-1}} L_t^\text{ELL}(\psi_{t|t-1}) = \psi_{t|t-1} + \alpha \nabla_{\rho_{t|t-1}} L_t^\text{ELL}(\psi_{t|t-1}) \end{aligned} $$
Theorem: This is exact in the conjugate case (eg. Gaussian prior, linear Gaussian likelihood).
"Bayesian online natural gradient". Matt Jones, Peter Chang, Kevin Murphy. NeurIPS 2024.{style="max-width: 50%" .centered}
OGD performs a single update step following the gradient of the log likelihood: $$ \begin{aligned} \theta_{t} &= \theta_{t-1} + \alpha \nabla_{\theta_{t-1}} \log p(y_t|h(\theta_{t-1},x_t)) \end{aligned} $$
- No model of uncertainty (needed for decision making).
- Statistically inefficient (needs lots of data).
By contrast, BONG updates the variational parameters
- Posterior
$q_{\psi_t}(\theta_t)$ captures uncertainty. - Second-order updates enable more rapid updates in small data setting.
Recall $$ \begin{aligned} \psi_{t} &= \psi_{t|t-1} + \alpha F_{\psi_{t|t-1}}^{-1} \nabla_{\psi_{t|t-1}} L_t^\text{ELL}(\psi_{t|t-1}) \ &= \psi_{t|t-1} + \alpha \nabla_{\rho_{t|t-1}} L_t^\text{ELL}(\psi_{t|t-1}) \ L_t^\text{ELL}(\psi) &= E_{q_{\psi}(\theta_t)}[ \log p(y_{t} \vert h(x_t, \theta_{t}))] \end{aligned} $$ For Gaussian variational family, this becomes $$ \begin{aligned} \mu_t &= \mu_{t|t-1} + \Sigma_t \underbrace{E_{\theta_t \sim q_{\psi_{t|t-1}}} [\nabla_{\theta_t} \log p(y_t|h(x_t,\theta_t))]}{g_t} \ \Sigma_t^{-1} &= \Sigma{t|t-1}^{-1} - \underbrace{E_{\theta_t \sim q_{\psi_{t|t-1}}}[ \nabla^2_{\theta_t} \log p(y_t|h(x_t,\theta_t))]}_{G_t} \end{aligned} $$
Key question: how to compute gradient
Exact expected gradient $$ \begin{aligned} g_t = E_{\theta_t \sim q_{\psi_{t|t-1}}} [\nabla_{\theta_t} \log p(y_t|h(x_t,\theta_t))] \end{aligned} $$
Standard approach: Monte Carlo approximation $$ \begin{aligned} g_t^{MC} = \frac{1}{K} \sum_{k=1}^K \nabla_{\theta_t} \log p(y_t|h(x_t,\theta_t^k)), \theta_t^k \sim q_{\psi_{t|t-1}} \end{aligned} $$
Our approach: linearize the likelihood and compute expectation deterministically (c.f., EKF) $$ \begin{aligned} g_t^{LIN} &= H_t^\intercal R_t^{-1} (y_t-\hat{y}t) \ \hat{y}t &= h(\mu{t|t-1},x_t) \ H_t &= \frac{\partial h_t}{\partial \theta_t}|{\theta_t=\mu_{t|t-1}} \ R_t &= \text{Var}(y_t|\theta_t=\mu_{t|t-1}) \ &= \hat{y}_t (1-\hat{y}_t) // {\rm Bernoulli} \end{aligned} $$
Exact expected Hessian $$ \begin{aligned} G_t &= E_{\theta_t \sim q_{\psi_{t|t-1}}}[ \nabla^2_{\theta_t} \log p(y_t|h(x_t,\theta_t))] \end{aligned} $$
- MC-Hess: Sample
$\theta_t^k$ and plug into Hessian $$ G_t^{MC-HESS} = \frac{1}{K} \sum_{k=1}^K \nabla^2_{\theta_t} \log p(y_t|h(x_t,\theta_t^k))] $$ - Lin-Hess: Linearize and compute Jacobian $$ G_t^{LIN-HESS} = -H_t^\intercal R_t^{-1} H_t $$
Exact expected Hessian $$ \begin{aligned} G_t &= E_{\theta_t \sim q_{\psi_{t|t-1}}}[ \nabla^2_{\theta_t} \log p(y_t|h(x_t,\theta_t))] \end{aligned} $$
-
EF with MC gradients (BLR): $$ G_t^{MC-EF} = -g_t^{MC} (g_t^{MC})^\intercal $$
-
EF with linearized gradients (BONG): $$ G_t^{LIN-EF} = -g_t^{LIN} (g_t^{LIN})^\intercal $$
For the posterior over
Name | Form | Complexity |
---|---|---|
Full rank | ||
Diagonal (mean field) | ||
LOFI1: Diag + Rank |
||
LRKF2: Rank |
EKF Predict-Update, then SVD projection. (c.f. Assumed Density Filtering)
{style="max-width: 50%" .horizontal-center}
::right::
- BONG (full-rank) for learning 1 layer MLP.
{style="max-width: 50%" .horizontal-center}
Expected calibration error
{style="max-width: 75%" .horizontal-center}
{style="max-width: 30%" .float-left}
Now the agent observes a state vector
Need to try new actions (explore) to learn about their effects before exploiting the best action (dual control problem).
{style="max-width: 30%" .horizontal-center}
Action at step
$$
\begin{aligned}
\pi_t(a^* | s_t) &= {\cal I}
(a^* = \arg \max_{a} \mu_t(a) + c \sigma_t(a) )\
\mu_t(a) &= E[R_{\theta}(s_t, a) | D_{1:t-1}] \
\sigma_t(a) &= \sqrt{ Var(R_{\theta}(s_t, a) | D_{1:t-1} ) }
\end{aligned}
$$
Need to recursively update belief state
{style="max-width: 50%" .horizontal-center}
$$ \begin{aligned} \pi_t(a^|s_t) &= p(a^ = \arg \max_{a} R_{\theta}(a, s_t) |D_{1:t-1}) \ &= \int {\cal I}(a^* = \arg \max_{a} R_{\theta}(a, s_t) ) p(\theta|D_{1:t-1}) d\theta \ &\approx {\cal I}(a^* = \arg \max_{a}) R_{\tilde{\theta}_t}(a, s_t) ) , ;; \tilde{\theta}t \sim p(\theta|D{1:t-1}) \end{aligned} $$
Algorithm: $$ \begin{aligned} \tilde{\theta}t &\sim p(\theta|D{1:t-1}) \ a_t &= \arg \max_{a} R_{\tilde{\theta}_t}(a, s_t) \end{aligned} $$
{style="max-width: 70%" .horizontal-center}
Global, gradient-free optmization of expensive, black-box function $$ \begin{aligned} x^* = \arg \max_{x \in \mathcal{X}} f^(x) \end{aligned} $$ where $f^$ is the unknown reward / objective function.
Equivalent to a MAB with one arm for each point in
Examples:
-
$f^*(x)$ is validation accuracy of hyper-parameters,$\mathcal{X}=R^N$ -
$f^*(x)$ is binding affinity of DNA sequence,$\mathcal{X}={A,C,G,T}^N$ .
We approximate posterior over functions
Posterior predictive at step
UCB rule: $$ x_{t+1} = \argmax_{x} \mu_t(x) + c \sigma_t(x) $$
{style="max-width: 50%" .horizontal-center}
Instead of using GPs, we would like to use neural networks.
This means
we approximate
Once we have the posterior, we sample parameters from it, plug them into the function, and then find its maximum (using gradient-based methods): $$ \begin{aligned} \tilde{\theta}t &\sim p(\theta|D{1:t}) \ x_{t+1} &= \arg \max_{x} f_{\tilde{\theta}t}(x) \ D{t+1} &= (x_{t+1}, f^*(x_{t+1})) \ p(\theta|D_{1:t+1}) &= \text{UpdateBel}(p(\theta|D_{1:t}), D_{t+1}) \end{aligned} $$
{style="max-width: 70%" .horizontal-center}
Standard (variational) Bayes is sensitive to outliers and misspecified measurement models
Measurements sampled from a Student-t distribution with 2.01 degrees of freedom.
Any measurement has 15% probability of taking value between -50 and 50.
VI $$ \begin{aligned} \psi^* &= \arg\min_{\psi} \underbrace{E_{q_{\psi}(\theta)}\left[ -\log p(\data|\theta) \right]}{\text{ENLL}} +\underbrace{KL(q{\psi}(\theta) | p_0(\theta))}_{\text{regularizer}} \end{aligned} $$
GVI $$ \begin{aligned} \psi^* &= \arg\min_{\psi} \underbrace{E_{q_{\psi}(\theta)}\left[ \ell(\theta|\data) \right]}{\text{loss}} +\underbrace{D(q{\psi}(\theta) | p_0(\theta))}_{\text{divergence}} \end{aligned} $$
"An Optimization-centric View on Bayes' Rule: Reviewing and Generalizing Variational Inference". Jeremias Knoblauch et al. JMLR 2021.Replace the log likelihood $\log {\cal N}(\bm y_t \vert h_t(\bm\theta_t), {\bf R}t)$ with loss function of the form $$ \begin{aligned} \ell_t(\theta_t) &= -W_t(\bm y{1:t}), \log{\cal N}(\bm y_t \vert h_t(\bm\theta_t), {\bf R}_t). \end{aligned} $$ Easy to modify KF-like methods, as we will show.
"Outlier-robust Kalman filtering through generalised Bayes". Duran-Martin, Altamirano, Shestopaloff, Betancourt, Knoblauch, Jones, Briol, Murphy. ICML 2024.For an LG-SSM, WoLF updates step is a simple modification of the KF update step.
$$ \begin{aligned} \hat{\bm y}_t &= {\bf H}t\bm\mu{t|t-1}\
{\bf S}_t &=
{\bf H}t\bm\Sigma{t|t-1}{\bf H}_t^\intercal + {\bf R}t {\color{red} / W_t(\bm y{1:t})} \
{\bf K} &= \bm\Sigma_{t|t-1}{\bf H}_t^\intercal{\bf S}_t^{-1}\
\bm\mu_t &= \bm\mu_{t|t-1} + {\bf K}_t({\bm y}_t - \hat{\bm y}_t)\
\bm\Sigma_t &= \bm\Sigma_{t|t-1} - {\bf K}_t{\bf S}_t{\bf K}_t^\intercal \end{aligned} $$
Can easily be generalized to EKF, BONG, etc.
Inverse multi-quadratic
$$
W_t({\bm y}_{1:t}) = \left(1 + \frac{|\bm y_t - \hat{\bm y}_t|_2^2}{c^2}\right)^{-1/2}
$$
with
Downweight "unlikely" observations, to avoid changing posterior too much.
Provably robust --- can bound the posterior influence function (see paper).
::right::
https://grdm.io/posts/wolf-ewma, Gerardo Duran-Martin, 2024.Online training of neural networks in corrupted UCI datasets.
Results are shown relative to online gradient descent (OGD) with multiple inner iterations.-
Sequential Bayesian inference has many applications, e.g., online learning, bandits, Bayes Opt.
-
We propose new efficient (and deterministic) algorithms based on recursive (generalized) variational inference and (low rank) Gaussian approximations.
-
Future work: scale up to large neural networks, and applications to RL.