Posterior Contraction in Variational Inference
Let \(\Dc = \{(\xv_i, y_i)\} _ {i=1}^{n}\) be a dataset. Assume that they are generated by the distribution \[ p(\Dc \mid \wv) = \prod_{i=1}^{n} p((\xv_i, y_i) \mid \wv), \] where \(\wv\) is the weight vector of the model. For simplicity, we assume the weight vector \(\wv\) has a standard normal prior \(p(\wv) = \Nc(\zero, \Iv)\).
Variational inference aims to find a variational distribution \(q \in \Qc\) that approximates the exact posterior \( p(\wv \mid \Dc) \) by minimizing the Kullback–Leibler divergence \[ \Ds_\KL\big(q(\wv), p(\wv \mid \Dc)\big), \] which measures the difference between the variational approximation and the posterior. This is equivalent to maximizing the evidence lower bound (ELBO): \[ \begin{equation} \tag{ELBO} \maxi_{q \in \Qc} \sum_{i=1}^{n} \Eb _ {q(\wv)} \log p((\xv_i, y_i) \mid \wv) - \Ds_\KL(q(\wv), p(\wv)). \end{equation} \] In practice, it is common to use multivariate normal distributions as the variational family \(\Qc\).
Our goal is to understand how the uncertainty of the variational posterior shrinks as the number of data increases. Intuitively, the data should reduce the uncertainty in the prior. We can characterize this intuition to some extent as follows.
A Special Case
Let the multivariate normal distribution \(\Nc(\muv, \Sigmav)\) be the optimal variational distribution by maximizing the ELBO. Then, the variational posterior has smaller uncertainty than the prior, i.e., \(\Sigmav \preceq \Iv\), if the likelihood \( p((\xv_i, y_i) \mid \wv) \) is log-concave in \( \wv \) for all \(i \in [n]\).
By the first-order optimality condition, the derivative of the ELBO with respect to the covariance \(\Sigmav\) has to be zero. The gradient of the first term can be computed by Price’s theorem: \[ \frac{\partial}{\partial \Sigmav} \sum_{i=1}^{n} \Eb _ {q(\wv)} \log p((\xv_i, y_i) \mid \wv) = \frac12 \sum_{i=1}^{n} \Eb _ {q(\wv)} \Big[\nabla_{\wv}^2 \log p((\xv_i, y_i) \mid \wv)\Big], \] where the right-hand side involves the Hessian of \(\wv\). The derivative of the second term is easy to compute since KL divergence between two multivariate normal distributions has a closed form: \[ \frac{\partial}{\partial \Sigmav} \Ds_\KL(q(\wv), p(\wv)) = -\frac12 \Sigmav\inv + \frac12 \Iv. \] The two derivatives sum to zero, and rearranging terms gives \[ \Sigmav\inv - \Iv = -\sum_{i=1}^{n} \Eb _ {q(\wv)} \Big[\nabla_{\wv}^2 \log p((\xv_i, y_i) \mid \wv)\Big]. \] The right-hand side is positive semi-definite, since each likelihood term \(\log p((\xv_i, y_i) \mid \wv)\) is log-concave in \(\wv\). Hence, we obtain \(\Sigmav\inv \succeq \Iv\), which immediately implies \(\Sigmav \preceq \Iv\).
Note that Bayesian linear regression and Bayesian logistic regression both have log-concave likelihoods, and thus the above simple result applies.
Remark. In a previous attempt, I tried to prove a stronger version that the optimal variational covariance is monotonically decreasing in the number of data points. But the above proof technique does not work anymore. I am also not sure if the stronger version holds at all. On the flip side, I am also interested to know if the log-concave assumption is necessary. Does there exist a likelihood such that the variational posterior uncertainty is even larger than the prior?