<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kayween.github.io/blogs/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kayween.github.io/blogs/" rel="alternate" type="text/html" /><updated>2026-05-30T04:16:30+00:00</updated><id>https://kayween.github.io/blogs/feed.xml</id><title type="html">The Really Good Blog</title><subtitle>This is a really good blog, as shown by the title.</subtitle><entry><title type="html">Posterior Contraction in Variational Inference</title><link href="https://kayween.github.io/blogs/2025/12/26/vi-posterior-contraction.html" rel="alternate" type="text/html" title="Posterior Contraction in Variational Inference" /><published>2025-12-26T00:00:00+00:00</published><updated>2025-12-26T00:00:00+00:00</updated><id>https://kayween.github.io/blogs/2025/12/26/vi-posterior-contraction</id><content type="html" xml:base="https://kayween.github.io/blogs/2025/12/26/vi-posterior-contraction.html"><![CDATA[<h1 id="posterior-contraction-in-variational-inference">Posterior Contraction in Variational Inference</h1>

<p>Let \(\Dc = \{(\xv_i, y_i)\} _ {i=1}^{n}\) be a dataset.
Assume that they are generated by the distribution
\[
    p(\Dc \mid \wv) = \prod_{i=1}^{n} p((\xv_i, y_i) \mid \wv),
\]
where \(\wv\) is the weight vector of the model.
For simplicity, we assume the weight vector \(\wv\) has a standard normal prior \(p(\wv) = \Nc(\zero, \Iv)\).</p>

<p>Variational inference aims to find a variational distribution \(q \in \Qc\) that approximates the exact posterior \( p(\wv \mid \Dc) \) by minimizing the Kullback–Leibler divergence
\[
    \Ds_\KL\big(q(\wv), p(\wv \mid \Dc)\big),
\]
which measures the difference between the variational approximation and the posterior.
This is equivalent to maximizing the evidence lower bound (ELBO):
\[
\begin{equation}
\tag{ELBO}
    \maxi_{q \in \Qc} \sum_{i=1}^{n} \Eb _ {q(\wv)} \log p((\xv_i, y_i) \mid \wv) - \Ds_\KL(q(\wv), p(\wv)).
\end{equation}
\]
In practice, it is common to use multivariate normal distributions as the variational family \(\Qc\).</p>

<p>Our goal is to understand how the uncertainty of the variational posterior shrinks as the number of data increases.
Intuitively, the data should reduce the uncertainty in the prior.
We can characterize this intuition to some extent as follows.</p>

<h2 id="a-special-case">A Special Case</h2>
<p>Let the multivariate normal distribution \(\Nc(\muv, \Sigmav)\) be the optimal variational distribution by maximizing the ELBO.
Then, the variational posterior has smaller uncertainty than the prior, i.e., \(\Sigmav \preceq \Iv\),
if the likelihood \( p((\xv_i, y_i) \mid \wv) \) is log-concave in \( \wv \) for all \(i \in [n]\).</p>

<p>By the first-order optimality condition, the derivative of the ELBO with respect to the covariance \(\Sigmav\) has to be zero.
The gradient of the first term can be computed by Price’s theorem:
\[
\frac{\partial}{\partial \Sigmav} \sum_{i=1}^{n} \Eb _ {q(\wv)} \log p((\xv_i, y_i) \mid \wv)
=
\frac12 \sum_{i=1}^{n} \Eb _ {q(\wv)} \Big[\nabla_{\wv}^2 \log p((\xv_i, y_i) \mid \wv)\Big],
\]
where the right-hand side involves the Hessian of \(\wv\).
The derivative of the second term is easy to compute since KL divergence between two multivariate normal distributions has a closed form:
\[
\frac{\partial}{\partial \Sigmav} \Ds_\KL(q(\wv), p(\wv))
=
-\frac12 \Sigmav\inv + \frac12 \Iv.
\]
The two derivatives sum to zero, and rearranging terms gives
\[
\Sigmav\inv - \Iv
=
-\sum_{i=1}^{n} \Eb _ {q(\wv)} \Big[\nabla_{\wv}^2 \log p((\xv_i, y_i) \mid \wv)\Big].
\]
The right-hand side is positive semi-definite, since each likelihood term \(\log p((\xv_i, y_i) \mid \wv)\) is log-concave in \(\wv\).
Hence, we obtain \(\Sigmav\inv \succeq \Iv\), which immediately implies \(\Sigmav \preceq \Iv\).</p>

<p>Note that Bayesian linear regression and Bayesian logistic regression both have log-concave likelihoods, and thus the above simple result applies.</p>

<p><strong>Remark.</strong>
In a previous attempt, I tried to prove a stronger version that the optimal variational covariance is monotonically decreasing in the number of data points.
But the above proof technique does not work anymore.
I am also not sure if the stronger version holds at all.
On the flip side, I am also interested to know if the log-concave assumption is necessary.
Does there exist a likelihood such that the variational posterior uncertainty is even larger than the prior?</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Posterior Contraction in Variational Inference]]></summary></entry><entry><title type="html">Matrix Calculus</title><link href="https://kayween.github.io/blogs/2025/11/26/matrix-calculus.html" rel="alternate" type="text/html" title="Matrix Calculus" /><published>2025-11-26T00:00:00+00:00</published><updated>2025-11-26T00:00:00+00:00</updated><id>https://kayween.github.io/blogs/2025/11/26/matrix-calculus</id><content type="html" xml:base="https://kayween.github.io/blogs/2025/11/26/matrix-calculus.html"><![CDATA[<h1 id="matrix-calculus">Matrix Calculus</h1>

<p>This post is concerned with computing the derivatives of matrix functions with a particular focus on reverse-mode automatic differentiation (AD).</p>

<p>To set up the stage, let us consider the following computational graph
\[
    x \to y \leadsto \cdots \leadsto f,
\]
which starts from \(x\), computes \(y\), and then eventually arrives at the final output \(f(x)\).
We assume the output \(f(x)\) is a scalar (which is ubiquitous in machine learning, e.g., the training loss of deep neural networks).
We are interested in computing the derivative
\[
    \bar x = \frac{\partial}{\partial x} f
\]
by reverse mode AD (a.k.a. the backward pass in deep learning).
That is, we wish to construct the above derivative using \(\bar y = \frac{\partial}{\partial y} f\).</p>

<p>For the time being, let us assume both \(x\) and \(y\) are vectors.
Let \(J(x)\) be the Jacobian matrix of \(y\) with respect to \(x\).
Then, we can write
\[
    \diff y = J(x) \diff x.
\]
It is not hard to see that \(\bar x\) and \(\bar y\) are related via the Jacobian:
\[
    \bar x = J(x)^\top \bar y.
\]
Despite its simplicity, the above process reveals two key steps for reverse mode AD:</p>
<ol>
  <li>Calculate the Jacobian (derivative) of \(y\) with respect to \(x\);</li>
  <li>Take the transpose (adjoint) of the Jacobian and apply it to \(\bar y\).</li>
</ol>

<p>In what follows, we will use the same two steps to derive reverse mode AD for matrix operations, where both \(x\) and \(y\) become matrices.
What makes it challenging is that the derivative of \(y\) with respect to \(x\) cannot be easily written as a Jacobian matrix anymore; this derivative becomes a linear operator.
As a result, we will have to take the adjoint of the linear operator, which is often more complicated than simply taking the transpose of a Jacobian.</p>

<h2 id="warm-up-matrix-multiplication">Warm Up: Matrix Multiplication</h2>

<p>Let \(Y = A X B\), where \(A, X, B\) are matrices with appropriate sizes.
Differentiating both sides of the equation gives the derivative:
\[
    \diff Y = A \diff X B.
\]
Unlike the vector setting, we cannot write the derivative in a simple matrix form.
(Technically, we still can using Kronecker products. But it’s strongly discouraged.)
This derivative is best viewed as a linear operator applied to \(\diff X\).</p>

<p>Next, we need to take the adjoint (transpose) of the derivative.
Some simple calculation yields
\[
\diff f
=
\langle \bar Y, \diff Y \rangle
=
\langle \bar Y, A \diff X B \rangle
=
\langle A^\top \bar Y B^\top, \diff X \rangle,
\]
which implies that \(\bar X = A^\top \bar Y B^\top\).</p>

<h2 id="inverse-matrix-multiplication">Inverse Matrix Multiplication</h2>

<p>Let \(Y = A X\inv B\).
Differentiating both sides of the equation gives
\[
\diff Y
=
A \diff \big(X\inv\big) B
= - A X\inv \diff X X\inv B,
\]
where the second equality is from the matrix cookbook (Petersen &amp; Pedersen 2008; Eq (59)).
Next, taking the adjoint of the derivative yields
\[
\diff f
=
\langle \bar{Y}, \diff Y \rangle
=
-\langle \bar{Y}, A X\inv \diff X X\inv B \rangle
=
-\langle X^{-\top} A^\top \bar{Y} B^\top X^{-\top}, \diff X \rangle.
\]
The last expression implies that the backward pass is
\[
    \bar{X} = - X^{-\top} A^\top \bar{Y} B^\top X^{-\top}.
\]</p>

<h2 id="cholesky-decomposition">Cholesky Decomposition</h2>

<p>Let \(L L^\top = \Sigma\) be the Cholesky decomposition of a symmetric positive semi-definite matrix \(\Sigma\).
Now we compute the derivative
\(
    \bar \Sigma = \frac{\partial}{\partial \Sigma} f
\)
by reverse mode AD.</p>

<p>Differentiating both sides of the equation \(\Sigma = L L^\top\) gives
\[
    \diff \Sigma = \diff L L^\top + L \diff L^\top,
\]
where we emphasize that \(\diff \Sigma\) is symmetric and \(\diff L\) is lower triangular.
To get the derivative, we need to express \(\diff L\) in terms of \(\diff \Sigma\), which requires some algebra.
We skip the step-by-step derivation here and directly give the final identity by Murray (2016, Section 3):
\[
\begin{equation}
\label{eq:cholesky_derivative}
    \diff L = L \Phi\big(L\inv \diff \Sigma L^{-\top}\big),
\end{equation}
\]
where \(\Phi\) is an element-wise function that zeros out the upper triangular part and halves the diagonal entries, i.e., \(\Phi(Z) = \operatorname{tril}(Z) - \frac12 \operatorname{diag}(Z)\) for any symmetric \(Z\).</p>

<p>Next, we need to take the adjoint (i.e., transpose) of the derivative in \eqref{eq:cholesky_derivative}.
Recall that the derivative \eqref{eq:cholesky_derivative} is a linear operator that maps from the space of symmetric matrices to the space of lower triangular matrices.
Hence, its adjoint is a linear operator that takes a lower triangular matrix and returns a symmetric matrix.</p>

<p>Some algebra gives
\[
\diff f
=
\langle \bar{L}, \diff L \rangle
=
\Big\langle \bar{L}, L \Phi\big(L\inv \diff \Sigma L^{-\top}\big) \Big\rangle
=
\Big\langle \operatorname{tril}\big(L^\top \bar{L}\big), \Phi\big(L\inv \diff \Sigma L^{-\top}\big) \Big\rangle,
\]
where we stress that the linear function \(\operatorname{tril}(\cdot)\) taking the lower triangular part (including the diagonal) cannot be omitted because the above inner product is defined in the space lower triangular matrices.</p>

<p>Let \(\Phi^*\) be the adjoint of the linear function \(\Phi\).
It is not hard to see that
\[
    \Phi^ *(Z) = \frac12\big(Z + Z^\top - \operatorname{diag}(Z)\big)
\]
for any lower triangular \(Z\).
The remaining is straightforward calculation:
\[
\diff f
=
\Big\langle \Phi^ *\big(\operatorname{tril}\big(L^\top \bar{L}\big)\big), L\inv \diff \Sigma L^{-\top} \Big\rangle
=
\Big\langle L^{-\top} \Phi^ *\big(\operatorname{tril}\big(L^\top \bar{L}\big)\big) L\inv, \diff \Sigma \Big\rangle,
\]
where we emphasize that both inner products are defined in the space of symmetric matrices.
From the last expression, we obtain \(\bar\Sigma = L^{-\top} \Phi^ *\big(\operatorname{tril}\big(L^\top \bar{L}\big)\big) L\inv\).
This expression is <a href="https://github.com/pytorch/pytorch/blob/4a0693682a8574bdc36e1ca2ea7bd2ddf5c19340/torch/csrc/autograd/FunctionsManual.cpp#L1984-L2010">what’s implemented in PyTorch</a>, different from the one in Murray (2016, Section 3).</p>

<h2 id="final-remarks">Final Remarks</h2>

<p>The matrix cookbook by Petersen &amp; Pedersen (2008) is always a good reference when it comes to linear algebra.
But it does not cover enough material for matrix calculus.
The MIT IAP course (Matrix Calculus for Machine Learning and Beyond) by Edelman and Johnson is an excellent resource for matrix calculus and automatic differentiation.</p>

<p>I always feel reverse mode AD is more complicated to implement compared to forward mode AD.
One of the reasons is that we need to take the adjoint of the derivative operator in reverse mode, which is not always straightforward when the derivative is complicated.</p>

<p>Throughout this post, we have used \(\bar X\) to denote the derivative of the final output w.r.t. \(X\).
The usual derivative notation \(\dot X\) is typically reserved for the derivative of \(X\) w.r.t. the very first input, which is commonly used in forward mode AD.</p>

<p>These two identities are worth remembering:
\(
    \langle \Av, \Bv \Cv \rangle = \langle \Av \Cv^\top, \Bv \rangle
\)
and
\(
    \langle \Av, \Bv \Cv \rangle = \langle \Bv^\top \Av, \Cv \rangle
\),
which arise frequently when doing matrix calculus.</p>

<h2 id="references">References</h2>

<ol>
  <li>Murray, I. (2016). Differentiation of the Cholesky decomposition. arXiv preprint arXiv:1602.07527.</li>
</ol>]]></content><author><name></name></author><summary type="html"><![CDATA[Matrix Calculus]]></summary></entry><entry><title type="html">Projection Matrix</title><link href="https://kayween.github.io/blogs/2024/08/31/projection-matrix.html" rel="alternate" type="text/html" title="Projection Matrix" /><published>2024-08-31T00:00:00+00:00</published><updated>2024-08-31T00:00:00+00:00</updated><id>https://kayween.github.io/blogs/2024/08/31/projection-matrix</id><content type="html" xml:base="https://kayween.github.io/blogs/2024/08/31/projection-matrix.html"><![CDATA[<h1 id="projection-matrix">Projection Matrix</h1>

<p>Consider a linear regression problem with a given dataset \((\Xv, \yv)\), where \(\Xv \in \Rb^{n \times d}\) and \(\yv \in \Rb^n\).
The optimal weight of linear regression is \(\wv^* = \big(\Xv^\top \Xv\big)\inv \Xv^\top \yv\).
It is easy to see that the prediction on the training set is a linear transformation of the training labels:
\[
    \hat\yv = \Xv \wv^* = \Xv \big(\Xv^\top \Xv\big)\inv \Xv^\top \yv,
\]
where we call \(\Pv = \Xv \big(\Xv^\top \Xv\big)\inv \Xv^\top \in \Rb^{n \times n}\) the projection matrix.</p>

<p>The projection matrix \(\Pv \in \Rb^{n \times n}\) has the following properties (taken directly from <a href="https://en.wikipedia.org/wiki/Projection_matrix#Properties">Wikipedia</a>).</p>
<ol>
  <li>The residual \(\rv = \yv - \hat\yv\) is orthogonal to the column space of \(\Xv\).</li>
  <li>\(\Pv\) is idempotent, i.e., \(\Pv^2 = \Pv\).</li>
  <li>The eigenvalues of \(\Pv\) are either zeros or ones.</li>
  <li>\(\Xv\) is invariant under \(\Pv\), i.e., \(\Pv \Xv = \Xv\).</li>
</ol>

<p>All of the above can be proved by brute-force calculation.
But I prefer the following argument, which explains where the name “projection” comes from.</p>

<p>It’s not hard to see that \(\hat \yv\) is a Euclidean projection of \(\yv\) onto the column space of \(\Xv\):
\[
    \hat \yv = \argmin_{\zv \in \operatorname{range}(\Xv)} \lVert \zv - \yv \rVert,
\]
which implies that \(\Pv\) implements this projection operator.
By the first-order optimality condition, we immediately have \((\hat \yv - \yv) \perp \operatorname{range}(\Xv)\), which proves (1).
For an arbitrary \(\yv \in \Rb^d\), applying the projection more than once is the same as applying the projection once, which proves (2).
The eigenvalues of \(\Pv\) are immediately available due to idempotence.
The last statement (4) is immediate because each column of \(\Xv\) is trivially inside \(\operatorname{range}(\Xv)\) and thus is invariant under \(\Pv\).</p>

<!-- connections with projection matrices and pseudo inverse -->

<h3 id="exercises"><strong>Exercises</strong></h3>
<ol>
  <li>Let \(V \subseteq \Hc\) be a closed linear subspace of a Hilbert space.
Show that the projection operator onto \(V\) is a linear function, where the projection operator is defined as
\[
 \xv \mapsto \argmin_{\zv \in V} \; \lVert \zv - \xv \rVert_\Hc.
\]</li>
</ol>]]></content><author><name></name></author><summary type="html"><![CDATA[Projection Matrix]]></summary></entry><entry><title type="html">Multivariate Normal Probability</title><link href="https://kayween.github.io/blogs/2024/06/28/multivariate-gaussian-probability.html" rel="alternate" type="text/html" title="Multivariate Normal Probability" /><published>2024-06-28T00:00:00+00:00</published><updated>2024-06-28T00:00:00+00:00</updated><id>https://kayween.github.io/blogs/2024/06/28/multivariate-gaussian-probability</id><content type="html" xml:base="https://kayween.github.io/blogs/2024/06/28/multivariate-gaussian-probability.html"><![CDATA[<h1 id="multivariate-normal-probability">Multivariate Normal Probability</h1>

<p>Given a Gaussian random variable \(\xv \sim \Nc(\muv, \Sigmav)\) and a constant vector \(\uv \in \Rb^d\), we are interested in computing the probability
\[
    \Pr(\xv \leq \uv),
\]
where the inequality is element-wise.</p>

<p>This probability is exactly the CDF of multivariate normal distributions—a fundamental quantity that characterizes the behavior of the distribution.
Yet, perhaps surprisingly, it is intractable (no analytical expression) to compute except in a few special cases.
Thus, one has to resort to Monte Carlo methods to numerically estimate the probability.</p>

<h2 id="a-slightly-more-general-problem">A Slightly More General Problem</h2>

<p>We introduce a slightly more general formulation with the following two modifications:</p>
<ol>
  <li>We restrict \(\xv \sim \Nc(\zero, \Iv)\) to be a standard normal random variable;</li>
  <li>But we allow a more general linear inequality \(\vv \leq \Av \xv \leq \uv\).</li>
</ol>

<p>Namely, we arrive at the integral
\[
\begin{equation}
\label{eq:normal-prob}
    \Pr(\vv \leq \Av \xv \leq \uv) = \int_{\vv \leq \Av \xv \leq \uv} \phi(\xv) \diff \xv,
\end{equation}
\]
where \(\phi(\xv) \propto \exp(-\frac12 \xv^\top \xv)\) is the standard normal density.</p>

<p>For the sake of presentation, we assume \(\Av \in \Rb^{d \times d}\) is a full rank lower triangular matrix.
The seemingly more general case where \(\Av \in \Rb^{m \times d}\) is non-square (but still row full rank) and \(\xv\) is non-standard normal can be reduced to \eqref{eq:normal-prob} by a change of variables.</p>

<h4 id="non-triangular-square-matrices"><strong>Non-Triangular Square Matrices</strong></h4>
<p>Suppose \(\Av\) is square and full rank but not triangular.
Let \(\Av = \Lv \Qv\) be its LQ decomposition.
Then, a change of variables \(\zv = \Qv \xv\) reduces the problem to \(\Pr(\vv \leq \Lv \zv \leq \uv)\), where \(\zv\) is standard normal and \(\Lv\) is lower triangular.</p>

<h4 id="non-square-wide-matrices"><strong>Non-Square Wide Matrices</strong></h4>
<p>Suppose \(\Av \in \Rb^{m \times d}\) is rectangle with \(m &lt; d\) and row full rank.
Similar to the previous argument, applying a LQ decomposition and a change of variables reduces the problem to
\(\Pr(\vv \leq \Lv \zv \leq \uv)\),
where \(\Lv \in \Rb^{m \times d}\) and \(\Qv \in \Rb^{d \times d}\).
Note that the last \((d - m)\) columns of \(\Lv\) are zeros, which implies that the last \((d - m)\) coordinates of \(\zv\) are irrelevant.
Thus, marginalizing out those coordinates reduces the problem to \eqref{eq:normal-prob}.</p>

<h4 id="non-standard-normal"><strong>Non-Standard Normal</strong></h4>
<p>Suppose \(\xv \sim \Nc(\muv, \Sigmav)\) is a non-standard normal random variable.
Let \(\Lv \Lv^\top = \Sigmav\) be the Cholesky decomposition.
A change of variables \(\xv = \Lv \zv + \muv\) yields \(\Pr(\vv \leq \Av(\Lv \zv + \muv) \leq \uv)\), where \(\zv\) is standard normal.
Reuse the previous argument completes the reduction.</p>

<h2 id="monte-carlo-estimate-by-separation-of-variables">Monte Carlo Estimate by Separation of Variables</h2>
<p>We will estimate the integral \eqref{eq:normal-prob} by importance sampling.
The basic idea is to construct a different distribution that is easy to sample from and whose support is exactly the domain defined by the inequality constraints.</p>

<p>Since \(\Av\) is lower triangular, the linear inequality constraints \(\vv \leq \Av \xv \leq \uv\) can be rewritten as
\[
    \tilde v_i \leq x_i \leq \tilde u_i, \quad i = 1, 2, \cdots, d,
\]
where \(\tilde v_i = \frac{1}{a_{ii}} \Big(v_i - \sum_{j=1}^{i-1} a_{ij} x_j \Big)\)  and \(\tilde u_i = \frac{1}{a_{ii}} \Big(u_i - \sum_{j=1}^{i-1} a_{ij} x_j \Big)\).
Note that \(\tilde v_i\) and \(\tilde u_i\) are functions of the first \(i-1\) coordinates only.</p>

<p>Let \(x_{1:i-1}\) denote the first \(i-1\) coordinates in \(\xv\).
Consider the probability distribution of the form
\[
\begin{equation}
\label{eq:importance-dist}
    p(\xv) = \prod_{i=1}^{d} p_i(x_i \mid x_{1:i-1}),
\end{equation}
\]
where
\[
\begin{equation}
\label{eq:conditional}
    p_i(x_i \mid x_{1:i-1}) = \frac{\phi(x_i) \cdot \mathbf{1}[\tilde v_i \leq x_i \leq \tilde u_i]}{\Phi(\tilde u_i) - \Phi(\tilde v_i)}.
\end{equation}
\]</p>

<p>Sampling from \eqref{eq:importance-dist} is easy, because each \(p_i\) is a univariate truncated normal distribution.
Using \(p(\xv)\) as the importance weight to estimate \eqref{eq:normal-prob} yields
\[
\eqref{eq:normal-prob}
=
\int \frac{\phi(\xv)}{p(\xv)} \cdot p(\xv) \diff \xv
=
\Eb_{\xv \sim p(\xv)}\left[\prod_{i=1}^{d} \Big(\Phi(\tilde u_i) - \Phi(\tilde v_i)\Big) \right],
\]
where the right hand side is readily estimated by Monte Carlo samples.
We have removed the inequality constraints because samples from \(p(\xv)\) automatically satisfy the constraints.</p>

<h2 id="discussion">Discussion</h2>
<p>Perhaps surprisingly, this fundamental problem of estimating high dimensional normal probability is still under active research.
The method of separation of variables in this post was developed in the 1990s (Genz, 1992).
Recent developments are still based on this idea, which we briefly mention below.</p>

<p>The method of separation of variables we presented above is based on univariate conditioning: The conditional distribution \eqref{eq:conditional} is a univariate distribution.
Bivariate conditioning (Genz and Trinh, 2016) instead uses bivariate conditional distributions \(p(x_{2i + 1}, x_{2i + 2} \mid x_{1:2i})\) for \(i = 0, 1, \cdots, \frac12 (n - 2)\).</p>

<p>A natural idea is tweaking the order of the variables in the conditional distributions \eqref{eq:conditional}.
For instance, Genz and Bretz (2009) propose a heuristic to find a permutation of the coordinates that reduces the Monte Carlo error.</p>

<p>Minimax tilting modifies the conditional distribution \eqref{eq:conditional} by an exponential tilting (Botev, 2017).
Very roughly speaking, they replace the standard normal density \(\phi(x)\) in the conditional distribution \eqref{eq:conditional} with a shifted normal density \(\phi(x; \mu, 1)\), where the shift parameter \(\mu\) is chosen carefully.</p>

<p>A careful reader will notice that we have ignored the case when the matrix \(\Av\) in \eqref{eq:normal-prob} has more rows than columns, i.e., more constraints than dimensions.
In fact, this is a hard one, which has to rely on Markov chain Monte Carlo methods combined with sophisticated numerical integration methods.
We have intentionally skipped this case.</p>

<h3 id="exercises"><strong>Exercises</strong></h3>
<ol>
  <li>
    <p>Show that separation of variables still works when some entries of \(\vv\) and \(\uv\) in  \eqref{eq:normal-prob} are infinite.
Hence, the method of separation of variables indeed applies to the normal CDF as well.</p>
  </li>
  <li>
    <p>Show the distribution \eqref{eq:importance-dist} is <strong><em>not</em></strong> the truncated normal distribution \(\phi(\xv \mid \vv \leq \Av \xv \leq \uv)\).</p>
  </li>
  <li>
    <p>What are the advantages of the importance distribution \eqref{eq:importance-dist} compared to the uniform distribution over the polytope \(\{\xv \in \Rb^d: \vv \leq \Av \xv \leq \uv\}\)?</p>
  </li>
</ol>

<h3 id="references"><strong>References</strong></h3>
<p>Botev, Z. I. (2017). The normal law under linear restrictions: simulation and estimation via minimax tilting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(1), 125-148.</p>

<p>Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of computational and graphical statistics, 1(2), 141-149.</p>

<p>Genz, A., &amp; Bretz, F. (2009). Computation of multivariate normal and t probabilities (Vol. 195). Springer Science &amp; Business Media.</p>

<p>Genz, A., &amp; Trinh, G. (2016). Numerical computation of multivariate normal probabilities using bivariate conditioning. In Monte Carlo and Quasi-Monte Carlo Methods: MCQMC, Leuven, Belgium, April 2014 (pp. 289-302). Springer International Publishing.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Multivariate Normal Probability]]></summary></entry><entry><title type="html">Matrix Quadratic Equation</title><link href="https://kayween.github.io/blogs/2024/06/08/matrix-quadratic-equation.html" rel="alternate" type="text/html" title="Matrix Quadratic Equation" /><published>2024-06-08T00:00:00+00:00</published><updated>2024-06-08T00:00:00+00:00</updated><id>https://kayween.github.io/blogs/2024/06/08/matrix-quadratic-equation</id><content type="html" xml:base="https://kayween.github.io/blogs/2024/06/08/matrix-quadratic-equation.html"><![CDATA[<h1 id="matrix-quadratic-equation">Matrix Quadratic Equation</h1>

<p>Everyone knows how to solve a univariate quadratic equation \(a x^2 + b x + c = 0\).
But it is less obvious how to solve \( \Xv \) in the matrix quadratic equation 
\[
\begin{equation}
\label{eq:symmetric-equation}
    \Xv^\top \Av \Xv + \frac12 \Bv^\top \Xv + \frac12 \Xv^\top \Bv + \Cv = 0,
\end{equation}
\]
where \( \Xv, \Bv \in \Rb^{m \times n} \), \( \Cv \in \Rb^{n \times n} \) is symmetric, and \( \Av \in \Rb^{m \times m} \) is symmetric positive definite.
All matrices are assumed to be real (we earned this privilege as computer scientists).
In what follows, we present the solution of this equation developed by Crone (1981).</p>

<h2 id="a-special-case">A Special Case</h2>
<p>Before dealing with the general case \eqref{eq:symmetric-equation}, we must solve the special case
\[
\begin{equation}
\label{eq:special-case}
    \Xv^\top \Xv = \Sv,
\end{equation}
\]
where \(\Sv\) is symmetric positive semi-definite.
This special case is simpler due to the the absence of the linear term, and is covered by the following lemma (Crone, 1981, Proposition 3).</p>

<p><strong>Lemma 1.</strong>
<i>
Let \(\Xv \in \Rb^{m \times n}\) and let \(\Sv\) be a \(n \times n\) symmetric positive semi-definite matrix with rank \(r\).
Then, the matrix equation \eqref{eq:special-case} has a root if and only if \(m \geq r\).
The roots, if exist, are of the form
\[
    \Xv = \Vv \Lambdav^{\frac12} \Uv^\top = \Vv \Uv^\top \Sv^{\frac12},
\]
where \(\Lambdav \in \Rb^{r \times r}\) and \(\Uv \in \Rb^{n \times r}\) are matrices in the (compact) spectral decomposition \(\Sv = \Uv \Lambdav \Uv^\top\), and \(\Vv \in \Rb^{m \times r}\) is an arbitrary column orthonormal matrix.
</i></p>

<!-- <i>Proof.</i>
If \\(m < r\\), the equation's two sides do not have the same rank, and thus cannot be equal.
When \\(m > r\\), it is trivial to verify \\(\Vv \Lambdav^{\frac12} \Uv^\top\\) is indeed a root as long as \\(\Vv\\) is column orthonormal.
It remains to prove all roots can be written in this form.
Let \\( \Xv \\) be an arbitrary root and define \\(\Vv = \Xv \Uv \Lambdav^{-\frac12}\\).
It is easy to verify that \\( \Vv \\) indeed has orthonormal columns.
Then, it follows that
\\[
\Vv \Lambdav^{\frac12} \Uv^\top
=
\big(\Xv \Uv \Lambdav^{-\frac12}\big) \Lambdav^{\frac12} \Uv^\top
=
\Xv \Uv \Uv^\top
=
\Xv.
\\]
It's worth noting that the last equality does not assume \\( \Uv \Uv^\top \\) is identity---it is not when \\( \Sv \\) is rank deficient.
Instead, the last equality uses
\\(
\Xv \Uv_{\perp} = 0
\\)
since
\\(
\big(\Xv \Uv_{\perp}\big)^\top
\big(\Xv \Uv_{\perp}\big)
= 0
\\).
**Q.E.D.** -->

<h2 id="the-general-case">The General Case</h2>

<p>Now we are ready to tackle the general form of the matrix quadratic equation \eqref{eq:symmetric-equation}.
The idea is based on completing the square.
Observe that the equation \eqref{eq:symmetric-equation} can be rewritten as
\[
\big(\Av^{\frac12} \Xv + \tfrac12 \Av^{-\frac12} \Bv\big)^\top
\big(\Av^{\frac12} \Xv + \tfrac12 \Av^{-\frac12} \Bv\big)
=
\frac14 \big(\Bv^\top \Av\inv \Bv - 4 \Cv\big).
\]
Let \(\Gv = \Bv^\top \Av\inv \Bv - 4 \Cv\) and call it the discriminant.
Clearly, \(\Gv\) has to be symmetric positive semi-definite for the solution to exist, since the left hand side is symmetric positive semi-definite.
Assuming it is, then invoking Lemma 1 gives a general root formula:
\[
    -\frac12 \Av\inv \Bv + \frac12 \Av^{-\frac12} \Vv \Uv^\top \Gv^{\frac12},
\]
where \( \Uv \) is a column orthonormal matrix consisting of the eigenvectors of \( \Gv \), and \( \Vv \) is an arbitrary column orthonormal matrix (with an appropriate size).
One can verify that this formula reduces to the usual univariate quadratic formula when all matrices are \( 1 \times 1 \).
The matrix quadratic equation has infinitely many roots in most cases, e.g., when the discriminant is positive definite, because there are infinitely many choices for the column orthonormal matrix \(\Vv\).
In the univariate case, however, \(\Vv\) is either \(1\) or \( -1 \), which yields at most two roots.</p>

<h2 id="an-asymmetric-variant">An Asymmetric Variant</h2>

<p>Now consider the following asymmetric variant
\[
\begin{equation}
\label{eq:asymmetric-equation}
    \Xv^\top \Av \Xv + \Bv^\top \Xv + \Cv = 0,
\end{equation}
\]
where \( \Xv \) is hit by \( \Bv \) from the left side only.
Again, we assume all matrices are real, \( \Av \) is symmetric positive definite, and \( \Cv \) is symmetric.</p>

<p>If \( \Xv \) is a root of the asymmetric equation \eqref{eq:asymmetric-equation}, then \(\Bv^\top \Xv\) has to be symmetric.
This is because the right hand side of the identity
\[
    \Bv^\top \Xv = -\big(\Xv^\top \Av \Xv + \Cv\big)
\]
is symmetric.
Then, splitting \( \Bv^\top \Xv \) into two parts \( \tfrac12 \Bv^\top \Xv + \tfrac12 \Xv^\top \Bv \) suggests that the asymmetric equation reduces to the symmetric one and we are ready to apply the quadratic formula we just derived!
However, as tempting as it is, this argument has a bug.
Finding the logical bug is left as an exercise to the readers.</p>

<p>It turns out that the asymmetric matrix quadratic equation is much harder to solve.
Crone (1981) did not obtain a complete solution to the asymmetric equation when the discriminant is singular,
and only gave a solution when the discriminant is positive definite.
The result, however, is much messier and hence is omitted here.
It wasn’t until recently that a general solution was discovered (Yuan et al., 2021).
I have not yet come across the asymmetric variant in applications and thus I am less motivated to read these results.
Interested readers may refer to their original papers.</p>

<h3 id="exercises"><strong>Exercises</strong></h3>
<ol>
  <li>Solve the matrix quadratic equation \( \Xv^\top \Av \Xv + 2 \Xv + \Cv = 0 \) with the following two conditions:
all matrices \( \Av, \Cv, \Xv \) are symmetric;
\( \Av \) is positive definite and satisfies \( \Av\inv \succeq \Cv \).
Note that we need to solve the equation with the symmetric constraint \( \Xv = \Xv^\top \).
How many roots are there?</li>
</ol>

<h3 id="references"><strong>References</strong></h3>
<p>Crone, L. (1981). Second order adjoint matrix equations. Linear Algebra and Its Applications, 39, 61-71.</p>

<p>Yuan, Y., Liu, L., Zhang, H., &amp; Liu, H. (2021). The solutions to the quadratic matrix equation X* AX + B* X + D = 0. Applied Mathematics and Computation, 410, 126463.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Matrix Quadratic Equation]]></summary></entry><entry><title type="html">Welcome to Jekyll!</title><link href="https://kayween.github.io/blogs/jekyll/update/2024/05/22/welcome-to-jekyll.html" rel="alternate" type="text/html" title="Welcome to Jekyll!" /><published>2024-05-22T21:21:02+00:00</published><updated>2024-05-22T21:21:02+00:00</updated><id>https://kayween.github.io/blogs/jekyll/update/2024/05/22/welcome-to-jekyll</id><content type="html" xml:base="https://kayween.github.io/blogs/jekyll/update/2024/05/22/welcome-to-jekyll.html"><![CDATA[<p>You’ll find this post in your <code class="language-plaintext highlighter-rouge">_posts</code> directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run <code class="language-plaintext highlighter-rouge">jekyll serve</code>, which launches a web server and auto-regenerates your site when a file is updated.</p>

<p>Jekyll requires blog post files to be named according to the following format:</p>

<p><code class="language-plaintext highlighter-rouge">YEAR-MONTH-DAY-title.MARKUP</code></p>

<p>Where <code class="language-plaintext highlighter-rouge">YEAR</code> is a four-digit number, <code class="language-plaintext highlighter-rouge">MONTH</code> and <code class="language-plaintext highlighter-rouge">DAY</code> are both two-digit numbers, and <code class="language-plaintext highlighter-rouge">MARKUP</code> is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.</p>

<p>Jekyll also offers powerful support for code snippets:</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">print_hi</span><span class="p">(</span><span class="nb">name</span><span class="p">)</span>
  <span class="nb">puts</span> <span class="s2">"Hi, </span><span class="si">#{</span><span class="nb">name</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="n">print_hi</span><span class="p">(</span><span class="s1">'Tom'</span><span class="p">)</span>
<span class="c1">#=&gt; prints 'Hi, Tom' to STDOUT.</span></code></pre></figure>

<p>Check out the <a href="https://jekyllrb.com/docs/home">Jekyll docs</a> for more info on how to get the most out of Jekyll. File all bugs/feature requests at <a href="https://github.com/jekyll/jekyll">Jekyll’s GitHub repo</a>. If you have questions, you can ask them on <a href="https://talk.jekyllrb.com/">Jekyll Talk</a>.</p>]]></content><author><name></name></author><category term="jekyll" /><category term="update" /><summary type="html"><![CDATA[You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.]]></summary></entry></feed>