8 August 2023

An Interactive Guide to the Bivariate Gaussian Distribution

The normal distribution moves into 3D!

Introduction

It’s not long after working with the normal distribution in one dimension that you realize you need to start working with vectors and matrices. Data rarely comes in isolation and we need more statistical tools to work with multiple variables at a time. The bivariate Gaussian distribution is normally the first step into adding more dimensions, and the intuition that one builds from the 2D case is useful in higher dimensions with the multivariate Gaussian.

I use “bivariate Gaussian” and “bivariate Normal” distribution interchangeably, sometimes “2d Normal”. These all mean the same thing.

Density of the Bivariate Normal

It’s a magical mathematical bump. It was a bump in 1 dimension and it’s still just a symmetric bump in 2 dimensions. The difference now is that there are more shapes that this hill can have because there’s more room to breathe in 3 dimensions! At first glance, there’s nothing glamourous about this function but it hides a lot of mathematical beauty.

Recall that the 1d Gaussian distribution has pdf:

f(x;μx,σx)=1σx2πexp[12(xμxσx)2]f(x;\mu_x, \sigma_x) = \frac{1}{\sigma_x\sqrt{2\pi}} \exp{\left[-\frac{1}{2}\left(\frac{x-\mu_x}{\sigma_x}\right)^2\right]}

where σx\sigma_x controlled the width of the distribution while μx\mu_x controlled the location.

The story is similar with the bivariate normal distribution, which is defined with 5 parameters, μx,μy\mu_x, \mu_y manage the mean of the distribution while σx,σy,ρ\sigma_x,\sigma_y, \rho manage the covariance of the distribution. We say that the pair of random variables X,YX,Y are jointly distributed as a bivariate normal distribution.

If you examine the density function, you can pick out the parts that that it shares with the 1d Gaussian. The interesting parts of this distribution lie in the the new parameter ρ\rho, which is the correlation between the two variables, and controls how much one variable moves with the other.

Here are some things to notice as you play with the function above:

  • The maximum and center of mass of the density is at (μx,μy)(\mu_x, \mu_y).
  • The mean parameters μx,μy\mu_x, \mu_y can move around independently of the overall shape of the density.
  • σx,σy>0\sigma_x, \sigma_y > 0, and ρ(1,1)\rho \in (-1,1). The density function becomes degenerate when you get to the edges of this range. (I limit this range, but hopefully just enough to still get the sense of what happens if you push them to the extremes. Don’t break my applet!)
  • The contours of this function, no matter the values of the parameters are elliptical.
  • When ρ=0\rho = 0, the density is symmetric about the xx and yy axes.
  • When ρ>0\rho > 0, the density seems to form a ridge along a positively sloped line, and when ρ<0\rho < 0 it’s along a line of negative slope. Slopes look symmetric along this line. This ridge line is the major axis of the contour ellipses.
  • The ridge line is closer to the xx-axis when σx>σy\sigma_x > \sigma_y, and closer to the yy-axis when σy>σx\sigma_y > \sigma_x.
  • When σx=σy\sigma_x = \sigma_y, as ρ\rho ranges from -1 to 1, the ridge line can range from 45°-45\degree to 45°45\degree (from the yy-axis). When the ratio of σx\sigma_x to σy\sigma_y is more extreme, the power of ρ\rho to change the slope of the ridge line is much more limited.

You should get the sense that the density of the bivariate normal is fundamentally about ellipses. It’s ellipses all the way down! This is not a coincidence — the shape of that ellipse is entirely controlled by the covariance matrix.

Bivariate Ellipses
Elliptical contours of a bivariate normal distribution.

If we rewrite the density function in vector/matrix terms so that the covariance matrix Var[XY]=Σ\operatorname{Var}\left[\begin{smallmatrix}X \\ Y\end{smallmatrix}\right] = \Sigma shows up explicitly, we get:

f(z;μz,Σ)=12πΣ1/2exp((zμz)Σ1(zμz)2)f(\mathbf{z};\mu_z, \Sigma) = \frac{1}{2\pi}|\Sigma|^{-1/2}\exp \left(-\frac{(\mathbf{z} - \bm{\mu}_{\mathbf{z}})'\Sigma^{-1}(\mathbf{z} - \bm{\mu}_{\mathbf{z}})}{2}\right)

where z=[xy]\mathbf{z} = \left[\begin{smallmatrix}x \\ y\end{smallmatrix}\right] and μz=[μxμy]\bm{\mu}_\mathbf{z} = \left[\begin{smallmatrix}\mu_x \\ \mu_y\end{smallmatrix}\right]

The kernel of the function is (zμz)Σ1(zμz)\htmlStyle{white-space: nowrap}{(\mathbf{z} - \bm{\mu}_{\mathbf{z}})'\Sigma^{-1}(\mathbf{z} - \bm{\mu}_{\mathbf{z}})}, and has a special name. It is the Mahalonobis distance (squared) between zz and the bivariate normal distribution, and the function responsible for the elliptical contours.

The Covariance Ellipse

The primary components of drawing an ellipse are the center, and two diameters. One for the major axis and one for the minor axis. The major axis goes through the center and is the longest line you can draw, the minor axis is perpendicular to that, also through the center. The covariance matrix Σ\Sigma has all the information you need to draw the ellipse if you make the center of the ellipse at μz\bm{\mu}_\mathbf{z}. The radii are the square roots of the eigenvalues of Σ\Sigma. The eigenvectors of Σ\Sigma give the directions of the major and minor axes.

Eigen- details

The details for this are no more than a standard 2x2 eigenproblem, and there are some simple formulas that we can use to get the eigenvalues and eigenvectors in this context.

Σv=λv(ΣλI)v=0    det(ΣλI)=0\begin{aligned} \Sigma \mathbf{v} &= \lambda \mathbf{v} \\ (\Sigma - \lambda I)\mathbf{v} &= 0 \\ \implies \det{(\Sigma - \lambda I)} &= 0 \\ \end{aligned}

If we label Σ=[abcd]\Sigma = \left[\begin{smallmatrix}a & b \\ c & d\end{smallmatrix}\right], with b=cb = c, then the characteristic polynomial is

(aλ)(dλ)bc=0λ2(a+d)λ+(adbc)=0    λ=(a+d)±(a+d)24(adbc)2\begin{aligned} (a - \lambda)(d - \lambda) - bc &= 0 \\ \lambda^2 - (a + d)\lambda + (ad - bc) &= 0 \\ \implies \lambda &= \frac{(a + d) \pm \sqrt{(a+d)^2 - 4(ad - bc)}}{2} \\ \end{aligned}

Since Ta+dT \colonequals a + d is the trace of Σ\Sigma (or Ma+d2M \colonequals \frac{a + d}{2}, the mean of the variances) and DadbcD \colonequals ad - bc is the determinant of Σ\Sigma, we can write this as

λ=T2±T24D=M±M2D\begin{aligned} \lambda = \frac{T}{2} \pm \sqrt{\frac{T^2}{4} - D} \\ = M \pm \sqrt{M^2 - D} \end{aligned}

For a single eigenvector v1\vec{v}_1, we examine the equation

(ΣλI)v1=0[aλbcdλ][v11v12]=[00]\begin{aligned} (\Sigma - \lambda I)\vec{v}_1 &= 0 \\ \begin{bmatrix}a - \lambda & b \\ c&d - \lambda\end{bmatrix}\begin{bmatrix}v_{11} \\ v_{12} \end{bmatrix} &= \begin{bmatrix}0 \\ 0\end{bmatrix} \end{aligned}

which implies that both (aλ)v11+bv12=0(a - \lambda)v_{11} + bv_{12} = 0 and cv11+(dλ)v12=0cv_{11} + (d - \lambda)v_{12} = 0. Since the matrix ΣλI\Sigma - \lambda I is singular, we only need to solve one of these equations because the other equation is some constant multiple of the first. A simple solution to the first equation is simply letting v11=bv_{11} = b and v12=(aλ)v_{12} = -(a - \lambda), (since λ\lambda represents two eigenvalues, the other eigenvector will use the other eigenvalue). This will work unless b=0b = 0. In that case, we can solve the second equation by letting v11=dλv_{11} = d - \lambda and v12=cv_{12} = -c. If both bb and cc are 0, then the matrix is diagonal and our eigenvectors are [10],[01][1\, 0]', [0\, 1]'.

Since b=cb=c in our covariance matrix, we just have 2 cases. That is, our eigenvectors are

v1=[bλ1a],v2=[bλ2a]v1=[10],v2=[01]\begin{align*} \vec{v}_1 &= \begin{bmatrix}b \\ \lambda_1 - a\end{bmatrix}, &\vec{v}_2 &= \begin{bmatrix}b \\ \lambda_2 - a\end{bmatrix} \tag{$b \neq 0$}\\[1.5em] \vec{v}_1 &= \begin{bmatrix}1 \\ 0\end{bmatrix}, &\vec{v}_2 &= \begin{bmatrix}0\\ 1\end{bmatrix} \tag{$b = c =0$} \end{align*}

v1,v2\vec{v}_1, \vec{v}_2 may switch in the latter case depending on if d>ad > a for the eigenvector to match the respective eigenvalue.

The angle for the major axis is θ=atan2(v12,v11)\theta = \operatorname{atan2}(v_{\small 12}, v_{\small 11}), where atan2 is a modified arctan()\arctan (\cdot) function to ensure the angle is always from the positive x-axis.

Conditional Distribution

The bivariate distribution has many beautiful symmetries, and I think they are highlighted well by the conditional distribution. The formula for the conditional distribution is a little more complicated at first but I think the picture and mental model is much more intuitive. Given that (X,Y)(X, Y) are jointly bivariate normal, the conditional distribution YX=xY | X=x is interpreted as “I know the value of XX to be xx, what is the variability and probability distribution left in the random variable YY.” Since these random variables had a joint distribution, knowing the value of one of them should influence the information I know about the other one.

If I know that X=2X=2, I’m thinking about slicing my density function along the plane of x=2x=2. Once I take the slice, we will need a scaling factor in order to make sure the slice integrates to 1 for a valid probability distribution.

Here are some things to notice:

  • The conditional distribution is a 1d Gaussian distribution.
  • When ρ=0\rho = 0, there are no terms that depend on xx or with a subscript of xx. In fact, it reduces to the marginal distribution of YY, which implies the random variables are statistically independent!
  • The conditional mean is a function of xx only when ρ0\rho \neq 0. Since ρ\rho controls the “shearness” of the ellipse, the conditional mean shifts more when ρ|\rho| is larger.
  • Even when ρ0\rho \neq 0, the conditional variance is not affected by σx\sigma_x.
Conditional Formulas

The most straight forward way to derive the conditional distribution formulas for a bivariate normal distribution is to use the definition of conditional probability (and some patient algebra):

f(yx)=f(x,y)f(x)=12πσxσy1ρ2×exp(12(1ρ2)[(xμxσx)2+(yμyσy)22ρ(xμx)(yμy)σxσy])1σx2π×exp(12(xμxσx)2)\begin{aligned} f(y | x) &= \frac{f(x, y)}{f(x)} \\ &= {\small \frac{ \frac{1}{2\pi \sigma_x\sigma_y\sqrt{1 - \rho^2}} \times \exp \left(-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 + \left(\frac{y-\mu_y}{\sigma_y}\right)^2 - \frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right]\right) }{ \frac{1}{\sigma_x\sqrt{2\pi}} \times \exp \left(-\frac{1}{2}\left(\frac{x-\mu_x}{\sigma_x}\right)^2\right) } } \end{aligned}

The marginal distribution of XX is a 1d Gaussian distribution, with mean μx\mu_x and variance σx2\sigma_x^2. This result is presented later in this blog post, but is often used to define the bivariate Gaussian distribution in some textbooks — any linear combination of the variables is also normally distributed.

We’ll work with the constant term and the things inside the exponential term separately. The constant simplifies:

12πσxσy1ρ21σx2π=12πσy2(1ρ2)\frac{ \frac{1}{2\pi \sigma_x\sigma_y\sqrt{1 - \rho^2}} }{ \frac{1}{\sigma_x\sqrt{2\pi}} } = \frac{1}{\sqrt{2\pi \sigma_y^2 (1 - \rho^2)}}

Since in a normal distribution, the variance shows up in the constant term, we get a hint of what the conditional variance will be, σy2(1ρ2)\sigma_y^2(1 - \rho^2).

The exponential term simplifies to:

exp(12(1ρ2)[(xμxσx)2+(yμyσy)22ρ(xμx)(yμy)σxσy][12(xμxσx)2])=?exp(12(1ρ2)[(xμxσx)2(1ρ2)(xμxσx)2+(yμyσy)22ρ(xμx)(yμy)σxσy])=?exp(12(1ρ2)[ρ2(xμxσx)2+(yμyσy)22ρ(xμx)(yμy)σxσy])=?exp(12(1ρ2)[ρ(xμxσx)+(yμyσy)]2)=?exp(12σy2(1ρ2)[ρσyσx(xμx)+(yμy)]2)=?exp(12σy2(1ρ2)conditional variance[y(μy+ρσyσx(xμx))conditional mean]2){\scriptsize \begin{aligned} &\exp \left(-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 + \left(\frac{y-\mu_y}{\sigma_y}\right)^2 - \frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right] \\[2em] - \left[-\frac{1}{2}\left( \frac{x-\mu_x}{\sigma_x}\right)^2\right]\right) \\ &\htmlClass{proof--conditional__step1 proof-explanation}{ \htmlData{help=give last term common denominator and bring closer to relevant expressions}{ \overset{\tiny ?}{=}}} \exp \left(-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 -(1-\rho^2)\left( \frac{x-\mu_x}{\sigma_x}\right)^2 + \left(\frac{y-\mu_y}{\sigma_y}\right)^2 - \frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right] \right) \\[1em] &\htmlClass{proof--conditional__step2 proof-explanation}{ \htmlData{help=cancel out terms}{ \overset{\tiny ?}{=}}} \exp \left(-\frac{1}{2(1-\rho^2)} \left[ \rho^2\left( \frac{x-\mu_x}{\sigma_x}\right)^2 + \left(\frac{y-\mu_y}{\sigma_y}\right)^2 - \frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y} \right] \right) \\[1em] &\htmlClass{proof--conditional__step3 proof-explanation}{ \htmlData{help=factor binomial}{ \overset{\tiny ?}{=}}} \exp \left( -\frac{1}{2(1-\rho^2)} \left[ \rho\left( \frac{x-\mu_x}{\sigma_x}\right) + \left(\frac{y-\mu_y}{\sigma_y}\right) \right]^2 \right) \\[1em] &\htmlClass{proof--conditional__step4 proof-explanation}{ \htmlData{help=take sigma out of squared term}{ \overset{\tiny ?}{=}}} \exp \left(-\frac{1}{2\sigma_y^2(1-\rho^2)} \left[ \rho\frac{\sigma_y}{\sigma_x}\left( x-\mu_x\right) + \left(y-\mu_y\right) \right]^2 \right) \\[1em] &\htmlClass{proof--conditional__step5 proof-explanation}{ \htmlData{help=rearrange terms to show conditional mean and variance explicitly}{ \overset{\tiny ?}{=}}} \exp \left(-\frac{1}{2\underbrace{\sigma_y^2(1-\rho^2)}_{\text{conditional variance}}} \left[ y - \underbrace{\left(\mu_y + \rho\frac{\sigma_y}{\sigma_x}\left( x-\mu_x\right)\right)}_{\text{conditional mean}} \right]^2 \right) \\[1em] \end{aligned} }

In total, if we combine the constant term and the exponential term,

f(yx)=12πσy2(1ρ2)exp([y(μy+ρσyσx(xμx))]22σy2(1ρ2)){\small f(y|x) = \frac{1}{\sqrt{2\pi \sigma_y^2 (1 - \rho^2)}} \exp \left(-\frac{\left[ y - \left(\mu_y + \rho\frac{\sigma_y}{\sigma_x}\left( x-\mu_x\right)\right) \right]^2}{2\sigma_y^2(1-\rho^2)} \right) }

we get the pdf of the conditional distribution with mean E[YX=x]=μy+ρσyσx(xμx)\operatorname{E}[Y|X=x] = \mu_y + \rho\frac{\sigma_y}{\sigma_x}(x-\mu_x) and variance Var[YX=x]=σy2(1ρ2)\operatorname{Var}[Y|X=x] = \sigma_y^2(1-\rho^2):

Regression Perspective

There is an intimate relationship between conditional distributions and regression. Suppose we are in a simple linear regression scenario regressing YY on XX. We can interpret ρσyσx\rho\frac{\sigma_y}{\sigma_x} as the population slope of the best fit line. Remember that ρ,σx,σy\rho, \sigma_x, \sigma_y together change the major axis of the ellipse, and thus the main direction of the relationship for these two variables. In order to see the resemblance of the formula, notice:

ρσyσx=ρσxσyσx2=Cov(X,Y)Var(X)\begin{aligned} \rho\frac{\sigma_y}{\sigma_x} &= \frac{\rho \sigma_x \sigma_y}{\sigma_x^2} \\ &= \frac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(X)} \end{aligned}

The slope of regression can be pulled out from the covariance matrix directly, if we label Σ=[abcd]\Sigma = \left[\begin{smallmatrix}a & b \\ c & d\end{smallmatrix}\right] then ρσyσx=ba\rho\frac{\sigma_y}{\sigma_x} = \frac{b}{a}. If we substitute with the sample version, we get

Cov^(X,Y)Var^(X)=1n1i=1n(xixˉ)(yiyˉ)1n1i=1n(xixˉ)2=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=β^1\begin{aligned} \frac{\operatorname{\widehat{Cov}}(X,Y)}{\operatorname{\widehat{Var}}(X)} &= \frac{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2} \\ &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ &= \hat \beta_1 \end{aligned}

This is the formula for the slope estimate in the simple linear regression of yi=β0+β1xi+εi\htmlStyle{white-space: nowrap}{y_i = \beta_0 + \beta_1 x_i + \varepsilon_i} assuming normal errors. Written in this manner,

E[YX=x]=μy+ρσyσx(xμx)=μy+β1(xμx)\begin{aligned} \operatorname{E}[Y | X = x] &= \mu_y + \rho\frac{\sigma_y}{\sigma_x}(x-\mu_x) \\ &= \mu_y + \beta_1 (x - \mu_x) \end{aligned}

we have a very intercept plus slope like interpretation that the conditional mean will shift based on best fit regression line, depending on how far shifted we are from the center of the bivariate distribution.

Note that the slope of this regression line is not the same as the major-axis line of the covariance ellipse that runs through the bulk of the bivariate distribution. Even though they trend in roughly the same direction, they represent different concepts.

Major Axis vs Regression
This plot shows a bivariate normal scatterplot with σx=1,σy=1.5,ρ=.5\htmlStyle{font-size: .85rem}{\sigma_x=1, \sigma_y=1.5,\rho=.5} with the covariance ellipse axes and related regression lines. Population slopes are shown, the lines are not subject to sampling variation.

The short explanation of why these are different is that the major axis line is the direction of the eigenvector, which deals with distances from points directly to the line, perpendicular, whereas the regression lines deal strictly in vertical or horizontal distances between points and the line, depending on if we regress y ~ x or x ~y.

Marginal Distribution

The marginal distribution is trying to disregard one of the variables altogether. If I average out all the variability across the dimension I don’t care about, what is the probability that is left? The mental model I have for marginals is to integrate out one of the dimensions so that the density is all “squished” into the dimension I care about.

Some things to notice:

  • Only 2/5 of the parameters determine the marginal distribution. ρ\rho is left out of both marginal distributions in the x and y directions.
  • In the bivariate normal, you can either think about the “squishing into 1 dimension” as a projection of the density into that dimensions, and then scaling the distribution appropriately so that everything adds up to 1. Or you can think about is as truly summing up everything in the excess dimensions and integrating. I’ve added a button so that you can see these different perspectives.
  • The location of the 1d grid is somewhat arbitrary. It doesn’t need to be at y=0y=0, we’re just integrating out the entire dimension.

Integrating in 2d vs 1d

The 2d Gaussian Distribution can look much larger than the 1d Guassian distribution.
The 2d normal distribution (red hill) can look much larger than the 1d normal distribution (black curve), even though both integrate to 1.

Plotting the 2d Gaussian next to the 1d Gaussian makes me appreciate some subtleties of measure theory. If you have low variance, and an extreme value of ρ\rho, you can make the 2d density appear MUCH larger than its 1d marginal distribution, and yet, the area under both of them is ALWAYS 1. This was quite shocking for me to see at first! Integrating, the density will actually appear to shrink. The half-rationale for this is that the integration is happening over the 2 dimensional space vs a 1 dimensional space (against lesbegue measure in R2\mathbb{R}^2 vs R\mathbb{R}), which are totally different. A point in 2d space is technically holding less “weight” than its 1d counterpart, a “point” in 1d is holding the weight of the entire real line perpendicular to it when calculating the integration. Before seeing these animations, I always thought the marginal distribution of the bivariate normal was straight forward and “made sense”, but now I’m more impressed that more terms like rho don’t appear in the marginal distributions. Although measure theory was designed as an extension of geometric concepts like area and volume, I think of it as more an algebraic discipline since physical intuitions can often fall apart, e.g. Banach-Tarski paradox.