Probability for Data Science
eBook  ›  Chapter 5 · Joint Distributions
Section 5.2

Joint Expectation

5.2.1Definition and interpretation

When we have a single random variable, the expectation is defined as

$$\E[X] = \int_{\Omega} x f_X(x) \;dx.$$

For a pair of random variables, what would be a good way of defining the expectation? Certainly, we cannot just replace \(f_X(x)\) by \(f_{X,Y}(x,y)\) because the integration has to become a double integration. However, if it is a double integration, where should we put the variable \(y\)? It turns out that a useful way of defining the expectation for \(X\) and \(Y\) is as follows.

Definition 5.10

Let \(X\) and \(Y\) be two random variables. The joint expectation is

$$\E[XY] = \sum_{y \in \Omega_Y} \sum_{x \in \Omega_X} xy \;\cdot \; p_{X,Y}(x,y)$$

if \(X\) and \(Y\) are discrete, or

$$\E[XY] = \int_{y \in \Omega_Y} \int_{x \in \Omega_X} xy \;\cdot \; f_{X,Y}(x,y) \;dx\;dy$$

if \(X\) and \(Y\) are continuous. Joint expectation is also called correlation.

The double summation and integration on the right-hand side of the equation is nothing but the state times the probability. Here, the state is the product \(xy\), and the probability is the joint PMF \(p_{X,Y}(x,y)\) (or PDF). Therefore, as long as you agree that joint expectation should be defined as \(\E[XY]\), the double summation and the double integration make sense.

The biggest mystery here is \(\E[XY]\). You may wonder why the joint expectation should be defined as the expectation of the product \(\E[XY]\). Why not the sum \(\E[X+Y]\), or the difference \(\E[X-Y]\), or the quotient \(\E[X/Y]\)? Why are we so deeply interested in \(X\) times \(Y\)? These are excellent questions. That the joint expectation is defined as the product has to do with the correlation between two random variables. We will take a small detour into linear algebra.

Let us consider two discrete random variables \(X\) and \(Y\), both with \(N\) states. So \(X\) will take the states \(\{x_1,x_2,\ldots,x_N\}\) and \(Y\) will take the states \(\{y_1,y_2,\ldots,y_N\}\). Let's define them as two vectors: \(\vx \bydef [x_1,\ldots,x_N]^T\) and \(\vy \bydef [y_1,\ldots,y_N]^T\). Since \(X\) and \(Y\) are random variables, they have a joint PMF \(p_{X,Y}(x,y)\). The array of the PMF values can be written as a matrix:

$$\text{PMF as a matrix} = \mP \bydef \begin{bmatrix} p_{X,Y}(x_1,y_1) & p_{X,Y}(x_1,y_2) & \cdots & p_{X,Y}(x_1,y_N)\\ p_{X,Y}(x_2,y_1) & p_{X,Y}(x_2,y_2) & \cdots & p_{X,Y}(x_2,y_N)\\ \vdots & \vdots & \ddots & \vdots \\ p_{X,Y}(x_N,y_1) & p_{X,Y}(x_N,y_2) & \cdots & p_{X,Y}(x_N,y_N) \end{bmatrix}.$$

Let's try to write the joint expectation in terms of matrices and vectors. The definition of a joint expectation tells us that

$$\begin{aligned} \E[XY] &= \sum_{i = 1}^N \sum_{j = 1}^N x_i y_j \,\cdot \, p_{X,Y}(x_i,y_j), \end{aligned}$$

which can be written as

$$\begin{aligned} \E[XY] &= \underset{\vx^T}{\underbrace{\begin{bmatrix} x_1 & \cdots & x_N \end{bmatrix}}} \underset{\mP}{\underbrace{\begin{bmatrix} p_{X,Y}(x_1,y_1) & \cdots & p_{X,Y}(x_1,y_N)\\ \vdots & \ddots & \vdots \\ p_{X,Y}(x_N,y_1) & \cdots & p_{X,Y}(x_N,y_N) \end{bmatrix}}} \underset{\vy}{\underbrace{\begin{bmatrix} y_1\\ \vdots\\ y_N \end{bmatrix}}} = \vx^T\mP\vy. \end{aligned}$$

This is a weighted inner product between \(\vx\) and \(\vy\) using the weight matrix \(\mP\).

Why correlation is defined as \(\E[XY]\)
  • sep0em
  • \(\E[XY]\) is a weighted inner product between the states:

    $$\E[XY] = \vx^T\mP\vy.$$
  • \(\vx\) and \(\vy\) are the states of the random variables \(X\) and \(Y\).
  • The inner product measures the similarity between two vectors.
Example 5.15

Let \(X\) be a discrete random variable with \(N\) states, where each state has an equal probability. Thus, \(p_X(x) = 1/N\) for all \(x\). Let \(Y = X\) be another variable. Then the joint PMF of \((X,Y)\) is

$$p_{X,Y}(x,y) = \begin{cases} \frac{1}{N}, &\qquad x = y,\\ 0, &\qquad x\not= y. \end{cases}$$

It follows that the joint expectation is

$$\E[XY] = \sum_{i=1}^N \sum_{j=1}^N x_iy_j \cdot p_{X,Y}(x_i,y_j) = \frac{1}{N} \sum_{i=1}^N x_iy_i.$$

Equivalently, we can obtain the result via the inner product by defining

$$\mP = \begin{bmatrix} \frac{1}{N} & 0 & \cdots & 0\\ 0 & \frac{1}{N} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots & \cdots & \frac{1}{N} \end{bmatrix} = \frac{1}{N}\mI.$$

In this case, the weighted inner product is $$\vx^T\mP\vy = \frac{\vx^T\vy}{N} = \frac{1}{N} \sum_{i=1}^N x_iy_i = \E[XY].$$

How do we understand the inner product? Ignoring the matrix \(\mP\) for a moment, we recall an elementary result in linear algebra.

Definition 5.11

Let \(\vx \in \R^N\) and \(\vy \in \R^N\) be two vectors. Define the cosine angle \(\cos\theta\) as

$$\cos\theta = \frac{\vx^T\vy}{\|\vx\|\|\vy\|},$$

where \(\|\vx\| = \sqrt{\sum_{i=1}^Nx_i^2}\) is the norm of the vector \(\vx\), and \(\|\vy\| = \sqrt{\sum_{i=1}^N y_i^2}\) is the norm of the vector \(\vy\).

This definition can be understood as the geometry between two vectors, as illustrated in Figure 5.7. If the two vectors \(\vx\) and \(\vy\) are parallel so that \(\vx = \alpha\vy\) for some \(\alpha\), then the angle \(\theta = 0\). If \(\vx\) and \(\vy\) are orthogonal so that \(\vx^T\vy = 0\), then \(\theta = \pi/2\). Therefore, the inner product \(\vx^T\vy\) tells us the degree of correlation between the vectors \(\vx\) and \(\vy\).

Figure 5.7
Figure 5.7. The geometry of joint expectation. \(\E[XY]\) gives us the cosine angle between the two random variables. This, in turn, tells us the correlation between the two random variables.

Now let's come back to our discussion about the joint expectation. The cosine angle definition tells us that if \(\E[XY] = \vx^T\mP\vy\), the following form would make sense:

$$\cos\theta = \frac{\vx^T\mP\vy}{\|\vx\|\|\vy\|} = \frac{\E[XY]}{\|\vx\|\|\vy\|}.$$

That is, as long as we can find out the norms \(\|\vx\|\) and \(\|\vy\|\), we will be able to interpret \(\E[XY]\) from the cosine angle perspective. But what would be a reasonable definition of \(\|\vx\|\) and \(\|\vy\|\)? We define the norm by first considering the variance of the random variable \(X\) and \(Y\):

$$\begin{aligned} \E[X^2] &= \sum_{i = 1}^N x_i x_i \,\cdot \, p_{X}(x_i)\\ &= \underset{\vx^T}{\underbrace{\begin{bmatrix} x_1 & \cdots & x_N \end{bmatrix}}} \underset{\mP_X}{\underbrace{\begin{bmatrix} p_{X}(x_1) & \cdots & 0\\ \vdots & \ddots & \vdots \\ 0 & \cdots & p_{X}(x_N) \end{bmatrix}}} \underset{\vx}{\underbrace{\begin{bmatrix} x_1\\ \vdots\\ x_N \end{bmatrix}}} \\ &= \vx^T\mP_{X}\vx = \|\vx\|_{\mP_X}^2, \end{aligned}$$

where \(\mP_X\) is the diagonal matrix storing the probability masses of the random variable \(X\). It is not difficult to show that \(\mP_X = \text{diag}(\mP\vone)\) by following the definition of the marginal distributions (which are the column and row sums of the joint PMF). Similarly we can define

$$\begin{aligned} \E[Y^2] &= \sum_{j = 1}^N y_j y_j \,\cdot \, p_{Y}(y_j)\\ &= \underset{\vy^T}{\underbrace{\begin{bmatrix} y_1 & \cdots & y_N \end{bmatrix}}} \underset{\mP_Y}{\underbrace{\begin{bmatrix} p_{Y}(y_1) & \cdots & 0\\ \vdots & \ddots & \vdots \\ 0 & \cdots & p_{Y}(y_N) \end{bmatrix}}} \underset{\vy}{\underbrace{\begin{bmatrix} y_1\\ \vdots\\ y_N \end{bmatrix}}} \\ &= \vy^T\mP_{Y}\vy = \|\vy\|_{\mP_Y}^2. \end{aligned}$$

Therefore, one way to define the cosine angle is to start with

$$\cos\theta = \frac{\vx^T\mP_{XY}\vy}{\|\vx\|_{\mP_X}\|\vy\|_{\mP_Y}},$$

where \(\mP_{XY} = \mP\), \(\|\vx\|_{\mP_X} = \sqrt{\vx^T\mP_X\vx}\) and \(\|\vy\|_{\mP_Y} = \sqrt{\vy^T\mP_Y\vy}\). But writing it in terms of the expectation, we observe that this cosine angle is exactly

$$\begin{aligned} \cos\theta &= \frac{\vx^T\mP_{XY}\vy}{\|\vx\|_{\mP_X}\|\vy\|_{\mP_Y}} \\ &= \frac{\E[XY]}{\sqrt{\E[X^2]}\sqrt{\E[Y^2]}}. \end{aligned}$$

Therefore, \(\E[XY]\) defines the cosine angle between the two random variables, which, in turn, defines the correlation between the two. A large \(|\E[XY]|\) means that \(X\) and \(Y\) are highly correlated, and a small \(|\E[XY]|\) means that \(X\) and \(Y\) are not very correlated. If \(\E[XY] = 0\), then the two random variables are uncorrelated. Therefore, \(\E[XY]\) tells us how the two random variables are related to each other.

To further convince you that \( [XY]{[X^2][Y^2]} \) can be interpreted as a cosine angle, we show that

$$-1 \le \frac{\E[XY]}{\sqrt{\E[X^2]}\sqrt{\E[Y^2]}} \le 1,$$

because if this ratio can go beyond \(+1\) and \(-1\), it makes no sense to call it a cosine angle. The argument follows from a very well-known inequality in probability, called the Cauchy-Schwarz inequality (for expectation), which states that \( -1 [XY]{[X^2][Y^2]} 1 \):

Theorem 5.2 (Cauchy-Schwarz inequality)

For any random variables \(X\) and \(Y\),

$$(\E[XY])^2 \le \E[X^2]\E[Y^2].$$

Proof. Let \(t\in \R\) be a constant. Consider

$$\begin{aligned} \E[(X+tY)^2] = \E[X^2 + 2tXY + t^2Y^2]. \end{aligned}$$

Since \(\E[(X+tY)^2] \ge 0\) for any \(t\), it follows that

$$\begin{aligned} \E[X^2 + 2tXY + t^2Y^2] \ge 0. \end{aligned}$$

Expanding the left-hand side yields

$$\begin{aligned} t^2\E[Y^2] + 2t \E[XY] + \E[X^2] \ge 0. \end{aligned}$$

This is a quadratic equation in \(t\), and we know that for any quadratic equation \(a t^2 + bt + c \ge 0\) we must have \(b^2 - 4ac \le 0\). Therefore, in our case, we have that

$$(2\E[XY])^2 - 4\E[Y^2]\E[X^2] \le 0,$$

which means \((\E[XY])^2 \le \E[X^2]\E[Y^2]\). The equality holds when \(\E[(X+tY)^2] = 0\). In this case, \(X = -tY\) for some \(t\), i.e., the random variable \(X\) is a scaled version of \(Y\) so that the vector formed by the states of \(X\) is parallel to that of \(Y\).

5.2.2Covariance and correlation coefficient

In many practical problems, we prefer to work with central moments, i.e., \(\E[(X - \mu_X)^2]\) instead of \(\E[X^2]\). This essentially means that we subtract the mean from the random variable. If we adopt such a centralized random variable, we can define the covariance as follows.

Definition 5.12

Let \(X\) and \(Y\) be two random variables. Then the covariance of \(X\) and \(Y\) is

$$\Cov(X,Y) = \E[(X-\mu_X)(Y-\mu_Y)],$$

where \(\mu_X = \E[X]\) and \(\mu_Y = \E[Y]\).

It is easy to show that if \(X = Y\), then the covariance simplifies to the variance:

$$\begin{aligned} \Cov(X,X) &= \E[(X-\mu_X)(X-\mu_X)] = \Var[X]. \end{aligned}$$

Thus, covariance is a generalization of variance. The former can handle a pair of variables, whereas the latter is only for a single variable. We can also demonstrate the following result.

Theorem 5.3

Let \(X\) and \(Y\) be two random variables. Then

$$\Cov(X,Y) = \E[XY] - \E[X]\E[Y].$$

Proof. Just apply the definition of covariance:

$$\begin{aligned} \Cov(X,Y) &= \E[(X-\mu_X)(Y-\mu_Y)] \\ &= \E[XY - X\mu_Y - Y \mu_X + \mu_X\mu_Y] = \E[XY] - \mu_X\mu_Y. \end{aligned}$$

The next theorem concerns the sum of two random variables.

Theorem 5.4

For any \(X\) and \(Y\),

  1. [a.] \(\E[X+Y]=\E[X]+\E[Y].\)
  2. [b.] \(\Var[X+Y] = \Var[X] + 2\Cov(X,Y) + \Var[Y].\)

Proof. Recall the definition of joint expectation:

$$\begin{aligned} \E[X+Y] &= \sum_{y}\sum_{x} (x+y)p_{X,Y}(x,y) \\ &= \sum_{y} \sum_{x} xp_{X,Y}(x,y) + \sum_{y} \sum_{x} y p_{X,Y}(x,y)\\ &= \sum_{x} x \left(\sum_{y} p_{X,Y}(x,y)\right) + \sum_{y} y \left(\sum_{x} p_{X,Y}(x,y)\right)\\ &= \sum_x x p_X(x) + \sum_y yp_Y(y) \\ &= \E[X] + \E[Y]. \end{aligned}$$

Similarly,

$$\begin{aligned} \Var[X+Y] &= \E[(X+Y)^2] - \E[X+Y]^2 \\ &= \E[(X+Y)^2] - (\mu_X+\mu_Y)^2 \\ &= \E[X^2 + 2XY + Y^2] - (\mu_X^2+ 2\mu_X\mu_Y + \mu_Y^2) \\ &= \E[X^2]-\mu_X^2 + \E[Y^2] - \mu_Y^2 + 2(\E[XY]-\mu_X\mu_Y)\\ &= \Var[X] + 2\Cov(X,Y) + \Var[Y]. \end{aligned}$$

With covariance defined, we can now define the correlation coefficient \(\rho\), which is the cosine angle of the centralized variables. That is,

$$\begin{aligned} \rho &= \cos\theta \\ &= \frac{\E[(X-\mu_X)(Y-\mu_Y)]}{\sqrt{\E[(X-\mu_X)^2]\E[(Y-\mu_Y)^2]}}. \end{aligned}$$

Recognizing that the denominator of this expression is just the variance of \(X\) and \(Y\), we define the correlation coefficient as follows.

Definition 5.13

Let \(X\) and \(Y\) be two random variables. The correlation coefficient is

$$\rho = \frac{\Cov(X,Y)}{\sqrt{\Var[X]\Var[Y]}}.$$

Since \(-1\le \cos\theta \le 1\), \(\rho\) is also between \(-1\) and \(1\). The difference between \(\rho\) and \(\E[XY]\) is that \(\rho\) is normalized with respect to the variance of \(X\) and \(Y\), whereas \(\E[XY]\) is not normalized. The correlation coefficient has the following properties:

5.2.3Independence and correlation

If two random variables \(X\) and \(Y\) are independent, the joint expectation can be written as a product of two individual expectations.

Theorem 5.5

If \(X\) and \(Y\) are independent, then

$$\E[XY] = \E[X]\E[Y].$$

Proof. We only prove the discrete case because the continuous can be proved similarly. If \(X\) and \(Y\) are independent, we have \(p_{X,Y}(x,y) = p_X(x)\;p_Y(y)\). Therefore,

$$\begin{aligned} \E[XY] = \sum_{y}\sum_{x} xy p_{X,Y}(x,y) &= \sum_{y} \sum_x xy p_{X}(x)p_Y(y) \\ &= \left(\sum_x x p_{X}(x)\right) \left(\sum_y y p_{Y}(y)\right) = \E[X]\E[Y]. \end{aligned}$$

In general, for any two independent random variables and two functions \(f\) and \(g\),

$$\E[f(X)g(Y)] = \E[f(X)]\E[g(Y)].$$

The following theorem illustrates a few important relationships between independence and correlation.

Theorem 5.6

Consider the following two statements:

  1. [a.] \(X\) and \(Y\) are independent;
  2. [b.] \(\Cov(X,Y) = 0\).

Statement (a) implies statement (b), but (b) does not imply (a). Thus, independence is a stronger condition than correlation.

Proof. We first prove that (a) implies (b). If \(X\) and \(Y\) are independent, then \(\E[XY] = \E[X]\E[Y]\). In this case,

$$\begin{aligned} \Cov(X,Y) = \E[XY]-\E[X]\E[Y] = \E[X]\E[Y] - \E[X]\E[Y] = 0. \end{aligned}$$

To prove that (b) does not imply (a), we show a counterexample. Consider a discrete random variable \(Z\) with PMF

$$p_Z(z) = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}.$$

Let \(X\) and \(Y\) be

$$X = \cos \frac{\pi}{2}Z \quad\mbox{and}\quad Y = \sin \frac{\pi}{2}Z.$$

Then we can show that \(\E[X] = 0\) and \(\E[Y] = 0\). The covariance is

$$\begin{aligned} \Cov(X,Y) &= \E[(X-0)(Y-0)] \\ &= \E\left[ \cos \frac{\pi}{2}Z \sin \frac{\pi}{2}Z\right] \\ &= \E\left[\frac{1}{2}\sin \pi Z\right] \\ &= \frac{1}{2}\left[ (\sin \pi 0) \frac{1}{4} + (\sin \pi 1) \frac{1}{4} + (\sin \pi 2) \frac{1}{4} + (\sin \pi 3) \frac{1}{4}\right] = 0. \end{aligned}$$

The next step is to show that \(X\) and \(Y\) are dependent. To this end, we only need to show that \(p_{X,Y}(x,y) \not= p_X(x)p_Y(y)\). The joint PMF \(p_{X,Y}(x,y)\) can be found by noting that

$$\begin{aligned} Z = 0 &\Rightarrow X = 1,\ Y = 0, \\ Z = 1 &\Rightarrow X = 0,\ Y = 1, \\ Z = 2 &\Rightarrow X = -1,\ Y = 0, \\ Z = 3 &\Rightarrow X = 0,\ Y = -1. \end{aligned}$$

Thus, the PMF is

$$p_{X,Y}(x,y) = \begin{bmatrix} 0 & \frac{1}{4} & 0 \\ \frac{1}{4} & 0 & \frac{1}{4}\\ 0 & \frac{1}{4} & 0 \end{bmatrix}.$$

The marginal PMFs are

$$\begin{aligned} p_X(x) = \begin{bmatrix} \frac{1}{4} & \frac{1}{2} & \frac{1}{4} \end{bmatrix}, \quad p_Y(y) = \begin{bmatrix} \frac{1}{4} & \frac{1}{2} & \frac{1}{4} \end{bmatrix}. \end{aligned}$$

The product \(p_X(x)\;p_Y(y)\) is

$$p_X(x)p_Y(y) = \begin{bmatrix} \frac{1}{16} & \tfrac{1}{8} & \tfrac{1}{16} \\[0.5em] \frac{1}{8} & \tfrac{1}{4} & \tfrac{1}{8}\\[0.5em] \frac{1}{16} & \tfrac{1}{8} & \tfrac{1}{16} \end{bmatrix}.$$

Therefore, \(p_{X,Y}(x,y) \not= p_X(x)p_Y(y)\), although \(\E[XY] = \E[X]\E[Y]\).

What is the relationship between independent and uncorrelated?
  • sep0em
  • Independent \(\Rightarrow\) uncorrelated.
  • Independent \(\nLeftarrow\) uncorrelated.

5.2.4Computing correlation from data

We close this section by discussing a very practical problem: Given a dataset containing two columns of data points, how do we determine whether the two columns are correlated?

Recall that the correlation coefficient is defined as

$$\rho = \frac{\E[XY]-\mu_X\mu_Y}{\sigma_X\sigma_Y}.$$

If we have a dataset containing \((x_n,y_n)_{n=1}^N\), then the correlation coefficient can be approximated by

$$\widehat{\rho} = \frac{\frac{1}{N}\sum_{n=1}^N x_ny_n - \overline{x} \,\overline{y}}{ \sqrt{\frac{1}{N}\sum_{n=1}^N(x_n-\overline{x})^2} \sqrt{\frac{1}{N}\sum_{n=1}^N(y_n-\overline{y})^2}},$$

where \(\overline{x} = \frac{1}{N}\sum_{n=1}^N x_n\) and \(\overline{y} = \frac{1}{N}\sum_{n=1}^N y_n\) are the means. This equation should not be a surprise because essentially all terms are the empirical estimates. Thus, \(\widehat{\rho}\) is the empirical correlation coefficient determined from the dataset. As \(N \rightarrow \infty\), we expect \(\widehat{\rho} \rightarrow \rho\). Figure 5.8 shows three example datasets. We plot the \((x_n,y_n)\) pairs as coordinates in the 2D plane. The first dataset contains samples that are almost uncorrelated. We can see that \(x_n\) does not tell us anything about \(y_n\). The second dataset is moderately correlated. The third dataset is highly correlated: If we know \(x_n\), we are almost certain to know the corresponding \(y_n\), with a small number of perturbations.

Figure 5.8. Visualization of correlated variables. Each of these figures represent a scattered plot of a dataset containing \((x_n,y_n)_{n=1}^N\). (a) is uncorrelated. (b) is somewhat correlated. (c) is strongly correlated.

On a computer, computing the correlation coefficient can be done using built-in commands such as corrcoef in MATLAB and stats.pearsonr in Python. The codes to generate the results in Figure 5.8(b) are shown below.

MATLAB
% MATLAB code to compute the correlation coefficient
    x = mvnrnd([0,0],[3 1; 1 1],1000);
    figure(1); scatter(x(:,1),x(:,2));
    rho = corrcoef(x)
Python
# Python code to compute the correlation coefficient
    import numpy as np
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    x = stats.multivariate_normal.rvs([0,0], [[3,1],[1,1]], 10000)
    plt.figure(); plt.scatter(x[:,0],x[:,1])
    rho,_ = stats.pearsonr(x[:,0],x[:,1])
    print(rho)