Probability for Data Science
eBook  ›  Chapter 4 · Continuous Random Variables
Section 4.4

Median, Mode, and Mean

There are three statistical quantities that we are frequently interested in: mean, mode, and median. We all know how to compute these from a dataset. For example, to compute the median of a dataset, we sort the data and pick the number that sits in the 50th percentile. However, the median computed in this way is the empirical median, i.e., it is a value computed from a particular dataset. If the data is generated from a random variable (with a given PDF), how do we compute the mean, median, and mode?

4.4.1Median

Imagine you have a sequence of numbers as shown below.

\(n\)123456789\(\cdots\)100
\(x_n\)1.52.53.11.1\(-0.4\)\(-4.1\)0.52.2\(-3.4\)\(\cdots\)\(-1.4\)

How do we compute the median? We first sort the sequence (either in ascending order or descending order), and then pick the middle one. On a computer, we permute the samples

$$\{x_{1'}, x_{2'}, \ldots, x_{N'}\} = \text{sort}\{x_{1}, x_{2}, \ldots, x_{N}\},$$

such that \(x_{1'}< x_{2'}< \ldots < x_{N'}\) is ordered. The median is the one positioned at the middle. There are, of course, built-in commands such as median in MATLAB and np.median in Python to perform the median operation.

Now, how do we compute the median if we are given a random variable \(X\) with a PDF \(f_X(x)\)? The answer is by integrating the PDF.

Definition 4.9

Let \(X\) be a continuous random variable with PDF \(f_X\). The median of \(X\) is a point \(c \in \R\) such that

$$\int_{-\infty}^{c} f_X(x) \;dx = \int_{c}^{\infty} f_X(x) \;dx.$$

Why is the median defined in this way? This is because \(\int_{-\infty}^{c} f_X(x) \;dx\) is the area under the curve on the left of \(c\), and \(\int_{c}^{\infty} f_X(x) \;dx\) is the area under the curve on the right of \(c\). The area under the curve tells us the percentage of numbers that are less than the cutoff. Therefore, if the left area equals the right area, then \(c\) must be the median.

How to find the median from the PDF
  • sep0ex
  • Find a point \(c\) that separates the PDF into two equal areas
Figure 4.14
Figure 4.14. [Left] The median is computed as the point such that the two areas under the curve are equal. [Right] The median is computed as the point such that \(F_X\) hits 0.5.

The median can also be evaluated from the CDF as follows.

Theorem 4.6

The median of a random variable \(X\) is the point \(c\) such that

$$F_X(c) = \frac{1}{2}.$$

Proof. Since \(F_X(x) = \int_{-\infty}^{x} f_X(x') \;dx'\), we have

$$\begin{aligned} F_X(c) &= \int_{-\infty}^{c} f_X(x) \;dx = \int_{c}^{\infty} f_X(x) \;dx = 1-F_X(c). \end{aligned}$$

Rearranging the terms shows that \(F_X(c) = \frac{1}{2}\).

How to find median from CDF
  • sep0ex
  • Find a point \(c\) such that \(F_X(c) = 0.5\).
Example 4.19

(Uniform random variable) Let \(X\) be a continuous random variable with PDF \(f_X(x) = \frac{1}{b-a}\) for \(a \le x \le b\), and is 0 otherwise. We know that the CDF of \(X\) is \(F_X(x) = \frac{x-a}{b-a}\) for \(a \le x \le b\). Therefore, the median of \(X\) is the number \(c \in \R\) such that \(F_X(c) = \frac{1}{2}\). Substituting into the CDF yields \(\frac{c-a}{b-a} = \frac{1}{2}\), which gives \(c = \frac{a+b}{2}\).

Example 4.20

(Exponential random variable) Let \(X\) be a continuous random variable with PDF \(f_X(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). We know that the CDF of \(X\) is \(F_X(x) = 1-e^{-\lambda x}\) for \(x \ge 0\). The median of \(X\) is the point \(c\) such that \(F_X(c) = \frac{1}{2}\). This gives \(1-e^{-\lambda c} = \frac{1}{2}\), which is \(c = \frac{\log 2}{\lambda}\).

4.4.2Mode

The mode is the peak of the PDF. We can see this from the definition below.

Definition 4.10

Let \(X\) be a continuous random variable. The mode is the point \(c\) such that \(f_X(x)\) attains the maximum:

$$c = \argmax{x \in \Omega} \;\; f_X(x) = \argmax{x \in \Omega} \;\;\frac{d}{dx}F_X(x).$$

The second equality holds because \(f_X(x) = F_X'(x) = \frac{d}{dx}\int_{-\infty}^x f_X(t)\;dt\). A pictorial illustration of mode is given in Figure 4.15. Note that the mode of a random variable is not unique, e.g., a mixture of two identical Gaussians with different means has two modes.

Figure 4.15
Figure 4.15. [Left] The mode appears at the peak of the PDF. [Right] The mode appears at the steepest slope of the CDF.
How to find mode from PDF
  • sep0ex
  • Find a point \(c\) such that \(f_X(c)\) is maximized.

How to find mode from CDF

  • sep0ex
  • Continuous: Find a point \(c\) such that \(F_X(c)\) has the steepest slope.
  • Discrete: Find a point \(c\) such that \(F_X(c)\) has the biggest gap in a jump.
Example 4.21

Let \(X\) be a continuous random variable with PDF \(f_X(x) = 6x(1-x)\) for \(0 \le x \le 1\). The mode of \(X\) happens at \(\argmax{x} \; f_X(x)\). To find this maximum, we take the derivative of \(f_X\). This gives

$$\begin{aligned} 0 = \frac{d}{dx} f_X(x) = \frac{d}{dx} 6x(1-x) = 6(1-2x). \end{aligned}$$

Setting this equal to zero yields \(x = \frac12\).

To ensure that this point is a maximum, we take the second-order derivative:

$$\begin{aligned} \frac{d^2}{dx^2} f_X(x) = \frac{d}{dx} 6(1-2x) = -12 < 0. \end{aligned}$$

Therefore, we conclude that \(x = \frac12\) is a maximum point. Hence, the mode of \(X\) is \(x = \frac12\).

4.4.3Mean

We have defined the mean as the expectation of \(X\). Here, we show how to compute the expectation from the CDF. To simplify the demonstration, let us first assume that \(X > 0\).

Lemma 4.1

Let \(X> 0\). Then \(\E[X]\) can be computed from \(F_X\) as

$$\E[X] = \int_{0}^{\infty} \left(1-F_X(t)\right) \;dt.$$

Proof. The trick is to change the integration order:

$$\begin{aligned} \int_{0}^{\infty} \left(1-F_X(t)\right) \;dt &= \int_{0}^{\infty} \left[1-\Pb[X \le t]\right] \;dt = \int_{0}^{\infty} \Pb[X > t] \;dt\\ &= \int_0^{\infty} \int_{t}^{\infty} f_X(x) \;dx \;dt \overset{(a)}{=} \int_0^{\infty} \int_{0}^{x} f_X(x) \;dt \;dx \\ &= \int_0^{\infty} \int_{0}^{x} \;dt f_X(x) \;dx = \int_0^{\infty} x f_X(x) \;dx = \E[X]. \end{aligned}$$

Here, step \((a)\) is due to the change of integration order. See Figure 4.16 for an illustration.

Figure 4.16
Figure 4.16. The double integration can be evaluated by \(x\) then \(t\), or \(t\) then \(x\).

We draw a picture to illustrate the above lemma. As shown in Figure 4.17, the mean of a positive random variable \(X>0\) is equivalent to the area above the CDF.

Figure 4.17
Figure 4.17. The mean of a positive random variable \(X>0\) can be calculated by integrating the CDF's complement.
Lemma 4.2

Let \(X < 0\). Then \(\E[X]\) can be computed from \(F_X\) as

$$\E[X] = \int_{-\infty}^{0} F_X(t) \;dt.$$

Proof. The idea here is also to change the integration order.

$$\begin{aligned} \int_{-\infty}^{0} F_X(t) \;dt = \int_{-\infty}^0 \Pb[X \le t] \;dt &= \int_{-\infty}^0 \int_{-\infty}^{t} f_X(x) \;dx \;dt\\ &= \int_{-\infty}^0 \int_{x}^{0} f_X(x) \;dt \;dx = \int_{-\infty}^0 x f_X(x) \;dx = \E[X]. \end{aligned}$$
Theorem 4.7

The mean of a random variable \(X\) can be computed from the CDF as

$$\E[X] = \int_{0}^{\infty} \left(1-F_X(t)\right) \;dt - \int_{-\infty}^{0} F_X(t)\;dt.$$

Proof. For any random variable \(X\), we can partition \(X = X^+ - X^-\) where \(X^+\) and \(X^-\) are the positive and negative parts, respectively. Then, the above two lemmas will give us

$$\begin{aligned} \E[X]= \E[X^+ - X^-] &= \E[X^+] - \E[X^-] \\ &= \int_{0}^{\infty} \left(1-F_X(t)\right) \;dt - \int_{-\infty}^{0} F_X(t)\;dt. \end{aligned}$$

As illustrated in Figure 4.18, this equation is equivalent to computing the areas above and below the CDF and taking the difference.

Figure 4.18
Figure 4.18. The mean of a random variable \(X\) can be calculated by computing the area in the CDF.

How to find the mean from the CDF

  • sep0ex
  • A formula is given by Eq.&nbsp;(4.20):

    $$\E[X] = \int_{0}^{\infty} \left(1-F_X(t)\right) \;dt - \int_{-\infty}^{0} F_X(t)\;dt.$$
  • This result is not commonly used, but the proof technique of switching the integration order is important.