Probability for Data Science
eBook  ›  Chapter 9 · Confidence and Hypothesis
Section 9.1

Confidence Interval

The first topic we discuss in this chapter is the confidence interval. At a high level, the confidence interval tells us the quality of our estimator with respect to the number of samples. We begin this section by reviewing the randomness of an estimator. Then we develop the concept of the confidence interval. We discuss several methods for constructing and interpreting these confidence intervals.

9.1.1The randomness of an estimator

Imagine that we have a dataset \(\calX = \{X_1,\ldots,X_N\}\), where we assume that \(X_n\) are i.i.d. copies drawn from a distribution \(f_{X}(x; \theta)\). We want to construct an estimator \(\Thetahat\) of \(\theta\) from the dataset \(\calX\). For example, if \(f_X\) is a Gaussian distribution with an unknown mean \(\theta\), we would like to estimate \(\theta\) using the sample average \(\Thetahat\). In statistics, an estimator \(\Thetahat\) is also known as a statistic, which is constructed from the samples. In this book we use the terms “estimator” and “statistic” interchangeably. Written as equations, an estimator is a function of the samples:

$$\underset{\text{estimator}}{\underbrace{\Thetahat}} = \;\; \underset{\text{function of $\calX$}}{\underbrace{g(X_1,\ldots,X_N)}},$$

where \(g\) is a function that takes the samples \(X_1,\ldots,X_N\) and returns a random variable \(\Thetahat\). For example, the sample average

$$\Thetahat= \underset{g(X_1,\ldots,X_N)}{\underbrace{\frac{1}{N}\sum_{n=1}^N X_n}}$$

is an estimator because it is computed by summing the samples \(X_1,\ldots,X_N\) and dividing it by \(N\).

What is an estimator?
  • sep0ex
  • An estimator \(\Thetahat\) is a function of the samples \(X_1,\ldots,X_N\):

    $$\Thetahat = g(X_1,\ldots,X_N).$$
  • \(\Thetahat\) is a random variable. It has a PDF, CDF, mean, variance, etc.

By construction, \(\Thetahat\) is a random variable because it is a function of the random samples. Therefore, \(\Thetahat\) has its own PDF, CDF, mean, variance, etc. Since \(\widehat{\Theta}\) is a random variable, we should report both the estimator's value and the estimator's confidence when reporting its performance. The confidence measures the quality of \(\widehat{\Theta}\) when compared to the true parameter \(\theta\). It provides a measure of the reliability of the estimator \(\widehat{\Theta}\). If \(\widehat{\Theta}\) fluctuates a great deal we may not be confident of our estimates. Let's consider the following example.

Example 9.1

A class of 1000 students took a test. The distribution of the score is roughly a Gaussian with mean 50 and standard deviation 20. A teaching assistant was too lazy to calculate the true population mean. Instead, he sampled a subset of 5 scores listed as follows:

Student ID12345
Scores119717882

He calculated the average, which is 53.8. This is a very good estimate of the class average (which is 50). What is wrong with his procedure?

Solution

He was just lucky. It is quite possible that if he sampled another 5 scores, he would get something very different. For example, if he looks at the 11 to 15 student scores, he could get:

Student ID1112131415
Scores4429192715

In this case the average is 26.8.

Both 53.8 and 26.8 are legitimate estimates, but they are the random realizations of a random variable \(\Thetahat\). This \(\Thetahat\) has a PDF, CDF, mean, variance, etc. It may be misleading to simply report the estimated value from a particular instant, so the confidence of the estimator must be specified.

Distributions of \(\Thetahat\). We next discuss the distribution of \(\Thetahat\). Figure 9.1 illustrates several key ideas. Suppose that the population distribution \(f_X(x)\) is a mixture of two Gaussians. Let \(\theta\) be the mean of this distribution (somewhere between the two peak locations). We sample \(N = 50\) data points \(X_1,\ldots,X_N\) from this distribution. However, the 50 data points we sample today could differ from the 50 data points we sample tomorrow. If we compute the sample average from each of these finite-sample distributions, we will obtain a set of sample averages \(\Thetahat\). Notably, we have a set of \(\Thetahat\) because today we have one \(\Thetahat\) and tomorrow we have another \(\Thetahat\). By plotting the histogram of the sample averages \(\Thetahat\), we will have a distribution.

Figure 9.1
Figure 9.1. Pictorial illustration of the randomness of the estimator \(\Thetahat\). Given a population, our datasets are usually a subset of the population. Computing the sample average from these finite-sample distributions introduces the randomness to \(\Thetahat\). If we plot the histogram of the sample averages, we will obtain a distribution. The mean of this distribution is the population mean, but there is a nontrivial amount of fluctuation. The purpose of the concept of confidence interval is to quantify this fluctuation.

The histogram of \(\Thetahat\) depends on several factors. According to the Central Limit Theorem, the shape of \(f_{\Thetahat}(\theta)\) is a Gaussian because \(\Thetahat\) is the average of \(N\) i.i.d. random variables. If \(\Thetahat\) is not the average of i.i.d. random variables, the shape is not necessarily a Gaussian. This results in additional complications, so we will discuss some tools for dealing with this problem. The spread of the sample distribution is mainly driven by the number of samples we have in each subdataset. As you can imagine, the more samples we have in a subdataset, the more accurate the distribution. Thus you will have a more accurate sample average. The fluctuation of the sample average will also be smaller.

Before we continue, let's summarize the randomness of \(\Thetahat\):

What is the randomness of \(\Thetahat\)?
  • sep0ex
  • \(\Thetahat\) is generated from a finite-sample dataset. Each time we draw a finite-sample dataset, we introduce randomness.
  • If \(\Thetahat\) is the sample average, the PDF is (roughly) a Gaussian. If \(\Thetahat\) is not a sample average, the PDF is not necessarily a Gaussian.
  • The spread of the fluctuation depends on the number of samples in each sub-dataset.

9.1.2Understanding confidence intervals

The confidence interval is a probabilistic statement about \(\Thetahat\). Instead of studying \(\Thetahat\) as a point, we construct an interval

$$\calI = \left[\Thetahat - \epsilon, \;\; \Thetahat + \epsilon\right],$$

for some \(\epsilon\) to be determined. Note that this interval is a random interval: If we have a different realization of \(\Thetahat\), we will have a different \(\calI\). We call \(\calI\) the confidence interval for the estimator \(\Thetahat\).

Given this random interval, we ask: What is the probability that \(\calI\) includes \(\theta\)? That means that we want to evaluate the probability

$$\Pb[\theta \in \calI] = \Pb\left[\widehat{\Theta} - \epsilon \le \theta \le \widehat{\Theta} + \epsilon\right].$$

We emphasize that the randomness in this probability is caused by \(\Thetahat\), not \(\theta\). This is because the interval \(\calI\) changes when we conduct a different experiment to obtain a different \(\Thetahat\). The situation is similar to that illustrated on the left-hand side of Figure 9.2. The confidence interval \(\calI\) changes but the true parameter \(\theta\) is fixed.

Figure 9.2
Figure 9.2. Confidence interval is the random interval \(\calI = [\Thetahat-\epsilon, \Thetahat+\epsilon]\), not the deterministic interval \([\theta-\epsilon,\theta+\epsilon]\). The random interval in the former case does not require any knowledge about the true parameter \(\theta\), whereas the latter requires \(\theta\). By claiming a 95% confidence interval, we say that there is 95% chance that the random interval will include the true parameter. So if you have 100 random realizations of the confidence intervals, then 95 on average will include the true parameter.

Confidence intervals can be confusing. Often the confusion arises because of the following identity:

$$\begin{aligned} \Pb\left[\widehat{\Theta} - \epsilon \le \theta \le \widehat{\Theta} + \epsilon\right] &= \Pb\left[- \epsilon \le \theta - \widehat{\Theta} \le \epsilon\right] \\ &= \Pb\left[- \epsilon - \theta \le - \widehat{\Theta} \le \epsilon - \theta\right] \\ &= \Pb\left[\theta - \epsilon \le \Thetahat \le \theta + \epsilon\right]. \end{aligned}$$

Although the values of the two probabilities are the same, the two events are interpreted differently. The right-hand side of Figure 9.2 illustrates \(\Pb[\theta - \epsilon \le \Thetahat \le \theta + \epsilon]\). The interval \([\theta-\epsilon,\theta+\epsilon]\) is fixed. What is the probability that the estimator \(\Thetahat\) lies within this deterministic interval? To find this probability, we need to know the true parameter \(\theta\), which is not available. By contrast, the other probability \(\Pb[\Thetahat-\epsilon \le \theta \le \Thetahat+\epsilon]\) does not require any knowledge about the true parameter \(\theta\). What is the probability that the true parameter is included inside the random interval? If the probability is high, we say that there is a good chance that our confidence interval will contain the true parameter. This is observed in the left-hand side of Figure 9.2.

In practice we often set \(\Pb[\Thetahat-\epsilon \le \theta \le \Thetahat+\epsilon]\) to be greater than a certain confidence level, say 95%, and then we determine \(\epsilon\). Once we have determined \(\epsilon\), we can claim that with 95% probability the interval \([\Thetahat - \epsilon, \; \Thetahat + \epsilon]\) will include the unknown parameter \(\theta\). We do not need to know \(\theta\) at any point in this process.

To make this more general, we define \(1-\alpha\) as the confidence level for some parameter  \(\alpha\). For example, if we would like to have a 95% confidence level, we set \(\alpha = 0.05\). Then the probability inequality

$$\Pb\left[\widehat{\Theta} - \epsilon \le \theta \le \widehat{\Theta} + \epsilon\right] \ge 1-\alpha$$

tells us that there is at least a 95% chance that the random interval \(\calI = [\Thetahat - \epsilon, \;\; \Thetahat + \epsilon]\) will include the true parameter \(\theta\). In this case we say that \(\calI\) is a “95% confidence interval”.

What is a 95% confidence interval?
  • sep0ex
  • It is a random interval \([\Thetahat-\epsilon, \Thetahat+\epsilon]\) such that there is a 95% probability for it to include the true parameter \(\theta\).
  • It is not the deterministic interval \([\theta-\epsilon,\theta+\epsilon]\), because we never know \(\theta\).
Example 9.2

After analyzing the life expectancy of people in the United States, it was concluded that the 95% confidence interval is (77.8, 79.1) years old. Is the following claim valid?

About 95% of the people in the United States have a life expectancy between 77.8 years old and 79.1 years old.

Solution

No. The confidence interval tells us that with 95% probability the random interval \((77.8, 79.1)\) will include the true average. We emphasize that \((77.8, 79.1)\) is random because it is constructed from a small set of data points. If we survey another set of people we will have another interval.

Since we do not know the true average, we do not know the percentage of people whose life expectancy is between 77.8 years old and 79.1 years old. It could be that the true average is 80 years old, which is out of the range. It could also be that the true average is 77.9 years old, which is within the range, but only \(10\%\) of the population may have life expectancy in \((77.8, 79.1)\).

Example 9.3

After studying the SAT scores of 1000 high school students, it was concluded that the 95% confidence interval is (1134, 1250) points. Is the following claim valid?

There is a 95% probability that the average SAT score in the population is in the range 1134 and 1250.

Solution

Yes, but it can be made clearer. The average SAT score in the population remains unknown. It is a constant and it is deterministic, so there is no probability associated with it. A better way to say this is: “There is a 95% probability that the random interval 1134 and 1250 will include the average SAT score.” We emphasize that the 95% probability is about the random interval, not the unknown parameter.

9.1.3Constructing a confidence interval

Let's consider an example. Suppose that we have a set of i.i.d. observations \(X_1,\ldots,X_N\) that are Gaussians with an unknown mean \(\theta\) and a known variance \(\sigma^2\). We consider the maximum-likelihood estimator, which is the sample average:

$$\Thetahat = \frac{1}{N}\sum_{n=1}^N X_n.$$

Our goal is to construct a confidence interval.

Before we consider the equations, let's look at a graph illustrating what we want to achieve. Figure 9.3 shows a population distribution, which is a Gaussian in this example.

Figure 9.3
Figure 9.3. Conceptual illustration of how to construct a confidence interval. Starting with the population, we draw random subsets. Each random subset gives us an estimator, and correspondingly an interval.

We draw \(N\) samples from the Gaussian to construct a random subset. Based on this random subset we construct the estimator \(\Thetahat\). Since this estimator is based on the particular random subset we have, we can follow the same approach by drawing another random subset. To differentiate the estimators constructed by the different random subsets, let's call the estimators \(\Thetahat^{(1)}\) and \(\Thetahat^{(2)}\), respectively. For each estimator we construct an interval \([\Thetahat-\epsilon, \; \Thetahat+\epsilon]\) to obtain two different intervals:

$$\begin{aligned} \calI^1 = [\Thetahat^{(1)}-\epsilon, \; \Thetahat^{(1)}+\epsilon] \qquad\mbox{and}\qquad \calI^2 = [\Thetahat^{(2)}-\epsilon, \; \Thetahat^{(2)}+\epsilon]. \end{aligned}$$

If we can determine \(\epsilon\), we have found the confidence interval.

We can determine the confidence interval by observing the histogram of \(\Thetahat\), which in our case is the histogram of the sample average, since the histogram of \(\Thetahat\) is well-defined, especially if we are looking at the sample average. The histogram of the sample average is a Gaussian because the average of \(N\) i.i.d. Gaussian random variables is Gaussian. Therefore, the width of this Gaussian is determined by the answer to this question:

For what \(\epsilon\) can we cover 95% of the histogram of \(\Thetahat\)?

To find the answer, we set up the following probability inequality:

$$\Pb\left[\frac{|\Thetahat - \E[\Thetahat]|}{\sqrt{\Var[\Thetahat]}} \le \epsilon \right] \ge 1-\alpha.$$

This probability says that we want to find an \(\epsilon\) such that the majority of \(\Thetahat\) is living close to its mean. The level \(1-\alpha\) is our confidence level, which is typically 95%. Equivalently, we let \(\alpha = 0.05\).

In the above equation, we can define the quotient as

$$\widehat{Z} \bydef \frac{\Thetahat - \E[\Thetahat]}{\sqrt{\Var[\Thetahat]}}.$$

We know that \(\widehat{Z}\) is a zero-mean unit-variance Gaussian because it is the standardized variable. [Note: Not all normalized variables are Gaussian, but if \(\Thetahat\) is a Gaussian the normalized variable will remain a Gaussian.] Thus, the probability inequality we are looking at is

$$\underset{\text{two tails of a standard Gaussian}}{\underbrace{\Pb\left[|\widehat{Z}| \le \epsilon \right]}} \ge \qquad \quad 1-\alpha.$$

The PDF of \(\widehat{Z}\) is shown in Figure 9.4. As you can see, to achieve 95% confidence we need to pick an appropriate \(\epsilon\) such that the shaded area is less than 5%.

Figure 9.4
Figure 9.4. PDF of the random variable \(\widehat{Z} = (\Thetahat - \E[\Thetahat])/\sqrt{\Var[\Thetahat]}\). The shaded area denotes the \(\alpha = 0.05\) confidence level.

Since \(\Pb[\widehat{Z} \le \epsilon]\) is the CDF of a Gaussian, it follows that

$$\begin{aligned} \Pb[|\widehat{Z}| \le \epsilon] &= \Pb[-\epsilon \le \widehat{Z} \le \epsilon] \\ &= \Pb[\widehat{Z} \le \epsilon] - \Pb[\widehat{Z} \le -\epsilon]\\ &= \Phi\left(\epsilon \right) - \Phi\left(-\epsilon \right). \end{aligned}$$

Using the symmetry of the Gaussian, it follows that \(\Phi\left(-\epsilon\right) = 1-\Phi\left(\epsilon \right)\) and hence

$$\Pb[|\widehat{Z}| \le \epsilon] = 2\Phi\left(\epsilon \right)-1.$$

Equating this result with the probability inequality \(\Pb[|\widehat{Z}| \le \epsilon] \ge 1-\alpha\), we have that

$$\begin{aligned} \epsilon \ge \Phi^{-1}\left(1-\frac{\alpha}{2}\right). \end{aligned}$$

The remainder of this problem is solvable on a computer. On MATLAB, we can call icdf to compute the inverse CDF of a standard Gaussian. On Python, the command is stats.norm.ppf. The commands are as shown below.

MATLAB
% MATLAB code to compute the width of the confidence interval
    alpha = 0.05;
    mu = 0; sigma = 1; % Standard Gaussian
    epsilon = icdf('norm',1-alpha/2,mu,sigma)
Python
# Python code to compute the width of the confidence interval
    import scipy.stats as stats
    alph = 0.05;
    mu   = 0; sigma = 1; # Standard Gaussian
    epsilon = stats.norm.ppf(1-alph/2, mu, sigma)
    print(epsilon)

If everything is done properly, we see that for a 95% confidence level (\(\alpha = 0.05\)) the corresponding \(\epsilon\) is \(\epsilon = 1.96\).

After determining \(\epsilon\), it remains to determine \(\E[\Thetahat]\) and \(\Var[\Thetahat]\) in order to complete the probability inequality. To this end, we note that

$$\begin{aligned} \E[\Thetahat] &= \E\left[ \frac{1}{N}\sum_{n=1}^N X_n \right] = \theta,\\ \Var[\Thetahat] &= \Var\left[ \frac{1}{N}\sum_{n=1}^N X_n \right] = \frac{\sigma^2}{N}, \end{aligned}$$

if we assume that the population distribution is \(\text{Gaussian}(\theta,\sigma^2)\), where \(\theta\) is unknown but \(\sigma\) is known. Substituting these into the probability inequality, we have that

$$\begin{aligned} \Pb\left[\frac{|\Thetahat - \E[\Thetahat]|}{\sqrt{\Var[\Thetahat]}} \le \epsilon \right] &= \Pb\left[\Thetahat - \epsilon \frac{\sigma}{\sqrt{N}} \le \theta \le \Thetahat + \epsilon \frac{\sigma}{\sqrt{N}} \right]\\ &= \Pb\left[\Thetahat -1.96 \frac{\sigma}{\sqrt{N}} \le \theta \le \Thetahat + 1.96 \frac{\sigma}{\sqrt{N}} \right], \end{aligned}$$

where we let \(\epsilon = 1.96\) for a 95% confidence level. Therefore, the 95% confidence interval is

$$\left[\Thetahat - 1.96 \frac{\sigma}{\sqrt{N}}, \quad \Thetahat + 1.96 \frac{\sigma}{\sqrt{N}}\right].$$

As you can see, we do not need to know the value of \(\theta\) at any point of the derivation because the confidence interval in Eq. (9.5) does not involve \(\theta\). This is an important difference with the other probability \(\Pb[\theta-\epsilon \le \Thetahat \le \theta+\epsilon]\), which requires \(\theta\).

How to construct a confidence interval
  • sep0ex
  • Compute the estimator \(\widehat{\Theta}\).
  • Determine the width of the confidence interval \(\epsilon\) by inspecting the confidence level \(1-\alpha\). If \(\widehat{\Theta}\) is Gaussian, then \(\epsilon = \Phi^{-1}(1-\frac{\alpha}{2})\).
  • If \(\Thetahat\) is not a Gaussian, replace the Gaussian CDF by the CDF of \(\Thetahat\).
  • The confidence interval is \([\Thetahat -\epsilon, \; \Thetahat + \epsilon]\).

9.1.4Properties of the confidence interval

Some important properties of the confidence interval are listed below.

Practice Exercise 9.1

Suppose that the number of photos a Facebook user uploads per day is a random variable with \(\sigma = 2\). In a set of 341 users, the sample average is 2.9. Find the 90% confidence interval of the population mean.

Solution

We set \(\alpha = 0.1\). The \(z_\alpha\)-value is

$$\begin{aligned} z_\alpha = \Phi^{-1}\left(1-\frac{\alpha}{2}\right) = 1.6449. \end{aligned}$$

The 90% confidence interval is then

$$\begin{aligned} \left[\Thetahat - 1.64 \frac{2}{\sqrt{341}}, \quad \Thetahat + 1.64 \frac{2}{\sqrt{341}} \right] = [2.72, 3.08]. \end{aligned}$$

Therefore, with 90% probability, the interval \([2.72, 3.08]\) includes the population mean.

Example 9.4

Professional cyber-athletes have a standard deviation of \(\sigma = 73.4\) actions per minute. If we want to estimate the average actions per minute of the population, how many samples are needed to obtain a margin of error \(<20\) at 90% confidence?

Solution

With a 90% confidence level, the \(z_\alpha\)-value is

$$\begin{aligned} z_\alpha &= \Phi^{-1}\left(1-\frac{\alpha}{2}\right) = \Phi^{-1}(0.95) = 1.645. \end{aligned}$$

The margin of error is 20. So we have \(z_\alpha \frac{\sigma}{\sqrt{N}} = 20\). Moving around the terms gives us

$$N \ge \left(z_\alpha \frac{\sigma}{20}\right)^2 = 36.45.$$

Therefore, we need at least \(N = 37\) samples to ensure a margin of error of \(< 20\) at a 90% confidence level.

The concepts of standard error se, the \(z_\alpha\) value, and the margin of error are summarized in Figure 9.5.

Figure 9.5
Figure 9.5. Relationships between the standard error se, the \(z_\alpha\) value, and the margin of error. The confidence level \(\alpha\) is the area under the curve for the tails of each PDF.

The left-hand side is the PDF of \(\widehat{Z}\). It is the normalized random variable, which is also the standard Gaussian. The right-hand side is the PDF of \(\Thetahat\), the unnormalized random variable. The \(z_\alpha\) value is located in the \(\widehat{Z}\)-space. It defines the range of \(\widehat{Z}\) in the PDF within which we are confident about the true parameter. The corresponding value in the \(\Thetahat\)-space is the margin of error. This is found by multiplying \(z_\alpha\) with the standard deviation of \(\Thetahat\), known as the standard error. Correspondingly, in the \(\widehat{Z}\)-space the standard deviation is unity.

Two further points about the confidence interval should be mentioned:

Figure 9.6
Figure 9.6. The PDF of \(\Thetahat\) as the number of samples \(N\) grows. Here, we assume that \(X_n\) are i.i.d. Gaussian random variables with mean \(\theta = 0\) and variance \(\sigma^2 = 1\).

9.1.5Student's \(t\)-distribution

In the discussions above, we estimate the population mean \(\theta\) using the estimator \(\Thetahat\). The assumption was that the variance \(\sigma^2\) was known a priori and hence is fixed. In practice, however, there are many situations where \(\sigma^2\) is not known. Thus we not only need to use the mean estimator \(\Thetahat\) but also the variance estimator \(\widehat{S}\), which can be defined as

$$\widehat{S}^2 \bydef \frac{1}{N-1}\sum_{n=1}^{N} (X_n-\Thetahat)^2,$$

where \(\Thetahat\) is the estimator of the mean. What is the confidence interval for \(\Thetahat\)?

For a confidence interval to be valid, we expect it to take the form of

$$\calI = \left[\Thetahat - z_\alpha \frac{\widehat{S}}{\sqrt{N}}, \;\; \Thetahat + z_\alpha \frac{\widehat{S}}{\sqrt{N}}\right],$$

which is essentially the confidence interval we have just derived but with \(\sigma\) replaced by \(\widehat{S}\). However, there is a problem with this. When we derive the confidence interval assuming a known \(\sigma\), the \(z_\alpha\) value is determined by checking the standard Gaussian

$$\widehat{Z} = \frac{\Thetahat - \theta}{\sigma/\sqrt{N}},$$

which gives us \(z_\alpha = \Phi^{-1}(1-\alpha/2)\). The whole derivation is based on the fact that \(\widehat{Z}\) is a standard Gaussian. Now that we have replaced \(\sigma\) by \(\widehat{S}\), the new random variable

$$T \bydef \frac{\Thetahat - \theta}{\widehat{S}/\sqrt{N}}$$

is not a standard Gaussian.

It turns out that the distribution of \(T\) is Student's \(t\)-distribution with \(N-1\) degrees of freedom. The PDF of Student's \(t\)-distribution is given as follows.

Definition 9.1

If \(X\) is a random variable following Student's \(t\)-distribution of \(\nu\) degrees of freedom, then the PDF of \(X\) is

$$f_X(x) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}.$$

We may compare Student's \(t\)-distribution with the Gaussian distribution. Figure 9.7 shows the standard Gaussian and several \(t\) distributions with \(\nu = N-1\) degrees of freedom. Note that Student's \(t\)-distribution has a similar shape to the Gaussian but it has a heavier tail.

Figure 9.7
Figure 9.7. The PDF of Student's \(t\)-distribution with \(\nu = N-1\) degrees of freedom.

Since \(T = \frac{\Thetahat - \theta}{\widehat{S}/\sqrt{N}}\) is a \(t\)-random variable, to determine the \(z_\alpha\) value we can follow the same procedure by considering the CDF of \(T\). Let the CDF of the Student's \(t\)-distribution with \(\nu\) degrees of freedom be

$$\Psi_{\nu}(z) = \text{CDF of $X$ at $z$.}$$

If we want \(\Pb[|T| \le z_\alpha] = 1-\alpha\), it follows that

$$z_\alpha = \Psi_\nu^{-1}\left(1-\frac{\alpha}{2}\right).$$

Therefore, the new confidence interval, assuming an unknown \(\widehat{S}\), is

$$\calI = \left[\Thetahat - z_\alpha \frac{\widehat{S}}{\sqrt{N}}, \;\; \Thetahat + z_\alpha \frac{\widehat{S}}{\sqrt{N}}\right],$$

with \(z_\alpha\) defined in Eq.&nbsp;(9.12), using \(\nu = N-1\).

Practice Exercise 9.2

A survey asked \(N = 14\) people for their rating of a movie. Assume that the mean estimator is \(\Thetahat\) and the variance estimator is \(\widehat{S}\). Find the confidence interval.

Solution

If we use Student's \(t\)-distribution, it follows that

$$z_\alpha = \Psi_{13}^{-1}\left(1-\frac{\alpha}{2}\right) = 2.16,$$

where the degrees of freedom are \(\nu = 14-1 = 13\). Thus the confidence interval is

$$\calI = \left[\Thetahat - 2.16 \frac{\widehat{S}}{\sqrt{N}}, \;\; \Thetahat + 2.16 \frac{\widehat{S}}{\sqrt{N}}\right].$$

The MATLAB and Python codes to report the \(z_\alpha\) value of a Student's \(t\)-distribution are shown below. They are both called through the inverse CDF function. In MATLAB it is icdf, and in Python it is stats.t.ppf.

MATLAB
% MATLAB code to compute the z_alpha value of t distribution
    alpha = 0.05;
    nu = 13;
    z = icdf(‘t’,1-alpha/2,nu)
Python
# Python code to compute the z_alpha value of t distribution
    import scipy.stats as stats
    alph = 0.05
    nu   = 13
    z = stats.t.ppf(1-alph/2, nu)
    print(z)
Example 9.5

A class of 10 students took a midterm exam. Their scores are given in the following table.

Student12345678910
Score72697558677060715965

Find the 95% confidence interval.

Solution

The mean and standard deviation of the datasets are respectively \(\Thetahat = 66.6\) and \(\widehat{S} = 5.91\). The critical \(z_\alpha\) value is determined by Student's \(t\)-distribution:

$$z_\alpha = \Psi_{9}^{-1}\left(1-\frac{\alpha}{2}\right) = 2.26.$$

The confidence interval is

$$\left[\Thetahat - z_\alpha \frac{\widehat{S}}{\sqrt{N}}, \quad \Thetahat + z_\alpha \frac{\widehat{S}}{\sqrt{N}}\right] = \left[62.37, 70.83\right].$$

Therefore, with 95% probability, the interval \(\left[62.37, 70.83\right]\) will include the true population mean.

Remark 1. Make sure you understand the meaning of “population mean” in this example. Since we have ten students, isn't the population mean just the average of the ten scores? This is incorrect. In statistics, we assume that these ten students are the realizations of some underlying (unknown) random variable \(X\) with some PDF \(f_X(x)\). The population mean \(\theta\) is therefore the expectation \(\E[X]\), where the expectation is taken w.r.t. \(f_X\). The sample average \(\Thetahat\), which is the average of the ten numbers, is an estimator of the population mean \(\theta\).

Remark 2. You may be wondering why we are using Student's \(t\)-distribution here when we do not even know the PDF of \(X\). The answer is that it is an approximation. When \(X\) is Gaussian, the sample average \(\Thetahat\) is a Student's \(t\)-distribution, assuming that the variance is approximated by the sample variance \(\widehat{S}\). This result is attributed to the original paper of William Gosset, who developed Student's \(t\)-distribution.

The above example can be solved computationally. An implementation of Python is given below, and the MATLAB implementation is straightforward.

Python
# Python code to generate a confidence interval
    import numpy as np
    import scipy.stats as stats
    x = np.array([72, 69, 75, 58, 67, 70, 60, 71, 59, 65])
    N         = x.size
    Theta_hat = np.mean(x) # Sample mean
    S_hat     = np.std(x)  # Sample standard deviation
    nu        = x.size-1   # degrees of freedom
    alpha     = 0.05       # confidence level
    z    = stats.t.ppf(1-alph/2, nu)
    CI_L = Theta_hat-z*S_hat/np.sqrt(N)
    CI_U = Theta_hat+z*S_hat/np.sqrt(N)
    print(CI_L, CI_U)
What is Student's \(t\)-distribution?
  • sep0ex
  • It was developed by William Gosset in 1908. When he published the paper he used the pseudonym Student.
  • We use Student's \(t\)-distribution to model the estimator \(\Thetahat\)'s PDF when the variance \(\sigma^2\) is replaced by the sample variance \(\widehat{S}^2\).
  • Student's \(t\)-distribution has a heavier tail than a Gaussian.

9.1.6Comparing Student's \(t\)-distribution and Gaussian

We now discuss an important theoretical result regarding the relationship between a Student's \(t\)-distribution and Gaussian distribution. The main result is that the standard Gaussian is a limiting distribution of the \(t\) distribution as the degrees of freedom \(\nu \rightarrow \infty\).

Theorem 9.1

As \(\nu \rightarrow \infty\), the Student's \(t\)-distribution approaches the standard Gaussian distribution:

$$\lim_{\nu \rightarrow \infty} \left\{\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{y^2}{\nu}\right)^{-\frac{\nu+1}{2}}\right\} = \frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}}.$$

Proof. There are two results we need to use:

  • sep0ex
  • Stirling's approximation: (K. G. Binmore, Mathematical analysis: A straightforward approach. Cambridge University Press, 1977. Section 17.7.2.) \(\Gamma(z) \approx \sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^z\).
  • Exponential approximation: \((1+\frac{x}{k})^{-k} \rightarrow e^{-x}\), as \(k \rightarrow \infty\).

We have that

$$\begin{aligned} \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi}\Gamma\left(\frac{\nu}{2}\right)} &\approx \frac{\sqrt{\frac{2\pi}{\frac{\nu+1}{2}}}\left(\frac{\nu+1}{2e}\right)^{\frac{\nu+1}{2}}}{\sqrt{\nu \pi}\sqrt{\frac{2\pi}{\frac{\nu}{2}}}\left(\frac{\nu}{2e}\right)^{\frac{\nu}{2}}}\\ &= \frac{1}{\sqrt{\nu \pi}}\sqrt{\frac{\nu}{\nu+1}} \frac{1}{\sqrt{e}} \left(\frac{\nu+1}{\nu}\right)^{\frac{\nu}{2}} \frac{\sqrt{\nu+1}}{\sqrt{\nu}}\\ &= \frac{1}{\sqrt{\nu \pi}}\frac{\sqrt{\nu}}{\sqrt{2e}} \left(\frac{\nu+1}{\nu}\right)^{\frac{\nu}{2}}\\ &= \frac{1}{\sqrt{2\pi e}} \left(1+\frac{1}{\nu}\right)^{\frac{\nu}{2}}. \end{aligned}$$

Putting a limit of \(\nu \rightarrow \infty\), we have that

$$\begin{aligned} \lim_{\nu \rightarrow \infty} \frac{1}{\sqrt{2\pi e}} \left(1+\frac{1}{\nu}\right)^{\frac{\nu}{2}} = \frac{1}{\sqrt{2\pi e}} e^{\frac{1}{2}} = \frac{1}{\sqrt{2\pi}}. \end{aligned}$$

The other limit follows from the fact that

$$\begin{aligned} \lim_{\nu \rightarrow \infty} \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}} = e^{-\frac{t^2}{2}}. \end{aligned}$$

Combining the two limits proves the theorem.

This theorem has several implications: