Summary

In this chapter, we have discussed five principles for quantifying the confidence of an estimator and making statistical decisions. To summarize the chapter, we clarify a few common misconceptions about these topics.

sep0ex
Confidence interval. Students frequently become confused about the meaning of a confidence interval. It is not the interval that 95% of the samples will fall inside. It is also not the interval within which the estimator has a 95% chance to show up. A confidence interval is a random interval that has a 95% chance of including the population parameter. A better way to think about a confidence interval is to think of it as an alternative to a point estimate. A point estimate only gives a point, whereas a confidence interval extends the point to an interval. All the randomness of the point estimate is also there in the confidence interval. However, if the confidence interval is narrow, there is a good chance for the point estimate to be accurate.
Bootstrapping. The most common misconception about bootstrapping is that it can create something from nothing. Another misconception is that bootstrapping can make your estimates better. Both beliefs are wrong. Bootstrapping is a technique for estimating the estimator's variance, and consequently it provides a confidence interval. Bootstrapping does not improve the point estimate, no matter how many bootstrapping samples you synthesize. Bootstrapping works because the sampling-with-replacement step is equivalent to drawing samples from the empirical distribution. The whole process relies on the proximity between the empirical distribution and the true population. If you do not have enough samples and the empirical distribution does not approximate the population, bootstrapping will not work. Therefore, bootstrapping does not create something from nothing; it uses whatever you have and tells you how reliable the estimate is.
Hypothesis testing. Students are often overwhelmed at first by the great number of tests one can use for hypothesis testing, e.g., \(p\)-value, critical value, \(Z\)-test, \(T\)-test, \(\chi^2\) test, \(F\)-test, etc. Our advice is to forget about them and remember that hypothesis testing is a court trial. Your job is to decide whether you have enough evidence to declare that the defendant is guilty. To reach a guilty verdict, you need to make sure that the test statistic is unlikely to happen. Therefore, the best practice is to draw the distributions of the test statistic and ask yourself how likely it is that the test statistic has such a value. When you draw the pictures of the distributions, you will know whether you should use a Gaussian \(Z\), a Student's \(t\), a \(\chi^2\), an \(F\)-statistic, etc. When you examine the likelihood of the test statistic, you will know whether you want to use the \(p\)-value or the critical value. If you follow this principle, you will never be confused by the oceans of tests you find in the textbooks.
Neyman-Pearson. Beginners often find Neyman-Pearson abstract and do not understand why it is useful. In this chapter, however, we have explained why we need to understand Neyman-Pearson. It is a very general framework for many kinds of hypothesis testing problems. All it says is that if we want to maximize the detection rate while maintaining the false alarm rate, then the optimal testing procedure boils down to the critical-value test and the \(p\)-value test. This gives us a certificate that our usual hypothesis testing is optimal according to the Neyman-Pearson framework.
ROC and PR curves. On the internet nowadays there is a huge quantity of articles, blogs, and tutorials about how to plot the ROC curve and the PR curve. Often these curves are explained through programming examples such as Python, R, or MATLAB. Our advice for studying the ROC curve and the PR curve is to go back to the Neyman-Pearson framework. These two curves do not come out of the blue. The ROC curve is the natural figure explaining the objective and the constraint in the Neyman-Pearson framework. By changing the coordinates, we obtain the PR curve. Therefore, the two curves are the same in terms of the amount of information, but they offer different interpretations.