Summary

Why do we need to study the sample average? Because it is the summary of the dataset. In machine learning, one of the most frequently asked questions is about the number of training samples required to train a model. The answer can be found by analyzing the average number of successes and failures as the number of training samples grows. For example, if we define \(f\) as the classifier that takes a data point \(\vx_n\) and predicts a label \(f(\vx_n)\), we hope that it will match with the true label \(y_n\). If we define an error

E_n = \begin{cases} 1, &\qquad f(\vx_n) \not= y_n \qquad\text{incorrect classification},\\ 0, &\qquad f(\vx_n) = y_n \qquad\text{correct classification}, \end{cases}

then \(E_n\) is a Bernoulli random variable, and the total loss \(\calE = \frac{1}{N}\sum_{n=1}^N E_n\) will be the training loss. But what is \(\frac{1}{N}\sum_{n=1}^N E_n\)? It is exactly the sample average of \(E_n\). Therefore, by analyzing the sample average \(\calE\) we will learn something about the generalization capability of our model.

How should we study the sample average? By understanding the law of large numbers and the Central Limit Theorem, as we have seen in this chapter.

sep0ex
Law of large numbers: \(\overline{X}\) converges to the true mean \(\mu\) as \(N\) grows.
Central Limit Theorem: The CDF of \(\overline{X}\) can be approximated by the CDF of a Gaussian, as \(N\) grows.

Performance guarantee? The other topic we discussed in this chapter is the concept of convergence type. There are essentially four types of convergence, ranked in the order of restrictions.

sep0ex
Deterministic convergence: A sequence of deterministic numbers converges to another deterministic number. For example, the sequence \(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\) converges to 0 deterministically. There is nothing random about it.
Almost sure convergence: Randomness exists, and there is a probabilistic convergence. Almost sure convergence means that there is zero probability of failure after a finite number of failures.
Convergence in probability: The sequence of probability values converges, i.e., the chance of failure is going to zero. However, you can still fail even if your \(N\) is large.
Convergence in distribution: The probability values can be approximated by the CDF of a Gaussian.