Summary

In this chapter, we have discussed the basic principles of parameter estimation. The three building blocks are:

sep0ex
Likelihood \(f_{\mX|\mTheta}(\vx|\vtheta)\): the PDF that we observe samples \(\mX\) conditioned on the unknown parameter \(\mTheta\). In the frequentist world, \(\mTheta\) is a deterministic quantity. In the Bayesian world, \(\mTheta\) is random and so it has a PDF.
Prior \(f_{\mTheta}(\vtheta)\): the PDF of \(\mTheta\). The prior \(f_{\mTheta}(\vtheta)\) is used by all Bayesian computation.
Posterior \(f_{\mTheta|\mX}(\vtheta|\vx)\): the PDF that the underlying parameter is \(\mTheta = \vtheta\) given that we have observed \(\mX = \vx\).

The three building blocks give us several strategies to estimate the parameters:

sep0ex
Maximum likelihood (ML) estimation: Maximize \(f_{\mX|\mTheta}(\vx|\vtheta)\).
Maximum a posteriori (MAP) estimation: Maximize \(f_{\mTheta|\mX}(\vtheta|\vx)\).
Minimum mean-square estimation (MMSE): Minimize the mean squared error, which is equivalent to finding the mean of \(f_{\mTheta|\mX}(\vtheta|\vx)\).

As discussed in this chapter, no single estimation strategy is universally “better” because one needs to specify the optimality criterion. If the goal is to minimize the mean squared error, then the MMSE estimator is the optimal strategy. If the goal is to maximize the likelihood without assuming any prior knowledge, the ML estimator would be the optimal strategy. It may appear that if we knew the ground truth parameter \(\vtheta^*\) we could minimize the distance between the estimated parameter \(\vtheta\) and the true value \(\vtheta^*\). If the parameter is a scalar, this will work. However, if the parameter is a vector, the noise of the distance becomes an issue. For example, if one cares about the mean absolute error (MAE), the optimal estimator would be the median of the posterior distribution instead of the mean of the posterior in the MMSE case. Therefore, it is the end user's responsibility to specify the optimality criterion.

Whenever we consider parameter estimation, we tend to think that it is about estimating the model parameters, such as the mean of a Gaussian PDF. While in many statistics problems this is indeed the case, parameter estimation can be much broader if we link it with regression. Specifically, a regularized linear regression problem can be formulated as a MAP estimation

\vtheta^* = \argmax{\vtheta} \;\; \underset{\textcolor{blue}{-\log f_{\mX|\mTheta}(\vx|\vtheta)}}{\underbrace{\|\mX\vtheta - \vy\|^2}} + \underset{\textcolor{blue}{-\log f_{\mTheta}(\vtheta)}}{\underbrace{ \lambda R(\vtheta) }},

for some regularization \(R(\vtheta)\), which is also the negative log of the prior. Expressed in this way, we recognize that the MAP estimation can be used to recover signals. For example, we can model \(\mX\) as a linear degradation process of certain imaging systems. Then solving the MAP estimation is equivalent to finding the best signal explaining the degraded observation using the posterior as the criterion. There is rich literature dealing with solving MAP estimation problems similar to these in subjects such as computational imaging, communication systems, remote sensing, radar engineering, and recommendation systems, to name a few.