Summary

Regression is one of the most widely used techniques in data science. The formulation of the regression problem is as simple as setting up a system of linear equations:

\minimize{\vtheta \in \R^d} \;\; \|\mX\vtheta-\vy\|^2,

which has a closed-form solution. The biggest problems in practice are outliers, lack of training samples, and poor choice of the regression model.

sep0ex
Outliers: We always recommend plotting the data whenever possible to check if there are obvious outliers. There are also statistical tests in which you can evaluate the validity of your samples. One simple way to debug outliers is to run the regression and check the prediction error against each training sample. If you have an outlier, and if your model is of reasonably low complexity, then a sample with an excessively large prediction error is an outlier. For example, if most of the training samples are within one standard deviation from your prediction but a few are substantially off, you will know which ones are the outliers. Robust linear regression is one technique for countering outliers, but an experienced data scientist can often reject outliers before running any regression algorithms. Domain knowledge is of great value for this purpose.
Lack of training samples: As we have discussed in the overfitting section, it is extremely important to ensure that your model complexity is appropriate for the number of training samples. If the training set is small, do not use a complex model. Regularization techniques are valuable tools to mitigate overfitting. However, choosing a good regularization requires domain knowledge. For example, if you know that some features are not important, you need to scale them properly so as not to over-influence the regression solution.
Wrong model: We have mentioned several times that regression can always return you a result because regression is an optimization problem. However, whether that result is meaningful depends on how meaningful your regression problem is. For example, if the noise is i.i.d. Gaussian, a data fidelity term with \(\|\cdot\|^2\) would be a good choice; however, if the noise is i.i.d. Poisson, \(\|\cdot\|^2\) would become a very bad model. We need a tighter connection with the statistics of the underlying data-generation model for problems like these. This is the subject of our next chapter, on parameter estimation.