Suppose we have a long list of pairs of real numbers. We can imagine these pairs of numbers as points in a plane. Then we can try and draw a line that is as close as possible to all the points at once. This is regression. It can be any sort of line that we draw, but a straight line is often best. This is linear regression. For now, we are only going to consider linear regression.
Usually we think of one of the values as causing the changes in the other. The value that is causing this change is called the predictor, and the other value is called the response.
Why would we want to do this? Well first we might want to investigate if there is a relationship between our two variables, and if there is, to measure how strong it is. We can answer these question by looking at how easy it is to draw a line that fits our points, and the slope of that line. The first gives us an indication of how well our model has fit our data and the second tells us the strength of the relationship. However, it is possible to have a line that fits very well, but for which the slope is approximately zero, this tells us that even though we can fit the data, we haven’t found any evidence of a relationship.
The second reason we might do regression is because we want to forecast values of our response. Suppose that we have fit our data well and we have convinced ourselves that we are modelling some underlying process. Say our matched pairs of data are the price of oil at time t – 1, and the stock price of an airline at time t. We can perform a regression on the data, then, whenever we have a new data point for the price of oil, we can forecast a new price for the stock.
Before we drew our line, we didn’t make any particular assumptions. We didn’t say anything about things being independent, or drawn from the same distribution, or normal, or in fact anything about distributions at all. There is a lot of theory that explains regression in mathematical terms, and talks about various assumptions required, and facts you can assume. But it only really exists to provide intellectual justification for what we have talked about.
Something important to remember is that just because we are doing a regression and finding a nice line that fits our data does not mean that there really is a relationship between the values. In particular, if we do a linear regression it does not necessarily mean that there is a linear relationship between them. In the other direction, if we fail to find a relationship via regression, that does not mean one does not exist, in particular if we don’t find anything with linear regression, it is possible that there is a non-linear relationship that we are not capturing. With all that scepticism in mind, we can say that linear regression is really useful, because a lot of process are linear, or at least well approximated linearly, at least locally.
Something that I haven’t mentioned is that we are not limited to just a single predictor for each response, we can have multiple predictors. Geometrically speaking, it gets a bit more complicate, for example if we have two predictors, now we are drawing points in three dimensional space and instead of drawing a line, we are drawing a plane.