A linear regression is a statistical method that helps one understand the relationship between two (or more) variables. It does this in three ways:
- It uses one variable to predict the value of another variable
- It tests hypotheses concerning the relationship between two variables
- It quantifies the strength of the relationship between two variables
As we did in our discussion of linear correlation, we will denote two variables as X and Y; X is the independent variable, Y the dependent one. A linear regression assumes that there is a linear relationship between X and Y, and is given by the following formula:
Yi = b0 + b1Xi + εi for i = 1, …, n
where:
Yi is the ith value of the dependent variable
b0 is the y-intercept
b1 is the slope coefficient
Xi is the ith value of the independent variable
εi is the ith value of an error term
i is the index of a particular variable
n is the maximum value of i
In English, the value of the dependent variable Yi is equal to {the value of dependent variable when the independent variable’s value is zero (b0)} plus {the product of the slope b1 and the independent variable b1} plus {some error term εi}. The error term is that part of Yi that is not explained by Xi . We call b0 and b1 the regression coefficients.
When we speak about the relationship between two variables, we think in terms of many contemporaneous observations (a cross-sectional series) or observations over a period of time (a time-series). Observations are indexed by values 1 to n. For example, you may be interested in the effect in various countries of money supply (Xi where i refers to a particular country) on the country’s inflation rate (Yi) – that would be a cross-sectional analysis. Conversely, you would use a time-series analysis to test the money supply/inflation rate relationship in one country over a period of time.
A perfect linear regression would be one where all of the error terms equaled zero. This would indicate that all changes to Y were accounted for by changes to X. For instance, if I eat every cookie handed to me, then there would be no error values when I plot cookies offered versus cookie consumed. In this case, the regression line’s y-intercept would be zero and the slope would equal 1; all actual data values would be points directly on the regression line. Thus, if you offered me 3 cookies, I’d eat 3 cookies. Obviously this example is unrealistic when the number of cookies offered rises above some critical value, say 3-dozen in my case.
A more realistic case is one that plots a straight regression line through the data in which the errors are minimized – the best fit. In real life, we are interested in imperfect correlations, so we need a method to achieve the best fit, which we define as the regression line that minimizes the sum of the squared vertical distances (deviations) between observations and the regression line. This method is called the linear least-square method. Nifty, but how do we calculate the best fit?
To achieve the best fitting regression line, we need to find the slope b1 and y-intercept b0 that produces the minimum sum of the squared errors. (We square the errors, which are simply the vertical deviations from the regression line, because we don’t want positive and negative values to cancel each other out). How do we find these magic regression coefficients? We need to make estimates, which we call the fitted parameters, according the following formula:
The funny little hat (^) above b0 and b1 designates that the regression coefficients are estimated. We are summing, for all index values of i, the squares of the following difference: the actual value of the dependent variable minus the predicted value of the dependent variable. When this sum (the sum of the squared error terms) is minimized, we have a best-fit regression line. The actual method of calculating this minimum is complicated, and we leave it to a computer spreadsheet or math package to do the nitty-gritty work.
A note about the slope coefficient b1: when a linear regression contains a single independent variable, the slope coefficient is equal to the following:
b1 = Cov(Y, X) / Var(X) = Cov(Y, X) / sxsx where s = standard deviation
which is the covariance of Y and X divided by the variance of X. Alert readers will recall from the previous blog that this formula is very similar to that for the correlation coefficient (r). The difference here is that the denominator, the variance of X, is the equivalent to the square of the standard deviation of X (sx). For the correlation coefficient, the denominator is the product of the standard deviations for X and Y:
r = Cov(Y, X) / sxsy
Conceptually, one can see that the coefficients are very similar – they both give a scale to the covariance of the two variables.
Next time, we will address the assumptions one makes in order to calculate a proper linear regression.



