Analysis of variance (ANOVA) is used to determine how useful an independent variable X is at explaining variation of the dependent variable Y. For this article, we’ll confine our discussion to linear regressions with a single independent variable, although ANOVA is also appropriate for multi-variable regressions. Recall that we have examined linear regressions and the meaning of the slope coefficient. The F-test is used within ANOVA to see whether the slope coefficient (b1) is equal to zero. That is, we test the null hypothesis H0 of b1 = 0 against the alternative hypothesis H1 that b1 ≠ 0. If H0 turns out to be true, then X is not a good predictor of Y.
To calculate the F-statistic, we need the following items of data:
- the number of observations n
- the number of parameters (intercept and slope coefficient) = 2
- the sum of squared errors SSE =
- the total variation in Y that is explained by the regression, known as the regression sum of squares RSS =
- total variation TSS = SSE + RSS
The F-statistic is the ratio of the RSS to the average SSE. This average is calculated by dividing the non-averaged SSE by n-2, the degrees of freedom (observations less parameters). Thus, the F-statistic equals RSS / (SSE / (n -2)).
To clarify, suppose H0 was true, and that X did nothing to predict the value of Y. In that case, the predicted value of Y for Xi is equal to the mean value of the dependent variable. But if this were true, then RSS = zero, and the F-statistic = zero. The higher the value of F, the more predictive X is.
In a previous blog, we examined the t-statistic. Note that F is equal to the square of t for a single-variable linear regression. In the next blog, we will look at prediction intervals.