Posts tagged “variables”

Financial Statistics (5) – Standard Error of Estimate

September 22nd, 2010

Plot A has a smaller standard error of estimate than does Plot B

- Eric Bank

We left off last time having concluded a discussion of the t-test for evaluating correlations. Next, using the standard error of estimate, we’ll examine how to assess the strength of a relationship between an independent and a dependent variable as determined by a linear regression.

Recall that the equation for a linear regression is:

Yi = b0 + b1Xi + εi for i = 1, …, n

where the residual error term, ε, gives an indication of how certain we are about a particular predicted Y value via a linear regression.   The standard error of estimate tells us how spread out actual values of Y are with respect to their predicted values. The bigger the standard deviation of the error term, the less precise is the relationship between the two variables.

The standard error of estimate (SEE) measures the variability of the error term:

Don’t panic: the equation just adds up the squares of the error terms, divides the sum by number of degrees of freedom, and takes the square root of the whole thing. Another way of saying this is that the SEE is the difference between the dependent variable’s actual value for each observation and its predicted value for each observation.

SEE and standard deviation are almost identical, except that SEE has n-2 degrees of freedom (to account for the two parameters and and standard deviation has n-1 degrees of freedom.  This little difference in the denominator ensures that SEE is unbiased. Whereas the standard deviation is the square root of the average squared deviation from the mean, the standard error of the estimate is the square root of the average squared deviation from the regression line.

To get a general feel of the meaning of a particular SEE value, know that if the error residuals (ε = Yactual – Ypredicted) are normally distributed around the prediction line, about 68% of actual scores will fall between ±1 SEE of their predicted values.

While we can say that smaller SEE values result in better predictions, it would be nice to know how well the independent variable X explains variation in the dependent variable Y. To calculate the fraction of the total variation in the dependent variable that is explained by the independent variable, one uses the coefficient of determination, which will be our next topic.

Financial Statistics (4) – Testing Correlations for Significance: the t-Test

September 20th, 2010

Tea test

- Eric Bank

Now that we have examined correlation and linear regression, we now need to understand whether a correlation describes a real relationship or is just the result of chance.  Only real relationships are predictive.  Another way of saying this is that we want to test the null hypothesis (H0) that a correlation coefficient ϱ in the population is equal to zero (ϱ = 0), versus the alternative hypothesis (H1) that it is significantly different from zero (ϱ 0).

Since we are testing whether the correlation is not zero (i.e. significantly bigger or smaller than zero), we need to perform a two-tailed test. We assume that the variables (X and Y) are normally distributed – this permits us to perform a t-test:

where the sample correlation r is an estimate of the population correlation ϱ, and n is the sample size. We use (n -2) degrees of freedom to see if the test statistic has a t-distribution; if it does, then H0 is true. By using n – 2 instead of n for the degrees of freedom, we avoid introducing a bias into the calculation.  If the calculated t-value exceeds the critical t-value for the degrees of freedom, then H0 can be rejected. By the way, you can look up the critical t-value in a table at the back of any statistics book. Note that as n increases, the absolute value of the critical t-value decreases: it’s easier to reject the null hypothesis with a larger sample size. Also note that the numerator of the t-test increases with increasing n, meaning you get larger values of t for larger samples. The bottom line is that the likelihood of failing to reject a false H0 decreases with sample size.

When we perform a t-test, we need to specify a level of statistical significance.  For example, if we choose the 0.05 level of significance, we are confident in the results of test 95 times out of 100. All things being equal, a lower level of significance produces a higher critical t-value: it becomes harder to reject H0, but you have more confidence in the predictive value of the correlation.

Let’s work a numerical example[1].  We determine that the sample correlation r between monthly returns on long-term U. S. government bonds and 30-day T-bills was 0.1119 over 924 months of observations. Is this value of r high enough to reject the hypothesis that returns on the bonds were uncorrelated to returns on the T-bills?  For the 0.05 level of significance, the critical t-value is 1.96, and we can plug in the values into the t-test:

tactual > tcritical =  0.1119 (924 – 2).5 / (1 – 0.11192).5 = 3.4193 > 1.96

Thus, in this example we are able to reject the null hypothesis, and say that there is correlation between government bonds and T-bills.

We want next to assess the strength of a relationship between an independent and a dependent variable as determined by a linear regression. We will examine this test in our next blog using a statistic called the standard error of estimate.


[1] Quantitative Methods for Investment Analysis, Second Edition, by Richard A. DeFusco, CFA, Dennis W. McLeavey, CFA, Jerald E. Pinto, CFA, and David E. Runkle, 294-295.

Financial Statistics (2) – Linear Regression: Definition

September 9th, 2010

A linear regression is a statistical method that helps one understand the relationship between two (or more) variables.  It does this in three ways:

  1. It uses one variable to predict the value of another variable
  2. It tests hypotheses concerning the relationship between two variables
  3. It quantifies the strength of the relationship between two variables

As we did in our discussion of linear correlation, we will denote two variables as X and Y; X is the independent variable, Y the dependent one.  A linear regression assumes that there is a linear relationship between X and Y, and is given by the following formula:

Yi = b0 + b1Xi + εi for i = 1, …, n

where:

Yi is the ith value of the dependent variable

b0 is the y-intercept

b1 is the slope coefficient

Xi is the ith value of the independent variable

εi is the ith value of an error term

i is the index of a particular variable

n is the maximum value of i

In English, the value of the dependent variable Yi is equal to {the value of dependent variable when the independent variable’s value is zero (b0)} plus {the product of the slope b1 and the independent variable b1} plus {some error term εi}. The error term is that part of Yi that is not explained by Xi . We call b0 and b1 the regression coefficients.

When we speak about the relationship between two variables, we think in terms of many contemporaneous observations (a cross-sectional series) or observations over a period of time (a time-series). Observations are indexed by values 1 to n.  For example, you may be interested in the effect in various countries of money supply (Xi where i refers to a particular country) on the country’s inflation rate (Yi) – that would be a cross-sectional analysis.  Conversely, you would use a time-series analysis to test the money supply/inflation rate relationship in one country over a period of time.

A perfect linear regression would be one where all of the error terms equaled zero.  This would indicate that all changes to Y were accounted for by changes to X.  For instance, if I eat every cookie handed to me, then there would be no error values when I plot cookies offered versus cookie consumed. In this case, the regression line’s y-intercept would be zero and the slope would equal 1; all actual data values would be points directly on the regression line. Thus, if you offered me 3 cookies, I’d eat 3 cookies. Obviously this example is unrealistic when the number of cookies offered rises above some critical value, say 3-dozen in my case.

A more realistic case is one that plots a straight regression line through the data in which the errors are minimized – the best fit.  In real life, we are interested in imperfect correlations, so we need a method to achieve the best fit, which we define as the regression line that minimizes the sum of the squared vertical distances (deviations) between observations and the regression line.  This method is called the linear least-square method.  Nifty, but how do we calculate the best fit?

To achieve the best fitting regression line, we need to find the slope b1 and y-intercept b0 that produces the minimum sum of the squared errors. (We square the errors, which are simply the vertical deviations from the regression line, because we don’t want positive and negative values to cancel each other out).  How do we find these magic regression coefficients? We need to make estimates, which we call the fitted parameters, according the following formula:

The funny little hat (^) above b0 and b1 designates that the regression coefficients are estimated. We are summing, for all index values of i, the squares of the following difference: the actual value of the dependent variable minus the predicted value of the dependent variable. When this sum (the sum of the squared error terms) is minimized, we have a best-fit regression line. The actual method of calculating this minimum is complicated, and we leave it to a computer spreadsheet or math package to do the nitty-gritty work.

A note about the slope coefficient b1: when a linear regression contains a single independent variable, the slope coefficient is equal to the following:

b1 = Cov(Y, X) / Var(X) = Cov(Y, X) / sxsx where s = standard deviation

which is the covariance of Y and X divided by the variance of X.  Alert readers will recall from the previous blog that this formula is very similar to that for the correlation coefficient (r). The difference here is that the denominator, the variance of X, is the equivalent to the square of the standard deviation of X (sx). For the correlation coefficient, the denominator is the product of the standard deviations for X and Y:

r = Cov(Y, X) / sxsy

Conceptually, one can see that the coefficients are very similar – they both give a scale to the covariance of the two variables.

Next time, we will address the assumptions one makes in order to calculate a proper linear regression.

Bottom Logo Wall Bottom Logo Reuters Bottom Logo Forbes Bottom Logo Fortune Bottom Logo Cnn Bottom Logo Cnbc Bottom Logo Fox Bottom Logo Comunity