Posts tagged “correlation coefficient r”

Financial Statistics (6) – The Coefficient of Determination

September 27th, 2010

- Eric Bank

Determination

As we pointed out in our discussion of the standard error of estimate, it would be nice to know how well the independent variable X explains variation in the dependent variable Y. To calculate the fraction of the total variation in the dependent variable that is explained by the independent variable, one uses the coefficient of determination (R2).

There are two ways to calculate R2. The easier method involves squaring the correlation coefficient for a linear regression with a single independent variable. Recall from a previous blog that the correlation coefficient, r, is equal to the covariance of the two variables divided by the product of their standard deviations (sxsy).  (We pointed out that covariance measures the extent to which two variables (X, Y) change together).   The formula for the correlation coefficient is:

r  = Cov(X, Y) / sxsy.

We square it, giving us R2 as the coefficient of determination. However, this doesn’t work when we are dealing with more than one independent variable (X).

The alternate calculation of R2 for multiple independent variables is to use the following definition:

Total variation = Unexplained variation + Explained variation

Since R2 stands for the fraction of the total variation that is explained by a linear regression, we get this solution:

R2 = Explained Variation/Total Variation = 1 – (Unexplained Variation / Total Variation)

There is one more alternative for calculating R2 . Linear regression packages typically report a statistic called multiple R, which is the correlation between actual Y values and predicted Y values.  R2 is the square of multiple R.

As an example, let’s take the results from a hypothetical multiple regression which regresses inflation rate on money supply growth rate for several different countries over a particular period of time. We calculate the following results:

Given that

  • total variation is the sum of the squared deviations (Yi – Yavg)2 = 0.001598
  • the unexplained variation is 0.000386

the value for R2 is (0.001598 – 0.000386) / 0.001598 = 0.7586.

Now when you inspect the generated results from a linear regression, you’ll have an understanding of the reported R2 statistic, and can judge the meaningfulness of the predicted Y values.

We are making great progress with our review of elementary financial statistics. Next time, we’ll look at analysis of variance (ANOVA) and the F-test.

Financial Statistics (1) – Correlation

September 7th, 2010

Scatter plot

Many people who work at financial institutions, such as prime brokerages and hedge funds, have had formal financial training, including the use of statistics and other quantitative methods.  Today we are launching a series of blogs that cover these important topics at a straightforward, accessible level. We’ll assume you have had some exposure to the subject matter (for instance, you are familiar with terms like population and sample) and that you can handle simple algebra.

Statistics play a key role in financial modeling, so we’ll begin by looking at linear correlations and linear regressions.

Data analysis and prediction are the reasons for employing statistical method.  Data can be organized and presented in many ways.  One of the most popular presentations is a scatter plot, in which two series of observations are plotted on an x-y coordinate graph.  For each data pair (that is, two simultaneous observations), the appropriate point is shown on the graph as the intersection of the x and y values.  For instance, if we place money-supply growth on the x-axis and inflation rate on the y-axis, we can plot a series of unconnected points that indicate some kind of relationship between the two data series.

To indicate how closely two data series are related, we use a measure of their linear association, the correlation coefficient (r). The values that r can have range from -1 (perfect negative correlation) through zero (no linear correlation) to +1 (perfect positive correlation).  To calculate the r of a data sample, we must first understand another statistic: sample covariance.

Covariance measures the extent to which two variables (X, Y) change together. It is given by the following equation:

where

n is the number of data pairs

i is a particular value from 1 to n

is the ith X variable,  is the ith Y variable

and are the mean X and Y values, respectively

In English, this states that the sample covariance is the average value of the product of the deviations of observations on two random variables from their sample means. The use of (n – 1) instead of n to calculate the mean is used to ensure that sample covariance is an unbiased estimate of population variance.

To show the relationship between covariance and r, we note that if we take the covariance of X with itself, we have calculated the variance of X. Variance (denoted by the symbol s2) is a measure of how far values deviate from their mean, and is given by the following equation:

This is the variance of X, a measure of X’s dispersion around its mean   Standard deviation (sx) is the positive square root of variance:

Now we have all of the elements in place to calculate the sample correlation coefficient:

Thus, the correlation coefficient, r, is equal to the covariance of the two variables divided by the product of their standard deviations.  Think of it as the covariance normalized for the dispersion of each variable.

It is assumed that for the correlation coefficient the means and covariances of X, Y, and Cov(X,Y) are finite and constant. Note that r refers solely to linear associations between X and Y, that is, no exponents greater than 1.

A value of r equal to, say, 0.9, would indicate a strong linear relationship between X and Y, but not necessarily any causal relationships between the two variables.  A classic example of spurious correlation is one between vocabulary and height.  One may infer that the real relationship has something to do with age.

Forecasters use correlations to analyze trends and changes in trends. For instance, a change in the consumer price index (CPI) is correlated with a change to the inflation rate. So whenever a new CPI figure is released, economists revise their forecasts for inflation, which in turn affect interest rates and bond prices. When dealing with more than two variables, a correlation matrix is used to sort out the various linear relationships among the variables.

Next time out, we’ll tackle linear regressions.

Bottom Logo Wall Bottom Logo Reuters Bottom Logo Forbes Bottom Logo Fortune Bottom Logo Cnn Bottom Logo Cnbc Bottom Logo Fox Bottom Logo Comunity