Posts tagged “linear correlation”

Financial Statistics (2) – Linear Regression: Definition

September 9th, 2010

A linear regression is a statistical method that helps one understand the relationship between two (or more) variables.  It does this in three ways:

  1. It uses one variable to predict the value of another variable
  2. It tests hypotheses concerning the relationship between two variables
  3. It quantifies the strength of the relationship between two variables

As we did in our discussion of linear correlation, we will denote two variables as X and Y; X is the independent variable, Y the dependent one.  A linear regression assumes that there is a linear relationship between X and Y, and is given by the following formula:

Yi = b0 + b1Xi + εi for i = 1, …, n

where:

Yi is the ith value of the dependent variable

b0 is the y-intercept

b1 is the slope coefficient

Xi is the ith value of the independent variable

εi is the ith value of an error term

i is the index of a particular variable

n is the maximum value of i

In English, the value of the dependent variable Yi is equal to {the value of dependent variable when the independent variable’s value is zero (b0)} plus {the product of the slope b1 and the independent variable b1} plus {some error term εi}. The error term is that part of Yi that is not explained by Xi . We call b0 and b1 the regression coefficients.

When we speak about the relationship between two variables, we think in terms of many contemporaneous observations (a cross-sectional series) or observations over a period of time (a time-series). Observations are indexed by values 1 to n.  For example, you may be interested in the effect in various countries of money supply (Xi where i refers to a particular country) on the country’s inflation rate (Yi) – that would be a cross-sectional analysis.  Conversely, you would use a time-series analysis to test the money supply/inflation rate relationship in one country over a period of time.

A perfect linear regression would be one where all of the error terms equaled zero.  This would indicate that all changes to Y were accounted for by changes to X.  For instance, if I eat every cookie handed to me, then there would be no error values when I plot cookies offered versus cookie consumed. In this case, the regression line’s y-intercept would be zero and the slope would equal 1; all actual data values would be points directly on the regression line. Thus, if you offered me 3 cookies, I’d eat 3 cookies. Obviously this example is unrealistic when the number of cookies offered rises above some critical value, say 3-dozen in my case.

A more realistic case is one that plots a straight regression line through the data in which the errors are minimized – the best fit.  In real life, we are interested in imperfect correlations, so we need a method to achieve the best fit, which we define as the regression line that minimizes the sum of the squared vertical distances (deviations) between observations and the regression line.  This method is called the linear least-square method.  Nifty, but how do we calculate the best fit?

To achieve the best fitting regression line, we need to find the slope b1 and y-intercept b0 that produces the minimum sum of the squared errors. (We square the errors, which are simply the vertical deviations from the regression line, because we don’t want positive and negative values to cancel each other out).  How do we find these magic regression coefficients? We need to make estimates, which we call the fitted parameters, according the following formula:

The funny little hat (^) above b0 and b1 designates that the regression coefficients are estimated. We are summing, for all index values of i, the squares of the following difference: the actual value of the dependent variable minus the predicted value of the dependent variable. When this sum (the sum of the squared error terms) is minimized, we have a best-fit regression line. The actual method of calculating this minimum is complicated, and we leave it to a computer spreadsheet or math package to do the nitty-gritty work.

A note about the slope coefficient b1: when a linear regression contains a single independent variable, the slope coefficient is equal to the following:

b1 = Cov(Y, X) / Var(X) = Cov(Y, X) / sxsx where s = standard deviation

which is the covariance of Y and X divided by the variance of X.  Alert readers will recall from the previous blog that this formula is very similar to that for the correlation coefficient (r). The difference here is that the denominator, the variance of X, is the equivalent to the square of the standard deviation of X (sx). For the correlation coefficient, the denominator is the product of the standard deviations for X and Y:

r = Cov(Y, X) / sxsy

Conceptually, one can see that the coefficients are very similar – they both give a scale to the covariance of the two variables.

Next time, we will address the assumptions one makes in order to calculate a proper linear regression.

Financial Statistics (1) – Correlation

September 7th, 2010

Scatter plot

Many people who work at financial institutions, such as prime brokerages and hedge funds, have had formal financial training, including the use of statistics and other quantitative methods.  Today we are launching a series of blogs that cover these important topics at a straightforward, accessible level. We’ll assume you have had some exposure to the subject matter (for instance, you are familiar with terms like population and sample) and that you can handle simple algebra.

Statistics play a key role in financial modeling, so we’ll begin by looking at linear correlations and linear regressions.

Data analysis and prediction are the reasons for employing statistical method.  Data can be organized and presented in many ways.  One of the most popular presentations is a scatter plot, in which two series of observations are plotted on an x-y coordinate graph.  For each data pair (that is, two simultaneous observations), the appropriate point is shown on the graph as the intersection of the x and y values.  For instance, if we place money-supply growth on the x-axis and inflation rate on the y-axis, we can plot a series of unconnected points that indicate some kind of relationship between the two data series.

To indicate how closely two data series are related, we use a measure of their linear association, the correlation coefficient (r). The values that r can have range from -1 (perfect negative correlation) through zero (no linear correlation) to +1 (perfect positive correlation).  To calculate the r of a data sample, we must first understand another statistic: sample covariance.

Covariance measures the extent to which two variables (X, Y) change together. It is given by the following equation:

where

n is the number of data pairs

i is a particular value from 1 to n

is the ith X variable,  is the ith Y variable

and are the mean X and Y values, respectively

In English, this states that the sample covariance is the average value of the product of the deviations of observations on two random variables from their sample means. The use of (n – 1) instead of n to calculate the mean is used to ensure that sample covariance is an unbiased estimate of population variance.

To show the relationship between covariance and r, we note that if we take the covariance of X with itself, we have calculated the variance of X. Variance (denoted by the symbol s2) is a measure of how far values deviate from their mean, and is given by the following equation:

This is the variance of X, a measure of X’s dispersion around its mean   Standard deviation (sx) is the positive square root of variance:

Now we have all of the elements in place to calculate the sample correlation coefficient:

Thus, the correlation coefficient, r, is equal to the covariance of the two variables divided by the product of their standard deviations.  Think of it as the covariance normalized for the dispersion of each variable.

It is assumed that for the correlation coefficient the means and covariances of X, Y, and Cov(X,Y) are finite and constant. Note that r refers solely to linear associations between X and Y, that is, no exponents greater than 1.

A value of r equal to, say, 0.9, would indicate a strong linear relationship between X and Y, but not necessarily any causal relationships between the two variables.  A classic example of spurious correlation is one between vocabulary and height.  One may infer that the real relationship has something to do with age.

Forecasters use correlations to analyze trends and changes in trends. For instance, a change in the consumer price index (CPI) is correlated with a change to the inflation rate. So whenever a new CPI figure is released, economists revise their forecasts for inflation, which in turn affect interest rates and bond prices. When dealing with more than two variables, a correlation matrix is used to sort out the various linear relationships among the variables.

Next time out, we’ll tackle linear regressions.

Bottom Logo Wall Bottom Logo Reuters Bottom Logo Forbes Bottom Logo Fortune Bottom Logo Cnn Bottom Logo Cnbc Bottom Logo Fox Bottom Logo Comunity