DEPARTMENT OF POLITICAL SCIENCE

AND

INTERNATIONAL AFFAIRS

POSC/URAF 815

CORRELATION AND REGRESSION

AGENDA:

The linear regression model.

Properties of the regression parameters
Parameter estimators
Calculating estimates

Examples of the interpretation regression and correlation coefficients.
Reading:

Agresti and Finlay, Statistical Methods Chapter 9, pages 307 to 318, 325 to 326.

THE TWO VARIABLE LINEAR REGRESSION MODEL:

A simple equation specifies the two variable linear regression model

As noted last time this model consists of two components a systematic part and a random error term.
Systematic (or deterministic) component (see Agresti and Finlay, page 314):

The systematic component, shows how changes in the independent variable, X, are related to changes in the dependent variable Y.

The nature and magnitude of the connection is given by the regression parameter, b: it indicates how much Y changes for a one unit change in X.

The regression constant, a, is interpreted as the value of Y when X equals 0.

Random error term:

The presence of an error term, e_i, suggests that Y is not a perfect or exact function of X. If it were, then there would be no errors and the error term would be unnecessary.
One interpretation of error is that variation in Y is not fully understood; one can attribute a portion, whether large or small, to the "effects" of X, but a portion remains unexplained even after X has been introduced into the analysis.
Another way is to think of each observation's score as consisting of a systematic part that is a function of X and an error. Thus, two units might have the same X scores, say the same incomes, but differ with respect to Y purely by chance.

Of course their different Y scores might be (partly) explainable by another variable, Z, that could possibly be measured and included in the equation. (See multiple regression.)

When applying regression analysis one makes important assumptions about this error component:

The errors are random:

Think of the errors this way: for each value of X, there are many observations, say N_i. Each of these observation's Y score consists, as we have seen above, of a systematic part and an error. Thus, for each value of X there is an error term, e_i.
And since there are many Xs there are distributions of errors around the X values, as this figure suggests.

The points represents observations or cases. For each value of X there are several Y scores. Although the cases may have the same Xs, their Y scores vary among themselves. What is the source or "cause" of this variation? It is assumed to be due to random error.
The expected value of the e_i is 0. That is, the errors "cancel out."

More formally, for each X there is a distribution of errors, e_i. Each error plus the X constitutes a Y score.

The expected value or mean of the these errors is assumed to be 0 and E(e_i) = 0
The variance of the errors at one level of X is assumed to be the same as the variance of the errors at another level for all pairs of levels:

This means the variation of the errors at level k equals the variation at level j which in turn equals a constant.

This is the (in)famous assumption of homoscedasticity.

The errors are assumed to be independent of X. That is,

Assumptions about errors are important for making inferences and estimating parameters.

PROPERTIES OF THE REGRESSION COEFFICIENT:

As noted last time, the regression coefficient is measured in units of the dependent variable.

Its numerical magnitude thus depends on the measurement scale used in the analysis.
If, for instance, one regressed income measured in dollars on education measured in years, b will probably have a large value.

b = 3,500.45 means that as education goes up by one year income will increase $3,500.45.
On the other hand if income is measured in thousands of dollars, the corresponding regression coefficient would be about 3.545, meaning that as education moves up by one year income moves up by 3.50045 units or 3.50045 thousands of dollars or $3,500.45 (= 3.50045 X 1,000).

For this reason the regression coefficient is called unstandardized. Think of its value as being affected by the measurement scales. Even innocuous changes will alter the numerical value.

Consequently, a b of 10 may or may not be comparable to a value of 100. The validity of the comparison depends several factors, especially the scales.
Suppose b₁ equals 10 and b2 equals 100. One is not justified in saying on the basis of this evidence alone that the second variable is 10 times as important or even 10 times as large as the first.

The regression coefficient is asymmetric: its numerical value depends on which variable is considered independent.

When someone says "Y is regressed on X," it means Y is considered the dependent variable.
With two variables two equations are possible.

As we'll see shortly the a's and b's in these equations are usually not the equal.
Thus it is important to specify or indicate or know which variable is independent and which is dependent.

Sometimes the choice is arbitrary in the sense that the investigator may not have reason to think one is dependent on another. Still, to use regression programs and interpret their results it is necessary to choose one variable as independent, the other as dependent.

The sign of the regression coefficient indicates the direction or form (positive or negative) of the linear relationship.

Hence if b = -2.5, then a one unit increase in X is associated with a 2.5 decrease in Y, which is just a case of negative correlation.

The regression coefficient differs from the correlation coefficient in these ways:

r is unstandardized: its value always lies between -1.0 an +1.0, no matter what the measurement scales.

Thus, some feel that the way to compare two independent variables' effects on or correlation with a dependent variable is to compare r's, not b's.

This practice can be very misleading, however.

r is symmetric: its value does not depend on which variable is treated as independent.

The r between Y and X is the same as the r between X and Y.
Here's an example. (We'll see others). Suppose we want to know if SAT tests are a valid indicator of college performance. We collect a random sample of 200 transcripts form the registration office and perform a some regression and correlation analyses.

We would expect college grade point average (GPA) to depend on or be a function of apptitude as measured by the SAT verbal score. So we regress GPA on verbal scores.
Note: this example is based on the file "grades," which is part of the Student Version of MINITAB for windows. It's also available in the example files of the full version.
Here are the results

Dependent variable is...	Regression Constant	Regression coefficient	Correlation coefficent
GPA	1.1075	.0025560	.322
Verbal	488.67	8.486	.322

When GPA (the dependent variable is regressed on verbal scores the estimated regression parameter is .002556 or about .003. This means that a one point increase in the verbal aptitude is associated with a .003 point increase in grade point average.
As an aside, an SAT verbal score of 0 would be associated with a GPA of 1.1, not bad for someone with no aptitude at all.
If we mistakenly treated verbal scores as the independent variable and tried to interpret it as before, we would erroneously conclude that a one unit increase in X would lead to 8.486 increase in grade point average, an obviously nonsensical result.
Note, finally, that the correlation between scores and GPA is .322 no matter which variable is considered dependent.

ESTIMATING REGRESSION PARAMETERS:

Let's look at a plot with a regression line drawn through the points.

This example comes from the Data and Story Library. Here's the background:

"Government statisticians in England conducted a study of the relationship between smoking and lung cancer. The data concern 25 occupational groups and are condensed from data on thousands of individual men. The explanatory variable is the number of cigarettes smoked per day by men in each occupation relative to the number smoked by all men of the same age. This smoking ratio is 100 if men in an occupation are exactly average in their smoking, it is below 100 if they smoke less than average, and above 100 if they smoke more than average. The response variable is the standardized mortality ratio for deaths from lung cancer. It is also measured relative to the entire population of men of the same ages as those studied, and is greater or less than 100 when there are more or fewer deaths from lung cancer than would be expected based on the experience of all English men."
We first plot the variables as in the next figure, which also shows the regression line passing through the points.

First note that the "units" are occupations such clerical, professional, service.
Second, and more important, note that the line passes through the "center" of the points.

So obviously some observations lie above, some below the line.

Now ask: where did this line come from? Why is this the regression line? Why not some other line?
Well, to answer that question consider the figure on the next page.
It shows the same plot with a different "regression" line. This one, however, seems to fit the data less well in that most of the points are farther away from the line than in the first graph.

We can be a bit more precise when saying that one line fits the data better than another.

Least Squares:

An observed data point can be thought of as a Y score produced by a linear regression function plus an error. In symbols:

If we ignore the error we obtain an equation for a predicted or estimated value of Y
Normally, the predicted value will not equal the observed score. There will be an error or residual left over, namely

The hat over the Y indicates that the systematic part of the model gives a predicted value of Y
The residual is just the observed value of Y minus the predicted value, the prediction being based on a linear equation. That is,

There will an error, e_i for each observation. These errors are called residuals.
The method of least squares.

The linear regression equation is constructed or computed in such a way as to minimize the sum of the squared errors.
The diagram below shows the regression line, the predicted Y, which lies on the regression line, and the observed Y. The gap between the two points (predicted and observed) is the residual, e_i.

The regression parameters are estimated in such a way the sum of all of these residuals or errors squared is a minimum.

For this reason the algorithm for calculating the regression parameters is called least squares and the technique is often called ordinary least squares (OLS).
Stated slightly differently the calculation procedure selects the regression parameters, a and b, so that the predicted Y's lead to the smallest sum of squares possible.

Computing formulas:

We can't go into the derivation of the computing formulas used the least squares principle. Instead we will simply state that they are calculated from raw data as follows:

The formula for the correlation coefficient is

To find R², sometimes called the coefficient of determination, just square r.

Note: MINITAB presents R² as a percent. But you should report it as a decimal. After all, its range is from 0 to 1.0. Why?

Regression analysis with MINITAB and SPSS.

In MINITAB go to the Statistics menu, then pick Regression. In the dialogue box just select the dependent (response) and independent (predictor) variables. You won't need any of the options for now.
Do the same in SPSS: go to Stat, then select Regression and Linear.

INTERPRETING COMPUTER OUTPUT:

Although most regression programs display lots of statistics, we are for now interested mainly in the coefficient estimates and R². Later we will use more of the information.
MINITAB regression results for the smoking and cancer example.

The most important display is the first, namely the estimated or calculated least squares equation:

This simply shows the estimates of a and b. In this example, a equals -2.9 and b is 1.09.

You know how to interpret them. For each 1 unit increase in the smoking index, cancer increases by 1.09 units. If no one smoked (X = 0), then the predicted cancer rate would be -2.9, not a very sensible result.

The remainder of the printout shows additional information that will be useful later:

The coefficient estimates are shown along with statistics used in determining their "statistical significance."
Next, find R² and an adjusted R². Both are interpreted as we have been. One might say, as an example, that smoking accounts for or explains about 50 of the variation in lung cancer rates.
Although it is very important, ignore the S statistic for now.
After this information appears MINITAB produces an analysis of variance box:

The total sum of squares are displayed on the bottom. (Here the sum is 16,366.)

That portion due to regression (that is, to X or the explanatory variable) is 8,395.7. Note that is 51.3 percent of the total.
The unexplained or error sum of squares is 7,970.3
Please keep in mind the remarks I made in reference to the meaning of explanation.

SPSS output looks pretty much the same.

Make sure that the dependent and independent (predictor) variables are properly selected. Otherwise, as noted before, you results will incorrect or misleading.
Here is the parameter section of the output for the smoking example:

Like MINITAB, SPSS provides an analysis of variance table:

It also supplies much additional information, some of which we cover later.

Note that the two programs report slightly different values for b. We will not be bothered by these discrepancies, but if you use SPSS on your assignments be sure to tell me. Nevertheless, statisticians spend a lot of time trying to decide which program produces the most accurate results. (I have no idea in this example which is best.)

EXAMPLE -SURGICAL PROCEDURES:

Now for another and perhaps more realistic example. This data set pertains to the general problem we considered previously, namely what explains the soaring cost of health care.

We investigated one hypothesis that states that contrary to expectations, the wide availability of health care services does not reduce the cost of providing them. One explanation is that availability health resources is positively correlated to their use. So the more facilities, the more doctors and patients take advantage of them. But this availability does not reduce costs.
Now, we can't completely "test" this proposition, but we can throw some light on it.

The variable definitions for the data set, which is called Surgical Procedures are:

c1 Number of surgical procedures
c2 Number of surgical specialists/100,000
c3 number of general surgeons/100,000
c4 Population in 1990

It is important to obtain a per capita or indexed measures of surgical procedures by dividing the number of procedures, c1, by the population, c4.

So put the result somewhere (e.g., let c5 = c1/c4.)

First let's plot the dependent versus the independent variable in order to take a first look at

the variation in X and Y
the pattern (if any) of the relationship between them.

We can then find the correlation between per capita procedures and surgical specialists and between per capita procedures and the number of general surgeons.

In MINITAB go to Descriptive and then Correlation. Enter the list of variables.

We are interested in the correlation of the dependent variable, log surgical procedures per capita (c5) with the two independent variables, number of surgical specialists (c2) and the number of general surgeons (c3).
Below is a correlation matrix: it displays the correlation coefficients among all pairs of variables.

Variables	c5	c2	c3
c5	-
c2	.637	-
c3	.737	.968	-

As discussed above, the correlation coefficient, r, is symmetric: its value is the same no matter whether X or Y is considered dependent. So r_YX = r_XY. Since the entries above the main diagonal of the matrix simply mirror the ones below, they can be and usually are omitted.
We see that surgical procedures and availability of surgeons are highly, positively correlated.

The greater the number of surgeons (per 100,000 population), the greater the number of surgical procedures per capita.
The estimated regression equations are Estimated per capita procedures = 0.0527 + 0.00301 (number of surgeons) and Estimated per capita procedures = 0.0519 +0.000873 (number of specialists)

The data "fit" a linear model reasonably well:

r²_{# procedures X surgical specialists} = .4058. If we are careful and mindful of the limitations of the statement, we could assert that number of surgical specialists explains about 40 percent of the variance in the number of surgical procedures.
Similarly, number of general surgeons "explains" about 50 percent of the variance in Y; that is, the correlation is r²_{# procedures X general surgeons} = .5432.

Note that these latter statements apply only to statistical variation. We may or may not have improved our understanding of health care. In particular, it seems likely that we need to examine another variable, Z = per capita income, because it may be the case that wealthier states have more surgeons and also better or more complete insurance, thus providing more opportunities to take advantage of these specialists.

NEXT TIME:

More regression examples
Time series and intervention analysis

Go to Statistics main page

Go to H. T. Reynolds page.