DEPARTMENT OF POLITICAL SCIENCE

AND

INTERNATIONAL AFFAIRS

Posc/Uapp 815

VARIOUS REGRESSION MODELS

  1. AGENDA:
    1. Two variable regression example
    2. Multiple regression
      1. Multiple regression model
      2. Partial regression coefficients
    3. Reading:
      1. Agresti and Finlay, Statistical Methods, Chapter 10, pages 356 to 371
        1. Read for understanding. There is a great deal of useful and important information in these pages.


  2. TWO VARIABLE REGRESSION:
    1. Lets explore the "State Expenditure" data in a bit more detail.
    2. Recall that the variables are:
      1. Per capita state and local public expenditures ($)
      2. Economic ability index, in which income, retail sales, and the value of output (manufactures, mineral, and agricultural) per capita are equally weighted.
      3. Percentage of population living in standard metropolitan areas
      4. Percent change in population, 1950-1960
      5. Percent of population aged 5-19 years
      6. Percent of population over 65 years of age
      7. "WEST"
        1. West is a "dummy" variable codes as: Western state (1) or not (0)
    3. First let's try to explain variation in states' ability index. We might hypothesize, for example, that the more "metropolitan" or urban the state, the higher its index. The presumption is that urbanization is related to retail sales, income, and the like.
    4. We'll try to do the regression in class, but in case we can't or the screen is not clear, here is the estimated regression equation.

      1. The interpretation is straight forward. The estimated regression coefficient indicates that a one percent increase in urbanization (percent living in metropolitan areas) is associated with about a third of a point increase in the ability to pay index.
        1. There is, in other words, some evidence of a positive correlation.
      2. How much? The measures of goodness of fit are R2 = .158, suggesting that a small part of the variation in the expenditure ratio is accounted for or explained by the independent variable.
        1. This is not an especially large value.
        2. Similarly, the simple correlation coefficient, r = .398, is relatively small.
      3. This finding raises several possibilities:
        1. There simply isn't much of a relationship.
        2. The modest or small correlation is due to or caused by some unmeasured or excluded factor.
        3. One or both variables are not properly measured.
        4. A combination of these.
    1. We will try to improve the fit by adding another variable to the analysis.
      1. See the section on multiple regression
    2. Additional examination of the data: analysis of residuals
      1. Recall that errors in the regression model are supposed to be random.
      2. One way to check this assumption and to also find ways to improve the fit is to examine the residuals.
    3. The residuals are:

      1. There is a residual for each data point. They are rough estimators of the error terms and as such should be randomly distributed above and below zero.
      2. A common technique is the residual plot of the observed residuals plotted against "fitted" or predicted values.
        1. That is, one can plot the pair: ei versus
        2. See the attached figure for an example.
        3. The full version of MINITAB's regression choose plot residuals in the options box.
        4. In the Student version, check both the "fits" and residuals boxes in the Storage list. Then go to plot and plot residuals as the dependent variable and "fits" as the independent factor.
        5. The points should be scattered more or less at random.
      3. In this example we see that for the most part the residuals are scattered above and below zero, as they should be, but also note that two residuals are quite large.
        1. Hence, we should investigate these cases more thoroughly.
          1. The two cases are numbers 42 and 47.
        2. For example, if one or both of the cases are removed the fit improves.
        3. The new estimated model is: = 79.3 + 0.308 X, where X is percent in metropolitan areas.
        4. The R2 is now .283.
        5. And the residuals now appears more random scattered about the "zero" point.
          1. See the attached graph.
    1. Now suppose we want to add additional factors to see if we can better understand variation in the dependent variable.


  1. MULTIPLE REGRESSION:
    1. We'll return to the state expenditure data but first here are some additional data.
    2. This information pertains to birth rates and population growth in several countries.

PROJECTED POPULATION INCREASE






Nation




Birth rate

X1





Death rate

X2





Life expectancy

X3



GNP per capita

X4

Percent projected population increase

Y

Bolivia 42 16 51 510 53.2
Cuba 17 6 73 1050 14.9
Cyprus 29 9 74 3720 14.3
Egypt 37 10 57 700 39.3
Ghana 47 15 52 320 60.1
Jamaica 28 6 70 1300 21.7
Nigeria 48 17 50 760 71.6
South Africa 35 14 54 2450 40.1
South Korea 23 6 66 2010 21.1
Turkey 35 10 63 1230 36.9


    1. Questions:
      1. What explains variation in Y, projected population increase?
      2. What are the "individual" effects of the independent variables?
      3. Are any of them redundant?
      4. How well does a linear model as a whole fit the data?
      5. What policy implications, if any, does the model contain?


  1. MULTIPLE REGRESSION MODEL:
    1. There is a single dependent variable, Y, which is believed to be a linear function of K independent variables.
      1. In the example, K = 4 because there are four independent variables, X1, X2, X3, and X4.
      2. The general model is written as:



      1. Sometimes the model is written equivalently as:

      1. The particular model for the comparative population data is:

      1. Interpretation:
        1. Systematic part:
          1. Note first that I am now writing the constant term as beta instead of alpha. This is just a common convention.
          2. The regression parameters, b's, represent the effects of each independent variable on Y when the other variables in the model have been controlled.
          3. Thus, b1, is the effect of X1 when the other X's have been controlled.
          4. The reason the word "controlled" appears is that the independent variables themselves are interrelated. Changing the value of, say, X1, not only changes Y but might also affect X2 which in turn impacts on Y. To see the "pure" or "uncontaminated" effect of X1 on Y we need to hold the other X's constant.
        2. A path diagram may help explain. Consider the models in the attached figures.
      2. Note: that multiple regression coefficients are often written with the dependent variable, Y, first, an independent variable (X2, for example) second, and any variables that are being controlled after the dot. Thus, Y2.1 means the coefficient between Y and X2, when the X1 has been (statistically) held constant.
      3. In the first diagram (a), Y depends on both X1 and X2. Changing X1 will affect the value of Y, even if we hold the other independent variable (X2) constant. Similarly, if we change X2, Y changes also. The "arrow" indicates that the beta "connecting" Y and X2 is non-zero.
      4. Thus, the regression procedure produces partial or controlled coefficients which means that Y changes 1 units for a one-unit change in X1 when X2 has been held constant.
        1. Note that direct linkages are indicated by arrows; an arrow represents the presence of a non-zero beta coefficient.
      5. Now look at the second figure (b). Here X2 is not connected directly to Y. But there is an indirect relationship: as X1 varies so do both X2 and Y. If we measured only the X2-Y relationship, we might be tempted to conclude that the two variables are related. But when X1 is added to the model, this relationship disappears. Why? Because Y2.1 gives the partial or controlled effect: when X1 is controlled there is no effect of X2 on Y.
        1. This latter case is an example of spurious correlation, examples of which we have discussed several times during the semester.
    1. To return to the population data, the estimated coefficients from the data set are:

        1. The estimated model is thus:

      1. The first parameter is the constant: it is the value of Y when all X's are zero.
        1. Note that as before the regression coefficients are measured in the units of the dependent variable.
        2. Hence, they cannot be directly compared with one another.
          1. That is the coefficient for X2 is numerically twice the size of the one for X1, but this does not mean it has twice the importance or is twice as strongly related.
        3. The first regression parameter, , means that Y increases .738 units for a one-unit change in X1 when X2, X3, and X4 have been held constant.
        4. The second parameter is interpreted in a similar way: Y changes by 1.46 percent for every one-unit change in X2, assuming that X1, X3, and X4 have been held constant.
        5. Note that partial regression coefficients are statistical means of physically holding variables constant. In other words, observational analysis limits our ability to manipulate variables so we compensate by making statistical adjustments.
    1. Random error part:
      1. The ei in the model once again represents random error--that is, random measurement error in Y (but not X's) and the idiosyncratic factors that affect the dependent variable.
      2. The observed Y scores are thus composed of the effects of the X' plus a random error. The random error is not observed independently; it is estimated from the residuals.
      3. Ideally, these errors really are random: they have an expected value of zero, a constant variance (their variation does not change with changes in X's), they are independent of the X's, and they are serially uncorrelated.


  1. OLS ESTIMATION:
    1. As before assume that estimates of the parameters have been somehow obtained. With these estimates we can obtain predicted values of Y's as in, for example:

    1. Since there will usually be a difference between predicted and observed Y's, we can take the difference to get residuals or estimates of errors:

    1. The mathematics of OLS leads to estimators of the 's that minimize the sum of these errors. In other words, the parameters are chosen such that

is a minimum.

    1. Note that OLS assumes that the assumptions about errors mentioned in class 19 notes hold.


  1. MULTIPLE REGRESSION COEFFICIENT, R2:
    1. As in two-variable regression, TSS measures the total variation in Y, the dependent variable.
    2. This total variation can be partitioned into two main parts, as before:



    1. These quantities can be obtained from the ANOVA table part of the regression results.
    2. The measure of fit is R2, also called the coefficient of determination is also defined as:

  1. ANOTHER EXAMPLE:
    1. Let's return once more to the state expenditures data.
      1. We'll continue to use the data with two cases (numbers 42 and 47) removed.
    2. Suppose in addition to percent of state population living in metropolitan areas we add another variable, rate of population growth.
    3. Now we have two explanatory factors and the estimated regression equation is

      1. The betas are the "partial" regression coefficients: they tell how much Y changes for a one unit increase in an X when the other X has been held constant.
    1. The estimated regression equation is: 79.2 + 0.307 X1 + 0.006 X2, where X1 is percent living in metropolitan areas and X2 is growth.
      1. We have a regression constant, 79.2, and two partial regression coefficients.
      2. A one percent change in "metropolitan" is associated with a .307 point increase in expenditure capacity when growth has been held constant.
        1. One might think of the coefficient this way: suppose we looked only at those states having the same or a common growth rate. Then b = .307 would measure the correlation between percent urbanization and expenditure capacity.
        2. Note that is value is only slightly different from the one obtained with two variable regression: .338 when no cases have been deleted and .308 when the two states are removed.
        3. So adding another variable, growth, does not change the previous results.
        4. But in many instances adding or subtracting a variable from a model will alter the sizes of coefficients. Doing so can even change their signs.
      3. The other coefficient is interpreted in the same way. It measures the relationship between growth and capacity after urabanization (X1) has been held constant.
    2. Fit.
      1. Adding a second variable does not really improve the fit of the model since R2 is .283, which is exactly what the previous result was.
        1. Note. This is such a strange result it is interesting. Normally, adding a variable will increase the multiple R, however slightly.


  1. PROGRAM PACKAGES:
    1. Multiple regression analysis is performed with the same software procedures used in the two variable case.
      1. Add independent variables to the list of factors or predictors.
    2. One can and should obtain residual plots. Do so in the same way as before: use plotting options or store residuals and fitted values and then plot them.


  2. NEXT TIME:
    1. Intervention (time series)
    2. Dummy variables.
    3. Statistical inference

Figures

Residual Plots

Multivariate Models



Go to Statistics main page

Go to H. T. Reynolds page.

Copyright © 1997 H. T. Reynolds