Quick and Dirty Regression Tutorial

The statistical procedure known as linear regression basically involves drawing and analyzing trend-lines through data points. Economists use regression analysis to test hypotheses, derived from economic theory, against real-world data.

In your first microeconomics class you saw theoretical demand schedules (Figure 1) showing that if price increases, the quantity demanded ought to decrease. But when we collect market data to actually test this theory, the data may exhibit a trend, but they are "noisy" (Figure 2).

Drawing a trendline through datapoints

To analyze the empirical relationship between price and quantity, download and open the Excel spreadsheet with the data.  Right-click on the spreadsheet chart to open a chart window, and print off a full-page copy of the chart (same as the one shown in Figure 2). Using a pencil and straightedge, eyeball and then draw a straight line through the cloud of points that best fits the overall trend. Extend this line to both axes. Now calculate the values of intercept A and slope B of the linear equation that represents the trend-line   Price = A + B*Quantity

Although it is standard practice to graph supply and demand with Price on the Y-axis and Quantity on the X-axis, economists more often consider demand Quantity to be the "dependent" variable influenced by the "independent" variable Price. To obtain a more conventional demand equation, invert your equation, solving for intercept and slope coefficients a and b, where   Quantity = a + b*Price.   Technically, since this "empirical" (i.e., data-derived) demand model doesn't fit through the data points exactly, it ought to be written as   Quantity = a + b*Price + e   where e is the residual "unexplained" variation in the Quantity variable (the deviations of the actual Quantity data points from the estimated regession line that you drew through them).

That's basically what linear regression is about: fitting trend lines through data to analyze relationships between variables. Since doing it by hand is imprecise and tedious, most economists and statisticians prefer to...

Fitting a trendline in an XY-scatterplot

MS-Excel provides two methods for fitting the best-fitting trend-line through data points, and calculating that line's slope and intercept coefficients. The standard criterion for "best fit" is the trend line that minimizes the sum of the squared vertical deviations of the data points from the fitted line. This is called the ordinary least-squares (OLS) regression line. (If you got a bunch of people to fit regression lines by hand and averaged their results, you would get something very close to the OLS line.)

The easiest way to plot a trend line and calculate a single-variable regression equation is to right-click on the data points in an Excel XY plot and select "Add Trendline." Under the "Options" tab check "Display equation on chart" and click "OK." How well do this trend line and calculated slope and intercept coefficients match the line you drew and the slope and intercept that you calculated?

Using Excel's Regression utility (Data Analysis tools)

Excel also includes a formal regression utility in its Analysis ToolPak that provides statistics indicating goodness-of-fit and confidence intervals for slope and intercept coefficients. This utility lets you regress one dependent "left-hand-side" (of the equal sign) variable against one or several independent "right-hand side" variables, and it provides useful indicators about the statistical reliability of your model.

Excel's Regression procedure is one of the Data Analysis tools. If you don't see it, you need to activate the Analysis ToolPak. Click the Windows symbol or the File menu, choose Options--Add-Ins, select Analysis ToolPak (not Analysis ToolPak VBA) and click "Go..."  Check the Analysis TookPak checkbox and "OK." You will find "Data Analysis" on the right end of the "Data" menu.

The only things you are required to specify are...
(a) one column of numbers as the Y Range, aka the dependent variable, "left-hand-side" variable or endogenous variable whose variation is to be "explained" by the regression model;
(b) one or several adjacent columns of numbers as the X Range, aka the independent variables, right-hand side (of the equals sign) variables, exogenous variables or "explanatory" variables;
(c) the upper-left corner of a blank range of cells in your spreadsheet where the results will be printed.
The X and Y ranges must contain the same number of rows, all numeric data, no missing values.

Here is output from Excel's regression utility replicating the regression of Price (Y range) against Quantity (X range). At the bottom of the output you can see the same Intercept and Quantity slope coefficients that are shown for the trend line in the XY plot above. This empirical inverse demand model, written out in equation form, is P = 13.675 - 0.1664*Q + e.   Other parts of the output are explained below.)

Try specifing Quantity as the dependent variable and Price as the independent variable, and estimating the conventional demand regression model Quantity = a + b*Price .   Note that you obtain an approximate rather than exact mathematical inverse of the price equation! This is because OLS minimizes the sum of the squared vertical deviations from the regression line, not the sum of squared perpendicular deviations:

Multivariate models

Now try regressing Quantity (Y range) against both Price and Income (the X range is both the Price and Income columns). This will yield coefficient estimates for the multivariate demand model   Quantity = a + b*Price + c*Income + e.   You should get something like this:

Written out in equation form, this empirical demand model is Q = 49.18 - 3.118*P + 0.510*I + e. Multivariate models such as this don't lend themselves to easy graphing, but they are much more interesting.   In this example an increase in Income shifts the conventional Q vs. P demand schedule to the right, while an increase in Price shifts the Q vs. Income curve (aka Engel curve) to the left.

Model diagnostics

When analyzing your regression output, first check the signs of the model coefficients: are they consistent with your hypotheses? Is the Price coefficient negative as theory predicts? Does the Income coefficient indicate this is a normal good, or an inferior good?   Try calculating the price and income elasticities using these slope coefficients and the average values of Price and Quantity.

The next thing you should check is the statistical significance of your model coefficients. Because the data are noisy and the regression line doesnt fit the data points exactly, each reported coefficient is really a point estimate, a mean value from a distribution of possible coefficient estimates. So the residuals e (the remaining noise in the data) are used to analyze the statistical reliability of the regression coefficients. The columns to the right of the coefficients column at the bottom of the Excel output report the standard errors, t-statistics, P-values, and lower and upper 95% confidence bounds for each coefficient.

The standard error is the square root of the variance of the regression coefficient. The t-statistic is the coefficient estimate divided by the standard error. If your regression is based on what statisticians call a "large" sample (30 or more observations), a t-statistic greater than 2 (or less than -2) indicates the coefficient is significant with >95% confidence. A t-statistic greater than 1.68 (or less than -1.68) indicates the coefficient is significant with >90% confidence. The confidence thresholds for t-statistics are higher for small sample sizes. This example uses only 21 observations to estimate 1 intercept and 2 slope coefficients, which leaves 21 - 3 = 18 "degrees of freedom" (df) for calculating significance levels. In this example, the t-statistic on the Income coefficient is 2.037, which would exceed the 95% confidence threshold for a "large" (N > 30 observations) dataset, but does not quite meet the 95% confidence threshold when N = 21 observations.

If that last paragraph is just statistical gibberish for you, don't worry--most people just check the P-values. These are the probabilities that the coefficients are not statistically significant. The P-value of 0.056 for the Income coefficient implies 1 - 0.056 = 94.4% confidence that the "true" coefficient is between 0 and about 1.02. The last two columns report the exact lower and upper 95% confidence thresholds for the Income coefficient: -0.0159 and +1.038 respectively. The very low P-values for the Intercept and Price coefficients indicate they are very strongly significant, so their 95% confidence intervals are relatively narrower.

The R-Square statistic near the top of the output represents the percent of the total variation in the dependent variable that is explained by the independent variables, i.e., the model's overall goodness of fit." But whether a model is really a "good" fit or not depends on context. R-squares for cross-sectional models are typically much lower than R-squares for time-series models. You can always increase R-square by throwing another independent variable (any variable!) into your model. Remember that your real objective is to test your hypotheses, not to maximize R-square by including irrelevant variables in your model and then making up some "hypothesis" after the fact to "explain" the results you got.

Those are all the diagnostics you really need to worry about.

Final comments

The classical OLS model assumes that the residuals e are independent of each other and randomly distributed with a mean of zero. It is sometimes helpful to examine plots of residuals to check for non-random pattens that indicate problems with your model. If you take an econometrics class, you will learn how to identify violations of these assumptions and how to adapt the OLS model to deal with these situations.

Keep in mind that a regression actually analyzes the statistical correlation between one variable and a set of other variables. It doesn't actually prove causality. It is only the context of your analysis that lets you infer that the "independent" variabes "cause" the variation in the "dependent" variable. Somebody else out there is probably using the same data to prove that your dependent variable is "causing" one of your independent variables!

You should never force the regression line through the origin (the "Constant is zero" check-box in the Excel utility) without a clear theoretical justification for doing so. It makes your model diagnostics unreliable.