# EXPLAINING VARIATION AND ANALYSIS OF VARIANCE

1. AGENDA:
1. Touring the net and obtaining data
2. The idea of "explained" variation
3. Analysis of variance
1. Using box plots to compare measures of central tendency and dispersion among different populations.
4. Examples of explanation with box plots.

2. SOURCES OF STATISTICAL INFORMATION AND DATA:
1. The course internet site contains an information page that leads to some interesting data sources, statistical resources, and learning tools. We'll briefly visit some of them if we can connect.
1. Carneige Mellon Statistics Library
2. "Math 438": a set of materials on statistical graphics

3. STATISTICAL EXPLANATION:
1. As noted before (see most recently the notes for Class 14), the variation of quantitative variable is usefully summarized by the total sum of squares:

1. Note that this concept of variation measures differences around a mean,.
2. Statistical explanation can be viewed from two perspectives:
1. The numerical quantity, TSS, is partitioned into parts, an explained by independent variables portion and an unexplained or error part.
2. But explanation also involves comparisons among measures of central tendency such as the mean or median.
1. As an example, some of the total variation in Y may be "due to" the fact that some units belong to a group that has a different average value on Y than members of other groups.
2. This idea is best illustrated with an example.

1. EXAMPLE OF "EXPLAINING VARIATION":
1. Consider these data pertaining to median weekly earnings of various occupational groups. The analysis of these numbers has several purposes.
1. First, it shows how tabular data can be summarized.
2. More to the point, it illustrates how variation in a variable, here wages, can be partially explained by introducing another variable, here gender.
2. The following table shows a portion of the data.
 OCCUPATION MALE FEMALE Administrators 617 476 Financial 788 487 Personnel 785 563 Purchasing 709 - Marketing 814 503 Education 757 449 Health 743 535 Real Estate 516 355 Other 610 426 Engineers 727 624 Math & Computer 733 575 Natural Scientists 677 540 Health 807 553 Health Treatment 591 511 College Teachers 752 555 Teachers 560 463 Counselors 599 522 Library - 463 Social Scientists 676 531 Social Workers 414 395 Lawyers 930 765 Writers & Entertainers 559 388

1. The summary statistics for these data as a whole are:
1. That is, these numbers represent the mean, median, and so forth for all of the categories regardless of gender. Figure 1 shows the distribution, using the familiar box plot.

1. Some points to note:
1. Analysis of variance works with the TSS. Think of this number as a description of the data, just as, say, a number describes a person's height or a county's percent vote for Perot.
2. The exact number, although it does not have an intuitive interpretation, simply describes the variation in Y (e.g., weekly wages).
3. The social scientist's job is to explain this variation: if every occupational category had the same weekly median earnings, TSS would equal 0. But since it doesn't, one wonders what factors produce the variation, just as one can wonder why one person is tall, another short, or why one county gave Perot a substantial percent of its vote at the same time that others did not.
4. Explanations of the variation are obvious: different occupations command different salaries and wages because they are perhaps worth more to society or their workers are organized or...you can think of other possibilities.
5. Columns two and three, however, suggest another source of variation, gender differences in pay: at every level women earn less than men.
6. Figure 2 shows the point:

1. Clearly, the average weekly earnings for women is less than for men. Here we use the median to compare; traditional analysis of variance relies on the mean, but, as we have, noted the mean is "sensitive" to extreme scores and so we will use the median and pictures to help "explain" variation.
1. In this instance, we would conclude that a substantial, but not total, part of the variation in wages is due to gender differences in pay.
2. As an aside, think about how you would interpret Figure 2.
1. Is there evidence of discrimination? Suppose that is your hypothesis. The statistical equivalent is that the population measures of central tendency differ. That is, think of males as comprising a subpopulation, females another. In the analysis of variance case, the research hypothesis is

1. Note: remember that a box plot displays medians so the above is just a way to think about the two plots.
1. It is clear that the data are consistent with this hypothesis or one cast in terms of medians (as in Figure 2). A more important question is why do the averages differ. That is, we have taken a first step toward explaining variation in wages but it's not a very long first step. A more interesting question is why do gender differences exist? Here are a couple of possibilities that these data cannot address (but we could find other data to sort them out):
1. Sex discrimination
2. Different work histories: men in each category have been employed longer, have more experience, more education, and so forth.
3. Can you think of other explanations.

1. ANOTHER EXAMPLE OF STATISTICAL EXPLANATION:
1. To see if sub-population or subgroup averages differ, we can use multiple box plots. That is, we draw a box plot for the units in each category of the independent variable.
2. Here's another example:
1. Figure 3 on the next page perhaps shows how opinions of Jesse Jackson, an American political and religious leader, differ by "ideology." In the past Jackson has been very controversial in that some people respect him greatly whereas others can't stand him.
1. What explains the variation in opinions?
2. Political ideology seems to be related as the Figure shows.
3. We see that the more liberal a person is the more he or she rates Jackson highly.

1. STILL ANOTHER EXAMPLE:
1. I obtained this example and data from the Data and Story Library at Carnegie Mellon University's "StatLib" web site.
1. It's a great place to visit and can be reached by clicking on sources of information and links on the class internet site.
2. N.M. Meltz, in "Interstate and Interprovincial Differences in Union Density," Industrial Relations, [28:2 (Spring 1989), 142-158.] wanted to explain variation in the percentage of state employees belonging to labor unions.
1. One variable he considered was "right to work" laws: some states make union more difficult than others by enacting rules and regulations that prevent people from being forced to join a union in order to work.
1. Apparently Delaware does not have such a law.
2. Common sense suggests that the presence of such laws, which perhaps reflect an "anti-union" attitude among citizens, would be associated with public union membership.
3. This idea is easy to test as in the box plot in Figure 4.

1. This figure suggests that the presence of right to work laws affects the rate or percentage of union membership.
1. We'll discuss this sort of conclusion and its supporting evidence in much more detail latter.
1. The data are available on the web site.

1. ANALYSIS OF VARIANCE:
1. The sort of analysis conduct above represents an "analogue" of an important statistical technique known as analysis of variance (ANOVA).
1. The objective is to "explain" be reference to an independent variable(s) the statistical variation in a dependent variable.
2. There is, for instance, variation in union membership among the states.
1. An investigator might suggest that this variation is due to differences in public attitudes as reflected in laws.
1. Of course, such a hypothesis assumes that laws reflect the will of the people, a very controversial assumption to say the least.
3. In any event, ANOVA partitions the total variation (see the Section II, above) into a part that is "explained by" or attributable to the independent variable (e.g., presence or absence of right-to-work laws) and to random error.
2. We take up the statistical method in detail in the second semester of applied statistics, but for now look at this equation:

1. These sums of squares are just numbers. The total sum of squares represents total variation in Y (see above), the "explained by..." sum of squares is that part attributable to X, an independent variable while the "error SS" represents what is unknown or unaccounted for or "left over" after X has "done its work."
2. For instance, suppose TSS = 100 and we find that

1. In this instance we would say that X explains 50 percent of the variation in Y and that the remaining 50 percent remains, for now, unexplained.
1. Someone might add another variable, Z, in an attempt to reduce the unexplained or error SS further.

1. CREATING PLOTS WITH MINITAB:
1. You can create multiple box plots with MINITAB.
1. But the independent variable must consist of a relatively small number of numerical categories.
1. In the example above states having no right-to-work laws were coded or represented by "0" and those with such laws by "1."
2. Hence, the independent variable here has just two categories.
2. A common mistake is to attempt to create multiple plots with an independent variable having a large number of categories or levels such as more than 20.
2. See the attached figures for information about annotating the plots.
1. They will be demonstrated in class (I hope).
3. For further examples and discussion go to:
4. Data Story and Library
1. This site, Carnegie Mellon University Statistics Department, has a huge web site that contains numerous examples and explanations not simply of box plots but of all sorts of statistical methods. We 're going to be going back in all likelihood.

.

1. NEXT TIME:
1. Correlation and causation.