DEPARTMENT OF POLITICAL SCIENCE

AND

INTERNATIONAL RELATIONS

Posc/Uapp 815

Descriptive Statistics and Numerical Summaries

CLASS 8 AGENDA:

Variance and standard deviation
Box-and-whiskers plot.
Measuring change
Populations and samples
Reading:

Agresti and Finlay, Statistical Methods, pages 45 to 67 as needed.

Note the box plot discussed in Agresti and Finlay is not constructed exactly the same way as the one described below.

Lewis-Beck, Data Analysis, pages 11 to 18.
Try some of the early tutorials in the Student Edition of MINITAB for Windows, such as the material starting on T-35.

MEASURES OF DISPERSION:

Total variation: the total variation in a batch of numbers equals the sum of the squared deviations about the mean.

Recall form the last class that the total variation, also called the total sum of squares, is defined by:

Computing formula:

When you have lots of data, the total sum of squares can be calculated easily with a good calculator by finding the sum of the Ys' and the sum of the Y's squared; that is:

Then put these quantities in this formula, called a computing formula:

Example: Recall the Perot voting data

Table 1

Percent for Perot, 1992

New Jersey, by County

No	Percent for Perot	No	Percent for Perot
1	17.6	12	15.8
2	12.9	13	17.1
3	20.4	14	15.5
4	17.6	15	19.3
5	20.1	16	13.0
6	19.0	17	26.0
7	9.7	18	17.4
8	23.1	19	22.0
9	7.9	20	11.4
10	23.6	21	23.8
11	15.5

For these data we obtain

Putting these quantities in the computing formula gives:

Software and calculators.

Nearly every calculator above those handed out by banks for opening an account have an "accumulate" key, usually marked Key in a number such as 17.6, then press the accumulation key; enter the next number (e.g., 12.9) and press the accumulate key again. You should see "2" in the display, which indicates that two numbers have been entered. The sum of these two numbers and the sum of their squares has been automatically calculated. Proceed in this manner for all N numbers. Then press the "statistics" or "sum" keys (or whatever the instruction manual says.) You should be able to read the totals and then perform simple subtraction and division to get the total sum of squares.
You can, of course, just key the raw data into MINITAB or SPSS and request various summations.

In the Student Version of MINITAB for windows, open the calc menus and select column statistics. In the dialog box just check or click the totals and variables you want.

The full version of MINITAB works essentially the same way.

Variance:

The variance of a batch of numbers or sample is denoted , for a batch of numbers and represents the total variation divided by N minus 1:

Example: consider once again the Perot data.

As seen above the TSS is 368.7. Therefore the variance of this batch of numbers is:

The larger the variance, the more variation or dispersion in the data. But at this point it is difficult to give an intuitive interpretation to any particular value such as 368.

We'll get an intuitive feel for the dispersion or variation in a moment and soon provide a more formal treatment.

The Standard deviation:

The standard deviation of a batch of numbers or sample, denoted is (loosely speaking) the average of the squared deviations from the mean. Since deviations indicate how much variation exists in the data, having an average of these differences tells one about overall variation.
It is also the square root of the variance.

To calculate the standard deviation, therefore, first obtain the total sum of squares by summing the squared deviations from the mean as before:

Then divided this total by N - 1, where N is the number of cases in the batch.
Finally, take the square root:

Example: for the Perot data we have simply

A SIMPLE GRAPH TO SHOW VARIATION:

For the moment put aside the standard deviation and variance and return to our stem-and-leaf displays and order statistics: the median, the hinges, and the maximum and minimum.

We can combine these summaries in a graph called a box-and-whiskers plot or more simply a box plot.

Drawing the box plot.
See Figure.
1. First, draw a horizontal line to indicate the scale of the variable.
2. Above the line, say about half an inch, draw a small vertical line to indicate the median.
3. Next draw short vertical lines above the scale to indicate the hinges.
4. Make a rectangle with the hinges at the ends. The median will be in the box somewhere.
5. Next, place points or marks of some kind to represent the maximum and minimum values.
6. Connect these points to the hinges with horizontal lines.
7. Figure 2 shows an example.
8. Here's a more detailed view.

We can use the plot to visual the variation in a batch of numbers. It is especially helpful for comparing the shape, central tendency, and variation in several groups of data sets.

We'll see some examples in class.
Percent Voting Example
Energy Consumption Example
- These two examples show how box plots can be used to compare distributions.

For now here's the Student Version of MINITAB's box plot of the Perot data.

MEASURING RELATIVE CHANGE OR DIFFERENCES:

Here are some data:

Here is another comparison:

Question: how should relative change or difference be measured? The standard approach is to calculate the percent difference. Most calculators do it this way:

This usual measure of relative change or difference has at least two problems.

First, it lacks "symmetry" in that the numerical value depends on which number is taken as the base. Thus, we have

In the case of female employment, we can say that it is 24.3 percent higher in 1990 than in 1980 or that it increased 19.6 percent from 1980 to 1990. Put the data in the formulas to see this for yourself. The point is better illustrated with the homicide rate. We can say either that the rate is 147.6 percent higher for blacks than whites or it is 59 percent less for whites. Again, do the calculations. Example:

Second, the usual indicator of relative change is not additive over successive time periods. Notice that in Figure 4 the percent increase in the women in the labor force in the 1980 to 1985 period is 12.2% and the increase in the period from 1985 to 1990 is 10.8%. These add to 23 percent. But, as we saw before, the overall increase from 1980 to 1990 is 24.3 percent.

For these reasons, some statisticians have proposed alternative measures of relative change or difference. One such proposal has been presented by Leo Tornqvist and his colleagues. They call it the "log percentages."

The formula is

We use the natural logarithm for this calculation. Most good calculators have it as one of their functions. It is usually denoted "ln" in contrast to "log," which means log to the base 10. Needless to say, MINITAB has natural logs (the function is loge).

This statistic, denoted L%, avoids the problems mentioned above, even though it is not at first as intuitively obvious as the standard measure. For example, the relative difference between black and white homicide rate is

That is, the rate for blacks is log 90.67 percent higher for blacks than whites. If we wanted to find out how much lower the white rate is, we would simply calculate;

Hence, the rate for whites is log 90.67 percent less than for blacks. Note that the two log percentages are symmetrical; they differ only in sign.
The log percentage increase in women in the labor force from 1980 to 1990 is 21.8%. This is the sum of the log percentage increases over the time periods 1980 to 1985 and 1985 to 1990 (log 11.54% plus log 10.24%). This is illustrated in the following figure.

POPULATIONS AND SAMPLES:

Population: a well-defined collection of units of analysis such as the American states, the people living in Delaware, the countries in Asia, the students taking this course.
Sample: a subset of the population drawn in some fashion. We can have, for example, a sample of the states, a sample of Delawareans, or a sample of students taking this class.
Simple Random Sample (SRS): a sample drawn from the population in such a way that each member has an equal chance of being included.
Sample sizes, N.

We have briefly discussed samples sizes and will do so in more detail when covering statistical inference.
For now note that large samples are not necessarily essential for producing "valid" results. In fact, N probably has more to do with "reliability."

Parameter: a statistical characteristic (e.g., mean, standard deviation) of a population. Two major goals of statistics are to estimate population parameters and to test hypotheses about them.

Parameters are usually denoted with Greek letters.

Example: the mean, usually denoted by , is a parameter.

Sample statistic: a characteristic of a sample or batch of values that is generally used to make inferences about the corresponding population parameter. (Often called an estimator.)

Sample statistics are frequently denoted by a Greek letter with a hat over it (e.g., ) but not always as in the case of the sample mean, .

NEXT TIME:

Distributions

Shapes and properties

Operations and rules of the summation sign.

Go to Statistics main page

Go to H. T. Reynolds page.