DEPARTMENT OF POLITICAL SCIENCE

AND

INTERNATIONAL RELATIONS

Posc/Uapp 815



Descriptive Statistics

(Continued)



  1. CLASS 5 AGENDA:
    1. Additional examples of stem-and-leaf displays
    2. Depths
    3. Descriptive statistics: the median, hinges, and the mean
    4. Reading:
      1. Agresti and Finlay, Statistical Methods, pages 40-42, 45-52.
        1. Note properties of the mean and median.


  2. ADDITIONAL EXAMPLES OF STEM-AND-LEAF DISPLAYS:
    1. Here are more examples that show common problems and solutions in constructing stem-and-leaf displays.
    2. General rules:
      1. Use 4 to 10 stems. Generally speaking, a display should not have more than 10 rows or fewer than 5.
      2. Your displays may differ from MINITAB's, since there is no single or "correct" way to draw them.
      3. For listing very low or high values use "LO" and "HI" labels at each end of the scale. Write out the exact values.
      4. What to do when there are too few stems.
        1. Try this: let . = digits 0,1,2,3,4 and * = digits 5,6,7,8,9 (See previous examples)
        2. Or try this: let . = 0,1; T = 2,3; F = 4,5; S = 6,7; and * = 8,9, as in a previous example.
    3. Software:
      1. MINITAB can produce stem-and-leaf displays.
        1. For the Student edition go to Graphs, then Character Graphs, and finally Stem-and-leaf display.
        2. In the full version, click on Statistics, then EDA, and then Stem-and-leaf display.
      2. The SPSS sequence is a little tricker. Go to Statistics, then Basic, then Explore. Pick variable(s) then go to Plots and make sure stem-and-leaf is checked. You might as well "deselect" the other plots.
    4. Example: Percent voting for Perot in 21 New Jersey counties: Here are the raw data:
      1. The counties are numbered consecutively.






Table 1

Percent for Perot, 1992

New Jersey, by County




No
Percent for

Perot





No
Percent

for

Perot

1 17.6 12 15.8
2 12.9 13 17.1
3 20.4 14 15.5
4 17.6 15 19.3
5 20.1 16 13.0
6 19.0 17 26.0
7 9.7 18 17.4
8 23.1 19 22.0
9 7.9 20 11.4
10 23.6 21 23.8
11 15.5


      1. Figure 1 shows a stem-and-leaf display using this type of labeling.


    1. Another example:
      1. Per capita energy consumption in 50 states in BTUs in 1993.
        1. The raw data are in the data section of of the course page.
      2. We can use this example to demonstrate other topics.




      1. Interpretation: Note that the shape of the distribution is "symmetric" (that is, about half of the observations are above average and half below; see later.) and "bell-shaped"
      2. There are 5 "outlying" or large values. Since they are considerably above or larger than the rest of the data, we give them their own stem called "High"


  1. DEPTHS: COUNTING UP AND DOWN A BATCH
    1. Overview: in order to calculate some summary statistics we need to order the numbers in a batch (that is, arrange them from lowest to highest) and then count relative positions.
    2. A depth is a number that indicates a position from the bottom or top of a ordered batch.
    3. To see how depths can be useful, it is sometimes easiest to rank all of the leaves on each row (stem) from lowest to highest.
    4. Example: the Perot data once again.
      1. This is the same display as Figure 1, but the leaves have been arranged in order starting with the lowest at one end and the highest at the other.
        1. That is the first "row" is written 7.9 and 9.7 instead of the reverse (9.7 and 7.9)
        2. The leaves on the second and third rows have also been ordered from lowest to highs.
        3. At the top, the leaves descend in order.
    5. Ordering the numbers this way is not necessary but does help us calculate order statistics or statistics based on the order of the data points.


    1. Depth: a data point's position or place in the batch, counting from one end of the scale or another.
      1. Example: consider the batch of ordered data 10 12 45 67 89 107
      2. The "depth" of 12, say, is 2 (because it is the 2nd score form the bottom); 3 is the depth of 45; and so-forth.
      3. In Figure 3, the "depth" of 7.9 percent is 1 because it is the first (smallest) number from the bottom.
      4. At the other end, 26 is the maximum value: its depth is thus also 1 because it is the first number counting from the top.
    2. Thus the depth of a number is how far it is from the low or high end of the batch, which ever is closer.
    3. To locate the depth of a value:
      1. Rank order the data from lowest to highest or put them in a stem-and-leaf display in which the leaves are sorted from lowest to highest as in the above example.
      2. Count from the bottom or low end of the scale if the number is closest to that end or count down from the top or high end if that's where the number is closet to.
      3. The number's depth is its position.
    4. One way to annotate a stem-and-leaf display is to write the depth of the leaves on each stem row. In the previous example:




      1. Notice: that we count from both ends of the scale.
      2. Note too the brackets in the "middle" stem. This row contains the observation that lies in the middle of the batch. Hence we mark it for special treatment.


  1. STATISTICS BASED ON POSITION:
    1. We can calculate statistics based on the depths, which really shows us the order.
    2. The statistics are the median (the "middle" value); the upper and lower hinges (the "middle" value of the upper and lower halves of the batch); the upper and lower eighths (the "middle" values of the upper and lower quarters of the batch); and so forth.


  2. MEDIAN:
    1. Definition: sample median is the numeric value of a variable that marks the middle of a batch of numbers in the sense that half of the observed values are below it and half above. In other words, the median is the number that divides a batch of numbers in half.
    2. Example: 10 20 30 40 50
      1. The median is 30 because half of the numbers are below, half are above 30.
    3. Calculation of the median:
      1. We use depths to find the median.
      2. If n is odd, the depth of the median is

        1. Example: if we have 21 cases, the median has a depth of:



          1. That is, the 11th value in the ordered batch is the median.
          2. Count down (or up) 11 cases and find the numerical value of the median.
          3. Example, the depth of the median based on the 21 cases in Figure 4 is 11. That is the median is the 11th number counting from either the top or the bottom.
      1. If N is even, the median is the average of the two middle values:
          1. First, find the depths of the middle two values:

    1. Examples:
      1. N = 5: 103 120 135 194 1007

        1. Thus, the median is the value of the third observation: M = 135
      1. N = 6: 103 120 135 194 1007 1200

        1. Thus, the median is the average of the 3rd and 4th values, namely the average of 135 and 194.

      1. This suggests that a median will not always be an actual or observed data point.
    1. Median Calculated From Stem-and Leaf Displays
      1. To find the median order the values of the stem-and-leaf display and find the depths.
      2. If N is odd the median can be read directly from the stem and leaf display: simply find the value associated with depth d(M).
        1. Consider the Perot data: since N is odd, the depth of the median is

        1. Here is the stem-and-leaf display with the depths and median, 17, indicated between the parentheses.


      1. If N is even the median will be the average of the values associated with depths d(M)l and d(M)u.
      2. Example: energy consumption data


        1. The depths for the median, since N is even, are:

        1. Thus the median is the average of the 25th and 26th observations:

        1. Notice that the median is just the average of 300 and 300 or 300..
    1. Properties of the Median
      1. Appropriate for numeric data and order grouped data.
      2. Splits the batch (sample values) in half.
      3. The median is generally not equal to the "average" or mean (see below).
      4. Median is a RESISTANT measure of central tendency: that is, it is not "influenced" by "outliers" or extreme values.
        1. Example:

10 20 30 40 50 M = 30 Average (mean) = 30

10 20 30 40 500 M = 30 Average (mean) = 120

      1. One has to order the data in order to calculate the median. So, if some quantities are not known exactly, the median can still be calculated. Suppose, for instance, that the highest per capita consumption of energy was known to be above 1,000. The median can still be calculated since we only need to find the number midway between the lowest and highest value without having to know exactly what those values are.


  1. HINGES:
    1. The median divides the data into two batches with equal numbers. We can in turn divide each of these sets in half.
      1. That is we can find the "median" of the "lower" batch and the median of the "Upper" batch.
      2. These order statistics are called hinges.
    2. Example: consider this set of 9 numbers

6 11 17 19 22 30 34 49 67

      1. The depth of the median is just

      1. And the median is the 5th number from the top or bottom or 22.
      2. Hence 22 divides the batch in half.
    1. Each of these batches can be divided in half.
    2. The depth of the hinge is (formally):

      1. "The integer part" means ignore any fraction if there is one in the depth.
      2. That is, find the number or point that divides the lower and upper batches in half. Example. for this case where d(M) is 5 the depth of the hinge is:

      1. So find the third number from the bottom in the lower batch and the third number from the top in the upper batch. These are the lower and upper hinges.
        1. 6 11 17 19 22 30 34 49 67
      2. The hinges are thus 17 and 34.
    1. Energy consumption example.
      1. The median depth is 25.5, which when used in the formula for the depth of the hinge gives (don't forget to drop the .5 part)


      1. The lower hinge is the 13th number or rate from the bottom or 280 BTUs per capita and the upper hinge is the 13th point from the top or 380 BTUs.


  1. THE MEAN:
    1. As is well known, the mean is the "average" value in a batch of data. It is found by summing the scores--summation is denoted by the Greek capital letter sigma, --and dividing by N, the number of data points or cases in the batch:

      1. The sample mean is usually denoted with a bar over a letter.
    1. Do not be intimidated by the sigma sign. It simply means "add." The context in which appears will usually make it clear what is to be added. For example, the notation Yi means:

    1. Not that the mean is not a very "robust" or "resistant" indicator of central tendency, since a few "extreme" cases can affect its numerical value. (See the example above.)
      1. This is one reason income is often reported as a median, rather than mean, value. If everyone, except one or two millionaires, earns about $20,000, the mean may suggest that average income is higher than most of us would think were we to see the total distribution.
      2. A large discrepancy between the mean and the median suggests that the data batch might be "skewed" in some fashion or that it contains a few "outlying" values.


  1. NEXT TIME:
    1. Using the median and hinge to draw graphs of the distributions of variables.
      1. Box plots.
    2. Multiple stem-and-leaf displays and boxplots.

Go to Statistics main page

Go to H. T. Reynolds page.

Copyright © 1997 H. T. Reynolds