Public Management Statistics Class 14 Notes

DEPARTMENT OF POLITICAL SCIENCE

AND

INTERNATIONAL RELATIONS

Posc/Uapp 815

MORE ON DISTRIBUTIONS

AGENDA:

The binomial distribution

Bernoulli trials

Sampling distributions
Reading:

Agresti and Finlay, Statistical Methods, Chapter 4, pages 94 to 99; pages 187 to 191.

PROBLEM:

Here's a hypothetical policy issue that Agresti and Finlay suggest on page 191.

I have embellished the example a little bit.

Suppose you are an Equal Employment Office investigator. A group representing women complains to you that a local construction firm refuses to hire women, even for positions requiring no particular gender traits. As a matter of fact, they argue, the company recently expanded and took on 10 new employees, only two of which were female. They take this result as evidence of discrimination and want you to take the matter to the judicial system.
So, you call the company president who says of the complaint, "Look, we don't discriminate. We hired the people more or less randomly, since the positions don't require any specific skills. It just so happens that only 2 women were selected. That's just a chance result that provides no evidence of discrimination."
Assuming that women constitute 50 percent of the members of the potential labor force in that area, would you accept the president's explanation? If not, why not?

NOTATION AND TERMS:

First some terms, notation, and background.
Think of hiring 10 employees as flipping a coin 10 times. Each time the "coin" lands one of two results can occur, a female or a male.

Such random or experimental process that can produce one of two possible outcomes (male or female) is called a Bernoulli process.

Suppose the process is repeated N times and you are concerned with Y, the number of occurrences of a particular type, such as the number of females in N draws or "flips."
Moreover, suppose the probability of getting one of the two outcomes is p.

According to the axioms and rules of probability, if an experiment can result in just two outcomes, M or F say, and the probability of getting M is p, then the probability of getting F, the other type, is 1 - p =q, because p + q must equal one.

Example: if the probability of getting "heads" is .5, then the probability of getting "tails" is 1 - .5 = .5

In this instance if men and women are equally represented in the labor force, then the probability of selecting a woman by chance (p) presumably equals .5, as does the probability of drawing a man (q).

Now, suppose Census Bureau data indicate that in fact men and women are equally represented in the potential population of workers from which the company draws its employees.
Given this information can you make a reasonable judgment about the company's hiring practices?

THE BINOMIAL DISTRIBUTION:

Recall the discussion of probability distributions: in this context a distribution pairs scores or values of a variable with the probability of their occurrence.

We drew a simple diagram to illustrate the idea.

A distribution, which a mathematical expression (equation) "creates," associates values of a variable, such as Y or the number of occurrences of something, with the corresponding relative frequency, proportion or probability of those values.

We saw an example in previous notes.
For a variable that is normally distributed the normal probability distribution function pairs values of Y with probabilities.

This distribution has the properties we discussed ad nauseam before.

Example: values relatively far from the mean, say 2 standard deviations, occur with less probability than those that are only, say, standard deviation from the mean.

Perhaps we can use these ideas to address the discrimination charge raised earlier.

Binomial distribution:

Given N "trials" or experiments, each of which can result in just one of two possible outcomes, the probabilities of which are p for the first type of outcome and q for the second, the probability of getting exactly Y outcomes of the first type and N - Y outcomes of the second is given by the binomial distribution:

The term N! means "N factorial" and is defined as

Hence, Y! means Y times Y-1 times Y-2...times 1.
At this point it is not essential to understand all of the equation or function's details, although they are relatively straightforward.

In word, the function says that given N "trials" or samples from a population in which the probability of getting an outcome of a particular type is p and the probability of the other type is q = 1 - p, the probability of obtaining Y outcomes of the first type is P(Y).
In the present case, we have N = 10 (10 people were picked); the probability that anyone of them is female is supposedly p = .5; the probability of getting a male is q = .5.

We assuming that all of the drawings from this population are independent of the other selections.
That is, the chance of drawing a female stays the same for each selection.

Furthermore the possible results or outcomes of these 10 draws are 10 males, 9 males 1 female, 8 males 2 females...10 females.

The distribution provides the probability of getting any of the these results, assuming that p = .5 and that the selections are independently drawn at random.
That is, successively substitute Y = 0, Y = 1, Y = 2 and so on into the formula"

If we successively insert the possible values of Y into the equation, we can find the probability of getting that many females in 10 draws or selections, assuming that the draws are made randomly and independently from a population in which the probability of selecting a female is .5.
Here are those probabilities

That is, the probability of getting 0 females in 10 random selections with the probability of drawing a female on any draw is .0010; the chance of getting just one female is .0098; the probability of 2 females out of 10 (given that the proportion in the population is .5) is .0439.

What these results show is that if the company really selected employees at random from a labor force consisting of half men and half women and if they only hired 2 females, then it accomplished something rather unusual for the chances of this result are less than 5 in 100.

AN ALTERNATIVE EXPLANATION:

Suppose that in fact the company's selection process purposely or inadvertently favored men over women so that the chance of drawing a male was .7 and a female .3.

Note again: p + q must equal 1, since we are dealing with a Bernoulli process. (That is, only one of two things can happen on any trial or attempt or drawing.)

In these circumstances the formula becomes:

If we substitute all the possible values of Y into this formula, we get

Now we see that Y = 2 occurs fairly frequently or the probability that Y = 2 is fairly large, namely .2335.

We would expect that a random drawing from a population in which the proportion (or chance of a getting female) is .3 would result in exactly 2 women to occur about a quarter of the time.
That's a fairly probable result given all the conditions specified. It is certainly more probable than if the conditions specified originally held.
So we might conclude that the company is in fact discriminating. After all, it obtained an "unusual" result under the hypothesis of fairness but an understandable one under the hypothesis of discrimination.

This line of thinking underlies hypothesis testing that will be covered soon.

SAMPLING DISTRIBUTIONS:

So far we have dealt with two types of distributions:

Empirical or frequency: it shows the number of cases or observations actually observed for each value (or intervals) of values of some variable.

We noted that the shape of these distributions can take many forms such as bell shaped or skewed to the right or left.

Probability: it gives the expected probability of observing various values or intervals of values, given that some set of conditions is true or holds.

So if a random variable has a probability of .3 of being between, say, 100 and 105, then an empirical frequency distribution of 1,000 cases should have about (1,000)(.3) = 300 cases in that interval.

It won't contain exactly 300 observations because of chance.

The binomial and normal distributions are examples: they show the probabilities of various scores.

Sampling distribution.

A sampling distribution is a particular kind of probability distribution. Or more exactly, it's a particular application of a probability distribution.

Example:

Before going into any more detail consider this case: I asked each of you to draw a random sample of 10 counties from the "population" of American counties and to obtain a sample mean for these 10 cases.
Consequently all 33 of you collected sample means that were estimators of the true or population mean, which by the way was 7.33.
The table below shows a stem-and-leaf plot of these 33 estimates.

Thus, someone's estimate base on N = 10 counties was 10.3; another person's estimate was 10.1 and so forth. Note that people's estimates ran as high as 10.3 and low as 4.6, even though the true value is 7.33.
Hence there was variation in the estimates: most were somewhere in the middle in the range, say, of 6.5 to 7.5. But may estimates fell above or below those values.
We conclude that if one person (or a group of people) takes repeated, independent samples from some population having some parameter, theta , of interest, the estimates of that parameter can vary widely.
The estimates follow a distribution that depends on the size of the samples, N, the form of the population distribution, and the value of the parameter.

Another example, the distribution of sample means when N = 100 was

Note that again the estimates vary: some are above 7.33; some are below.
So once again they are distributed. This time, however, the distribution is based on samples of size N = 100, so the estimates don't vary as much as in the previous instance where N = 10.

Definition: a sampling distribution is a probability distribution that shows the relationship between possible values of a sample statistic such as based on a sample of N and the probability of those values given that various conditions hold.

Example: sample means based on N cases drawn from a population with mean = mu will be distributed in a particular way. That distribution is called the sampling distribution of the mean.

See Agresti and Finlay, pages 99 to 100.

Note that Tables 3 and 4 are not sampling distributions: they are empirical distributions that I created to illustrate a point. A sampling distribution is a mathematical equation.

NEXT TIME:

Some final remarks on distributions
Explaining variation: relationships among variables

Go to Statistics main page

Go to H. T. Reynolds page.