GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
TESTING HYPOTHESES ABOUT
SAMPLE MEANS:
t-TESTS
Readings:
• De Veaux et al 2006, Chapter 23-25
I. What is a t-test?
A. Parametric test
B. t-sampling distribution model
C. Used to make confidence intervals and
tests hypotheses about sample means.
D. Has degrees of freedom
II. When can we use a t-test?
III.
Testing hypotheses about sample
means...
A. One sample mean to a population mean
B. Two independent sample means
C. Two paired sample means
1
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
2
WHAT IS A t-TEST?
T-tests are hypothesis tests of sample means.
Like the Z-test, this is a parametric test.
• Parametric tests make assumptions about the
shape of the underlying population from
which the sample is drawn.
• Shape of Z- distribution is always Normal.
• As n increases, shape of t-distribution
becomes Normal.
t-test based on a t-sampling distribution model.
• t-distribution also used to construct
confidence intervals around sample means
(see De Veaux et al 523, 548 and 579 for
formulas).
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
CONFIDENCE INTERVALS USING t-DISTRIBUTIONS
One-sample t-interval (De Veaux et al 2005: 523)
and where the standard error of the mean is...
The critical value t*n-1 depends on the specified á-level and
degrees of freedom.
Two-sample t-interval (De Veaux et al 2005: 548)
where the standard error of the difference of the means is...
The critical value of t*df depends on the specified á-level and
degrees of freedom.
Paired t-interval (De Veaux et al 2005: 579)
where the standard error of the mean difference is...
The critical value t*n-1 depends on the specified á-level and
degrees of freedom.
3
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
4
t-tests differ from Z-tests in three ways
1. t-tests based on a t-distribution; Z-tests on Zdistribution.
• However, when the n is small, tdistributions have fatter tails –> the fewer
the degrees of freedom, the fatter the tails of
the t-distribution.
2. t-tests are used when you do not know or can
not infer the population standard deviation, ó
• NB: Z-tests used w/ proportions because we
can calculate the SD(p) from p.
• t-tests used to test hypotheses about sample
means. We cannot calculate ó from X-bar.
3. t-tests and t-distributions are characterized by
degrees of freedom a.k.a. d.f.
GEOG 2P12 Lecture 33-36
•
Jeff Boggs
Brock University
5
DEGREES OF FREEDOM?
If we do not know ó, we use the standard
error of the mean, aka SE(X-bar), and d.f. to
[1] calculate appropriate critical bounds for
failing to reject & rejecting null hypotheses
(i.e., testing a hypothesis), and [2]
construct confidence intervals around
sample means.
If we do not use d.f., we underestimate standard
error of the mean, SE (X-bar). An
underestimate of SE(X-bar) makes it easier to
reject H0 . And that is bad.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
WHEN CAN WE USE A T-TEST?
Must meet four conditions:
1. Random sampling condition:
2. Independence Assumption:
3. 10% Condition:
4. Nearly Normal Condition: The population
from which we sample is approximately
normally distributed. –> check histogram.
• If n compare X-bar1 to
X-bar2 to see if both could come from same
population.
•
two paired sample means –> compare
changes in a sample before and after a
treatment.
•
more than two sample means requires an Ftest, something your textbook discusses in
Chapter 28 (on the CD only).
8
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
ONE SAMPLE MEAN T-TEST
One sample t-test for the mean (De Veaux et
al 2005: 527)
where the standard error of the sample mean
is...
Associated Hypothesis Test:
H0: ì = ì0
HA: ì ì0 if a two-tailed test
HA: ì > ì0 or HA: ì < ì0 if a one-tailed test.
9
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
10
EXAMPLE: After 4th year internship with the
Ministry of the Environment, you get a job
monitoring pollutants discharged into 12 Mile
Creek, home of the magnificent Canada Goose.
If the creek is too polluted (i.e., has 100 parts
per million of DDT or more) Geese lay eggs
with weak shells. If that happens, you must
locate eggs and warm them artificially.
• Thus, if the pollutants greater than 100 parts
per million of DDT, then eggs weak-shelled.
Your boss wants to know if she needs to start
interviewing anyone to take care of weakshelled goose eggs. –> wants to know if Twelve
Mile Creek is too polluted or acceptable.
So you go and collect 9 random samples along
12 Mile Creek. (X-bar)=137 ppm, s= 42 ppm.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
What are hypotheses?
Ho: ì = 100 ppm diesel contamination
HA: ì < 100 ppm diesel contamination
What is the alpha level?
Your boss will fire you if you screw this up.
You love your job, a point evident in your HA.
To be safe, set á=0.01. This is a one-tailed ttest, with critical region in left-tail.
DRAW t-DIAGRAM
What is the corresponding critical t-score?
We look at the t-table on page A-58, and
realize we need to calculate degrees of
freedom. Since n= 9, our d.f.= 8. The
corresponding t-crit = -2.896.
11
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
12
What is our t-test statistic?
Reject or fail to reject null hypothesis?
We compare t-critical of -2.896 w/ calculated tscore of 2.64, and realize our sample mean is
nowhere near critical region. Thus, we fail to
reject the null hypothesis.
What does this mean to your boss?
You tell your boss you are 99% confident that
the diesel levels are at least 100 parts per
million, so she better start interviewing people
to collect and take care of goose eggs.
You also tell her that you could be wrong one
time out of 100.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
TWO SAMPLE MEAN T-TEST
Two independent sample means t-test (De
Veaux et al 2005: 552)
X-bar1 = mean from first sample
X-bar2 = mean from second sample
Ä0 = hypothesized difference between the two
sample means. This is often zero.
The standard error of the difference between
the sample means:
Associated Hypothesis Test:
H0: ì1 - ì2 = Ä0
HA: ì1 - ì2 Ä0 if two-tailed test
HA: ì1 - ì2 > Ä OR
HA: ì1 - ì2 < Ä0 if one-tailed test
13
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
14
Degrees of freedom for two independent
sample means (De Veaux et al 2005: 547,
547f)
Simple way: the smaller of (n1 or n2)-1.
Precise way:
The simple easy way usually underestimates the
number of degrees of freedom you would get
compared to the more complicated formula.
Minitab uses the precise way.
Thus, the simple way usually makes it harder to
reject the null hypothesis.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
15
EXAMPLE: Your boss promotes you to
investigate surface water pollution throughout
Niagara Region.
• Abandoned refinery is located on level
ground halfway between two creeks. Its
storage tanks leak diesel fuel, contaminating
at least one of these creeks. You wonder if
only one or both creeks are contaminated.
• Ground is level; geology is uniform; thus,
each creek should be equally contaminated.
• If not equally contaminated, might mean
only some tanks leak, which might mean a
less expensive clean-up.
• Random samples along each of these creeks
within one kilometer downstream of the
abandoned tank park–> West Creek: n=8, w/
X-bar = 45 ppm, s = 12 ppm. East Creek:
produces n=4, w/ X-bar = 53 ppm, s = 24
ppm.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
16
What are the hypotheses?
H0: ì1 = ì2 –> diesel contamination same in
each creek
HA: ì1 ì2 –> diesel contamination differs
between creeks
What are the alpha-levels?
You are sure both creeks dreadfully
contaminated; not sure if much can be done to
remove contamination. You are not really sure
of the consequences of being wrong. You set
the á= 0.1. Two-tailed test, thus each tail= á/2.
What is the corresponding critical t-value?
See t-table on page A-58 –> need to calculate
degrees of freedom. Can calculate d.f.
precisely or simply. We choose simple; means
we are even more likely to fail to reject H0. n2
is smaller sample size; thus, d.f.=n–1=4-1= 3.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
17
What is the resulting t-score?
First, calculate the standard error of the
difference of the two means:
=
Then, calculate the t-score:
Reject or fail to reject null hypothesis?
Fail to reject the null since -0.63 falls well
within t-crit of ± 2.353.
What does this mean to your boss?
Both creeks are equally contaminated, and we
are wrong one time in ten.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
18
T-TEST FOR TWO PAIRED SAMPLE
MEANS
Sometimes we have two samples that are not
independent, but that we wish to compare.
• For instance, we evaluate a weight loss
program by weighing clients before and
after a ten weeks of exercise. These
observations are not independent: one’s
prior weight affects one’s later weight.
• Such sampling designs are called matched
pair sampling designs.
• The random sampling happens when we first
select elements to measure.
If the sample sizes are not equal, then the
observations cannot be matched. Matching
implies that the number of pairs is constant.
The formula for degrees of freedom is n–1.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
Paired sample means t-test (De Veaux et al
2005: 576)
d-bar = the mean of the pairwise differences
Ä0 = hypothesized mean of the pairwise
difference. This is usually zero.
where the pooled standard error of the
difference between the sample means is...
Associated Hypothesis Test:
H0: ìd = Ä0 , where Ä0 is almost always zero.
HA: ìd Ä0 if a two-tailed test
HA: ìd > Ä0 OR
HA: ìd < Ä0 if a one-tailed test.
19
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
20
EXAMPLE: At the Ministry of the
Environment, you become waterfowl specialist
of Region Niagara. One day, a tanker truck
filled with diesel fuel flips and spills contents
into a watershed. Shortly afterwards, Ducks
Unlimited Canada contacts your office, and
complains that the quacking of the ducks in that
watershed can scarcely be heard.
• You call biogeographer from McMaster who
researches and records duck vocalizations.
• This biogeographer has tagged all the
watershed’s ducks with unique ID. She has
device that simultaneously records each
duck’s quack and id number.
GEOG 2P12 Lecture 33-36
•
Jeff Boggs
Brock University
21
She notes that all the ducks in her random
sample seem to be there. So the population
has not changed. However, their collective
quacks have become mysteriously quieter
since the diesel fuel spilled into the
watershed. She shares her data on ducks’
quacks from before and after the spill:
Duck ID
Loudest Quack in
decibels (db)
Paired
Differences
Before
Spill
After Spill
0032A2
35
23
35-23 = 12
0036X3
46
35
46-35 = 11
0931GS
65
70
65-70 = -5
Qckr037 55
42
55-42 = 13
0012B2
36
65-36 = 29
65
Mean of the paired differences —> d-bar =
(12+11-5+13+29)/5 = 60/5 = 12 decibels
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
22
Let’s calculate the standard deviation of the
paired differences:
Duck
ID
Paired
(di - d-bar) (di - d-bar)2
differences
0032A2 12
12 -12 = 0
02 = 0
0036X3 11
11-12 = -1
-12 = 1
0931GS -5
-5-12= -17
-172 = 289
Qckr037 13
13-12 = 1
12 =1
0012B2 29
29-12 = 17
172 = 289
G(di - d-bar)2 =
580
GEOG 2P12 Lecture 33-36
•
•
Jeff Boggs
Brock University
23
H0 –> diesel spill had no effect on ducks’s
loudness of quacks measured in decibels.
Thus, H0: d-bar = 0.
HA–> diesel spill had adverse effect on
ducks, specifically, that their loudness has
decreased. Thus, HA: d-bar >0.
•
Remember: the mean difference will be
positive if the initial values were higher, but
are now lower.
•
Conversely, the mean difference will be
negative if the initial values were lower, but
are now higher.
So let’s conduct the hypothesis test:
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
24
What are hypotheses?
H0: d-bar = 0 –> ducks’ loudness unchanged
since spill
HA: d-bar > 0 –> ducks’ loudness diminished
since spill
What is alpha-level?
Ducks Unlimited Canada people pretty upset,
since they fear they might not have enough
diesel-free ducks for their needs. You also like
ducks, so you set á=0.05. This is a one-tailed
test, with critical region in the right-hand tail.
What is corresponding critical t-value?
We look at the t-table on page A-58, and
realize we need to calculate degrees of
freedom. n = 5, so d.f. = n–1 = 5-1 = 4. The
corresponding t-crit is 2.132.
GEOG 2P12 Lecture 33-36
Jeff Boggs
Brock University
What is the resulting t-score?
First, calculate the standard error of the
difference:
Then, calculate the t-score:
Reject or fail to reject null hypothesis?
Reject the null since 2.23 falls into the critical
region which starts at t-crit of 2.132.
What does this mean?
The ducks are significantly quieter, and you
are wrong one time in twenty.
25
An Overview of Hypothesis Testing
Jeff Boggs
Assistant Professor
Department of Geography
Brock University
St. Catharines, Ontario
15 October 2007
Revised 1 January 2009
Most introductory statistics books discuss one or two approaches to hypothesis testing: the
classical approach and the p-value approach. In turn, both of these approaches combine elements
of two earlier approaches to hypothesis testing: the Neyman-Pearson approach, and Fisher’s
approach (Armstong 2005).
The classical approach to hypothesis testing was developed before computers were in
widespread use. As a result, all you need to conduct a hypothesis test are the tables that
correspond to the underlying population model, be it Z, t, and so on. The p-value approach
requires that you have access to either a sophisticated scientific calculator or a computer and
statistical software.
I teach the classical approach. The textbook teaches the p-value approach.
The Classical Approach to Hypothesis Testing
There are a few steps here, and different sources will sort them slightly differently.
STEP 0: Become curious about something.
Ultimately, curiosity motivates us to bother to test hypotheses. This curiosity eventually
manifests as a claim. Most textbooks leave out this step.
STEP 1: Make a claim.
Normally this claim is supported by some larger body of scholarly literature as well as knowledge
of the topic at hand. For instance, as an economic geographer I know there is a large body of
scholarly literature that examines the relationship between regional development, worker
productivity, and wage rates. In general, this literature claims that places with more productive
workers tend to also be places with higher wage rates. I also know something about the Niagara
Region as a result of living here (and looking at Statistics Canada data). For the sake of
argument, let us say that St. Catharines has a sizeable number of manufacturing employees who
are relatively well paid. Let us also say that Niagara Falls has many employees in the tourism
sector (e.g., casino workers, chefs, wait staff, and hotel staff) who tend to receive lower wages.
However, I don’t really know if these differences in wages are all that pronounced between the
two locations. I claim there is probably a difference in wages.
1
STEP 2: Convert this claim into a null and alternate hypothesis.
I claim that there is probably a difference in wages between St. Catharines and Niagara Falls.
This claim is called the alternate hypothesis. The claim that there is no difference in wages
between these two locations is the null hypothesis.
STEP 3: Express the null and alternate hypotheses in numerical terms.
There are many claims that can be made that are not measurable, and this includes claims
informed by scholarly work. If you want to make systematic, rigorous claims about things
relating to cultural geography, social geography and the like, then you best pay attention in
GEOG 2P10.
You can express the null and alternate hypotheses either as proportions, or as means and
variances. De Veaux et al adopts the position that Z-tests are used only on proportions, whereas
t-tests, chi-squared tests and ANOVA are used on means and variances. To avoid confusing you,
I abide by their decision, and further restrict our focus to Z- and t-tests.
For the sake of illustration, I will focus on proportions, and in turn, Z-tests. I set my null
hypothesis to be that the proportion of employees whose wages exceed 100,000 CAD is the same
in both St. Catharines and Niagara Falls. My alternate hypothesis is that there is a difference in
the proportion of adults whose wages exceed 100,00 CAD between St. Catharines and Niagara
Falls. I am not specifying a direction, and hence this is a two-tailed test. If we decided to state
the alternate hypothesis as “proportion is higher in St. Catharines than in Niagara Falls...” then
we would have a directional hypothesis test, which always is a one-tailed test.
Here, Ho: p1 = p2 and HA: p1 p2 . You might ask: “Where are the measureable bits?” They are
implied in this instance. This null hypothesis implies a difference of zero, whereas this alternate
hypothesis implies that the difference is not zero. Thus, these can be re-expressed as
Ho: p1 - p2 = 0 and HA: p1 - p2 0, a convention that your textbook does not follow.
STEP 4: Determine the critical regions.
The critical regions help us decide if we reject the null hypothesis, or if we fail to reject the
null hypothesis. This follows on the logic of critical rationalism. Texts that are still steeped in
logical positivism will state that you need critical regions to decide if you reject or accept the null
hypothesis.
Customarily, we set the critical regions by establishing a boundary called the significance- or
alpha-level. If our calculated test statistic (a Z-score in this example) falls into the critical region,
then we reject the null hypothesis. If our test statistic does not fall into the critical region, then
we fail to reject the null hypothesis.
The significance level or alpha level (sometimes written as á-level, or just á =) can be thought of
as the border between the land of rejection and failing to reject. It is useful to remember that the
2
lands of rejection are always in the tail or tails of the distribution. Our example here sets both
tails to be the critical region, and hence is called a two-tailed test. If our alternate hypothesis
stated that one proportion was greater than another, then we would use a one-tailed test, and only
one tail would be the critical region.
We select our significance level or á-level based on the cost of making what is called a Type I
error. This cost is the cost of rejecting the null hypothesis when it is actually true. Customarily,
these are set at á = 0.1, 0.5 or .01. These correspond to the probabilities of rejecting the null
hypothesis when it is true. Thus, if the result of rejecting the null when it is true is not very costly
(i.e., little or no money will be lost, nobody will be harmed), then we might set á = 0.1 or 0.05.
Medical research, however, is generally set á = 0.05 or 0.01, as the costs of failing to reject the
null could be much costlier.
These significance levels correspond to critical values. Critical values are expressed in terms of
your test-statistic. In our example, they would be expressed in terms of Z. You might think of
them as they the gate or gates (depending on the number of tails) into the land of rejection.
STEP 5: Calculate your test statistic.
Here is the plug-and-chug portion of your evening. Substitute numbers for variables. Be sure to
check your work.
STEP 6: Compare your test statistic to your critical value.
Is your test statistic larger than your critical value? In other words, does your test statistic fall
into the critical region?
If your test statistic falls into the critical region, then you reject your null hypothesis. If your test
statistic does not fall into the critical region, then you fail to reject your null hypothesis.
STEP 7: Interpret your findings.
The last step involves telling your reader two things. First, tell the reader whether you rejected or
failed to reject the null hypothesis. Then tell the reader what this means in plain English.
Bibliography
Armstrong, J. Scott (2005) “Why We Really Don’t Know What ‘Statistical Significance’ Means:
A Major Educational Failure.” Unpublished paper. The Wharton School of Business, University
of Pennsylvania. Accessed on 3 November 2007 at the following URL:
http://marketing.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf
For an extended discussion of the procedures for the classical approach, see this description:
http://wind.cc.whecn.edu/~pwildman/statnew/section_2_and_3_-_hypothesis_testing_about_the_
mean_-_large_samples.htm
3
1. Identify the steps of hypothesis testing.
2. Explain the difference between the null and alternate hypothesis.
3. Demonstrate that semantic difference between failing to reject and accepting a null hypothesis.
4. Explain how the p-value and a test statistic are related to each other.
5. Explain how the p-value or test statistic is used to evaluate a hypothesis.
6. Explain when one uses a two-instead of a one-tailed hypothesis test.
Instructions
Scenario: Your friend told two other friends that you provide helpful explanations about statistical concepts.
These three friends are confused about the steps involved in hypothesis testing.
Prompt: Your friends have five questions:
1 What are the steps involved in hypothesis testing?
2. What is the key difference between the null hypothesis and the alternate hypothesis?
3. Why do we say we reject or fail to reject the null hypothesis instead of saying we reject or accept the null
hypothesis?
4. How is a p-value related to a test statistic?
5. What is the key difference between a one- and two-tailed hypothesis test?
To keep things simple, focus on the one-sample mean t-test if you need to talk about a specific procedure,
though you suspect that just dealing with the ideas more generally will be fine. Although your friends have
read Rogerson Statistical Methods for Geography, Chapter 5, they are still confused. If you can't find your
copy of Rogerson, the Sakai site has a link to an electronic copy of the first edition.
Rubric
Level 1 (Worst)
Level 2
Level 3
Level 4
Novice
Organization
No clear
sequence
Advanced
Material explained
in a logical
sophisticated, and
engaging manner
Intermediate Proficient
Material
Material explained
explained/prese... in a logical
in a confusing
manner that
manner
relates each of
questions two
through five back
to the steps
identified in
answer to
question 1
Displays some Displays sufficient
understanding of understanding of
the concepts.
the concept
Leaves important
information out
Examples are Most but not all
usually not helpful examples are
in clarifying helpful in
answers.
clarifying answers.
Knowledge of
Content
Displays little/no
understanding of
the concept
Displays thorough
understanding of
content. Extra
research is
evident
Use of examples
No useful
examples
provided.
As LEVEL 3, plus
examples include
figures, tables or
scenario that
makes concepts
easier to
understand.
All questions
answered
Fewer than two
Number of
questions
answered
questions
answered
Three questions
answered
correctly.
Four questions
answered
correctly.
correctly.
Purchase answer to see full
attachment