Lecture 5: t tests
Probability and
Statistics
Part 2: Statistics
Tests of Significance
Reasoning of significance tests
Stating hypotheses
Test statistics
P-values
Statistical Significance
Test for a population mean
Two-sided significance tests and
confidence intervals
P-values vs. fixed α
Tests of Significance
2 most common forms of statistical
inference
Confidence interval
appropriate when goal is to estimate a
population parameter
Significance test
Appropriate when goal is to assess the
evidence provided by the data in favor of
some claim about the population
Reasoning of significance
tests
Significance test
Formal procedure for comparing observed
data with a hypothesis whose truth we want
to assess.
Hypothesis: statement about the parameters in a
population or model
Results of the test are expressed in
terms of a probability that measures how
well the data and the hypothesis agree.
Stating hypotheses
Null hypothesis
The statement being tested in a test of
significance
The test of significance is designed to
assess the strength of the evidence
against it.
Statement of “no effect” or “no
difference”
Abbreviated as H
0
Stating hypotheses, cont.
Alternative hypothesis
The statement we hope or suspect is true
instead of H
0
Abbreviated as H
a
Hypotheses always refer to some population
or model, not to a particular outcome.
The hypotheses will be stated in terms of
population parameters.
Can have one-sided or two-sided alternative
hypotheses.
Test Statistics
The test is based on a statistic that estimates
the parameter that appears in the
hypotheses. Usually this is the same
estimate we would use in a confidence
interval for the parameter. When H
0
is true,
we expect the estimate to take a value near
the parameter value specified by H
0
.
Values of the estimate far from the parameter
value specified by H
0
give evidence against
H
0
. The alternative hypothesis determines
which directions count against H
0
.
Test Statistics, cont.
Test statistic
Measures compatibility between the
null hypothesis and the data.
Used for the probability calculation
that we need for our test of
significance.
Random variable with a distribution
that we know.
P-values
A test of significance finds the
probability of getting an outcome
as extreme or more extreme than
the actually observed outcome
.
Extreme: far from what we would
expect if the null hypothesis were
true
P-values, cont.
The probability, computed assuming that H
0
is true, that the test statistic would take a
value as extreme or more extreme than that
actually observed is called the P-value of the
test.
The smaller the P-value, the stronger the
evidence against H
0
provided by the data.
Calculate it using the sampling distribution of
the test statistic (for example the standard
normal distribution for the test statistic z).
Statistical Significance
If the P-value is as small or smaller than
α, we say that the data are statistically
significant at level α.
Example:
alpha level of 0.01 means
that we are insisting that we obtain
evidence so strong that it would appear
only 1% of the time if the null
hypothesis is in fact true.
What does a p-value of 0.03 tell you?
Random samples from N(0,1): two
groups of 16 individuals
Collect p-values obtained from any
statistical test
Number of experiments N=25000
Simulation study
N = 1262
Test of Significance Steps
You can assess the significance of the
evidence provided by data against a null
hypothesis by performing these steps:
1.
State the null hypothesis H
0
and the alternative
hypothesis H
a
. The test is designed to assess the
strength of the evidence against H
0
. H
a
is the
statement that we will accept if the evidence
enables us to reject H
0
.
2.
Calculate the value of the test statistic on which
the test will be based. This statistic usually
measures how far the data are from H
0
.
Test of Significance Steps
3.
Find the P-value for the observed data. This is
the probability (calculated assuming that the null
is true) that the statistic will weigh against the
null at least as strongly as it does for these data.
4.
State a conclusion. Choose a significance level α
(how much evidence against the null you regard
as decisive). If the P-value is less than or equal
to α, you conclude that the alternative
hypothesis is true; if it is greater than α, you
conclude that the data do not provide sufficient
evidence to reject the null hypothesis. Your
conclusion is a sentence that summarizes what
you have found by using a test of significance.
Tests for a population
mean
We have an SRS of size n drawn from a
normal population with unknown mean
μ but known variance σ
2
. We want to
test the hypothesis that μ has a
specified value, call it μ
0
.
Null hypothesis: H
0
: μ = μ
0
.
Test statistic:
has the standard normal distribution when
the null hypothesis is true.
n
x
z
/
0
Tests for a population
mean
Alternative hypothesis: If it is one-
sided on the high side, H
a
: μ > μ
0
P-value: probability that a standard
normal random variable Z takes a
value at least as large as the
observed z.
We can similarly reason for other
alternative hypotheses.
Two-sided significance
tests and confidence
intervals
A level α two-sided significance
test rejects a hypothesis H
0
: μ = μ
0
exactly when the value μ
0
falls
outside a level 1- α confidence
interval for μ.
P-values vs. fixed α
The P-value is the smallest level α at
which the data are significant.
Knowing the P-value allows us to assess
significance at any level.
It is more informative than a reject-or-not
finding at a fixed significance level.
A value z* with a specified area to its
right under the standard normal curve is
called a a critical value of the standard
normal distribution.
Use and Abuse of Tests
Choosing a level of significance
Choose a level α in advance if you must
make a decision.
Does not make sense if you wish only to
describe the strength of your evidence.
If you do use a fixed α significance test to
make a decision, choose it by asking how
much evidence is required to reject H
0
.
This depends on how plausible the null
hypothesis is.
Choosing a level of significance
IF H
0
represents an assumption that
everyone in your field has believed
for years, strong evidence (small α)
will be needed to reject it.
Level of evidence required to reject
H
0
depends on the consequences of
such a decision.
Expensive: strong evidence
Choosing a level of significance
Better to report the P-value, which
allows each of us to decide
individually if the evidence is
sufficiently strong.
There is no sharp border between
“significant” and “insignificant,”
only increasingly strong evidence
as the P-value decreases.
What statistical
significance doesn’t mean
Statistical significance is not the same as
practical significance.
Example: hypothesis of no correlation is
rejected
Does not mean strong association, only that there is
strong evidence of some association
A few outliers can produce highly
significant results if you blindly apply
common tests of significance.
Outliers can also destroy the significance
of otherwise convincing data.
Don’t ignore lack of
significance
If a researcher has good reason to
suspect that an effect is present and then
fails to find significant evidence of it, that
may be interesting news---perhaps more
interesting than if evidence in favor of the
effect at the 5% level had been found.
Keeping silent about negative results may
condemn other researchers to repeat the
attempt to find an effect that isn’t there.
Statistical inference is not
valid for all sets of data
Formal statistical inference cannot
correct basic flaws in the design.
Randomization in sampling or
experimentation ensure that the
laws of probability apply to our
tests of significance and
confidence intervals.
Beware of searching for
significance
The reasoning behind statistical
significance works well if you
decide what effect you are
seeking, design an experiment or
sample to search for it, and use a
test of significance to weigh the
evidence you get.
Beware of searching for
significance
Because a successful search for a
new scientific phenomenon often
ends with statistical significance, it
is all too tempting to make
significance itself the object of the
search.
Beware of searching for
significance
Once you have a hypothesis,
design a study to search
specifically for the effect you now
think is there. If the result of this
study is statistically significant,
you have real evidence.
Power and Inference as a
Decision
Examining the usefulness of a
confidence interval
Level of confidence: tells us how
reliable the method is in repeated use
Margin of error: tells us how sensitive
the method is, or how closely the
interval pins down the parameter
being estimated
Power and Inference as a
Decision
Examining the usefulness of fixed
level α significance tests
Significance level: tells us how reliable
the method is in repeated use
Power
: tells us the ability of the test to
detect that the null hypothesis is false
Measured by the probability that the test
will reject the null hypothesis when an
alternative is true.
The higher this
probability is, the more sensitive the test is.
Power
The probability that a fixed level α
significance test will reject H
0
when
a particular alternative value of the
parameter is true is called the
power of the test to detect that
alternative.
Calculating Power
State H
0
, H
a
, the particular
alternative we want to detect,
and the significance level α.
Find the values of that will lead
us to reject H
0
.
Calculate the probability of
observing these values of when
the alternative is true.
x
x
Increasing the Power
Increase α. A 5% test of significance will
have a greater chance of rejecting the
alternative than a 1% test because the
strength of evidence required for rejection
is less.
Consider a particular alternative that is
farther away from μ
0
. Values of μ that are
in H
a
but lie close to the hypothesized
value μ
0
are harder to detect (lower power)
than values of μ that are far from μ
0
.
Increasing the Power,
cont.
Increase the sample size. More data will
provide more information about so
we have a better chance of
distinguishing values of μ.
Decrease σ. This has the same effect as
increasing the sample size: more
information about μ. Improving the
measurement process and restricting
attention to a subpopulation are two
common ways to decrease σ.
x
Two types of error
In significance testing, we must accept one
hypothesis and reject the other.
We hope that our decision will be correct,
but sometimes it will be wrong.
2 types of incorrect decisions
If we reject H
0
(accept H
a
) when in fact H
0
is
true, this is a Type I error (related to p-value).
If we accept H
0
(reject H
a
) when in fact H
a
is
true, this is a Type II error (related to the power).
The common practice of
testing hypotheses
State H
0
and H
a
just as in a test of
significance.
Think of the problem as a decision
problem, so that the probabilities of
Type I and Type II errors are relevant.
Type I errors are more serious. So
choose an α (significance level) and
consider only tests with probability of
Type I error no greater than α.
The common practice of
testing hypotheses
Among these tests, select one that
makes the probability of a Type II error
as small as possible (that is, power as
large as possible). If this probability is
too large, you will have to take a
larger sample to reduce the chance of
an error.
The one-sample one-tail t
test
Suppose that an SRS of size n is drawn
from a population having unknown
mean μ. To test the hypothesis that H
0
:
μ = μ
0
based on an SRS of size n,
compute the one-sample t statistic
In terms of a random variable T having
the t(n-1) distribution, the P-value for a
test of H
0
against H
a
: μ > μ
0
is
or against H
a
: μ < μ
0
is
n
/
s
x
t
0
t
T
P
)
t
T
(
P
The one-sample one-tail t
test – example 1
Suppose that an SRS of size n is drawn
from a population having unknown mean
μ.
[114, 123.3, 116.7, 129.0, 118, 124.6, 123.1, 117.4, 111,
121.7, 124.5, 130.5]
sample mean = 121.15
sample standard deviation = 5.89
To test the hypothesis that H
0
: μ =
μ
0
=120
based on an SRS of size n,
compute the one-sample t statistic
6764
.
0
12
/
89
.
5
120
15
.
121
n
/
s
x
t
0
The one-sample one-tail t
test – example 1
In terms of a random variable T
having the t(n-1) distribution, the P-
value for a test of H
0
against H
a
: μ >
μ
0
is
t
T
P
degree of freedom
Critical value at
α=0.95
If t
1
=0.6 then cdf(t
1
)=0.71967 so p
1
=1-
0.71967=0.28033
If t
2
=0.7 then cdf(t
2
)=0.75077 so p
2
=1-
0.75077=0.24923
so p
2
< p < p
1
The one-sample one-tail t
test – example 1
From the exact t-distribution we
get
tcdf(0.6764,11) = 0.743621
p = 1 – 0.743621 =
0.256379
The one-sample one-tail t
test – example 1
The one-sample one-tail t
test – example 1
0.7436
p=0.2564
The one-sample one-tail t
test – example 1
Final conclusion:
we cannot reject the H
0
that
the population mean μ=120
and accept H
a
at p=0.2564
The another example of one-
sample one-tail t test
To test the hypothesis that H
0
: μ = 135
based on an SRS of size n, compute the
one-sample t statistic
In terms of a random variable T having
the t(n-1) distribution, the P-value for a
test of H
0
against
H
a
: μ < μ
0
is
n
/
s
135
x
t
t
T
P
The one-sample one-tail t
test – example 2
[114, 123.3, 116.7, 129.0, 118, 124.6,
123.1, 117.4, 111, 121.7, 124.5, 130.5]
sample mean = 121.15
sample standard deviation = 5.89
To test the hypothesis that H
0
:
μ = μ
0
=135 compute the one-
sample t statistic
1456
.
8
12
/
89
.
5
135
15
.
121
n
/
s
x
t
0
From the exact t-distribution we
get
P(T ≤ t)
=
tcdf(-8.1456,11) =2.7e-6
p =
0.0000027 < 0.000005
The one-sample one-tail t
test – example 2
The one-sample one-tail t
test – example 2
p=2.7e-6
t = -8.1456
The one-sample one-tail t
test – example 2
Final conclusion:
we reject the H
0
that the population
mean μ=135 and accept H
a
that μ
< μ
0
at p < 0.000005
The one-sample two-tails t
test
To test the hypothesis that H
0
: μ = 115
based on an SRS of size n, compute the
one-sample t statistic
In terms of a random variable T having
the t(n-1) distribution, the P-value for a
test of H
0
against
H
a
: μ ≠ μ
0
is
n
/
s
119
x
t
|
t
|
|
T
|
P
The one-sample two-tails
t test – example 3
[114, 123.3, 116.7, 129.0, 118, 124.6,
123.1, 117.4, 111, 121.7, 124.5, 130.5]
sample mean = 121.15
sample standard deviation = 5.89
To test the hypothesis that H
0
:
μ = μ
0
=115 compute the one-
sample t statistic
6170
.
3
12
/
89
.
5
115
15
.
121
n
/
s
x
t
0
• Since critical t
0.05
=2.2010 and our
observed is bigger than it we reject
the null hypothesis at the α=0.05.
• From the t-distribution we get
P(|T| > |t|)
= 2*(1-
tcdf(3.6170,11)) =
0.0040
So the exact p =
0.0040
The one-sample two-tails
t test – example 3
The one-sample two-tails
t test – example 3
p=0.0040
-t = -3.6170
t = 3.6170
The one-sample two-tails
t test – example 3
Final conclusion:
we reject the H
0
that the
population mean μ=115 and
accept H
a
that μ ≠ μ
0
at p = 0.0040
Matched pairs t
procedures
In a matched pairs study, subjects
are matched in pairs and the
outcomes are compared within
each matched pair.
Example: before and after studies
Key points to remember
concerning matched pairs
•
A matched pair analysis is
needed when there are two
measurements or observations on
each individual and we want to
examine the change from the first
to the second. Typically the
observations are “before” and
“after” measures in some sense.
Key points to remember
concerning matched pairs
•
For each individual, subtract the
“before” measure from the “after”
measure.
•
Analyze the difference using the
one-sample confidence interval
and significance-testing
procedures.
Robustness of the t
procedures
A statistical inference procedure is
called robust if the probability
calculations required are insensitive to
violations of the assumptions made.
The t procedures are quite robust
against nonnormality of the population
except in the case of outliers or strong
skewness.
Robustness of the t
procedures
Larger samples improve the
accuracy of the P-values and critical
values from the t distributions when
the population is not normal.
A normal quantile plot or boxplot to
check for skewness and outliers is
an important preliminary to the use
of t procedures for small samples.
Practical Guidelines for
inference on a single
mean
Sample size less than 15: Use t procedures
if the data are close to normal. If the data
are clearly nonnormal or if outliers are
present, do not use t.
Sample size at least 15: The t procedures
can be used except in the presence of
outliers or strong skewness.
Large samples: The t procedures can be
used even for clearly skewed distributions
when the sample size is large, roughly n >
40.
Comparing Two Means
Two-sample problems
The goal of inference is to compare
the responses in two groups.
Each group is considered to be a
sample from a distinct population.
The responses in each group are
independent of those in the other
group.
Notation
Population
Variable
Mean
Standard Deviation
Sample size
Sample mean
Sample standard deviation
The two-sample z statistic
Natural estimator of the difference μ
1
-
μ
2
is the difference between the sample
means,
To do inference on this statistic, we must
know its sampling distribution.
If the two population distributions are both
normal, then the distribution of
is also normal.
2
1
x
x
2
1
x
x
2
2
2
1
2
1
2
1
,
n
n
N
The two-sample z statistic,
cont.
Suppose that is the mean of an SRS of
size n
1
drawn from a N(μ
1,
σ
1
) population
and that is the mean of an SRS of size
n
2
drawn from a N(μ
2,
σ
2
) population. Then
the two-sample z statistic
has the standard N(0,1) sampling
distribution.
1
x
2
x
2
2
2
1
2
1
2
1
2
1
n
n
x
x
z
The two-sample t
procedure
This test assumes that the variances in
the populations from which the two
samples were taken are identical
The two-sample t
procedures
If the population standard
deviations are not known, then we
estimate them by the sample
standard deviations.
Use the proper formula depending
on the sample sizes n
1
and n
2
.
The two-sample t
procedures
When the sample sizes are unequal and
n
1
or n
2
or both are small (<30) use:
This statistic have a t distribution with
k=n
1
+ n
2
– 2 degrees of freedom.
2
1
2
1
2
1
2
2
2
2
1
1
2
1
2
1
n
n
n
n
2
n
n
s
)
1
n
(
s
)
1
n
(
x
x
t
The two-sample t
procedures
When n
1
and n
2
are equal (regardless
of size) use:
This statistic have a t distribution
with k=2(n – 1) degrees of freedom.
n
s
s
x
x
t
2
2
2
1
2
1
2
1
The two-sample t
procedures
When n
1
and n
2
are unequal but large
(≥30) use:
This statistic have a t distribution with
k=n
1
+ n
2
- 2 degrees of freedom.
2
2
2
1
2
1
2
1
2
1
n
s
n
s
x
x
t
Two-sample t significance
test
Suppose that an SRS of size n
1
is
drawn from a normal population
with unknown mean μ
1
and that an
independent SRS of size n
2
is drawn
from another normal population with
unknown mean μ
2.
We assume the standard deviations
for both populations are identical.
Two-sample t significance
test
To test the hypothesis H
0
: μ
1
= μ
2
versus H
a
: μ
1 ≠
μ
2
compute the two-
sample t statistic and use P-values
or critical values for the t(k)
distribution, with proper df.
Two-sample t
significance test -
example
Suppose that an SRS of size n
1
is drawn from a
population having unknown mean μ
1
.
[114, 123.3, 116.7, 129.0, 118, 124.6, 123.1,
117.4, 111, 121.7, 124.5, 130.5]
sample mean = 121.15
sample standard deviation = 5.89
Another SRS of size n
2
is drawn from a
population having unknown mean μ
2
.
[120, 125.5, 126, 125.5, 128.5, 125, 128, 116,
122, 121, 117, 125]
sample mean = 123.29
sample standard deviation = 4.08
Two-sample t
significance test -
example
0383
.
1
n
s
s
x
x
t
2
2
2
1
2
1
2
1
=0 by assumption
p = 0.3104
Conclusion: we cannot reject the hypothesis that
mean values of both populations are the same at
α=0.05
Since n
1
=n
2
=12
Two-sample Confidence
Interval
Suppose that an SRS of size n
1
is drawn
from a normal population with unknown
mean μ
1
and that an independent SRS of
size n
2
is drawn from another normal
population with unknown mean μ
2
. For
large n
1
and n
2
the confidence interval for
μ
1
- μ
2
given by
where t statistics has n
1
+n
2
-2 dfs.
2
2
2
1
2
1
2
1
2
2
2
1
2
1
2
1
*
,
*
n
s
n
s
t
x
x
n
s
n
s
t
x
x
Robustness of the two-
sample procedures
The two-sample t procedures are more
robust than the one-sample t methods.
When the sizes of the two samples are equal
and the distributions of the two populations
being compared have similar shapes, we get
good accuracy.
When the two population distributions have
different shapes, larger samples are needed.
In planning a two-sample study, you should
usually choose equal sample sizes.
Bartlett’s test for
homogeneity of variances
A test of null hypothesis that two
normal populations represented by
two samples have the same
variance.
The alternative hypothesis is that
the two variances are unequal.
Welch’s approximate t-test of
equality of the means of two
samples whose variances are
assumed to be unequal
Welch’s approximate t-test
This test calculates an approximate
t-value, t
’
, for which the critical
value is calculated as a weighted
average of the critical values of t
based on the corresponding degrees
of freedom of the two samples.
Welch’s approximate t-test
The two-sample t
’
statistic is
2
2
2
1
2
1
2
1
2
1
n
s
n
s
x
x
't
Welch’s approximate t-test
The critical value of t
α’
for type I
error is computed as:
2
2
2
1
2
1
2
2
2
]
[
1
2
1
]
[
n
s
n
s
n
s
t
n
s
t
't
2
1
Welch’s approximate t-test
The statistic does NOT have a t
distribution but we can approximate
the distribution of the two-sample t’
statistic by using the t(k)
distribution with an approximation
for the degrees of freedom k.
To calculate p-value k equal to the
smaller of n
1
– 1 and n
2
– 1.
Homework