Statistics: Module 2 - Groep Biomedische Wetenschappen KU … · 2017. 6. 23. · PhD Biomedical...

Statistics: Module 2

Geert VerbekeI-BioStat: Interuniversity Institute for Biostatistics and statistical Bioinformatics

K.U.Leuven & Hasselt University, Belgium

[email protected]

http://perswww.kuleuven.be/geert verbeke

PhD Biomedical Sciences

Contents

1 The comparison of two means: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 The comparison of two proportions: Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 The comparison of two means: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 The comparison of two proportions: Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Errors in statistics: Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6 Errors in statistics: Practical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 One-sided versus two-sided tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8 Describing associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

PhD Biomedical Sciences: Module 2 i

9 Non-parametric statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Bibliography 234

PhD Biomedical Sciences: Module 2 ii

Chapter 1

The comparison of two means: Unpaired data

. Example

. Confidence interval for the differenceof two means

. The unpaired t-test

. Assumptions

. Example: Survival times of cancer patients

. Example from the biomedical literature

PhD Biomedical Sciences: Module 2 1

1.1 Example

• Consider an experiment in which weight gain in rats with high protein level diet iscompared with weight gain in rats with low protein level diet.

• Group-specific histograms:


• Group-specific summary statistics:

• On average, there is an observed difference of 19g between the rats on a highprotein diet and those on a low protein diet.

• Is this observed difference sufficient evidence to conclude that there indeed is aneffect of diet on the weight gain ?

• It would be of interest to know how likely such a difference of 19g is to occur ifweight gain would be completely unrelated to the protein level of the diet.


• Note that, strictly speaking, we have two populations, with a sample randomlydrawn from each:

. High protein rats: The hypothetical population of all rats that are given ahigh protein diet

. Low protein rats: The hypothetical population of all rats that are given alow protein diet

• From the first population, a random sample of n1 = 12 rats was taken. From thesecond one, a random sample of n2 = 7 rats was drawn.

• The corresponding observed means are x1 = 120 and x2 = 101 respectively.

• Because there is no relation between the observations taken from the firstpopulation and those taken from the second, we have unpaired data.


1.2 Confidence interval for the difference of two means

• Let µ1 and µ2 be the (unknown) mean weight gain in the high and low proteinpopulation, respectively:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Low protein High protein

|µ2

|µ1

• Of interest is to draw inferences about µ1 − µ2


• As always, our estimate of µ1 − µ2 is

µ1 − µ2 = x1 − x2 = 19

• Based on the observed data, C.I.’s can be constructed for µ1 − µ2

• For example, a 95% C.I. for µ1 − µ2 is given by

[−2.19; 40.19]

• The true difference µ1 − µ2 may or may not be in the interval [−2.19; 40.19].However, if 100 similar experiments would be conducted, then 95 out of the 100corresponding C.I.’s are expected to contain µ1 − µ2.

• Hence, with 95% certainty, we can conclude that we believe µ1 − µ2 to be withinthe interval [−2.19; 40.19].


• This C.I. shows that:

. the estimate (19g) of µ1 − µ2 is a very imprecise estimate:

∗ the C.I. is very wide

∗ the estimate is up to 21.19 units precise with 95% chance

. based on our data, it cannot be ruled out that µ1− µ2 would be zero, i.e., thatthere would be no difference between both populations.


1.3 The unpaired t-test

• Often, it is of interest to test whether two populations have the same mean.

• This is translated in a set of hypotheses of the form:

H0 : µ1 = µ2 versus HA : µ1 6= µ2

• We will reject the null hypothesis if the observed data show too much deviationfrom what is expected to see if the null hypothesis were correct

• Hence, we will reject H0 if x1 is much larger than x2, or vice versa

• This is equivalent with rejecting H0 if |x1 − x2| is too large


• Question:

How large is too large ?

• Answer:

If the observed difference |x1 − x2|is very unlikely to happen by pure chance

• We therefore calculate the propability p of observing a similar experiment withmean difference between the groups of at least 19g, if µ1 = µ2.


• In our example, this probability equals p = 0.0757:

• So, even if there is no relation at all between the protein content of the diet andweight gain, then one can still expect to observe a difference of at least 19g in7.6% of the future similar experiments.

• Since p = 0.0757 > 0.05 = α, we consider this unsufficient evidence to concludethat the protein level would indeed affect the weight gain


• Conclusion:

There is no significant difference (p = 0.0757) in weight gain

between rats on a high protein level diet,

and rats on a low protein level diet

• The above testing procedure is called the unpaired t-test since unpaired data areanalysed, and since the calculation of the p-value is based on the t-distribution.


1.4 Assumptions

• The calculation of the C.I., as well as the computation of the p-value are based onthe sampling distribution of X1 −X2, which describes what values for x1 − x2

can be expected in case the experiment would be repeated many times.

• The sampling distribution of X1 −X2 is completely determined from thesampling distribution of X1 and X2

• In case of large samples, those distributions are known to be normal (CLT)

• In small samples, this normality of X1 and X2 is only valid in cases where theoriginal data are (approximately) normally distributed.


• Therefore, in case of small samples, one assumes the outcome to be normallydistributed in each group separately:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1


• Conclusion:

Large samples: no assumptions

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

Small samples: Normality in both groups

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

• Note that the samples in our group were small (n1 = 12 and n2 = 7). Hence thehistograms should be explored for any evidence against symmetry


• The group-specific histograms are:

• Note that, given the small sample sizes, assessment of symmetry is difficult

• This illustrates another drawback of small samples: Assumptions are often needed,which are very hard to check based on the observed data.


• Subject-matter knowledge can often help in deciding whether the underlyingassumptions are realistic

• The unpaired t-test also implicitly asssumes that, both populations have the samevariance

• This can be checked with a test for equality of variances, in which thefollowing hypotheses are tested:

H0 : σ21 = σ2

2 versus HA : σ21 6= σ2

2

• Most software packages automatically report the results from such a test, andeven provide a corrected unpaired t-test, which corrects for the unequal variances:


• The variances are not significantly different from each other (p = 0.9788), suchthat our original result remains valid.

• Note that, since the variances are so similar, the corrected and uncorrected t-testsyield very similar results (p-values).

• Often, non-equality of the variances is associated with non-normality of the data


1.5 Example: Survival times of cancer patients

• Based on data on survival times of cancer patients, we want to compare the surivaltimes of stomach cancer patients with the survival times of colon cancer patients

• Summary statistics:

• We observe a large difference of 457.4− 286 = 171.4 days in average survival timebetween both groups.


• On the other hand, there is a lot of variability between the subjects in both groups.

• Hence, it is not clear whether the observed difference of 171 days is sufficientevidence to conclude that survival times are indeed different for colon cancerpatients and stomach cancer patients

• Results of the unpaired t-test:

• We do not find a significant difference between both groups, with respect to thesurvival time (p = 0.2483).


• However, the histograms suggest skewness in the data, such that the underlyingassumption of normality becomes questionable:

• The skewness in the direction of the large values suggests that a logarithmic (orsimilar) transformation might be useful:

X = survival time −→ Y = ln(X) = ln(survival time)


Histogram Possible transformations


Stomach Colon

X Y = ln(X) X Y = ln(X)

124 4.82 248 5.51

42 3.74 377 5.93

25 3.22 189 5.24

45 3.81 1843 7.52

412 6.02 180 5.19

51 3.93 537 6.29

1112 7.01 519 6.25

46 3.83 455 6.12

103 4.63 406 6.01

876 6.78 365 5.90

146 4.98 942 6.85

340 5.83 776 6.65

396 5.98 372 5.92

163 5.09

101 4.62

20 3.00

283 5.65


• As before, assessing symmetry is difficult due to the small number of observationsin both groups. However, the evidence against symmetry is much weaker now.

• Results of unpaired t-test based on transformed data:

• The observed difference between both groups is still not significant (p = 0.0671),but the p-value is very different from what we obtained before the transformation(p = 0.2483).

• This illustrates that:

. assumptions need to be checked

. violation of assumptions can lead to serious errors


• Note that this is another example where geometric means and standarddeviations would be useful to describe the location and spread in survival times inthe two cancer groups separately:

Stomach cancer Colon cancer

Outcome mean∗ (stand.dev.)∗ mean∗ (stand.dev.)∗

Survival time (days) 144.03 (3.49) 314.19 (2.72)= exp(4.97) = exp(1.25) = exp(5.75) = exp(1.00)

∗ geometric means and standard deviations

which is very different from the arithmetic means and standard deviations thatwere reported before:


• The fact that the formal test has been performed on the log-transformed survivaltimes does not change the interpretation of the result

• If the log-transformed survival times are different for the two groups, then also theuntransformed survival times

• Hence, although the conclusion, strictly speaking, should be that

“there is no significant difference in log survival times,”

it will often be formulated as

“there is no significant difference in survival times.”


1.6 Example from the biomedical literature

• Nissen et al. [1], Table 1:

. Large samples

. Similar variability in both groups

. p < 0.001 rather than p = 0.000


• Kellett, Kellett, and Nordholm [2], Table 2:

. Relatively small samples

. Normality assumption NOT satisfied

. Variances NOT equal

. No reporting of the p-values


Chapter 2

The comparison of two proportions: Unpaired data

. Example

. The chi-squared test

. Assumptions – The Fisher Exact test

. Rows versus columns

. Example: Case-control data



2.1 Example

• Consider data on sickness absence, collected on 585 employees with a similar job:

Sickness absenceNo Yes

Genderfemale 245 184 429

male 98 58 156

343 242 585


• Research question:

Is there a relation between absence and gender ?

• 184/429 = 42.9% of the females, and 58/156 = 37.2% of the males have beenabsent

• This suggests that females are more absent than males

• However, even if absence due to sickness is equally frequent amongst males andfemales, the above results could have occurred by pure chance.

• It therefore would be of interest to calculate how likely it would be to observe suchdifferences, by pure chance


• Note that we have again two populations, with a sample randomly drawn fromeach:

. Males: The hypothetical population of all male employees with similar jobconditions

. Females: The hypothetical population of all female employees with similarjob conditions

• From the first population, a random sample of n1 = 156 males was taken. Fromthe second one, a random sample of n2 = 429 females was drawn.

• Let π1 and π2 denote the proportion of males and females in the total populations

• Then π1 and π2 can be estimated based on their sample versions π1 = 0.372 andπ2 = 0.429

• Because there is no relation between the observations taken from the firstpopulation and those taken from the second, we have unpaired data.


2.2 The chi-squared test

• Often, it is of interest to test whether two populations have the same percentageof people with absence due to sickness.

• This is translated in a set of hypotheses of the form:

H0 : π1 = π2 versus HA : π1 6= π2


• Hence, we will reject H0 if π1 is much larger than π2, or vice versa

• This is equivalent with rejecting H0 if |π1 − π2| is too large


• Question:


• Answer:

If the observed difference |π1 − π2|is very unlikely to happen by pure chance

• We therefore calculate the propability p of observing a similar experiment withdifference between the groups at least equal to|π1 − π2| = 0.429 − 0.372 = 0.057, if π1 = π2



• So, even if there is no relation at all between gender and absence, then one canstill expect to observe a difference of 5.7% in 21.5% of the future similarexperiments.

• Since p = 0.215 > 0.05 = α, we consider this unsufficient evidence to concludethat the occurrence of sickness absence is related to gender


• Conclusion:

There is no significant difference (p = 0.215) in prevalence

of sickness absence

between males and females

• The testing procedure needed for the comparison of proportions in unpaired datais called the chi-squared test since the calculation of the p-value is based on thechi-squared (χ2) distribution.


2.3 Assumptions – The Fisher Exact test

• The calculation of the p-value is based on the sampling distribution of Π1 − Π2,which describes what values for π1 − π2 can be expected in case the experimentwould be repeated many times.

• Note that Π1 and Π2 are the sample averages X1 and X2 of the binary variable‘sickness absence’.

• Hence, for large samples, the sampling distribution of π1− π2 directly follows fromthe CLT

• In small samples, the normality of Π1 and Π2 can be problematic, and analternative calculation of the p-value is needed.


• The Fisher Exact test provides an alternative way to calculate the p-value,without relying on the CLT, nor on the assumption of large samples.

• As an example, we consider again data on sickness absence, but from a second,much smaller, company:


Genderfemale 1 1 2

male 10 2 12

11 3 14

• The results based on the chi-squared as well as on the Fisher Exact test are:


• We observe considerable differences due to the (extremely) small sample sizes inboth groups

• In larger samples, chi-squared and Fisher Exact produce much more similarp-values:

Sickness absence p-value

Company Males Females χ2 Fisher Exact

1 58/156 184/429 0.215 0.219

2 2/12 1/2 0.287 0.396

3 107/330 405/1079 0.091 0.102

4 37/97 40/122 0.409 0.477

5 3/10 48/150 0.895 1.000

6 56/156 1/11 0.070 0.100

7 1/12 0/1 0.764 1.000

8 53/170 0/1 0.501 1.000

9 378/1089 117/269 0.007 0.009


• The Fisher Exact test is very time-consuming, and cannot be calculated for largesamples, except with special software.

• However, note that, for large samples, the chi-squared test remains possible, andyields results very similar to the ones that would have been obtained with theFisher Exact test

• In practice, it is often standard to use Fisher Exact, unless computationalrestrictions require the use of chi-squared.

• Conclusion:

Large samples: Chi-squared test

Small samples: Fisher Exact test


2.4 Rows versus columns

• When comparing two unpaired proportions, the data can always be summarized bya 2× 2 table:


Genderfemale A B A + B

male C D C + D

A + C B + D A + B + C + D

in which A, B, C, and D represent the number of observations in each cell.

• The hypothesis of interest was to compare the prevalence of sickness absencebetween males and females.


• One can show that this is equivalent with comparing the percentage of males(females) between the employees with and without sickness absence:

B

A + B=

D

C + D⇐⇒ C

A + C=

D

B + D

Proof: B

A + B=

D

C + D⇐⇒ B(C + D) = D(A + B)

⇐⇒ BC = AD

⇐⇒ C(B + D) = D(A + C)

⇐⇒ C

A + C=

D

B + D

• This implies that, for the analysis of a 2× 2 table, rows and columns can beinterchanged.

• This is of interest for the analysis of case-control data


2.5 Case-control data

• We consider the data on cervical cancer, where the relationship between theoccurrence of cervical cancer and the age at first pregnancy is studied.

• Data were collected on 49 cancer cases and 317 non-cancer cases (controls). Allwomen were asked about their age at first pregnancy, and the data aresummarized as:

Disease statusCervical cancer Control

Age≤ 25 42 203 245

> 25 7 114 121

49 317 366



Is there a relation between cancer and age ?

• Of interest is to compare the prevalence of cancer between women with firstpregnancy before the age of 25, and those with first pregnancy later.

• However, correct estimation of these percentages would have required a sample ofwomen with first pregnancy before the age of 25, and a sample of women withfirst pregnancy later

• This was not the setup of the present experiment, where a number of cases and anumber of controls are randomly selected, and where all women are thenquestioned about their age at first pregnancy.


• Such a design only allows correct estimation of the percentage of women with firstpregnancy before the age of 25, for cases and controls separately.

• However, since rows and columns can be interchanged, this is sufficient to answerour research question of interest:


• For testing purposes, rows and columns can be interchanged, implying that theanalysis of case-control data still answers the research question of interest

• For descriptive purposes, however, the choice between row and columnpercentages entirely depends on the design of the study.

• In the above example on cervical cancer, the row-percentages (i.e., percentage ofwomen with first pregnancy before the age of 25), for cancer cases and controlsseparately, are the only ones that reflect the case-control nature of the experiment.



Zuskin et al. [3], p.173 and Table 1:


It is not clear when chi-squared is used, and when Fisher Exact is used


Chapter 3

The comparison of two means: Paired data

. Example

. Confidence interval for the difference of two means

. The paired t-test

. The paired versus unpaired t-test

. Example

. Assumptions



3.1 Example

• We consider the Captopril example, where blood pressure was taken in 15hypertensive patients, before and after administration of the drug Captopril:


• Dataset ‘Captopril’

Before After

Patient SBP DBP SBP DBP

1 210 130 201 125

2 169 122 165 121

3 187 124 166 121

4 160 104 157 106

5 167 112 147 101

6 176 101 145 85

7 185 121 168 98

8 206 124 180 105

9 173 115 147 103

10 146 102 136 98

11 174 98 151 90

12 201 119 168 98

13 198 106 179 110

14 148 107 129 103

15 154 100 131 82

Average (mm Hg)

Diastolic before: 112.3

Diastolic after: 103.1

Systolic before: 176.9

Systolic after: 158.0



How does treatment affect BP ?

• As in the unpaired t-test, we might consider this a two-sample case, where asample is taken from each of two populations:

. Population 1: Patients without treatment

. Population 2: Patients after treatment with Captopril

• Let µ1 be the population average BP if no treatment is given, and let µ2 denotethe population average BP after treatment.


••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

After treatment Without treatment

|µ2

|µ1

• Interest is in inference for the difference µ = µ1 − µ2.

• The main difference when compared to the unpaired t-test is that eachobservation from the first sample now uniquely corresponds to one observationfrom the second sample, and vice versa.

• Hence, we have paired data


• In the case of unpaired data, µ would be estimated by the difference between thetwo sample averages:

µ = µ1 − µ2 = x1 − x2

• In the case of paired data, µ is estimated by the average of all subject-specificdifferences between BP’s before and after treatment. More specifically, thevariable of interest becomes the difference X in BP before and after treatment:

X = BPbefore − BPafter


• The observed values xi for X can be calculated from the observed values of theBP in our sample:

Before After Change

Patient DBP DBP xi

1 130 125 5

2 122 121 1

3 124 121 3

4 104 106 −2

5 112 101 11

6 101 85 16

7 121 98 23

8 124 105 19

9 115 103 12

10 102 98 4

11 98 90 8

12 119 98 21

13 106 110 −4

14 107 103 4

15 100 82 18

• µ is the population mean of the variable X , and inference for µ can be based onthe within-subject differences xi, rather than on the original BP measurements.


3.2 Confidence interval for the difference of two means

• For example, a 95% confidence interval for µ is given by

[4.91; 13.63].

• Other confidence levels (99%, 90%, . . . ) are possible as well

• The true average effect µ may or may not be in the interval [4.91; 13.63].However, if 100 similar experiments would be conducted, then 95 out of the 100corresponding C.I.’s are expected to contain µ.

• Hence, with 95% certainty, we can conclude that we believe µ to be within theinterval [4.91; 13.63].


• This C.I. shows that:

. the estimate (9.27mmHg) of µ is a very imprecise estimate:

∗ the C.I. is very wide

∗ the estimate is up to 4.36 units precise with 95% chance

. based on our data and with 95% certainty, it can be ruled out that µ would bezero, i.e., that there would be no treatment effect at all.


3.3 The paired t-test

• The hypothesis of interest is

H0 : µ1 = µ2 versus HA : µ1 6= µ2

• This is equivalent with the following test about the mean of the difference X inbloodpressure:

H0 : µ = 0 versus HA : µ 6= 0


• Hence, we will reject H0 if x is much larger or smaller than 0.


• This is equivalent with rejecting H0 if |x− 0| is too large

• Question:


• Answer:

If the observed difference |x− 0|is very unlikely to happen by pure chance

• We therefore calculate the propability p of observing a similar experiment withaverage observed effect of at least 9.27mmHg, if µ = 0.

• In our example, this probability equals p = 0.001


• So, if there would be no treatment effect at all, then one can expect to observe adifference of at least 9.27mmHg in only 0.1% of the future similar experiments.

• Since p = 0.001 < 0.05 = α, we consider this sufficient evidence to conclude thatCaptopril affects the diastolic BP

• Conclusion:

There is a significant difference (p = 0.001) in diastolic BP

before and after treatment with Captopril

• The testing procedure is called the paired t-test since paired data are analysed,and since the calculation of the p-value is based on the t-distribution.


3.4 The paired versus unpaired t-test

• What if the Captopril data were analysed using an unpaired t-test ?


• Results from unpaired and paired t-tests, respectively:

. Unpaired:

. Paired:

• Although both tests lead to a significant result, there is a serious difference inp-values, showing that ignoring the paired nature of the data can lead to wrongconclusions.


• Conclusion:

15 × 2 measurements 6= 30 × 1 measurement

• In general, the analysis of an outcome, measured multiple times per subject(repeated measures), requires different statistical procedures than when theoutcome is measured only once for each subject.


3.5 Example

• Obviously, it is important to correctly account for the paired nature of the data

• In practice, this requires knowledge about the design of the study and the waydata have been collected

• As an example, suppose interest is in testing for differences in BMI between malesand females

• Suppose that BMI measurements are available for 100 males and 100 females.

• The unpaired t-test is the obvious choice for the analysis, provided all assumptionsare satisfied.

• Suppose now that the 100 males and females are taken from 100 married couples,would this change the preferred method for analysis ? YES !


3.6 Assumptions

• The calculation of the C.I. as well as the computation of the p-value is based onthe sampling distribution of X , which describes what values for x can be expectedin case the experiment would be repeated many times.

• In large samples, this sampling distribution is normal (CLT)

• In small samples, this normality is only valid in cases where the difference in BP is(approximately) normally distributed.

• Therefore, in case of small samples, one assumes the difference X to be normallydistributed.

• Note that, in this context, the sample size refers to the number of pairs, not thenumber of observations in the data set


• Conclusion:

Large samples: no assumptions

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Difference X

|µ = 0 ?

Small samples: Normality for difference X

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Difference X

|µ = 0 ?

• In our Captopril example, the sample size was small (n = 15). Hence thehistogram of the observed differences should be explored for any evidence againstsymmetry


• Histogram of observed differences:

• Assessment of symmetry is again difficult due to the small sample size, but thereis no strong evidence for severe skewness.

• Note that the normality assumption is with respect to the difference X , not theoriginal measurements.


• In our example, the original BP measurements (before and after treatment) areallowed to be skewed, as long as their differences are symmetrically distributed:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

After treatment Before treatment

|µ2

|µ1

→••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Difference X

|µ = 0 ?

• Hence, it is useless to check symmetry of the original observations.


• Note that, in case of skewness, it is often difficult and/or not helpful to transformthe observed differences xi:

. Since often negative differences are observed, several standard transformationssuch as ln(·) or

√· are not possible

. Even if a transformation such as, e.g., yi = ln(xi + 10) would yield symmetricobservations yi, it is not clear what null hypothesis should be tested.

. Obviously, one can no longer test whether the mean of Y is equal to zero.

• In case of skewness, one therefore usually transforms the original data in such waythat the differences become symmetric. This has the advantage that:

. Simple, standard, transformations can often be used

. One can still test for mean zero.


• For example, a potential transformation for the Captopril data would be:

BPbefore

BPafter

⇒

ln(BPbefore)

ln(BPafter)⇒ X = ln(BPbefore)− ln(BPafter)

instead of:

BPbefore

BPafter

⇒ X = BPbefore − BPafter⇒ Y = ln(X + 5)



Chen et al. [4], p. 76 and Tables 1 and 2:


Paired t-test to test for time trends (IAC versus AOD)


Unpaired t-test to test for group differences (SARS verus Control)


Chapter 4

The comparison of two proportions: Paired data

. Example

. Mc Nemar test

. Assumptions

. Remark

. Mc Nemar versus chi-squared

. Example from biomedical literature


4.1 Example

• Consider the data on the prevalence of severe colds in 1319 children, measured atthe ages of 12 and 14.

• The response of interest is whether the child had severe colds during the last 12months

Severe colds at 14 yrs.Yes No

Severe coldsat 12 yrs.

Yes 212 144 356

No 256 707 963

468 851 1319



Is the prevalence of severe colds different at the two ages ?

• At age 12, 356/1319 = 27% of the children reported severe colds.

• At age 14, this percentage equals 468/1319 = 35%

• These data suggest that the prevalence of severe colds increases with age.

• It would be of interest to know how likely the observed change in prevalence is tooccur by pure chance.

• If this is very unlikely, the above data provide evidence that the prevalence indeedchanges with age. Otherwise, the above data do not provide evidence for such achange.


• Note that the data structure is similar to the one in the Captopril data, in thesense that subjects are measured twice at different time points:

• Hence, we have again paired data.


4.2 Mc Nemar test

• Let π1 and π2 be the percentage of children in the total population with a severecold at the ages 12 and 14 respectively.

• Interest is in testing whether π1 and π2 are equal, which would reflect no changeover time in the percentage of children with a severe cold.

• The hypothesis of interest is

H0 : π1 = π2 versus HA : π1 6= π2

• Note that a change over time in the percentage of severe colds can only occur ifchildren change their status:

. No severe cold at 12yrs −→ severe cold at 14yrs

. Severe cold at 12yrs −→ no severe cold at 14yrs


• Moreover, in order to have a change over time, more children should change inone direction than in the other

• Our test will therefore reject H0 if the number of changers in one direction ismuch larger than the number of changers in the other direction.

• In our example, we will reject H0 if |256− 144| is too large

• Question:


• Answer:

If the observed difference |256 − 144|is very unlikely to happen by pure chance


• We therefore calculate the probability p of observing a similar experiment withdifference between the numbers of changers at least equal to |256− 144| = 112, ifthere would be no change over time in the total population.

• In our example, this probability equals p < 0.0001:

← This p-value !

• So, if severe colds would occur equally frequently at both ages, it would be veryunlikely to observe what has been observed in this particular experiment

• We therefore conclude that our data provide evidence that the probability ofhaving a severe cold at the age of 12 is not the same as the probability of having asevere cold at the age of 14.


• Conclusion:

There is a significant difference (p < 0.0001) in theoccurrence of severe colds between the ages 12 and 14

• The testing procedure needed for the comparison of proportions in paired data iscalled the Mc Nemar test.


4.3 Assumptions

• Similarly to the chi-squared test, the calculation of the p-value is based on theassumption of a large sample

• In case of small samples, the p-value can be calculated without approximationsbased on CLT

• The exact calculation is similar to the Fisher Exact test for unpaired data.

• Many statistical packages only support the large-sample calculations.


4.4 Remark

• As discussed before, the Mc Nemar test rejects H0 if the off-diagonal elements aretoo different from each other, i.e., if there are many more changes in one directionthan in the other direction.

• This implies that the testing procedure is independent of the observed diagonalelements

• Examples:

Table:20 20

40 50

200 20

40 500

McNemar: comparison: 60130

vs. 40130

240760

vs. 220760

result: p = 0.0142 p = 0.0142


4.5 Mc Nemar versus chi-squared

• There seems to be a lot of confusion about when Mc Nemar test and whenchi-squared test should be used.

• As an example, consider the results from a survey in which 75 people werequestioned about their intended vote in the US presidential elections, before andafter a debate on the national television:

After TV debateReagan Carter

BeforeTV debate

Reagan 27 7 34

Carter 13 28 41

40 35 75


• Depending on the research question, this table can be analysed in two differentways:

. Chi-squared: test for relation between vote before and after debate

. Mc Nemar: test for equal proportion Reagan voters before and after debate

• Hence, even when data are paired, the chi-squared test can be used

• Note that, in case of continuous data, there is no such choice:

. Unpaired data =⇒ Unpaired t-test

. Paired data =⇒ Paired t-test


4.5.1 Mc Nemar test


BeforeTV debate

Reagan 27 7 34

Carter 13 28 41

40 35 75


Is the proportion Reagan voters the samebefore and after the debate ?

• The observed proportions are 34/75 = 45.3% and 40/75 = 53.3%


• The p-value obtained from the Mc Nemar test equals p = 0.2636:

• Hence the observed difference of 45.3% versus 53.3% would happen in 26.36% ofthe cases, even if the percentage of voters for Reagan is the same before and afterthe debate.

• Conclusion:

The debate has not significantly changed the votingbehaviour (p = 0.2636).


4.5.2 Chi-squared test


BeforeTV debate

Reagan 27 7 34

Carter 13 28 41

40 35 75


Is there a relation between voting behaviour before andafter the debate ?

• Or equivalently:

Is the proportion of Reagan voters after the debate the sameamongst those who were in favour of Reagan before the debate asamongst those who were in favour of Carter before the debate ?


• The observed proportions are 27/34 = 79.4% and 13/41 = 31.7%

• Note that this comes down to comparing the proportion of Reagan voters afterthe debate, between two separate groups: Those who were in favour of Reaganbefore the debate, and those who were not in favour of Reagan before the debate.

• Hence, we now compare unpaired proportions.

• The p-value obtained from the Chi-squared test equals p < 0.0001:

• The observed difference of 79.4% versus 31.7% is very unlikely to happen if therewould be no relation between the voting behaviour before and after the debate.


• Conclusion:

There is a significant relation between the voting behaviourbefore and after the debate (p < 0.0001).


4.5.3 General conclusion

• The survey results can be analysed in two different ways, leading to two differentconclusions:

. Mc Nemar: There is no evidence that a TV debate would change the resultsof an election (p = 0.2636)

. Chi-squared: There is a strong relation between voting behaviour before andafter the debate (p < 0.0001).

• Note that the proportion of Reagan voters before and after a TV debate couldalso be compared based on unpaired data.

• One then would question 75 people before the debate, and one would question 75other people after the debate.


• The resulting 2× 2 table would then contain 150 subjects:

PreferenceReagan Carter

TV debateBefore 34 41 75

After 40 35 75

74 76 150

• The chi-squared test would compare the observed proportions 34/75 = 45.3% and40/75 = 53.3%, which are the same ones as those compared before with theMc Nemar test for the experiment with paired observations


4.5.4 Some further examples

• There is no relation between (non-)significance of the chi-squared test and(non-)significance of the Mc Nemar test

• Examples:

Table:25 25

25 25

10 10

40 40

40 10

10 40

5 20

45 30

χ2: comparison: 2550 vs. 25

501050 vs. 10

504050 vs. 10

50550 vs. 20

50

result: p = 1.0000 p = 1.0000 p < 0.0001 p = 0.0291

McNemar: comparison: 50100

vs. 50100

50100

vs. 20100

50100

vs. 50100

50100

vs. 25100

result: p = 1.0000 p < 0.0001 p = 1.0000 p = 0.0098


4.6 Example from biomedical literature

De Clercq et al. [5], Abstract:

Mc Nemar test to compare the presence of sumptoms before and after surgery.


Chapter 5

Errors in statistics: Basic concepts

. Introduction

. Two types of errors

. Power

. Sample size calculation

. Examples

. Remarks



5.1 Introduction

• Re-consider the example on the weight gain in rats, where interest is in thecomparison between rats fed on a high or low protein diet

• Group-specific histograms:


• Group-specific summary statistics:

• On average, there is an observed difference of 19g between the rats on a highprotein diet and those on a low protein diet.

• Based on the unpaired t-test, we obtained before that this observed difference isnot sufficient evidence to believe that the weight gain is really different for the twodiets (p = 0.0757)


• Conclusion:

There is no significant difference (p = 0.0757) in weight gain

between rats on a high protein level diet,

and rats on a low protein level diet

• As indicated before, the result of a statistical test should be interpreted asevidence in favour or against the null hypothesis, and should not be interpreted asformal proof.

• In our example, the difference in weight gain between a population treated withone diet and a population treated with the other diet is too small to be detectedbased on 12 and 7 animals, respectively.


• Alternatively, if the t-test would have lead to p = 0.001, this would still notformally proof that there is a difference between both populations.

• After all, p = 0.001 would only indicate that the observed difference of 19g occursonce every 1000 times, even if there is no difference at all between bothpopulations.

• Maybe, our sample was indeed the extreme one that happens once every thousandexperiments.

• Hence, whenever statistical tests are used, one has to be aware that errors in theconclusions can occur.

• It is therefore important to quantify the errors, and to keep them undercontrol


5.2 Two types of errors

RealityH0 correct H0 not correct

Test resultAccept H0 No error Type II error

Reject H0 Type I error No error

• Type I error: H0 is incorrectly rejected

• Type II error: H0 is incorrectly accepted


5.3 Type I error

• A type I error occurs if H0 is correct but the test leads to a significant result.

• Question:

How likely is such an error to occur ?

• Suppose the test is performed at the α = 5% level of significance

• If H0 is correct, then one will observe a significant result in 5% of the cases

• Hence, in 5% of the cases, H0 would be incorrectly rejected


• The probability of making a type I error is therefore equal to the chosen level α ofsignificance.

• In practice, the probability of making a type I error is kept under control bychoosing α sufficiently small

• In biomedical sciences α = 5% is often used, hereby allowing to make a type Ierror in 5% of the cases.


Test resultAccept H0 1− α

Reject H0 α

1

• If H0 is correct, then the probability of making a type I error is α, while theprobability of correctly accepting H0 is 1− α.


5.4 Type II error

• A type II error occurs if H0 is incorrect but the test has not detected this, i.e., anon-significant result is obtained

• Question:

How likely is such an error to occur ?

• In contrast to the type I error, the probability of making a type II error is not easilycontrolled, and depends on various aspects of the sample(s) and population(s)


• In analogy to the type I error, the type II error rate is denoted by β


Test resultAccept H0 1− α β

Reject H0 α 1− β

1 1

• The power of a statistical test is 1− β, the probability of correctly rejecting H0


5.5 Power

• In general, a specific testing procedure is acceptable, only if:

. the chance of making a type I error rate is sufficiently small

. the power to detect deviations from H0 is sufficiently large

• The first condition can be met by specifying α sufficiently small.

• The second condition is more difficult to meet, as the power depends on variousaspects of the sample(s) and population(s)

• This will be illustrated in the context of the comparison of two groups (such asthe weight gain experiment)


• As before, let µ1 and µ2 represent the average weight gain in the total population,under high and low protein diets, respectively.

• The null and alternative hypotheses are given by

H0 : µ1 = µ2 versus HA : µ1 6= µ2

• The power is the probability of correctly rejecting H0.

• In that case, µ1 6= µ2, and we denote the true difference between bothpopulations by ∆ = µ1 − µ2

• The unpaired t-test assumes the data to be normally distributed in bothpopulations, with equal variability σ2


• Graphically:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

................................................................................................................................................................................................................................................................... ........................................ ...........................................................................................................................................................................................................................................................................................................∆

........................................................................................................................................................................................................................................................ ........................................ ................................................................................................................................................................................................................................................................................................σ2

........................................................................................................................................................................................................................................................ ........................................ ................................................................................................................................................................................................................................................................................................σ2


5.5.1 Power as a function of α

The smaller α, the smaller the power

• Intuitively: Type I errors are less likely if the null hypothesis is rejected lessoften. However, in cases where H0 is truly wrong, it will still be rejected less often.

• An extreme case is obtained for α = 0:

. α = 0 implies that the null hypothesis is always accepted

. So, in case the null hypothesis is wrong, it is still accepted, leading to power 0


5.5.2 Power as a function of true difference ∆

The smaller ∆, the smaller the power

• Intuitively: Large deviations from the null hypothesis are easier to detect

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

................................................................................................................................................................................................................................................................... ........................................ ...........................................................................................................................................................................................................................................................................................................∆

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

..............................................................................................................................................................∆


5.5.3 Power as a function of variability σ2

The smaller σ2, the larger the power

• Intuitively: Homogeneous groups are easier discriminated than heterogeneousgroups

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

................................................................................................................................................................................................................................................................... ........................................ ...........................................................................................................................................................................................................................................................................................................∆

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


|µ2

|µ1

................................................................................................................................................................................................................................................................ ........................................ ........................................................................................................................................................................................................................................................................................................∆


5.5.4 Power as a function of sample size(s)

The more observations, the larger the power

• Intuitively: More observations yields more information about the population(s),therefore implying more precision in the conclusions


5.5.5 Conclusion

• The power depends on various aspects:

. Level of significance α

. True difference ∆ between the populations

. Within-group variance σ2

. Sample size(s)

• Note that the sample size is the only aspect under control of the investigator.

• In practice, one can calculate the sample size needed to reach a sufficiently highpower.


5.6 Sample size calculation

• As indicated before, a testing procedure is only acceptable if it has sufficientpower, i.e., if the probability of making a type II error is sufficiently small.

• Since the sample size is the only aspect influencing the power, which is undercontrol of the investigator, it is important that experiments are sufficiently large inorder for the power to be sufficiently large as well

• The level α of significance is chosen such that the probability of making a type Ierror is sufficiently small

• The within-group variance σ2 is pre-specified based on earlier, similar experiments,relevant literature, or a pilot study


• To be on the safe side, usually an upperbound for σ2 is used: In case thevariability would be smaller, the power would be higher, hence still sufficiently high

• In practice, ∆ is not known. Instead, the smallest ∆ which would still be clinicallyrelevant to detect, is specified.

• If sufficient power is attained for the smallest meaningful ∆, we have that:

. Any larger difference will be detected with even larger power

. We are not concerned about small powers for detecting smaller differences, assuch differences are not relevant anyway.

• One can then calculate the number(s) of observations needed to reach a desiredlevel of power.


5.7 Example: Weight gain data

• In the weight gain data, the observed difference of 19g was found not to besignificant (p = 0.0757)

• We can calculate the power that a real difference of 19g would be foundsignificant if a new experiment were to be conducted, again with 12 and 7observations in the high and low protein diet groups, respectively.

• Group-specific summary statistics, from the current experiment:


• Power calculations will be based on σ = 21, and α = 0.05

• The power to detect a difference ∆ of 19g equals 43.45%

• Hence, with 12 and 7 observations respectively, there is only 43.45% chance thata true difference of 19g would be detected.

• If a difference of 19g is considered clinically relevant, then the weight gainexperiment was clearly too small, since it is very likely that such a difference wouldremain undetected.

• We can also calculate the power for other values of ∆


• Summary:

∆ Power to detect a difference ∆

0g 5.00%∗

10g 15.70%

19g 43.45%

30g 80.80%

40g 96.49%

∗: equal to α

• For example, 12 and 7 observations would be sufficient to show a true differenceof 40g with more than 96% chance.

• Alternatively, one can also calculate how large the samples should be to detect adifference of, e.g., 20g with sufficiently high power.


• If a power of 90% is required to detect true effects as small as ∆ = 20g, at least25 observations are needed in each group.

• With 30 observations in each group, the probability of making a type II error,when the true effect is not smaller than 20g, is approximately 5%.


5.8 Example: Sickness absence

• We re-consider the data on sickness absence, collected on 585 employees with asimilar job:



male 98 58 156

343 242 585

• The observed difference between the absence rate 42.9% in females and 37.2% inmales was found not significant (chi-squared test, p = 0.215).


• In case the percentages of sickness absence would be 42% in the total femalepopulation, and 37% in the total male population, and in case a random sample of429 females and 156 males would be taken, there would be 19.01% chance toreach a significant effect.

• So, if the population proportions are indeed 42% and 37%, an experiment with429 en 156 would detect this difference only 19 times out of 100 experiments.

• If a difference of 5% is considered clinically relevant, then the current experimentwas clearly too small, since it is very likely that such a difference would remainundetected.

• We can calculate how large the samples should be in order to detect a differencebetween 42% and 37%, with sufficiently high power


• For example, two samples of approximately 2500 observations are needed in orderto show a difference between 37% and 42%, with 95% probability


5.9 Remarks

• The earlier examples of power and/or sample size calculations were in the contextof the unpaired t-test and chi-squared test.

• Similar calculations can be done in any other statistical testing situation, e.g.,Fisher Exact test, paired t-test, McNemar test, . . .

• Strictly speaking, all experiments should be preceded by a realistic sample sizecalculation to avoid experiments with unacceptable high type II error rates, i.e.,with almost no chance at all to show clinically meaningful effects.



Wong et al. [6]

• Methodology section, p.658:


• Table 2 with results:

• Discussion, p.664:


• The difference on which the sample size calculation was based was much largerthan what actually was observed in the experiment

• Therefore, the power to reject equality of the groups was (much) lower than theexpected 80%

• The current study cannot tell the difference between a 9% increase and a 3%decrease.

• If such differences are considered clinically important, then the current study wasunder-powered, due to the fact that the difference was overestimated at the timeof the sample size calculation.


Chapter 6

Errors in statistics: Practical implications

. Multiple testing

. Bonferroni correction

. Tests for baseline differences

. Equivalence tests

. Significance versus relevance

. Examples from biomedical literature


6.1 Multiple testing

• Each time a test is performed, there is probability α of making a type I error

• For example, if α = 0.05, we can expect to incorrectly reject the null hypothesis in5 out of 100 times.

• Implication:

“The more tests one performs, the higher the probabilitythat something is detected by pure chance”

• This problem of multiple testing occurs very frequently in bio-medical sciences,in various settings


6.1.1 Example: A classroom experiment

• On entry in the classroom, assign each student at random to be seated at the leftor at the right side of the classroom

• Compare both sides with respect to 100 aspects including weight, height, age,gender, color of hair, color of eyes,. . .

• It is to be expected that for at least 5 of these outcomes, a significant difference isobtained at the 5% level of significance, by pure chance.


6.1.2 Example: Testing many relations

• Amin et al. [7], Table 2:

. 18 tests performed

. only 2 significant results


6.1.3 Example: Subgroup analyses

• Kaplan et al. [8], Table 5:

. Tests based on C.I.’s for odds ratios

. C.I. containing 1 is equivalent to anon-significant test result

. 21× 3 = 63 tests performed

. only 5 significant results


6.1.4 Example: Searching for the most significant results

• This ‘scientific finding’ was printed in the Belgian newspapers:

• It was even stated that those who wake up before 7.21am have a statisticallysignificant higher stress level during the day than those who wake up after 7.21am.


6.1.5 Conclusion

• Significant results obtained by multiple testing are often overinterpreted

• If the number of tests is reported, the reader knows that such results need to beinterpreted with extreme care

• The problem arises when only the significant results are reported, and one doesnot know how many tests were performed in total

• This leads to reporting results which turn out to be not reproducible

• For example, a new study would not find that students seated on the left are tallerthan those on the right. Instead, students seated on the left may weigh more thanthose seated on the right.


• For example, a new experiment might show no difference in stress levels betweensubjects waking up early and those waking up late. Or maybe a difference wouldbe found only when waking up is later than 8.12am.


6.2 Bonferroni correction

• Suppose two tests are performed, both at the 5% level of significance.

• The probability that at least one type I error will be made can be shown not toexceed 2× 0.05 = 0.10:

P (at least 1 type I error) ≤ 2× 5% = 10%

• In general, if k tests are performed, all at the 5% level of significance, theprobability of making at least one type I error can only be shown not to exceedk × 5%

• Obviously, controling the overall type I error rate can be done by performing eachseparate test at the α/k level of significance.


• For example, performing 2 tests at the 2.5% level of significance each implies thatthe probability of making at least one type I error will not exceed 5%.

• In general, when k tests are performed at the α/k level of significance, one is surethat the overall probability of making at least one type I error will not exceed α.

• This correction of the significance level is called the Bonferroni correction.

• When confidence intervals are used instead of p-values, the confidence levels canbe corrected in a similar way


• Some examples:

Number of tests Significance level α Confidence level

1 0.05 95%

2 0.025 97.5%

5 0.01 99%

k 0.05/k (1− 0.05/k) × 100%

• For example, if CI1, CI2, . . .CI5 are 5 intervals with 99% confidence, for 5unknown parameters θ1, θ2, . . . , θ5, then there is at least 95% probability that all5 C.I.’s will contain all 5 unknown parameters:

P (CI1 contains θ1 and . . . and CI5 contains θ5) ≥ 95%


• Note that, strictly speaking, the Bonferroni correction is an overcorrection, sincethe overall type I error rate can only be shown not to exceed 5%, and usuallywill be smaller than the required 5%.

• In some specific testing situations (e.g., ANOVA analysis), more accuratecorrections are available.


6.3 Examples from the biomedical literature

• Baba et al. [9], p.1202 and p.1203:


• Kellett et al. [2], Table 2 (for example):


In the discussion, R.Roy writes:

Note that the reader cannot perform the Bonferroni correction as the exactp-values have not been reported.


6.4 Tests for baseline differences

• In order to show causal effects, patients are often randomized into 2 or moregroups

• This ensures (at least in large studies) that all treatment groups are identical,except for the treatment the patients receive

• In (relatively) small studies, imbalances can still occur by pure chance

• Therefore, one often compares the various groups with respect to importantfactors which are believed to be strongly related to the outcome of interest.

• This is called testing for baseline differences, as one compares thecharacteristics of the patients at the start of the study.


• As an example, suppose interest is to compare two oral treatments, A and B, forthe treatment of hypertension.

• Suppose the change in diastolic BP is the oucome of interest

• Age is one of the factors believed to be strongly related to BP. Therefore, it isimportant that both treatment groups have the same age distribution

• Therefore, one often tests for age differences between A and B, e.g., based on thetwo-sample t-test.

• The hypothesis tested is

H0 : µA = µB versus HA : µA 6= µB

• Note that H0 and HA express properties of the populations, not the samples


• In the populations (infinitely large), we know that, due to the randomization, µA

and µB are identical

• Conclusion:

It makes no sense at all to perform baseline testsin randomized studies

• No matter how small the resulting p-value would be (e.g., < 10−8) we know thatthe observed difference in age between groups A and B has occurred purely bychance.

• A meaningful alternative is to calculate a C.I. of the average age differencebetween both groups, to ensure that the observed difference is sufficiently small toconclude that it cannot (completely) explain the observed differences in theoutcome of interest.


• In our example suppose that a 95% confidence interval for the average differencein age (years) is given by [0.1; 0.3], then we believe that this difference would betoo small to explain why patients in group A show more decrease in BP thanpatients in group B.

• Note also that testing for baseline differences cannot be used to check whetherthe randomization was done properly.



Nissen et al. [1], abstract and table 1:

A two-arm randomized study


formal tests at baseline


6.6 Equivalence tests

• Suppose two groups A and B are to be compared, and a two-sample t-test is usedto test

H0 : µA = µB versus HA : µA 6= µB

• In case of a non-significant test result, one often concludes that both groups areidentical or equivalent

• An alternative interpretation is that the experiment did not have sufficient powerto show an effect which is present.

• Conclusion:

Non-significance should not be interpreted as equivalence


• This can also be seen from the fact that, if the two-sample t-test could be used toshow equivalence, it would be best to collect data on (extremely) small samples,as this would increase the chance to obtain an non-significant result, due to lackof power.

• Instead, one should reverse H0 and HA:

H0 : |µA − µB| > ∆ versus HA : |µA − µB| ≤ ∆

where ∆ is a pre-specified constant, defining ‘equivalence’

• Note that HA is equivalent to −∆ ≤ µA − µB ≤ ∆

• Hence, in order to reject H0, one needs to show evidence that µA and µB are lessthan ∆ away from each other

• One way to proceed is to construct a C.I. for µA − µB and to check whether it isentirely within the interval [−∆; ∆].


• Graphically, H0 would be rejected if:

µA − µB

−∆ +∆

0

[ ]

95% C.I.

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

..

.

.....

..

..

..

..

..

..

..

..

..

..

..

..

..

..

.

................................

• Graphically, H0 would not be rejected if:

µA − µB

−∆ +∆

0

[ ]

95% C.I.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

..

..

..

..

...

..

..

..

..

..

..

..

..

..

..

..

.

..

..

..

...

..

..

..

..

..

..

..

..

..

..

..


• Obviously, the result of the equivalence test entirely depends on the choice of ∆

• Therefore, ∆ needs to be specified prior to the data collection



• Shatari et al. [10]:

. Title:


. Table 1:

No significantdifferences !


. Results and conclusions (abstract):


• Sripalakit et al. [11], abstract, Table 3, and p.1038

. Title:


. Study design:

∗ Aim: equivalence of 2 treatments

∗ Cross-over: all subjects receive both treatments

∗ ‘Washout’ period of 1 week between both treatments

∗ Treatments given in random order


. Definition of equivalence:

∗ Paired data, with skewed distribution for differences

∗ Log transformation of original outcomes: ln(Yi)− ln(Xi) = ln(Yi/Xi)

∗ Equivalence defined as: ∆ = 0.22 =⇒ [−∆; +∆] = [−0.22; +0.22]

∗ Back-transformed: [exp(−0.22); exp(+0.22)] = [0.80; 1.25]


. Table 3 with results, and conclusion (abstract):


6.8 Significance versus relevance

• We discussed before that the power to detect some effect ∆ increases with thesample size

• This implies that any effect ∆, no matter how small, will, sooner or later, bedetected, if the sample is sufficiently large.

• For example, consider the Captopril data, where the observed difference of 9.27mmHg was found significantly different from zero (p < 0.001), based on datafrom 15 patients only:


• The 99% confidence interval for the average change µ in BP was found to be[3.02; 15.52].

• Suppose that the observed difference would have been 0.1 mmHg.

• A p-value as small as 0.001 would be likely to be obtained, provided that thesample would be sufficiently large.

• Obviously, an average change in BP as small as 0.1 mmHg is not relevant from aclinical point of view.

• Conclusion:

Statistical significance 6= Clinical relevance


• A highly significant effect can be a large effect:

µ

0

[ ]

95% C.I. p = 0.0001

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

• A highly significant effect can also be a very small effect, but estimated with highprecision, due to a large sample size:

µ

0

[ ]

95% C.I. p = 0.0001

.

.

..

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

..

.

..

.

.

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.


• The p-value cannot distinguish between both situations

• It is therefore important not to blindly overinterpret significant results withoutknowing the size of the effect

• This is another reason why confidence intervals are to be preferred oversignificance testing


Chapter 7

One-sided versus two-sided tests

. Introduction

. One-sided tests

. Example



7.1 Introduction

• Re-consider the Captopril data, where the observed difference of µ = 9.27 mmHgwas found significantly different from zero (p < 0.001):

• The hypothesis tested is

H0 : µ = 0 versus HA : µ 6= 0

• This hypothesis is two-sided since it is not pre-specified whether, in case H0 isrejected, µ is larger or smaller than 0


• This implies that an observed difference much larger or much smaller than 0provides evidence against H0

• This is also reflected in the calculation of the p-value:

p is the probability of observing an average difference atleast as far away from 0 as 9.27, if µ = 0.

• This is equivalent to

p is the probability of observing an average difference largerthan 9.27 or smaller than −9.27, if µ = 0.


• Graphically:

|0

Sampling distribution of X under H0

|9.27

|−9.27

..

..

...............................................................................................................

........................

..

..

..

.............................................................................................................

..

..

.

..

..

.

..

..

.

..

.

..

..

.

.

p/2p/2


7.2 One-sided tests

• Sometimes it is of interest to test one-sided hypotheses, e.g.,

H0 : µ ≤ 0 versus HA : µ > 0

• Obviously, observed differences smaller than 0 do not provide any evidenceagainst H0.

• Only differences larger than 0 can be used as evidence in the data against H0

• This has implications for the calculation of the p-value:

p is the probability of observing an average difference atleast as large as 9.27, if µ = 0.


• Graphically:

|0

Sampling distribution of X under µ = 0

|9.27

|−9.27

..

.................................................................................................................

........................

p


• Note that the above distribution is the sampling distribution of X assumingµ = 0.

• Intuitively: If the data provide evidence to reject µ = 0 then also to reject µ ≤ 0

• Note that the p-value is now only half the p-value one would obtain when testingthe two-sided hypothesis

• As a result, significance is reached more often.

• It is therefore tempting to search for arguments justifying one-sided testing ratherthan the classical two-sided testing.

• Often, this is done after the data have been collected, and after having seen thedirection of the observed effect (positive or negative).


• However, the study objectives should never be influenced by the data that areobserved.

One-sided testing is justified only if

. it is known that an effect, if any, can only bein one direction

. only one direction is of scientific interest

. the decision is made prior to the data collection


7.3 Example

• In the context of the Captopril data, suppose that one is only interested intreatments which yield an average decrease of at least 5 mmHg in diastolic BP.

• This would lead to testing

H0 : µ ≤ 5 versus HA : µ > 5

• Note that only differences larger than 5 can be used as evidence against H0

• The p-value is calculated as:

p is the probability of observing an average difference atleast as large as 9.27, if µ = 5.


• Graphically:

|5

Sampling distribution of X under µ = 5

|9.27

..

.................................................................................................................

........................

p


• This p-value is now given by p = 0.038

• Conclusion:

The average treatment effect is significantly larger than5 mmHg (p = 0.038).



Hutchins et al. [12]

• Description of methods, p.8315:

∗ Authors in favour of one-sided tests

∗ Journal required two-sided results


• Results, p.8316:

• Results (abstract):


Chapter 8

Describing associations

. Introduction

. Pearson correlation

. Relative risk

. Odds ratio



8.1 Introduction

• All test procedures discussed so far aim at expressing to what extent an observedrelation between two variables can be ascribed to pure chance:

. Unpaired t-test: The relation between a continuous response Y (e.g., weightgain) and a dichotomous variable X (e.g., protein level) which defines thegroups to be compared.

. Chi-squared test: The relation between a dichotomous response Y (e.g.,sickness absence) and a dichotomous variable X (e.g., gender) which definesthe groups to be compared.

• As discussed before, p-values do not express the size of a relation: A highlysignificant effect does not necessary mean that the effect is clinically relevant, i.e.,the association between the variables is not necessarily very strong.


• A number of association measures is available to describe the strength ofassociation between two variables.

• Association measures frequently used in the biomedical literature are:

. the correlation coefficient

. the relative risk

. the odds ratio


8.2 Pearson correlation

• As an example, we consider surgery data, in which the relation is studied betweenthe time needed, after surgery, for the BP to recover to a ‘normal’ level (systolicBP ≥ 100mmHg), and its relation to the BP during the surgery, and the dose ofthe drug needed to keep the BP sufficiently low during the surgery.

• Data on 53 patients, with 3 types of operation

• Available measurements:

. Time (min.) before the patient’s systolic BP returns to 100 mmHg

. The 10-base log(dose) of the drug in log(mg)

. The average systolic BP while the drug was being administered


• Let us focuss on describing the association between the recovery time, and thelog(dose), irrespective of the type of operation.

• For each patient, we have two measurements:

. The log(dose): xi for the ith patient

. The recovery time: yi for the ith patient

• Our data are couples (xi, yi), which can be graphically explored using ascatterplot.

• The scatterplot suggests a positive relation between X and Y

• Note that such a relation is an average relation, not a relation at the patient level

• Also, the relation is not expected to be very strong: Knowing the dose, onecannot predict the recovery time very precisely.


• The Pearson correlation is a quantitative measure for the strength ofassociation between two variables X and Y , and is defined as:

r =∑

i(xi − x)(yi − y)√∑

i(xi − x)2√∑

i(yi − y)2

where x and y are the sample averages of the observed x-values and y-values,respectively:

x =1

n

∑

i

xi, y =1

n

∑

i

yi

• Insight in the above expression can best be obtained graphically.


r =∑

i(xi − x)(yi − y)√

∑i(xi − x)2

√∑

i(yi − y)2

x

y

•

••

••

•••

•

•

•

xi

yi(+,+)

(+,–)

(–,+)

(–,–)


• The Pearson correlation coefficient measures to what extent there is a linearrelation between X and Y , and has the following properties:

. −1 ≤ r ≤ 1

. r < 0 : negative linear trend between the xi and the yi

. r > 0 : positive linear trend between the xi and the yi

. r = −1 : the data points xi and yi are located on a decreasing straight line

. r = 1 : the data points xi and yi are located on an increasing straight line

. r = 0 : there is no LINEAR trend between the xi and the yi


• Note that the correlation r is computed from the observed values (xi, yi), andonly describes the association that has been observed in the sample.

• However, this sample correlation r can be considered an estimate for thepopulation correlation ρ, i.e., the correlation that would be obtained if thetotal (infinite) population would be studied.

• Usually it is of interest to use the observed sample to test whether ρ can beconsidered different from zero

• Formally, the following hypothesis is to be tested:

H0 : ρ = 0, versus HA : ρ 6= 0

• The test procedure assumes X and Y to be jointly normally distributed.

• Alternatively to testing a hypothesis about ρ, C.I.’s can be computed for ρ as well


POPULATION

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

S

A

M

P

L

E

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

H0 : ρ = 0

HA : ρ 6= 0?

Scatterplot of (xi, yi) Estimate for ρ

ρ = r

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••

INFERENCE AND ESTIMATIONRANDOM


• For our example, the correlation matrix for the three variables in the surgery dataset is:


• The corresponding scatterplot matrix is:


• Note that the normality assumption for the time variable is questionable, implyingthat the reported p-values may not be correct

• One way to solve this is to transform the variable logarithmically, leading to:

• The conclusions do not change qualitatively


8.3 Relative risk

• We re-consider the sickness absence example, where the following data wereobserved in one of the companies studied:

Sickness absenceYes No


male 378 711 1089

495 863 1358

• The observed proportions of 117/269 = 43.49% and 378/1089 = 34.71% ofsickness absence in females and males, respectively, were found to be significantlydifferent (chi-squared: p = 0.007, Fisher Exact: p = 0.009).


• The relative risk (RR) quantifies how much more sickness absence occurs infemales, compared to males:

RR =Proportion sickness absence in females

Proportion sickness absence in males

=117/269

378/1089= 1.26

• This implies that sickness absence occurs 1.26 times more in females than in males

• Alternatively, we can conclude that the risk on sickness absence is 26% larger infemales than in males

• As for the correlation coefficient, the RR can be considered an estimate, based onour sample, for the theoretical relative risk in the total population.


• Note that a RR equal to 1 would imply that the risk is the same for both genders,i.e., that there is no relation between sickness absence and gender.

• It is therefore often of interest to test whether the relative risk in the population isequal to 1. Alternatively, C.I.’s for the relative risk can be constructed as well.

• For example, a 95% C.I. for the RR in our example, is given by [1.0692; 1.4686].

• Since 1 /∈ [1.0692; 1.4686], we know that the null hypothesis of no relationbetween gender and sickness absence is rejected.

• Note that formal testing of this hypothesis was done before using the chi-squaredand Fisher Exact test.


8.4 Odds ratio

• We re-consider the data on the relation between the occurrence of cervical cancerand the age at first pregnancy:

Disease statusCervical cancer Control

Age≤ 25 42 203 245

> 25 7 114 121

49 317 366

• It was shown before that there is a highly significant relation between age at firstpregnancy and the occurrence of cervical cancer (p = 0.002, chi-squared andFisher Exact).


• The relative risk of interest would indicate how much more likely cervical cancer isto occur when the first pregnancy is before the age of 25 years, compared to whenthe first pregnancy is after the age of 25 years.

• Hence, the relative of interest is

RR =Proportion cancer cases when first pregnancy ≤ 25yrs.

Proportion cancer cases when first pregnancy > 25yrs.

• As discussed before, the case-control nature of this study does not allowestimation of the proportions needed to calculate the above RR.

• This is a direct consequence of the fact that the scientist him-/herself decides howmany cancer cases and how many controls will be selected in the sample.


• The effect of that decision can be seen from comparing several situations withdifferent numbers of selected controls:

Table :

≤ 25yrs > 25yrs

Case 42 7

Control 203 114

≤ 25yrs > 25yrs

Case 42 7

Control 2030 1140

RR:42/(42 + 203)

7/(7 + 114)= 2.96

42/(42 + 2030)

7/(7 + 1140)= 3.36

• This means that the RR can be completely influenced by taking more or lesscontrols.

• Therefore, the RR cannot be used to describe the strength of association incase-control studies.


• An alternative to the RR, which can be used for case-control studies, is the oddsratio, defined as the ratio of the odds of cancer in the ≤ 25 group over the oddsof cancer in the > 25 group.

• The odds of cancer in the ≤ 25 group is defined as:

Odds≤25 =Proportion cancer cases when first pregnancy ≤ 25yrs.

Proportion non-cancer cases when first pregnancy ≤ 25yrs.

=42/(42 + 203)

203/(42 + 203)=

42

203= 0.2069

• Note that this odds is a measure for the risk of cancer in the ≤ 25 group, since itwill be large if there are many cancer cases, and small otherwise.


• Similarly, the odds of cancer in the > 25 group is defined as:

Odds>25 =Proportion cancer cases when first pregnancy > 25yrs.

Proportion non-cancer cases when first pregnancy > 25yrs.

=7/(7 + 114)

114/(7 + 114)=

7

114= 0.0614

• This odds is a measure for the risk of cancer in the > 25 group, since it will belarge if there are many cancer cases, and small otherwise.

• The odds ratio is now defined as:

OR =Odds≤25

Odds>25=

0.2069

0.0614= 3.37


• Hence there is 3.37 times more ‘odds’ on developing cervical cancer when the firstpregnancy is at an age younger than 25 years old.

• The odds ratio is difficult to interpret, but it clearly gives a general indication ofhow much more ‘risk’ there is in one group, compared to another group.

• Note that the odds ratio also equals:

OR =42 × 114

203 × 7= 3.37

• In general, we have, for a general 2× 2 table:

Group 1 Group 2

Case A B

Control C D

OR =A×D

B × C


• This shows that, in contrast to the RR, the OR does not depend on the numbersof selected cases and controls.

• This can also be seen in our earlier examples:

Table :

≤ 25yrs > 25yrs

Case 42 7

Control 203 114

≤ 25yrs > 25yrs

Case 42 7

Control 2030 1140

RR:42/(42 + 203)

7/(7 + 114)= 2.96

42/(42 + 2030)

7/(7 + 1140)= 3.36

OR:42× 114

7× 203= 3.37

42× 1140

7× 2030= 3.37


• As for the correlation coefficient and the RR, the OR can be considered anestimate, based on our sample, for the theoretical odds ratio in the totalpopulation.

• Note that an OR equal to 1 would imply that the risk is the same for both groups,i.e., that there is no relation between cervical cancer and the age at firstpregnancy.

• In that case, one would also have RR = 1.

• It is therefore often of interest to test whether the odds ratio in the population isequal to 1. Alternatively, C.I.’s for the odds ratio can be constructed as well.


• For example, a 95% C.I. for the OR in our example, is given by [1.4658; 7.7457].

• Since 1 /∈ [1.4658; 7.7457], we know that the null hypothesis of no relationbetween cervical cancer and age at first pregnancy is rejected.

• Note that formal testing of this hypothesis was done before using the chi-squaredand Fisher Exact test.



• Giantomaso et al. [13]

. Figure 2, p.398:

Positive association between theactual distance and the distanceestimated by the physician


. Table 1, p.398:

. Negative Pearson correlation (r = −0.139)

. Correlation of patient estimate with physician estimateequals r = 0.349, r2 = 0.12

. Joint normality of X and Y are questionable (see graph)


• Marlow et al. [14], Table 1:

Classmates Preterm

Impaired 2 (1.3%) 99 (41%)

Not impaired 158 (98.7%) 142(59%)

OR =158 × 99

2× 142= 56

RR =99/241

2/160= 33


Chapter 9

Non-parametric statistics

. Introduction

. The principle of ranks

. Wilcoxon test

. Example: Survival times in cancer patients

. Spearman correlation

. Example: Surgery data

. Remarks



9.1 Introduction

• Most test procedures commonly used in statistics are based on specificassumptions about the way the outcome Y of interest is distributed in thepopulation. Examples are:

. Normality

. Equal variance

• This is why all techniques discussed so far are examples of so-called parametricstatistics

• Sometimes, transformations of the data can be used in order to satisfy theseassumptions.

• However, this (slightly) complicates the interpretation of the results


• Also, in some cases, it is not possible to find a suitable transformation. Forexample, consider the case where non-normality is caused by multi-modality:

• If no transformation can be found, or if transformations are not desired,non-parametric methods can be used.


9.2 The principle of ranks

• We re-consider the analysis of the survival times of cancer patients, where acomparison of stomach cancer patients to colon cancer patients was of interest.

• Overlaid histogram of the survival times of both groups:


• This suggests that the colon-cancer cases have longer survival times than thestomach-cancer cases, i.e., that the distribution of the the survival times in onegroup is shifted more to the right from the distribution in the other group.

• This implies that, if all observations would be ranked, we expect to see moreobservations from the stomach-cancer group in the lower ranks, and more fromthe colon-cancer group in the higher ranks.

• This suggests that it is sufficient to study the ranks of the observations, i.e.,which observations are larger/smaller than others, to decide whether the survivaltimes in both groups can be assumed to be sampled from the same distribution.

• The actual location of the observations is not needed, it is sufficient to know theirranks.

• Most non-parametric tests are based on replacing the observations by their ranks.


• This will now be illustrated with two frequently-used non-parametric procedures:

. The Wilcoxon test

. The Spearman correlation coefficient


9.3 Wilcoxon test

• The Wilcoxon test is the non-parametric version of the unpaired t-test. Hence, itallows comparison of two populations, without having to assume the data to benormally distributed in both populations

• The null and alternative hypotheses are:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Stomach cancerColon cancer

H0: one distribution

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Stomach cancer Colon cancer

HA: shifted distributions


• Hence, the alternative assumes that one distribution is just shifted from the other.

• As an example of how the Wilcoxon test proceeds, consider the comparison of twopopulations (A en B), on the basis of the following two samples:

A 7 4 9 17

B 11 6 21 14 18

• The observations are now sorted, while keeping track of the population fromwhich they were sampled (group A or B):

4 6 7 9 11 14 17 18 21

A B A A B B A B B


• The observed values are now replaced by their rank in the complete data set(groups A and B together):

1 2 3 4 5 6 7 8 9

A B A A B B A B B

• The sum of the ranks of all observations from one group is now calculated. Forexample, for group A, this becomes:

WA = 1 + 3 + 4 + 7 = 15

• Obviously, if WA is exceptionally large, this means that the observations in groupA are located more to the right, when compared to the observations in group B


• Alternatively, if WA is exceptionally small, this means that the observations ingroup A are located more to the left, when compared to the observationsin group B

• Hence, H0 will be rejected if WA is ‘too large’ or ‘too small’.

• Question:

How large/small is too large/small ?

• Answer:

If the observed value for WA

is very unlikely to happen by pure chance


• We therefore calculate the propability p of observing an experiment with similarvalue for WA, if the two populations would be identical.


• Hence, even if the two samples were drawn from the same population, there wouldbe 28.57% chance of observing two samples shifted from each other as much as inthe current experiment, by pure chance.

• Hence, what has been observed in the current experiment is perfectly in line withwhat is to be expected, if the two populations are identical.


• We therefore conclude that there is no significant difference between the groups Aand B (p = 0.2857).

• This testing procedure is called the Wilcoxon (rank sum) test or, equivalently,the Mann-Whitney U test.

• Note that, alternatively, one can also decide to sum the ranks of the other group(here group B):

WB = 2 + 5 + 6 + 8 + 9 = 30

• This would lead to identical results, since WA is large if WB is small, and viceversa. Indeed, we have that

WA + WB =(nA + nB + 1) × (nA + nB)

2

• Hence, knowing WA is equivalent to knowledge of WB.


9.4 Example: Survival times in cancer patients

• The survival times of colon cancer patients was compared before with those ofstomach cancer patients, using the unpaired t-test, after logarithmictransformation of the survival times.

• We can now repeat this non-parametrically, for the original as well aslog-transformed survival times:

t-test Wilcoxon

Original data p = 0.2483 p = 0.0945

Log-transformed data p = 0.0671 p = 0.0945


• Note that the Wilcoxon test yields a p-value closer to the one obtained from thet-test based on log-transformed data than to the one obtained from the t-testbased on the original data

• Since the Wilcoxon test is based on ranks rather than the original data,transforming the data will not affect the result, as long as monotonictransformations are used.


9.5 Spearman correlation

• The Pearson correlation coefficient r expresses the strength of linear associationbetween two variables X and Y

• As discussed before, the test for significance of the observed correlation assumesX and Y to be jointly normally distributed.

• In cases where a transformation is not possible or not desired, a non-parametricversion can be derived, leading to the so-called Spearman correlationcoefficient.

• As for the Wilcoxon test, the Spearman correlation coefficient will be based onreplacing the observations by their ranks.


• As an example of how the calculation of the Spearman correlation proceeds,consider the following 8 observations for the variables X and Y :

xi

yi

•

•

•

•

••

•

•

xi yi xi yi

0 0.10 10 6.55

13 8.17 2 1.05

6 5.30 12 6.65

4 4.00 8 5.75

• Each value xi is now replaced by its rank amongst all observed values for X .

• Similarly, each value yi is now replaced by its rank amongst all observed valuesfor Y .


• Grahically:

rank(xi)

rank(yi)

•

•

•

•

•

•

•

•

rank(xi) rank(yi) rank(xi) rank(yi)

1 1 6 6

8 8 2 2

4 4 7 7

3 3 5 5

• One now calculates a Pearson correlation as a measure of association betweenthe so-obtained ranks.


• In the above example, the ranks show a perfect linear relation, implying that theSpearman correlation will equal 1.

• Note that the original data did not show a perfect linear fit, implying that thePearson correlation would be less than 1.

• The Spearman correlation coefficient measures to what extent there is amonotone relation between X and Y , and has the following properties:

. −1 ≤ r ≤ 1

. r < 0 : negative trend between the xi and the yi

. r > 0 : positive trend between the xi and the yi

. r = −1 : there is a perfect negative monotone relation between the xi and yi

. r = 1 : there is a perfect positive monotone relation between the xi and yi

. r = 0 : there is no monotone trend between the xi and the yi


• A statistical test for significance of the Spearman correlation can be constructedas well.

• This test procedure is not based on any distributional assumptions about X or Y .

• Note that, although the Spearman correlation is often interpreted as just thenon-parametric version of the Pearson correlation, it is important to realize that,strictly speaking, both correlations measure different types of association:

. Pearson: Linear association

. Spearman: Monotone association


9.6 Example: Surgery data

• As an example, we re-consider the surgery data, in which the relation is studiedbetween the time needed, after surgery, for the BP to recover to a ‘normal’ level,and its relation to the BP during the surgery, and the dose of the drug needed tokeep the BP sufficiently low during the surgery.

• Data on 53 patients, with 3 types of operation

• Available measurements:

. Time (min.) before the patient’s systolic BP returns to 100 mmHg

. The 10-base log(dose) of the drug in log(mg)

. The average systolic BP while the drug was being administered


• Before, a Pearson correlation analysis was performed, and the variable Time waslog-transformed in order to satisfy the normality assumption.

• We compare the previous results with those from a non-parametric Spearmancorrelation analysis:

• Note that Spearman correlations are not always larger/smaller than Pearsoncorrelations.

• Since the Spearman correlation is based on ranks rather than the original data,monotone tansformations of the data will not affect the result.


9.7 Remarks

• For most simple statistical procedures, non-parametric versions are available.

• Non-parametric procedures are not based on distributional assumptions for thedata.

• Since non-parametric procedures are based on ranks, they are not affected bymonotone transformations of the data. Hence, transforming the data prior to anon-parametric analysis does not make any sense.

• Since non-parametric procedures are based on ranks, they are not influenced byextreme values (outliers).


• In general, the use of non-parametric procedures should be consistent with thesummary statistics used to describe the observed data:

. Means and standard deviations + Parametric tests

. Medians and interquartile ranges + Non-parametric tests

• In case the distributional assumptions of a specific test are satisfied, one has thechoice between the parametric and non-parametric test.

• In such cases, the parametric techniques are to be preferred, as they are morepowerful to detect relevant effects.

• Unfortunately, many research questions will require more complex statistical toolsfor which no non-parametric alternatives are available.



• Choksy et al. [15]

. Statistical methodology, p.647:

. Power analysis does not specify the test

. Parametric and non-parametric data ?


. Figure 3:


• Chen et al. [4], Table 3:

. Spearman rank correlations

. Many tests, few significant results, multiple testing


• Huang et al. [16], Figure 1:

. Spearman correlation to quantifylinear relations

. Spearman correlation not affectedby outlier


Bibliography


Bibliography

[1] S.E. Nissen, E.M. Tuzcu, P. Schoenhagen, et al. Statin therapy, LDL cholesterol, C-reactive protein, and coronary artery disease. The

New England Journal of Medicine, 352:29–38, 2005.

[2] K.M. Kellett, D.A. Kellett, and L.A. Nordholm. Effects of an exercise program on sick leave due to back pain. Physical Therapy,71:283–293, 1991.

[3] E. Zuskin, J. Mustajbegovic, N. Schachter, et al. Longitudinal study of respiratory findings in rubber workers. American Journal of

Industrial Medicine, 30:171–179, 1996.

[4] N.H. Chen, P.C. Wang, M.J. Hsieh, et al. Impact of severe acute respiratory syndrome care on the general health status of healthcareworkers in taiwan. Infection Control and Hospital Epidemiology, 28:75–79, 2007.

[5] C.A.S. De Clercq, J.S.V. Abeloos, M.Y. Mommaerts, and L.F. Neyt. Temporomandibular joint symptoms in an orthognathic surgerypopulation. Journal of Cranio Maxillo-Facial Surgery, 23:195–199, 1995.

[6] C.A. Wong, B.M. Scavone, A.M. Peaceman, et al. The risk of cesarean delivery with neuraxial analgesia given early versus late in labor.The New England Journal of Medicine, 352:655–665, 2005.

[7] A.I. Amin, O. Hallbook, A.J. Lee, R. Sexton, B.J. Moran, and R.J. Heald. A 5-cm colonic j pouch colo-anal reconstruction followinganterior resection for low rectal cancer results in acceptable evacuation and continence in the long term. Colorectal Disease, 5:33–37,2003.


[8] S. Kaplan, S. Etlin, I. Novikov, and B. Modan. Occupational risks for the development of brain tumours. American Journal of Industrial

Medicine, 31:15–20, 1997.

[9] Y. Baba, J.D. Putzke, N.R. Whaley, Z.K. Wszolek, and R.J. Uitti. Gender and the parkinson’s disease phenotype. Journal of Neurology,252:1201–1205, 2005.

[10] T. Shatari, M.A. Clark, T. Yamamoto, A. Menon, C. Keh, J.Alexander-Williams, and M. Keighley. Long strictureplasty is as safe andeffective as short strictureplasty in small-bowel crohn’s disease. Colorectal Disease, 6:438–441, 2004.

[11] P. Sripalakit, P. Nermhom, and S. Maphanta. Bioequivalence evaluation of two formulations of Doxazosin tablet in healthy Thai malevolunteers. Drug Development and Industrial Pharmacy, 31:1035–1040, 2005.

[12] L.F. Hutchins, S.J. Green, P.M. Ravdin, D. Lew, S. Martino, M. Abeloff, A.P. Lyss, C. Allred, S.E. Rivkin, and C.K. Osborne.Randomized, controlled trial of Cyclophosphamide, Methotrexate, and Fluorouracil versus Cyclophosphamide, Doxorubicin, andFluorouracil with and without Tamoxifen for high-risk, node-negative breast cancer: Treatment results of intergroup protocol int-0102.Journal of Clinical Oncology, 23:8313–8321, 2005.

[13] T. Giantomaso, L. Makowsky, N.L. Ashworth, and R. Sankaran. The validity of patient and physician estimates of walking distance.Clinical Rehabilitation, 17:394–401, 2003.

[14] N. Marlow, D. Wolke, M.A. Bracewell, et al. Neurologic and developmental disability at six years of age after extremely preterm birth.The New England Journal of Medicine, 352:9–19, 2005.

[15] S.A. Choksy, P.L. Chong, C. Smith, M. Ireland, and J. Beard. A randomised controlled trial of the use of a tourniquet to reduce blood lossduring transtibial amputation for peripheral arterial disease. European Journal of Vascular and Endovascular Surgery, 31:646–650, 2006.

[16] C.-C.J. Huang, C.-M. Li, C.-F. Wu, S.-P. Jao, and K.-Y. Wu. Analysis of urinary N-acetyl-S-(propionamide)-cysteine as a biomarker forthe assessment of acrylamide exposure in smokers. Environmental Research, 104:346–351, 2007.


Statistics: Module 2 - Groep Biomedische Wetenschappen KU … · 2017. 6. 23. · PhD Biomedical...

Documents

Transcript of Statistics: Module 2 - Groep Biomedische Wetenschappen KU … · 2017. 6. 23. · PhD Biomedical...