Bio Stat 600

download Bio Stat 600

of 35

Transcript of Bio Stat 600

  • 8/2/2019 Bio Stat 600

    1/35

    Introduction to Biostatistics

    BIOSTATISTICS 600

    Instructor: T. E. Raghunathan (Raghu)

    E-mail: [email protected]

    About the course

    To provide an overview of basic concepts in Designand Analysis of Biostatistical Investigations.

    Unify the thought process as many students takecourses under different circumstances at varioustimes

    Kindle imaginations for your course work andrefresh memory

    Grade Letter grade will be based on a multiple choice test on

    the last day of the lecture

    What is biostatistics? Biostatistics, as a field of science, is concerned with:

    The design and conduct of experiments (orstudies) to collect observations (or data).

    Display and analyze the data to infer about thepopulation

    Duly acknowledge the uncertainty in the statedconclusions while inferring about the population.

    R. A. Fisher defined statistics as

    Study of populations

    Study of variations

    Study of methods to reduce the data

    Key Concepts

    Goal of statistical analysis is to draw inference or conclusions inan unbiased fashion

    The target population should be clearly defined (that is, thepopulation for which inference is drawn should be clearly stated)

    It is important recognize the variations in the population and

    should be reflected in the inferences . The same experiment conducted on two different populations

    may yield different results (systematic component of variation)

    Two different conditions or experiments conducted on thesame population may yield different results (systematiccomponent of variation)

    The experiment replicated under the same conditions on the

    same population may yield different results (randomcomponent of variation).

    Always find ways to succinctly describe the results usinggraphical and numerical summaries

  • 8/2/2019 Bio Stat 600

    2/35

  • 8/2/2019 Bio Stat 600

    3/35

    Two Displays

    1 4466

    5431 2 0446888

    331 3 012

    4

    4 5

    1 4466

    2 0134456888

    3 011233

    4

    5 4

    Control

    Treatment

    Such displays are useful to visually

    inspect the extent to which

    distributions overlap as well as the

    magnitude of the differences

    Displays for qualitative variables

    Distribution of Cause of Death

    Circulatory system 137,165

    Neoplasms 69,948

    Respiratory system 33,223

    Injury/Poisoning 6,427

    Digestive system 10,779

    Nervous system 5,990

    Others 30,695

    Total 294,227

    Dot chart

    Bar chart

    You can create these

    and other types plots

    using PROC

    GPLOT in SAS or

    in Excel. These

    graphs were

    produced by using a

    freeware called R

    which can be

    downloaded from

    www.r-project.org

    We will discuss some more graphical displays after introducingsome numerical summaries

    Numerical Summaries

    Central tendency: Represents a typical or middle value

    Spread: Extent of variability across observations

    Shape: The structure of the distribution

    Central tendency Mean: Sum all the observations and divide by the number of observations

    being summed (arithmetic mean)

    Mean survival time of guinea pigs in control group

    (21+23+24+25+31+33+33+54)/8=244/8=30.5

    Median: The number such that 50% of the observations are less than thenumber and 50% are greater than the number

    Median survival time of guinea pigs in the control group

    21,23,24,25, 31,33,33,54

    Technically a number between 25 and 31. Sometimes the average (25+31)/2=28 is used.

  • 8/2/2019 Bio Stat 600

    4/35

  • 8/2/2019 Bio Stat 600

    5/35

    Histogram

    Divide the data into groups choosing intervals (Mutuallyexclusive, equal width and exhaustive)

    Some rules: n/3 or n1/2log10(n)

    Count the observations in each group

    Draw a bar chart with area of the bar proportional to the count

    If you fix the width of the bar to 1 then the height of the baris proportional to the count

    Shape of the distribution:

    Symmetric: Mean, Median and Modes are the same.

    Values on the either side of the mean are equallylikely

    Skewed: Mean is larger or smaller than the median Normal distribution

    Characterized by the mean and

    standard deviation:

    68% of observations lie with 1 SD

    95% of observations lie within 2 SD

    99.7% of observations lie within 3SD

    Key Concepts Population: It is a collection of units in whom we are interested

    All people living in the United States

    In the study of treatment for diabetes, all people with diabetes

    Blood pressure for a person: All possible measurements blood pressure inthat person

    Generally, it is impossible to measure each and every unit in the population (ifwe could,it is called a Census).

    A practical approach: Sample is usually a very small subset units in thepopulation. The sample is measured and studied to draw conclusions about thepopulation.

    Method used to draw the sample is the key step in a biostatistical investigation.

    Sample should be representative of the population (probability or

    random sampling designs assure such unbiased representativeness)

    Due to sampling from the population there is uncertainty in the inferences.Statistical analysis expresses these uncertainties in terms of probabilisticstatements

    Probability

    Meaning of probability is controversial.

    Empirical definition: Probability of an eventA is the relativefrequency with which the events occur in a long sequence of trials inwhichA is one of the outcomes.

    It only makes sense to talk about the probability when the event under question

    can be thought of as a result of an experiment that could be performedrepeatedly

    Tossing of a coin or throwing a dice

    Suppose the median height in the population is 168 cm. Suppose we keepdrawing one individual at a time and measuring height. Over the long run, as thesample size gets large, half the people in our sample will have heights below168 cm.

    A random person chosen from this population will have height below 168cm withprobability

  • 8/2/2019 Bio Stat 600

    6/35

    Probability (contd.) Subjective interpretation of probability

    It is a degree of belief expressing the certainty with which the event isexpected to occur.

    This broader definition allows probabilistic statements withoutnecessarily contemplating a series of trials

    Anything that is not known to you means that you are uncertain about it.The probability is simply an expression of that uncertainty.

    Statistical inference based on the empirical definition of probability is calleda frequentist or repeated sampling inference

    Statistical inference based on the subjective interpretation of the probabilityis called Bayesian inference.

    Fortunately, for large samples the numerical results under both system ofinferences are very similar but the interpretation differ.

    Frequentist inference is the focus of this course

    Key Concepts A collection of all possible outcome from an

    experiment is call sample space

    Tossing a coin: S={H,T}

    Study on health insurance: Random sample ofn subjectsand assessing how many have health insurance:S={0,1,2,,n}

    An event is a subset of sample space

    Tossing a coin:E={H}

    Study on health insurance: None have insurance (E={0})

    At least 60% have health insurance

    { | 0.6 }E X S X n=

    Experiment may involve measuring a continuousvariable on an individual

    Sample space: An interval on the real line

    Assume or with almost zero mass

    outside the appropriate interval Example: X=Systolic blood pressure

    Mathematical convenience

    An event is a subset of the real (or positive real)line

    E={X > 140}

    (0, ) ( , )

    Probability Distribution Rule for assigning probability to all possible events

    Probability Mass function: Probability assignment to eachindividual element of the sample space (Discrete SampleSpace)

    Probability density function: Probability assignment to anarbitrarily small interval around each potential value of acontinuous variable (Continuous Sample Space)

    Distribution function

    Pr( ) ( ),

    Pr( ) 0,

    X x f x x S

    X x x S

    = =

    = =

    Pr( ) ( )X dx f x dx =

    Pr( ) ( ) ( )

    u

    a

    X u F u f x dx = =

  • 8/2/2019 Bio Stat 600

    7/35

    Rules of Probability

    0 Pr( ) 1

    Pr( ) 0

    Pr( ) 1

    ( )

    Pr( ) 1 Pr( )

    c

    c

    A Event

    A

    A A will not occur in the entire

    sequence of experiments

    A Only A will occur in the entire

    sequence of experiments

    A or A Complement of A or Not A

    A A

    =

    =

    =

    =

    =

    Two events A and B are mutually

    exclusive when occurrence of Arules out the occurrence of B in a

    trial

    Pr (A or B)=Pr(A)+Pr(B)

    Two events A and B are independent

    when occurrence of A has no bearingon the occurrence or non-occurrence

    of B

    Pr (A and B)=Pr(A)*Pr(B)

    Example 1: Median height of the population is 168cm. Two individuals are chosen at random

    independently. What is the probability that height of both is less than 168cm?

    A= First persons height is less than 168cm

    B= Second persons height is less than 168cm

    A and B=Both have height less than 168cm

    Because of independence,Pr(A and B)=Pr(A)*Pr(B)=1/2 *1/2=1/4

    Example 2: Suppose that 10% of the population has height exceeding 180cm. What is theprobability that exactly one persons height exceeds 180cm?

    Two possible scenarios:

    C1: A 180

    C2: A>180 and B

  • 8/2/2019 Bio Stat 600

    8/35

    Properties of the Binomial distribution

    In a typical sample of size n, you may expect nP subjects to havedisease

    If you take several samples of size n from this population and note

    down the number of diseased subjects, the variance among thesenumbers will be nP(1-P).

    Inferential problem: Given n andx, how do we infer about P?

    Intuitively the estimate ofP isx/n. This estimate turns out to bereally a good estimate.

    How do we decide that it is good? We will see later.

    Poisson Distribution

    Poisson distribution is a close cousin of Binomial distribution.

    IfP is very small (rare disease) and n is very large thenprobability that in a sample of size n, you will findx diseasedsubjects is

    ( )

    ! !

    x nP xnP e e

    x x

    where nP

    =

    =

    Expected number of diseased people in the population

    If you take a large number of very large samples

    and count the number of diseased people in

    each sample, the variance among these numbers

    will be a

    =

    pproximately

    Inferential problem:

    Givenx how do we

    draw inference about

    Normal Distribution A popular model for many continuous variables

    It is a symmetric bell shaped curve characterizedby two parameters: Mean and Standard Deviation

    Mean: Center of the distribution (the same as themedian and mode)

    90% of observations lie between mean-1.64*SDand mean+1.64*SD

    95% of observations lie between mean-1.96*SDand mean+1.96*SD

    Conditional Probability

    How likely that an Event A will happen given thatthe event B has occurred?

    If A and B are independent then

    Inverse Problem (Bayes Rule)

    Pr( )( | )

    Pr( )

    A BP A B

    B=

    I

    Pr( ) Pr( ) Pr( )Pr( | ) Pr( )

    Pr( ) Pr( )

    A B A BA B A

    B B

    = = =

    I

    ( ) Pr( | ) Pr( )Pr( | )

    Pr( ) Pr( )

    P A B A B BB A

    A A= =

    I

  • 8/2/2019 Bio Stat 600

    9/35

    Measures used in Diagnostic tests

    Diagnostic test indicates T+ or T

    True state of the disease: D+ or D-

    Properties of diagnostic tests

    Sensitivity: Pr(T+|D+)

    Specificity: Pr(T-|D-)

    Usefulness or value of diagnostic tests

    Positive Predictive Value (PPV): Pr(D+|T+)

    Negative Predictive Value (NPV): Pr(D-|T-)

    Key Concept in statistical inference: Sampling Distribution

    How do we judge whether the estimator, (x/n), (sample proportion) is agood estimator of the population proportion P?

    Imagine that you draw several samples each of size n. Each will give youdifferent estimate. Variation in the estimates from sample to sample is

    called the sampling variance. The square root of the sampling variance iscalled the standard error.

    Two important criteria:

    You would want the estimates to be the same as the estimand, on the average. Suchestimates are called unbiased

    The sample to sample variation in the estimates should be as small as possible. Thatis, the standard error should be as small as possible

    The most desirable estimate: An unbiased estimate that has the smallest sampling

    variance.

    In this sense the sample proportion is the most desirable estimate of the populationproportion

    Sample Proportion The sampling variance of the sample proportion,p, is approximatelyp(1-

    p)/n.

    The standard error:

    Instead of using the single value to estimate the population proportion,sometimes it is desired to provide a range of plausible values for theunknown population proportion with reasonable degree of confidence.

    Confidence interval is a summary measure that provides such set ofplausible values. Usually confidence levels are 90%,95% or even 99%.

    An approximate 95% confidence interval for the unknown populationproportion is

    (1 ) / p p n

    1.96

    1.96 (1 ) /

    p SE

    p p p n

    Confidence intervals

    90% confidence intervals:

    99% confidence intervals:

    Example: In a random sample of size 2,837children in the State of Michigan, 118said they usually coughed, first thing in the morning. What can you infer about theprevalence of this condition in the entire state?

    Sample prevalence is= 118/2837=0.0416, the estimated prevalence rate for entirestate.

    The uncertainty in the estimate is

    1.64p SE 2.57p SE

    0.0416 (1 0.0416) / 2837 0.0037

    95% confidence interval:

    0.0416 1.96 0.0037=(0.034,0.049)

    With reasonable confidence one could conclude thatthe population prevalence rate is between 3.4% to

    4.9%

    =

  • 8/2/2019 Bio Stat 600

    10/35

  • 8/2/2019 Bio Stat 600

    11/35

  • 8/2/2019 Bio Stat 600

    12/35

    E l

  • 8/2/2019 Bio Stat 600

    13/35

    Example

    The following data was collected in a study of plasma magnesium indiabetic patients. The diabetic subjects were all insulin dependent subjectsattending a diabetic clinic over a 5 month period. The non-diabeticcontrols were mixture of blood donors and people attending day centersfor elderly, to give wide age distribution. Plasma magnesium follows aNormal distribution very closely.

    The summary data is as follows:

    Number of diabetic subjects=227

    Mean plasma magnesium=0.719

    Standard deviation =0.068

    Number of non-diabetic controls=140

    Mean plasma magnesium=0.810

    Standard deviation=0.057

    Questions of Interest

    Calculate an interval which would include 95% of plasma magnesiummeasurements from the control population. This is called reference interval.It give information about the distribution of plasma magnesium in thepopulation.

    Given that the distribution of plasma magnesium is normal, the mean and standard

    deviation completely specify the distribution. Thus we would expect 95% of theobservations to lie between 0.810-1.96*0.057 and 0.810+1.96*0.057. That is, between0.698 and 0.922.

    What proportion of diabetic subjects do we expect to lie in the referenceinterval?

    The plasma magnesium level for diabetic subject is normal with mean 0.719 andstandard deviation 0.068. What is the area under this normal curve between 0.698 and0.922?

    0.698 0.719 0.922 0.719Pr(0.698 0.922) Pr

    0.068 0.068X Z

    =

    P r ( 0 . 3 1 2 . 9 9 )

    P r ( 2 . 9 9 ) P r ( 0 . 3 1 )

    0 . 9 9 8 6 0 . 3 7 8 3

    0 . 6 2 0 3

    Z

    Z Z

    =

    =

    =

    Only about 62% of diabetic patient will lie in the referenceinterval.

    What are the estimates of the population mean of plasma

    magnesium for diabetic and non-diabetic populations?

    Estimate of the population mean for diabetic subjects is

    0.719 mmol/liter

    Estimate of population mean for non-diabetic subjects is

    0.810 mmol/liter.

    What are the standard errors of population mean estimates?

    Sample-to-sample variation in estimated mean for the population diabeticsubjects is 0.0045 and for the control population it is 0.0048.

    1 1

    1 1 1

    2 2

    2 2 2

    Diabetic population:

    227, 0.068;

    / 0.068/ 227 0.0045

    Non-diabetic population:

    140, 0.057;

    / 0.057 / 140 0.0048

    n s

    SE s n

    n s

    SE s n

    = =

    = = =

    = =

    = = =

  • 8/2/2019 Bio Stat 600

    14/35

    Find 95% confidence interval for the population meanfor the control population.

    How does the confidence interval differ from the 95%reference interval? Why are they different?

    2

    2

    2 2 2 2

    0.810

    0.004895% confidence interval:

    ( 1.96 , 1.96 )

    (0.810 1.96 0.0048,0.810 1.96 0.0048)

    (0.801,0.819)

    x

    SE

    x SE x SE

    =

    =

    +

    = +

    =

    Find the standard error of difference in the mean plasma magnesiumbetween diabetic and non-diabetic population?

    Find 95% confidence interval for the difference in the means betweendiabetic and non-diabetic populations.

    More than 95% confident that the difference in the population meansis negative. That is, the mean magnesium for diabetic subjects issmaller than the mean magnesium level for non-diabetic subjects.

    2 2 2 2

    1 2

    :

    0.719 0.810 0.091

    ( ) 0.0045 0.00480.0066

    Estimated difference

    SE diff SE SE

    =

    = + = +=

    ( 0.091 1.96 0.0066, 0.091 1.96 0.0066)

    ( 0.104, 0.078)

    +

    =

    Would plasma magnesium be a good diagnostic test for diabetes?

    The method discussed so far can be used to compare two populationproportions. Note that the proportion is simply the average of 0s and1s. The proportion is the mean of the binary variable.

    Example: A study was conducted to determine to what extent childrenwith bronchitis in infancy get more respiratory symptoms in later lifethan others. 273 children who had bronchitis before age 5 (group 1)

    were compared to 1046 children who did not(group 2). The outcomewas whether or not these children coughed during the day or night at age14.

    26 of 273 reported coughing in group 1 and 44 of 1046 reportedcoughing in group 2.

    1

    2

    1 2

    1 1 2 21 2

    1 2

    26 / 273 0.095

    44/1046 0.042

    0.095 0.042 0.053

    (1 ) (1 )( )

    0.095 (1 0.095) 0.042 (1 0.042)

    273 1046

    0.0188

    p

    p

    p p

    p p p pSE p p

    n n

    = =

    = =

    = =

    = +

    = +

    =

    95% confidence interval:

    0.053 1.96 0.0188=(0.016,0.090)

  • 8/2/2019 Bio Stat 600

    15/35

    Adjustments for small samples

    When the sample size is large, the central limit theorem applies and thesample mean has a normal distribution regardless of the originaldistribution of the the outcome variables.

    When the same size is small, the distribution of the sample is not normal

    even if the distribution of the outcome is normal.

    Adjustment is needed when the sample size is small

    Example: Does increasing the amount of calcium in our diet reduce bloodpressure? In a randomized experiment 10 black men were a calciumsupplement for 12 weeks and 11 black men received placebo thatappeared identical. The experiment was double blind. The outcome wasthe change in the blood pressure over a 12 week period.

    Data

    Calcium group: n=10, mean=5 and standard deviation =8.743

    Placebo group: n=11, mean=-0.273 and standarddeviation=5.901

    Two situations

    Suppose the population standard deviations in the twopopulations are the same

    Pooled standard deviation

    2 22 1 1 2 2

    1 2

    2 2

    ( 1) ( 1)

    ( 1) ( 1)

    9 (8.743) 10 (5.901)54.536

    9 10

    54.536 7.385

    p

    p

    n s n ss

    n n

    s

    + =

    +

    + = =

    +

    = =

    2 2

    1 2

    1 2

    ( )

    7.385 1/ 1/ 3.227

    95%

    (5.273 3.227)

    9 10 19

    2.093

    (5.273 2.093 3.227,5.273 2.093 3.227)

    ( 1.48,12.027

    10 11

    )

    p ps sSE x x

    n n

    confidence interval

    t

    Degrees of freedom

    t

    = +

    = + =

    = + ==

    +

    =

    Given the considerable uncertainty, no change in the

    population mean difference between calcium and placebo

    groups is plausible. Based on this data, we are confidentthat the mean difference that one would observe is between

    1.5 to 12.0

    What if the two population standard deviations are not the same?

    2 2 2 2

    1 21 2

    1 2

    22 2

    1 2

    1 2

    2 22 2

    1 2

    1 1 2 2

    2 2 2

    2 2 2 2

    2

    2 2

    ( 8 .7 4 3 ) ( 5 .9 0 1)( )

    1 0 1 1

    7 .6 5 3 .1 7 3 .2 9

    1 1

    1 1

    [ ( ) / 1 0 ( 5 .9 0 1) / 1 1]

    ( / 1 0 ) / 9 ( 5 .9 0 1 / 1 1) / 1 0

    ( 3 .1 6 )

    ( / 9 3 .1 6 / 1

    8 . 7 4 3

    8 . 7 4 3

    7 . 6 4

    7 . 6 4

    s sS E x x

    n n

    s s

    n n

    d f s s

    n n n n

    = + = +

    = + =

    +

    =

    +

    +=

    +

    +=

    + 0 ) 1 .0 0

    2 .1 3 1 ( 2 .1 2 0 2 .1 3 1) *

    9 5 % c o n f i d e n c e in t e rv a l :

    ( 5

    1 1 6 . 6 41 5 . 5 7

    6 . 4 9

    0 .5 7 2 .1 2 5

    2 .1.2 7 3 3 .2 9 , 5 .2 72 5 2 .1 23 3 .25

    1 .

    9 )

    ( , )7 2 1 2 .2 6

    t

    = =+

    + =

    +

    =

  • 8/2/2019 Bio Stat 600

    16/35

    How can we check whether the population standard deviations are Ho to make the decision?

  • 8/2/2019 Bio Stat 600

    17/35

    How can we check whether the population standard deviations arethe same?

    We will discuss this in the context of hypothesis testing.

    It can be argued that the equality of population standard deviationscan never be empirically verified, especially, if the sample size issmall. One should always, therefore, use the procedure which does

    not assume equality of population standard deviations.

    How to make the decision?

    General principle: Check whether the data is consistent with the nullhypothesis.

    We will answer the question by assessing how likely the observeddata would have been generated if the null hypothesis were true.

    A simple procedure

    If the null hypothesis were true then the difference betweenpronethalol and baseline values will be on the average 0 withroughly half of them positives and the rest of them negatives. Thatis, under the hypothesis the number of negative signs will occurwith probability 0.5. But only one negative value has occurred.

    Probability of observing 1 negative and 11 positives is

    Even more extreme observation will be 0 negative and 12positives which has the probability

    Test statistic: Number of negative values

    P-value: The probability of observing the test statistic which isas or more extreme than the observed, if the null hypothesis

    were true P-value=0.00293+0.00024=0.00317

    1 1112

    0.5 0.5 0.002931

    =

    0 1212

    0.5 0.5 0.000240

    =

    The above test is called sign test. Also we performed a onesided test because only smaller values of the number ofnegative values were considered.

    If one were to find large number of negative values, say 11 or12 then that is also evidence against the null hypothesis

    The probability of obtaining 11 or 12 negatives is also0.00317 (verify!).

    The two sided p-value is 0.00317+0.00317=0.00634

    How do we define extreme values that constitute evidenceagainst the null hypothesis?

    A T l f T E R i i i h h l l l

  • 8/2/2019 Bio Stat 600

    18/35

    A Tale of Two Errors Type 1 error: Rejecting the null hypothesis when it is actually

    true.

    Type 2 error: Failure to reject the null hypothesis when it isactually false. (Equivalently, accepting the null hypothesis

    when it is actually false) Chances of making type 1 error is called significance level and

    is denoted by

    Chances of making type 2 error is denoted by.

    1-is called power. Power is the chance of Rejecting the nullhypothesis when alternative is true

    Objective is to control chances of making either types of errors

    Strategy: For a fixed significance level, we will define theextreme values.

    Revisiting the pronethalol example

    Suppose we specify that the chances of making type 1 error is 0.05

    For two sided alternatives: The extreme value then is determined bychoosing values, c and d, so that the number of negative values lessthan or equal to c or greater than or equal to dis 0.05.

    For one sided alternatives: The extreme value is determined bychoosing a value, c, so that the number of negative values less than orequal to c is 0.05.

    Looking at the binomial table on Page T-9 (last column, n=12). If wechoose c = 2 and d = 10. The probability of type 1 error is0.0161+0.0161=0.0322. It is not possible to determine c and dtoachieve significance level 0.05.

    For one sided hypothesis, choose c=3. (It gives slightly larger than the

    specified significance level 0.05

    Usually power is calculated instead of chances of making type 2error.

    We need a specific alternative value which is the truth. Supposethat 1/12 (approximately, 8%) is the indeed the true value. Thechances of rejecting the null hypothesis is:0.3677+0.3837+0.1835+0.0000+0.0000+0.0000=0.9349

    Try calculating power for alternatives 0.05, 0.10, 0.20 etc. Power curve is a plot of power against the alternative values.

    The sign test so far has considered only the sign of the differenceand not the magnitude of the difference. Let us consider somealternatives.

    Suppose that the differences can be assumed to be normallydistributed. The estimate of the mean difference in the populationis 7.7 and standard deviation is 15.1.

    The standard error is 4.4.

    If the null hypothesis is true then the sample mean should bedistributed around 0. The extent to which the observed sample

    mean is different from 0 is the evidence against the nullhypothesis.

    One way to measure the distance between the observed samplemean and the null hypothesis value is in terms of standard errorunit.

    0

    ( )

    x

    t SE x

    =

    Two Sample Tests

  • 8/2/2019 Bio Stat 600

    19/35

    Calculated value of t-statistic is 7.7/4.4 = 1.75

    What is the probability that one would observe this extreme oreven more extreme values of the test statistic under the nullhypothesis?

    If the null hypothesis were true then the statistic has a t-

    distribution with 11 degrees of freedom.

    -1.75 1.75

    Shaded area: 0.1079

    Computed using a computer

    For a fixed significance level,say 0.05. The value of the test

    statistic considered to be large

    is 2.201

    Two-Sample Tests Revisit the plasma magnesium example

    The following data was collected in a study of plasma magnesium in diabetic patients.The diabetic subjects were all insulin dependent subjects attending a diabetic clinicover a 5 month period. The non-diabetic controls were mixture of blood donors andpeople attending day centers for elderly, to give wide age distribution. Plasmamagnesium follows a Normal distribution very closely.

    The summary data is as follows: Number of diabetic subjects=227

    Mean plasma magnesium=0.719

    Standard deviation =0.068

    Number of non-diabetic controls=140

    Mean plasma magnesium=0.810

    Standard deviation=0.057

    Frame the question as testing of statistical hypothesis: Are the means of plasma magnesium in the two populations (diabetic and non-diabetic)

    the same?

    Diabetic

    Non

    Diabetic1

    1

    1

    227

    0.719

    0.068

    n

    x

    s

    =

    =

    =

    2

    2

    2

    140

    0.810

    0.057

    n

    x

    s

    =

    =

    =

    Mean for Diabetic population:

    Mean for Non-diabetic population:

    Null Hypothesis:

    Alternative Hypothesis:

    is an estimate of

    If the Null hypothesis were true then should bedistributed around mean 0. The extent to which it is away from 0

    is evidence against the null hypothesis.

    Test statistic:

    1

    2

    1 2:oH =

    1 2:AH

    1 2x x 1 2

    1 2

    x x

    1 2 1 2 1 2

    1 2 1 2

    ( ) ( )

    ( ) ( )

    0.09113.780.0066

    x x x xt

    SE x x SE x x

    = =

    = =

    13.78- 13.78

    Sampling distribution is normal given thelarge sample sizes from each population. Ifthe null hypothesis were true, 68% of thesamples should result in the value of the teststatistic to be between -1 and 1, 90% of thesamples between -1.64 and 1.64 and 95% ofsamples between -1.96 and 1.96. What wehave observed is very unlikely under thenull hypothesis. Therefore, the nullhypothesis is a suspect

    Small sample example revisited Two situations: Population variances are equal or unequal

  • 8/2/2019 Bio Stat 600

    20/35

    Example: Does increasing the amount of calcium in our diet reduce blood pressure? In a

    randomized experiment 10 black men were given a calcium supplement for 12 weeks

    and 11 black men received placebo that appeared identical. The experiment was double

    blind. The outcome was the change in the blood pressure over a 12 week period.

    Data

    Calcium group: sample size =10, mean=5 and standard deviation =8.743

    Placebo group: sample size=11, mean=-0.273 and standard deviation=5.901 Population mean if everybody in the population were given calcium supplement:

    Population mean if everybody in the population were given only Placebo:

    Null hypothesis:

    Alternative hypothesis:

    Large positive mean difference is evidence against the null hypothesis in

    favor of the alternative hypothesis

    Alternative hypothesis:

    Large positive or negative mean difference is evidence against the null hypothesis infavor of alternative hypothesis

    1

    21 2:oH =

    1 2:oH >

    1 2x x

    1 2:AH

    p q q

    Equal

    Pooled standard deviation

    Standard error of the difference in the means

    Test statistic

    2 22 1 1 2 2

    1 2

    2 2

    ( 1) ( 1)

    ( 1) ( 1)

    9 (8.743) 10 (5.901)54.536

    9 10

    54.536 7.385

    p

    p

    n s n ss

    n n

    s

    + =

    +

    + = =

    += =

    2 2

    1 2

    1 2

    10 1

    ( )

    7.385 1/ 1/ 3.221 7

    p ps sSE x x

    n n = +

    = + =

    1 2 1 2

    1 2

    ( ) ( ) 5.2731.63

    ( ) 3.227

    9 10 19

    x xt

    SE x x

    Degrees of freedom

    = = =

    = + =

    Sampling

    distribution: t with 19

    degrees of freedom

    P-value

    One sided alternative

    From Table D on page

    T-11, the shaded area is

    between 0.05 and 0.10

    Computer software:0.0598

    Two sided alternative

    P-value=2*0.0598=0.1196

    1.63

    1.63-1.63

    Variances are unequal

    2 2 2 2

    1 21 2

    1 2

    22 2

    1 2

    1 22 2

    2 2

    1 2

    1 1 2 2

    2 2 2

    2 2 2 2

    2

    2 2

    (8.743) (5.901)( )

    10 11

    7.65 3.17 3.29

    1 1

    1 1

    [( ) /10 (5.901) /11]

    ( /10) / 9 (5.901 /11) /10

    ( 3.16)

    ( / 9 3.16 /1

    8.743

    8.743

    7.64

    7.64

    s sSE x x

    n n

    s s

    n ndfs s

    n n n n

    = + = +

    = + =

    +

    =

    +

    +=

    +

    +=

    +

    116.6415.57

    6.490) 1.00= =

    +

    Test statistic

    5.273

    1.6033.29

    :

    : 0.065

    : 0.13

    t

    P value

    One sided

    Two sided

    = =

  • 8/2/2019 Bio Stat 600

    21/35

    A l i f V i

  • 8/2/2019 Bio Stat 600

    22/35

    Blood Pressure Example

    1 2

    7.4

    n n n

    =

    = =2.5 5 7.5

    30

    40

    50

    60

    70100

    25.6 74.4 97.5

    32.7 85.6 99.4

    39.3 92.2 99.9

    45.6 95.9 100

    51.5 97.9 10066.6 99.8 100

    n

    Analysis of Variance

    Suppose now that we want to compare more than 2 populations

    One could do pair-wise comparisons. This is cumbersome and is not easyto summarize when the number of populations compared is large.

    The analysis of variance is used by framing question in terms of in-depthinvestigation of variations in the observed data

    Analysis variance basically partitions the overall variability into one ormore assignable causes or reasoning.

    What is left unassigned is called residual variability.

    Based on the partition of the variability relative merits of assignablecauses are investigated.

    Generally, the variation due to an assignable cause relative to the residualvariability is used as a yard stick for judging the importance of theassignable cause.

    The assignable causes can be carefully planned or manipulatedthrough an experimental design

    The assignable causes are based on substantive reasoning in anobservational study design

    Example:

    A randomized study was conducted to test the generality of the observation

    that stimulation of the walking and placing reflexes in the newbornpromotes increase walking and placing (Zelazo, Zelazo and Kolb (1972,Science, pages 314-315)). A total of 29 one-week old males wererandomized to four groups. 1: Active exercise, 2: Passive exercise, 3: No-exercise and 4: 8-week control group. Age of infants walking alone (inmonths) was the outcome variable of interest.

    The assignable cause is the levels of exercise

    Is the variation caused by this assignable cause substantial?

    Data

    ActiveExercise

    PassiveExercise

    No Exercise Control

    9.00

    9.50

    9.75

    10.00

    13.00

    9.50

    11.00

    10.00

    10.00

    11.75

    10.50

    15.00

    11.50

    12.00

    9.00

    11.50

    13.25

    13.00

    13.25

    11.50

    12.00

    13.50

    11.50

    2

    1 1

    Observation for subject in group

    1,2,...,

    1,2,...,

    Overall mean

    Total variation= ( )i

    ij

    i

    nk

    ij

    i j

    y j i

    j n

    i k

    y

    y y

    ++

    ++= =

    =

    =

    =

    =

    1 2 3 4

    :

    4

    6, 5261 / 23 11.34

    58.47

    Example

    k

    n n n ny

    Total variation

    + +

    =

    = = = == =

    =

    ANOVA

  • 8/2/2019 Bio Stat 600

    23/35

    ANOVA

    ( ) ( )ij ij i iy y y y y y++ + + ++ = +

    Overall

    deviation

    Within Groups or

    Between-subjects nested

    within groupsBetween Groups

    2 2 2

    1 1 1 1 1 1

    ( ) ( ) ( )

    58.47 43.69 14.78

    i i in n nk k k

    ij ij i i

    i j i j i j

    y y y y y y

    TotalSS WithinSS BetweenSS

    ++ + + ++= = = = = =

    = +

    = += +

    Degrees of freedom: Number of independent statistics

    used to compute the sum of square

    An alternative Expression

    ( ) ( )ij i ij i

    ij i ij

    y y y y y y

    y

    ++ + ++ += + +

    = + +

    Overall mean Deviation of the

    group i mean

    from overall mean

    Residual

    ( ) ij j is Effect of subject j nested within group i = =

    Df for Total SS=22 ( Every observation

    is used but sum of deviations is zero)

    Df for Within SS=19 (Every

    observation is used but sum of

    deviations within each group is zero)

    Df for Between SS=3 (Four means areused but sum of deviations from the

    overall mean is zero)

    To compare the Sums of squares, differences in the

    degrees of freedom has to be taken into account.

    Mean square =Sum of square/Degrees of Freedom

    1

    ( ) 1

    ( ) 1

    ( )

    k

    i

    i

    N n Total sample size

    Df TotalSS N

    Df BetweenSS k

    Df WithinSS N k

    =

    = =

    =

    =

    =

    ANOVA Example

    ( ) 14.78 / 3 4.93

    ( ) 43.69 /19 2.30

    ( ) / ( ) 2.14

    MS Between

    MS Within

    MS Between MS Within

    = =

    = =

    =

    Is 2.14 large?

    Use F-distribution with (numerator df=3,

    denominator df=19) to determine how likely is

    2.14 or even larger F when in actuality there are

    no differences among the four groups?

    P-value: 0.1228

    Regression Analysis TerminologyX I d d i bl

  • 8/2/2019 Bio Stat 600

    24/35

    Bulk of scientific investigations are concerned with relationships.

    Causal relationship: If one changes the variableXby a certain amount how muchdoes the variable Ychange?

    Association or correlational relationship: Are subjects with different values ofXalsotend have different values ofY?

    What is the nature of these relationships in the population?

    How do you quantify these relationships in the population?

    How to estimate the quantities describing these population relationships?

    How accurately are those estimates? How much uncertainty is there in assessingthese relationships?

    The two sample tests and ANOVA also fit into this category

    Are the population means related to treatments assigned or the observed grouping ?

    We will later see that the two-sample t-tests and ANOVA F-tests are particular casesof the general regression framework.

    X= Independent variable.

    A variable that an investigator can change in an experiment

    Amenable to intervention in an observational study

    Simply the variable whose impact is to be assessed.

    It is possible that there can be more than one independent variable ofinterest

    Other names: Predictors, correlates, right-had-side variables, exogenousvariables

    Y= Dependent variable of interest.

    Variable for which you want to assess effect of X.

    Other names: Outcome, endogenous variables, left-hand-side variables

    Impact of different values of X on differences in Y expressed in somemeaningful terms is of interest

    Example The following table gives data collected by a group of

    medical students in a physiology class. The objective is toassess association between height and FEV1.

    Height FEV1

    164.0 3.54

    167.0 3.54

    170.4 3.19

    171.2 2.85

    171.2 3.42

    171.3 3.20

    172.0 3.60

    Height FEV1

    172.0 3.78

    174.0 4.32

    176.0 3.75

    177.0 3.09

    177.0 4.05

    177.0 5.43

    177.4 3.60

    Height FEV1

    178.0 2.98

    180.7 4.80

    181.0 3.96

    183.1 4.78

    183.6 4.56

    183.7 4.68

    Scatter plotScatter plot:

    A graphical device to assess

    the type of relationship.

    Each point is a pair (X,Y)

    Dependent variable on thevertical axis

    Independent variable on the

    horizontal axis

    Inspection of the graph

    suggests a linear

    relationship

    Linear relationship

    Method of Least

  • 8/2/2019 Bio Stat 600

    25/35

    Linear relationship

    Representation

    Clearly not (none) every observations will satisfy thisequation

    How to determine a and b?

    1 2 1 2( ) ( )y y x x

    y a b x= +

    i i iy a b x e= + +

    ResidualLine-value or

    the expected value

    {1

    Squares:

    Find a and b that

    minimizes the

    residual sum of

    squares:

    2 2

    1 1

    ( )n n

    i i i

    i i

    y a b x= =

    =

    2

    ( )( )

    ( )

    i i

    i

    i

    i

    x x y y

    b

    x x

    a y b x

    =

    =

    Simplified formulas Slope

    Intercept

    Needed quantities

    2

    2

    /

    /

    i i i i

    i i i

    i i

    i i

    x y x y n

    b

    x x n

    =

    / /i ii i

    a y n b x n

    =

    2, , ,i i i i i

    i i i i

    x y x y x

    Example y=FEV1, x=Height

    Prediction equation

    2

    2

    3,507.6, 77.12

    13,568.18, 615,739.24

    13,568.18 3,507.6 77.12 / 20 0.074389615,739.24 (3,507.6) / 20

    77.12 / 20 0.074389 3, 507.6 / 20 9.19

    i i

    i i

    i i i

    i i

    x y

    x y x

    b

    a

    = =

    = =

    = =

    = =

    1 9.19 0.0744FEV Height = +

    Interpretation Interpretation (Contd.)

  • 8/2/2019 Bio Stat 600

    26/35

    p

    Slope

    Expected difference in for unit positve

    difference inTwo Individuals

    Individual 1:

    Individual 2: 1

    Expected or line-value for individual 1:

    Expected or line-value for individual 2:

    b y

    x

    x h

    x h

    a b h

    =

    =

    = +

    +

    ( 1)

    a b h

    Difference b

    + +

    =

    Intercept

    Expected value ofy whenx=0.

    It is not very interpretable in this particular problem. Value

    of FEV1 when Height is 0!

    Modification: Centering

    ( )y c d x x= +

    d b

    c Expected value of y for average height

    =

    =

    Residual Residuals from the estimated line

    Residuals represent deviations from the expected value.Large residuals reflect unreliability or uncertainty.

    One way to measure this uncertainty is through variance ofthe residuals (or the standard deviation of the residuals).

    i i ie y a b x=

    2

    2

    2

    i

    ie

    e

    sn=

    Computational formulas2 2 2 2

    2

    2 2 2

    ( ) ( )

    2 2

    ( 1)( )

    ( 2)

    of '

    of '

    i i i

    i i ie

    y x

    y

    x

    e y y b x x

    sn n

    n s b sn

    s SD y s

    s SD x s

    = =

    =

    =

    =

    2

    ( )( )

    1

    xy

    x

    i i

    xy

    sb

    s

    x x y y

    sn

    =

    =

    Covariance

    Example How useful isx in predictingy?

  • 8/2/2019 Bio Stat 600

    27/35

    p

    2 2 22

    5.51, 0.71

    19 (0.71 0.0744 5.51 )0.35

    18

    x y

    e

    s s

    s

    = =

    = =

    2.260.56

    5.51 0.71

    xy

    x y

    sr

    s s

    = = =

    Correlation Coefficient: Another measure of

    strength of linear relationship

    Measure of linear association between x

    and y.

    p g y2

    2

    2

    2

    19

    18

    ( 1)

    2

    y

    e

    y

    Total variance= 0.71 = 0.504 s

    (Residual variance from a horizontal line)

    Degrees of freedom

    Residual variance= 0.35 s

    (Residual variance from the regression line on x)

    Degrees of freedom

    n sR

    =

    =

    =

    =

    =2

    2( 2) 19 0.50 18 0.35

    ( 1) 19 0.50

    0.34

    34% (in percentage)

    e

    y

    n sn s

    =

    =

    Another Form ofR2

    2

    2

    2

    ( ' )

    ( ' )

    y

    y

    sVariance y sR

    Variance y s s= =

    R-square is a simple measure to assess how much

    variability iny is explained by the variation inx.

    Large values of R-square indicates substantial

    variation in y is due to variation in x. Small R-square

    indicates the opposite

    This measure also has disadvantages and we willdiscuss those when we consider multiple preidtors

    Inference How much the slope and intercept estimates vary from sample to

    sample?

    Standard error of the estimates

    95% confidence interval

    2

    2

    ( ) ( 1)

    e

    x

    s

    SE b n s=

    0.025, 2 ( )nb t SE b

    Estimated Line Value

  • 8/2/2019 Bio Stat 600

    28/35

    ( )

    ,

    ( )

    b bt

    SE b

    Under the null

    bt

    SE b

    =

    =

    Test the

    hypothesisHo: b is

    equal to 0 versusHA: b is not equal to

    zero

    Supposex=fand it is not one of the observed values in the data set.

    What would one expecty to be on the average?

    22

    2

    1 ( )( )

    ( 1)

    f

    f e

    x

    y a b f

    f xSE y s

    n n s

    = +

    = +

    2

    2

    175

    9.19 0.0744 175 3.83

    1 (175 175.38)( ) 0.35 0.133

    20 19 5.51

    f

    f

    f

    y

    SE y

    =

    = + =

    = + =

    Prediction Interval

    This refers to a confidence interval for a singleobservation on outcome variable for a given valueof the independent variablex = f.

    22

    2

    1 / 2, 2

    1 ( )( ) 1

    ( 1)

    f

    f e

    x

    f n

    y a b f

    f xPrediction SE y s

    n n s

    y t PSE

    = +

    = + +

    Discrete Predictors In the example considered so far, the independent variable

    was a continuous variable.

    Suppose now the independent variable is a binary coded asx = 0 or 1.

    Interpretation of regression coefficients:

    ( | )

    ( | 0)

    ( | 1)

    ( | 1) ( | 0)

    E y x a bx

    E y x a

    E y x a b

    b E y x E y x

    = +

    = =

    = = +

    = = =

    a = Mean for the reference group

    defined as subjects with x = 0

    b = Difference in the mean between

    two groups x=1 versus x=0

    -

    Test for significance of b is identical to two sample

    t test

    Multiple Predictors ( | )E Y X a bX= +

  • 8/2/2019 Bio Stat 600

    29/35

    Often in practice several variables mightinfluence the dependent variable.

    Some common examples

    C

    X Y

    Part of X, Y relationship is due to common

    relationship with C

    ( | , )

    ( | 1, ) ( | , )

    [ ( 1) ] [ ]

    b Ignores the influence of C on X and Y

    E Y X C c dX eC

    E Y X x C f E Y X x C f

    c d x ef c dx ef d

    =

    = + +

    = + = = =

    + + + + + =

    d = Difference in the expected values of Y associated with one

    positive unit difference in X holding C constant.

    b= Unadjusted estimate

    d= Estimate adjusted for C; Usually d will be smaller than b

    (but it can be larger than b)

    C=Confounding variable

    X I Y

    ( | )E Y X a bX= +

    b=representing the

    effect of actually the

    variable I

    I: Intervening

    variable( | , )

    0

    E Y X I c dX eI

    d

    = + +

    X I YX may act through I as

    well as act

    independently on Y

    Statistical effect of I or C will be the same on the regression

    coefficient of X. The conceptual understanding has to

    distinguish between confounding and Intervening variables

    X

    M=0

    M=1

    Y0

    Y1

    The effect of X depends on M. The effect of X is

    modified by the presence or absence of M

    ( | , )E Y X M a bX cM dX M= + + +

  • 8/2/2019 Bio Stat 600

    30/35

    ( , )

    ( | , 0)

    ( | , 1) ( ) ( )

    E Y X M a bX

    E Y X M a c b d X

    = = +

    = = + + +

    d=The extent to which the effect of X is modified by the

    presence of M (that is, M=1)

    d=0 c arbitrary: Parallel lines for two groups M=1 and

    M=0

    d=0,c=0: Coincidental lines

    c=0,d arbitrary: Same intercept, lines for M=0 and M=1

    are fanning out

    d=0,c arbitrary d,c arbitrary

    d=0,c=0 c=0,d arbitrary

    Analysis of Cross-classifed Data So far we have concentrated on analyzing relationships between a

    continuous dependent variable and continuous or discrete (orcategorical) independent variables.

    Regression

    ANOVA

    Many times the dependent and independent variables are bothdiscrete.

    Qualitative categories (Such as Gender, Race/Ethnicity,geographical location, type of health insurance)

    Quantitative or ordered categories

    low, medium and high socioeconomic status

    none, very low, low, medium, high doses of environmentalexposure

    Other combinations are also possible

    Discrete dependent (Yes/No for a disease), continuousindependent (Age, BMI etc)

    Logistic regression

    Number of events as dependent (Number of seizuresamong epileptic patients over a fixed or variable

    period of time)

    Poisson Regression

    Continuous dependent but truncated (or censored).For example, failure time, time to death, time tosymptoms. These may be known for some individualsand for others it is only known to exceed some known

    value. Survivial analysis

  • 8/2/2019 Bio Stat 600

    31/35

    Chi-square statistic is one of the distance T has a chi-square distribution with df=(r-1)(c-1) degrees of

    freedom.

  • 8/2/2019 Bio Stat 600

    32/35

    measure:2

    1 1

    ( )

    Number of rows

    Number of columns

    Observed frequency in row and column

    Expected frequency in row and column

    r cij ij

    i j ij

    ij

    ij

    O ET

    E

    r

    c

    O i j

    E i j

    = =

    =

    =

    =

    =

    =

    2 2 2

    2 2

    (50 61.7) (849 837.3) (29 17.7)

    61.7 837.3 17.7

    (3 2.7) (36 36.3)... 10.5

    2.7 36.3

    T

    = + + +

    + + =

    r=5, c=2; df=4.

    Critical value for significance level is 0.05 is 9.49. The data arenot consistent with the hypothesis of no association between

    housing tenure and time of delivery. That is, there is a goodevidence of association between housing tenure and time ofdelivery

    The chi-square statistics is not a measure of association. If wedouble the frequencies in each cell, the association will remainunchanged but chi-square will double.

    Chi-square is a large sample test and is questionable if anyexpected frequency is less than 5. Alternatives are

    Yates correction

    Fishers exact test

    Yates Correction

    Fishers exact test. It is based on computing theprobability of observing a particular contingency table ortables that are more inconsistent with the hypothesis of

    no association. It is a complicated algorithm and usuallyis performed using a computer.

    Example: The following table is from a trialinvestigating the efficacy of streptomycin for thetreatment of pulmonary tuberculosis. The data is forsubgroup for patients with an initial temperature of 100-100.9F. The two variables are radiological assessment of

    the disease 6 months later and treatment.

    2(| | 0.5)

    Y

    O ET

    E

    =

    Radiological assessment Streptomycin Control

    Improvement 13 5

    Deterioration 2 7

    Death 0 5

    Pooled table

    R di l i l S i C l

    Log(OR)=2.75

    (l (OR)) 1/ +1/b+1/ +1/d 1/13+1/2+1/5+1/12 0 86

  • 8/2/2019 Bio Stat 600

    33/35

    Radiological assessment Streptomycin Control

    Improvement 13 (a) 5 (b)

    Deterioration or 2 (c) 12 (d)

    Death

    Odds of improvement in the streptomycin group=13/2=a/c

    Odds of improvement in the control group=5/12=b/d

    Odds ratio=(13/2)/(5/12)=15.6 =(ab)/(cd)

    Confidence interval:

    exp[Log(OR)-z*SE(log(OR)), Log(OR)+z*SE(log(OR))]

    var(log(OR))=1/a+1/b+1/c+1/d=1/13+1/2+1/5+1/12=0.86

    SE=0.93

    95% confidence interval for log-odds-ratio:

    (2.75-1.96*0.93,2.75+1.96*0.93)

    =(0.93,4.57)

    95% confidence interval for odds-ratio:

    (2.53,96.54)

    Analysis so far involved only two variables. What to do ifwe have more than two variables? For example, suppose wewant to adjust for Age and other confounding variables

    while assessing association between treatment and outcome(or home ownership and time of delivery).

    Technique is called logistic regression

    Matched study: Binary Outcome

    A questionnaire was administered to1,319 school children at ages 12 and14. One question asked was whetherthe prevalence of reported symptomswas different at the two ages. Thefollowing two by two table gives theresult

    As in the paired t-test example, wewant to exploit the fact that the samechildren were asked at ages 12 andthen again when they were 14

    The concordant pairs are (yes, yes)and (no, no).

    The discordant pairs (yes, no), (no,yes)

    Severecolds atage 12

    Severe colds at age14

    Yes NoYes 212 144

    No 256 707

    : The prepvalence of reported symptoms

    is the same at two ages

    Under the null hypothesis,

    proportion of subjects answering (yes,no) should

    be same as the subjects answering (no,yes)

    oH

    2 2

    2

    :

    144, 256

    :

    2002

    2 2

    2 2

    31.4

    1

    yn ny

    yn ny

    yn ny yn ny

    yn ny

    yn ny yn ny

    Observed

    f f

    Expected

    f f

    f f f ff f

    f f f f

    df

    = =

    +=

    + + = +

    + +

    =

    =

    The chi-square value is highly

    significant. The proportions at

    two ages are not the same.

    Odds of transition from No to Yes is

    256/144=1.78

    Conditional analysis (conditional on

    transition)

    This is called

    McNemars test

    Nonparametric Approaches

    Most statistical approaches discussed so far assume some distribution for Paired Designs

  • 8/2/2019 Bio Stat 600

    34/35

    Most statistical approaches discussed so far assume some distribution forthe population (mostly normal).

    The approaches such as one and two sample t-tests, linear regression etc.are valid unless the departure from normality is very severe.

    Nevertheless, it will be useful to have a set of techniques that can be appliedwithout any distributional assumptions.

    The sign test discussed earlier in the course is an example of anonparametric test. However, this procedure can have low power because ituses only the signs and not the magnitude.

    An alternative is to use magnitude in some way but still maintaining thenonparametric nature of the tests.

    Rank-based procedures are quite popular

    Paired Designs Revisit the husband-wife

    pair example

    Wilcoxon signed rank

    procedure Step 1: Rank the absolute

    values of the differences

    Step 2: Take the differencein the sums of the ranks ofthe positive and negativedifferences

    Pair

    1

    2

    3

    4

    5

    6

    7

    89

    10

    Rank

    (r)

    7

    4.5

    3

    8

    9

    10

    1.5

    1.54.5

    6

    Difference

    (d)

    2.3

    -1.1

    0.8

    2.4

    -3.1

    -3.2

    -0.6

    0.6-1.1

    -1.5

    (6 3 7 1.5) (4.5 8 9 1.5 4.5 5)

    17.5 32.5 15

    ( ) 0; ( ) ( 1)(2 1) / 6

    10 11 21/ 6 385

    ( ( )) / var( ) 15 / 385 0.76

    w

    E w Var w n n n

    z w E w w

    = + + + + + + + += =

    = = + +

    = =

    = = =

    Null hypothesis: Median of the distribution of differences is

    zero.

    All nonparametric procedures formulate hypotheses in terms

    of medians rather than mean

    Two sample nonparametric tests

    These are analog of two-sample t-tests

    Mann-Whitney-Wilcoxon test

    Sample of size n from population 1

    Sample of size m from population 2

    Rank (n+m) units regardless of the populations

    Sum the ranks of subjects in sample 1and call it T.

    Define U=T-n(n+1)/2

    Alternatively, one can sum the ranks of subjects in sample 2and then replace n by m

    [ / 2] / ( 1) /12z U mn mn m n= + +

    Null hypothesis: The distributionin the two populations are the

    Crohnsdi

    Coeliacdi

    Rank all 29 observations

  • 8/2/2019 Bio Stat 600

    35/35

    in the two populations are thesame

    Example:The following tablegives biceps skinfold

    measurements for 20 patientswith Crohns disease and 9patients Coelic disease. Theobjective is to assess whether thedistribution of the bicepmeasurements are the same

    disease

    1.8,2.8,4.2,6.2,2.2,3.2,4.4,6.6,2.4,3.6,

    4.8,7.0,2.5,3.8,5.6,10.0,2.8, 4.0,6.0,10.4

    disease

    1.8,2.0,2.0,2.0, 3.0, 3.8,4.2,5.4, 7.6

    1.8 1.5 3.0 11 5.4 21

    1.8 1.5 3.2 12 5.6 22

    2.0 4 3.6 13 6.0 23

    2.0 4 3.8 14.5 6.2 24

    2.0 4 3.8 14.5 6.6 25

    2.2 6 4.0 16 7.0 26

    2.4 7 4.2 17.5 7.6 27

    2.5 8 4.2 17.5 10.0 28

    2.8 9.5 4.4 19 10.6 29

    2.8 9.5 4.8 20

    Circled

    numbers are

    from Sample 2

    Rank

    sum=104.5

    104.5 -9 10 / 2 59.5

    (59.5 9 10 / 2) / 9 20 (9 20 1) /12

    1.44

    0.15

    U

    z

    Two sided p value

    = =

    = + +

    =

    =

    This is very similar to the result one obtains using two

    sample t-test

    Generalizations

    What if you have more than two groups?

    Rank all the observations regardless of group and

    then perform the one-way analysis of variance ofthe ranks.

    The null hypothesis: The distributions for thevarious populations defined by the groups are thesame.

    You can get ranks by using PROC RANK inSAS. See the handout for example