Stat Topics

download Stat Topics

of 17

Transcript of Stat Topics

  • 8/2/2019 Stat Topics

    1/17

    Mean, Mode, Median, and Standard

    Deviation

    The Mean and ModeThe sample mean is the average and is computed as the sum of all the observed outcomes fromthe sample divided by the total number of events. We use x as the symbol for the sample mean.In math terms,

    where n is the sample size and the x correspond to the observed valued.

    ExampleSuppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenousweed and came up with the following counts of this weed in this region:

    34, 43, 81, 106, 106 and 115We compute the sample mean by adding and dividing by the number of samples, 6.34 + 43 + 81 + 106 + 106 + 115

    = 80.836

    We can say that the sample mean of non-indigenous weed is 80.83.The mode of a set of data is the number with the highest frequency. In the above example 106 isthe mode, since it occurs twice and the rest of the outcomes occur only once.Thepopulation meanisthe average of the entirepopulation and is usually impossible to

    compute. We use the Greek letterfor the population mean.

    Median, and Trimmed MeanOne problem with using the mean, is that it often does not depict the typical outcome. If there isone outcome that is very far from the rest of the data, then the mean will be strongly affected bythis outcome. Such an outcome is called and outlier. An alternative measure is the median. Themedian is the middle score. If we have an even number of events we take the average of the twomiddles. The median is better for describing the typical value. It is often used for income andhome prices.ExampleSuppose you randomly selected 10 house prices in the South Lake Tahoe area. Your areinterested in the typical house price. In $100,000 the prices were

    2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8If we computed the mean, we would say that the average house price is 744,000. Although thisnumber is true, it does not reflect the price for available housing in South Lake Tahoe. A closerlook at the data shows that the house valued at 40.8 x $100,000 = $4.08 million skews the data.Instead, we use the median. Since there is an even number of outcomes, we take the average ofthe middle two

    3.7 + 4.1= 3.9

    2

  • 8/2/2019 Stat Topics

    2/17

    The median house price is $390,000. This better reflects what house shoppers should expect tospend.

    There is an alternative value that also is resistant to outliers. This is called thetrimmedmean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom.We can also use the trimmed mean if we are concerned with outliers skewing the data, howeverthe median is used more often since more people understand it.Example:At a ski rental shop data was collected on the number of rentals on each of ten consecutiveSaturdays:

    44, 50, 38, 96, 42, 47, 40, 39, 46, 50.

    To find the sample mean, add them and divide by 10:

    44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50= 49.2

    10Notice that the mean value is not a value of the sample.To find the median, first sort the data:

    38, 39, 40, 42, 44, 46, 47, 50, 50, 96

    Notice that there are two middle numbers 44 and 46. To find the median we take the average ofthe two.

    44 + 46Median = = 45

    2Notice also that the mean is larger than all but three of the data points. The mean is influencedby outliers while the median is robust.

    Variance, Standard Deviation and Coefficient of VariationThe mean, mode, median, and trimmed mean do a nice job in telling where the center of the dataset is, but often we are interested in more. For example, a pharmaceutical engineer develops anew drug that regulates iron in the blood. Suppose she finds out that the average sugar contentafter taking the medication is the optimal level. This does not mean that the drug is effective.There is a possibility that half of the patients have dangerously low sugar content while the otherhalf have dangerously high content. Instead of the drug being an effective regulator, it is adeadly poison. What the pharmacist needs is a measure of how far the data is spread apart. Thisis what the variance and standard deviation do. First we show the formulas for thesemeasurements. Then we will go through the steps on how to use the formulas.

    We define the variance to be

    and the standard deviation to be

    Variance and Standard Deviation: Step by Step

    1. Calculate the mean, x.

  • 8/2/2019 Stat Topics

    3/17

    2. Write a table that subtracts the mean from each observed value.3. Square each of the differences.4. Add this column.5. Divide by n -1 where n is the number of items in the sample This is the

    variance.6. To get the standard deviationwe take the square root of the variance.

    ExampleThe owner of the Ches Tahoe restaurant is interested in how much people spend at therestaurant. He examines 10 randomly selected receipts for parties of four and writes down thefollowing data.

    44, 50, 38, 96, 42, 47, 40, 39, 46, 50He calculated the mean by adding and dividing by 10 to get

    x = 49.2Below is the table for getting the standard deviation:

    x x - 49.2 (x - 49.2 )2

    44 -5.2 27.04

    50 0.8 0.64

    38 11.2 125.44

    96 46.8 2190.24

    42 -7.2 51.84

    47 -2.2 4.84

    40 -9.2 84.64

    39 -10.2 104.04

    46 -3.2 10.24

    50 0.8 0.64

    Total 2600.4

    Now

    2600.4= 288.7

    10 - 1Hence the variance is 289 and the standard deviation is the square root of 289 = 17.

    Since the standard deviation can be thought of measuring how far the data values lie from themean, we take the mean and move one standard deviation in either direction. The mean for thisexample was about 49.2 and the standard deviation was 17. We have:

    49.2 - 17 = 32.2

    and

  • 8/2/2019 Stat Topics

    4/17

    49.2 + 17 = 66.2

    What this means is that most of the patrons probably spend between $32.20 and $66.20.

    The sample standard deviation will be denoted by s and the population standard deviation will be

    denoted by the Greek letter .

    The sample variance will be denoted by s2 and the population variance will be denoted by 2.The variance and standard deviation describe how spread out the data is. If the data all lies closeto the mean, then the standard deviation will be small, while if the data is spread out over a largerange of values, s will be large. Having outliers will increase the standard deviation.One of the flaws involved with the standard deviation, is that it depends on the units that areused. One way of handling this difficulty, is called the coefficient of variation which is thestandard deviation divided by the mean times 100%

    CV = 100%

    In the above example, it is

    17100% = 34.6%

    49.2This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.

    Chebyshev's TheoremA mathematician named Chebyshev came up with bounds on how much of the data must lieclose to the mean. In particular for any positive k, the proportion of the data that lies within kstandard deviations of the mean is at least

    11 -

    k2For example, ifk = 2 this number is

    11 - = .75

    22This tell us that at least 75% of the data lies within 75% of the mean. In the above example, wecan say that at least 75% of the diners spent between

    49.2 - 2(17) = 15.2and

    49.2 + 2(17) = 83.2dollars.

    A normal distribution is a very important statistical data distribution pattern occurringin many natural phenomena, such as height, blood pressure, lengths of objects

    produced by machines, etc. Certain data, when graphed as a histogram (data on the

  • 8/2/2019 Stat Topics

    5/17

    horizontal axis, amount of data on the vertical axis), creates a bell-shaped curveknown as a normal curve, or normal distribution.

    Normal distributions are symmetrical with a single central peak at the mean (average)of the data. The shape of the curve is described as bell-shaped with the graph fallingoff evenly on either side of the mean. Fifty percent of the distribution lies to the left

    of the mean and fifty percent lies to the right of the mean.The spread of a normal distribution is controlled by the standard deviation, . The

    smaller the standard deviation the more concentrated the data.The mean and the median are the same in a normal distribution.

    Chart prepared by the NY State Education Department

    Reading from the chart, we see that approximately 19.1% of normally distributed datais located between the mean (the peak) and 0.5 standard deviations to the right (or

    left) of the mean.(The percentages are represented by the area under the curve.)

    Understand that this chart shows only percentages that correspond tosubdivisions up to one-half of one standard deviation. Percentages for othersubdivisions require a statistical mathematical table or a graphing calculator.

    (See example 4)

    If you add percentages, you will see that approximately: 68% of the distribution lies within one standard deviation of the mean. 95% of the distribution lies within two standard deviations of the mean. 99.7% of the distribution lies within three standard deviations of the mean.

    These percentages are known as the "empirical rule".Note: The addition of percentages in the chart at the top of the page are slightly different than the empirical rule values

    due to rounding that has occurred in the chart.

  • 8/2/2019 Stat Topics

    6/17

    s.d. in callout boxes = standard deviation

    It is also true that: 50% of the distribution lies within 0.67448 standard deviations of the mean.

    If you are asked for the interval about the mean containing 50% of the data, you areactually being asked for theinterquartile range, IQR. The IQR (the width of aninterval which contains the middle 50% of the data set) is normally computed bysubtracting the first quartile from the third quartile. In a normal distribution (withmean 0 and standard deviation 1), the first and third quartiles are located at -0.67448and +0.67448 respectively. Thus the IQR for a normal distribution is:

    Interquartile range = 1.34896 x standard deviation(this will be thepopulation IQR)

    Percentilesand the Normal

    Curve

    The mean (at thecenter peak of thecurve) is the 50%percentile.

    The term "percentilerank" refers to thearea (probability) tothe left of the value.Adding the givenpercentages from thechart will let you find

    http://www.regentsprep.org/Regents/math/algtrig/ATS1/CentralTendency.htmhttp://www.regentsprep.org/Regents/math/algtrig/ATS1/CentralTendency.htmhttp://www.regentsprep.org/Regents/math/algtrig/ATS1/CentralTendency.htmhttp://www.regentsprep.org/Regents/math/algtrig/ATS1/CentralTendency.htm
  • 8/2/2019 Stat Topics

    7/17

    certain percentilesalong the curve.

    Examples: Look for the words "normally distributed" in a question before

    referring to the Normal Distribution Standard Deviation chart seen on this page.When using the chart, your information should fall on the increments of one-half of

    one standard deviation as shown in the chart.

    1. Find the percentage of the normally distributeddata that lies within 2 standarddeviations of the mean.

    Solution: Read the percentages from the chart at the top of this page from -2 to +2standard deviations.

    4.4% + 9.2% + 15.0% + 19.1% + 19.1% + 15.0% + 9.2% + 4.4% = 95.4%

    2. At the New Age Information Corporation, the agesof all new employees hired during the last 5 years arenormally distributed. Within this curve, 95.4% of theages, centered about the mean, are between 24.6 and37.4 years. Find the mean age and the standarddeviation of the data.

    Solution: As was seen in Example 1, 95.4% implies aspan of 2 standard deviations from the mean. The meanage is symmetrically located between -2 standard

    deviations (24.6) and +2 standard deviations(37.4).

    The mean age is years of age.From 31 to 37.4 (a distance of 6.4 years) is 2 standarddeviations. Therefore, 1 standard deviation is (6.4)/2 =3.2 years.

    3.The amount of time that Carlos plays video games in any givenweek is normally distributed. If Carlos plays video games anaverage of 15 hours per week, with a standard deviation of 3hours, what is the probability of Carlos playing video gamesbetween 15 and 18 hours a week?

  • 8/2/2019 Stat Topics

    8/17

    Solution: The average (mean) is 15 hours. If thestandard deviation is 3, the interval between 15 and 18hours is one standard deviation above the mean, whichgives a probability of 34.1% or 0.341, as seen in thechart at the top of this page.

    4.The lifetime of a battery is normally distributedwith a mean life of40 hours and a standard deviation of 1.2 hours. Find the probabilitythat a randomly selected battery lasts longer than 42 hours.

    The most accurate answer to a problem such as this cannot be obtained by using thechart at the top of this page. One standard deviation above the mean would be located at

    41.2 hours, 2 standard deviations would be at 42.4, and one and one-half standarddeviations would be at 41.8 standard deviations. None of these locations correspondsexactly to the needed 42 hours. We need more power than we have in the chart to find

    the most accurate answer. Calculator to the rescue!!

    Solution: Graph thenormal curve. We seefrom the location of42 on the graph that

    the answer is going tobe quite small.

    Now, determine the probability of a value falling to the right of 42 hours (between42 hours and infinity). Answer: 4.779%

  • 8/2/2019 Stat Topics

    9/17

  • 8/2/2019 Stat Topics

    10/17

  • 8/2/2019 Stat Topics

    11/17

  • 8/2/2019 Stat Topics

    12/17

  • 8/2/2019 Stat Topics

    13/17

  • 8/2/2019 Stat Topics

    14/17

  • 8/2/2019 Stat Topics

    15/17

  • 8/2/2019 Stat Topics

    16/17

  • 8/2/2019 Stat Topics

    17/17