检验正态分布的方法

© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 1 http://www.indiana.edu/~statmath I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s Univariate Analysis and Normality Test Using SAS, Stata, and SPSS* Hun Myoung Park, Ph.D. © 2002-2008 Last modified on November 2008 University Information Technology Services Center for Statistical and Mathematical Computing Indiana University 410 North Park Avenue Bloomington, IN 47408 (812) 855-4724 (317) 278-4740 http://www.indiana.edu/~statmath * The citation of this document should read: “Park, Hun Myoung. 2008. Univariate Analysis and Normality Test Using SAS, Stata, and SPSS. Working Paper. The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University.” http://www.indiana.edu/~statmath/stat/all/normality/index.html © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 2 http://www.indiana.edu/~statmath This document summarizes graphical and numerical methods for univariate analysis and normality test, and illustrates how to do using SAS 9.1, Stata 10 special edition, and SPSS 16.0. 1. Introduction 2. Graphical Methods 3. Numerical Methods 4. Testing Normality Using SAS 5. Testing Normality Using Stata 6. Testing Normality Using SPSS 7. Conclusion 1. Introduction Descriptive statistics provide important information about variables to be analyzed. Mean, median, and mode measure central tendency of a variable. Measures of dispersion include variance, standard deviation, range, and interquantile range (IQR). Researchers may draw a histogram, stem-and-leaf plot, or box plot to see how a variable is distributed. Statistical methods are based on various underlying assumptions. One common assumption is that a random variable is normally distributed. In many statistical analyses, normality is often conveniently assumed without any empirical evidence or test. But normality is critical in many statistical methods. When this assumption is violated, interpretation and inference may not be reliable or valid. Figure 1. Comparing the Standard Normal and a Bimodal Probability Distributions 0 .1 .2 .3 .4 -5 -3 -1 1 3 5 Standard Normal Distribution 0 .1 .2 .3 .4 -5 -3 -1 1 3 5 Bimodal Distribution The t-test and ANOVA (Analysis of Variance) compare group means, assuming a variable of interest follows a normal probability distribution. Otherwise, these methods do not make much sense. Figure 1 illustrates the standard normal probability distribution and a bimodal distribution. How can you compare means of these two random variables? There are two ways of testing normality (Table 1). Graphical methods visualize the distributions of random variables or differences between an empirical distribution and a theoretical distribution (e.g., the standard normal distribution). Numerical methods present © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 3 http://www.indiana.edu/~statmath summary statistics such as skewness and kurtosis, or conduct statistical tests of normality. Graphical methods are intuitive and easy to interpret, while numerical methods provide objective ways of examining normality. Table 1. Graphical Methods versus Numerical Methods Graphical Methods Numerical Methods Descriptive Stem-and-leaf plot, (skeletal) box plot, dot plot, histogram Skewness Kurtosis Theory-driven P-P plot Q-Q plot Shapiro-Wilk, Shapiro- Francia test Kolmogorov-Smirnov test (Lillefors test) Anderson-Darling/Cramer-von Mises tests Jarque-Bera test, Skewness-Kurtosis test Graphical and numerical methods are either descriptive or theory-driven. A dot plot and histogram, for instance, are descriptive graphical methods, while skewness and kurtosis are descriptive numerical methods. The P-P and Q-Q plots are theory-driven graphical methods for normality test, whereas the Shapiro-Wilk W and Jarque-Bera tests are theory-driven numerical methods. Figure 2. Histograms of Normally and Non-normally Distributed Variables 0 .1 .2 .3 .4 .5 -3 -2 -1 0 1 2 3 Randomly Drawn from the Standard Normal Distribution (Seed=1,234,567) A Normally Distributed Variable (N=500) 0 .0 3 .0 6 .0 9 .1 2 .1 5 0 10 20 30 40 50 60 Per Capita Gross National Income in 2005 ($1,000) A Non-normally Distributed Variable (N=164) Three variables are employed here. The first variable is unemployment rate of Illinois, Indiana, and Ohio in 2005. The second variable includes 500 observations that were randomly drawn from the standard normal distribution. This variable is supposed to be normally distributed with mean 0 and variance 1 (left plot in Figure 2). An example of a non-normal distribution is per capita gross national income (GNI) in 2005 of 164 countries in the world. GNIP is severely skewed to the right and is least likely to be normally distributed (right plot in Figure 2). See the Appendix for details. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 4 http://www.indiana.edu/~statmath 2. Graphical Methods Graphical methods visualize the distribution of a random variable and compare the distribution to a theoretical one using plots. These methods are either descriptive or theory-driven. The former method is based on the empirical data, whereas the latter considers both empirical and theoretical distributions. 2.1 Descriptive Plots Among frequently used descriptive plots are the stem-and-leaf-plot, dot plot, (skeletal) box plot, and histogram. When N is small, a stem-and-leaf plot and dot plot are useful to summarize continuous or event count data. Figure 3 and 4 respectively present a stem-and-leaf plot and a dot plot of the unemployment rate of three states. Figure 3. Stem-and-Leaf Plot of Unemployment Rate of Illinois, Indiana, Ohio -> state = IL Stem-and-leaf plot for rate(Rate) rate rounded to nearest multiple of .1 plot in units of .1 3. | 7889 4* | 011122344 4. | 556666666677778888999 5* | 0011122222333333344444 5. | 5555667777777888999 6* | 000011222333444 6. | 555579 7* | 0033 7. | 8* | 0 8. | 8 -> state = IN Stem-and-leaf plot for rate(Rate) rate rounded to nearest multiple of .1 plot in units of .1 3* | 1 3. | 89 4* | 012234 4. | 566666778889999 5* | 00000111222222233344 5. | 555666666777889 6* | 002222233344 6. | 5666677889 7* | 1113344 7. | 67 8* | 14 -> state = OH Stem-and-leaf plot for rate (Rate) rate rounded to nearest multiple of .1 plot in units of .1 3* | 8 4* | 014577899 5* | 01223333445556667778888888999 6* | 001111122222233444446678899 7* | 01223335677 8* | 1223338 9* | 99 10* | 1 11* | 12* | 13* | 3 Figure 4. Dot Plot of Unemployment Rate of Illinois, Indiana, Ohio 0 5 10 15 Illinois (N=102) Indiana (N=92) Ohio (N=88) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 5 http://www.indiana.edu/~statmath A box plot presents the minimum, 25th percentile (1st quartile), 50th percentile (median), 75th percentile (3rd quartile), and maximum in a box and lines.1 Outliers, if any, appear at the outsides of (adjacent) minimum and maximum lines. As such, a box plot effectively summarizes these major percentiles using a box and lines. If a variable is normally distributed, its 25th and 75th percentile are symmetric, and its median and mean are located at the same point exactly in the center of the box.2 In Figure 5, you should see outliers in Illinois and Ohio that affect the shapes of corresponding boxes. By contrast, the Indiana unemployment rate does not have outliers, and its symmetric box implies that the rate appears to be normally distributed. Figure 5. Box Plots of Unemployment Rates of Illinois, Indiana, and Ohio 2 4 6 8 10 12 14 Illinois (N=102) Indiana (N=92) Ohio (N=88) U ne m pl oy m en t R at e (% ) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics The histogram graphically shows how each category (interval) accounts for the proportion of total observations and is appropriate when N is large (Figure 6). Figure 6. Histograms of Unemployment Rates of Illinois, Indiana and Ohio 0 .1 .2 .3 .4 .5 0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15 Illinois (N=102) Indiana (N=92) Ohio (N=88) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics 1 The first quartile cuts off lowest 25 percent of data; the second quartile, median, cuts data set in half; and the third quartile cuts off lowest 75 percent or highest 25 percent of data. See http://en.wikipedia.org/wiki/Quartile 2 SAS reports a mean as “+” between (adjacent) minimum and maximum lines. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 6 http://www.indiana.edu/~statmath 2.2 Theory-driven Plots P-P and Q-Q plots are considered here. The probability-probability plot (P-P plot or percent plot) compares an empirical cumulative distribution function of a variable with a specific theoretical cumulative distribution function (e.g., the standard normal distribution function). In Figure 7, Ohio appears to deviate more from the fitted line than Indiana. Figure 7. P-P Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 0. 00 0. 25 0. 50 0. 75 1. 00 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 2005 Indiana Unemployment Rate (N=92 Counties) 0. 00 0. 25 0. 50 0. 75 1. 00 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 2005 Ohio Unemployment Rate (N=88 Counties) Similarly, the quantile-quantile plot (Q-Q plot) compares ordered values of a variable with quantiles of a specific theoretical distribution (i.e., the normal distribution). If two distributions match, the points on the plot will form a linear pattern passing through the origin with a unit slope. P-P and Q-Q plots are used to see how well a theoretical distribution models the empirical data. In Figure 8, Indiana appears to have a smaller variation in its unemployment rate than Ohio. By contrast, Ohio appears to have a wider range of outliers in the upper extreme. Figure 8. Q-Q Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 4. 15 .5 7. 4 0 5 10 15 U ne m pl oy m en t R at e in 2 00 5 5.641304 7.3501913.932418 3 4 5 6 7 8 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles Source: Bureau of Labor Statistics 2005 Indiana Unemployment Rate (N=92 Counties) 4. 5 6. 1 8. 8 0 5 10 15 U ne m pl oy m en t R at e in 2 00 5 6.3625 8.7608573.964143 2 4 6 8 10 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles Source: Bureau of Labor Statistics 2005 Ohio Unemployment Rate (N=88 Counties) Detrended normal P-P and Q-Q plots depict the actual deviations of data points from the straight horizontal line at zero. No specific pattern in a detrended plot indicates normality of the variable. SPSS can generate detrended P-P and Q-Q plots. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 7 http://www.indiana.edu/~statmath 3. Numerical Methods Graphical methods, although visually appealing, do not provide objective criteria to determine normality of variables. Interpretations are thus a matter of judgments. Numerical methods use descriptive statistics and statistical tests to examine normality. 3.1 Descriptive Statistics Measures of dispersion such as variance reveal how observations of a random variable deviate from their mean. The second central moment is 1 )( 22    n xx s i Skewness is a third standardized moment that measures the degree of symmetry of a probability distribution. If skewness is greater than zero, the distribution is skewed to the right, having more observations on the left. 232 3 3 3 3 3 ])([ )(1 )1( )(])[(      xx xxn ns xxxE i ii   Kurtosis, based on the fourth central moment, measures the thinness of tails or “peakedness” of a probability distribution. 22 4 4 4 4 4 ])([ )()1( )1( )(])[(      xx xxn ns xxxE i ii   Figure 9. Probability Distributions with Different Kurtosis 0 .2 .4 .6 .8 0 .2 .4 .6 .8 -5 -3 -1 1 3 5 -5 -3 -1 1 3 5 Kurtosis < 3 Kurtosis = 3 Kurtosis > 3 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 8 http://www.indiana.edu/~statmath If kurtosis of a random variable is less than three (or if kurtosis-3 is less than zero), the distribution has thicker tails and a lower peak compared to a normal distribution (first plot in Figure 9).3 By contrast, kurtosis larger than 3 indicates a higher peak and thin tails (last plot). A normally distributed random variable should have skewness and kurtosis near zero and three, respectively (second plot in Figure 9). state | N mean median max min variance skewness kurtosis -------+-------------------------------------------------------------------------------- IL | 102 5.421569 5.35 8.8 3.7 .8541837 .6570033 3.946029 IN | 92 5.641304 5.5 8.4 3.1 1.079374 .3416314 2.785585 OH | 88 6.3625 6.1 13.3 3.8 2.126049 1.665322 8.043097 -------+-------------------------------------------------------------------------------- Total | 282 5.786879 5.65 13.3 3.1 1.473955 1.44809 8.383285 ---------------------------------------------------------------------------------------- In short, skewness and kurtosis show how the distribution of a variable deviates from a normal distribution. These statistics are based on the empirical data. 3.2 Theory-driven Statistics The numerical methods of normality test include the Kolmogorov-Smirnov (K-S) D test (Lilliefors test), Shapiro-Wilk test, Anderson-Darling test, and Cramer-von Mises test (SAS Institute 1995).4 The K-S D test and Shapiro-Wilk W test are commonly used. The K-S, Anderson-Darling, and Cramer-von Misers tests are based on the empirical distribution function (EDF), which is defined as a set of N independent observations x1, x2, …xn with a common distribution function F(x) (SAS 2004). Table 2. Numerical Methods of Testing Normality Test Statistic N Range Dist. SAS Stata SPSS Jarque-Bera 2 )2(2 - - - Skewness-Kurtosis 2 9≤N )2(2 - .sktest - Shapiro-Wilk W 7≤N≤ 2,000 - YES .swilk YES Shapiro-Francia W’ 5≤N≤ 5,000 - - .sfrancia - Kolmogorov-Smirnov D EDF YES * YES Cramer-vol Mises W2 EDF YES - - Anderson-Darling A2 EDF YES - - * Stata .ksmirnov command is not used for testing normality. The Shapiro-Wilk W is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965).5 The statistic is positive and less than or equal to one. Being close to one indicates normality. 3 SAS and SPSS produce (kurtosis -3), while Stata returns the kurtosis. SAS uses its weighted kurtosis formula with the degree of freedom adjusted. So, if N is small, SAS, Stata, and SPSS may report different kurtosis. 4 The UNIVARIATE and CAPABILITY procedures have the NORMAL option to produce four statistics. 5 The W statistic was constructed by considering the regression of ordered sample values on corresponding expected normal order statistics, which for a sample from a normally distributed population is linear (Royston 1982). Shapiro and Wilk’s (1965) original W statistic is valid for the sample sizes between 3 and 50, but Royston extended the test by developing a transformation of the null distribution of W to approximate normality throughout the range between 7 and 2000. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 9 http://www.indiana.edu/~statmath The W statistic requires that the sample size is greater than or equal to 7 and less than or equal to 2,000 (Shapiro and Wilk 1965).6      2 2 )( )( xx xa W i ii where a’=(a1, a2, …, an) = 21111 ]'['  mVVmVm , m’=(m1, m2, …, mn) is the vector of expected values of standard normal order statistics, V is the n by n covariance matrix, x’=(x1, x2, …, xn) is a random sample, and x(1)< x(2)< …

                    本文档为【检验正态分布的方法】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

检验正态分布的方法

你可能还喜欢