© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 1
http://www.indiana.edu/~statmath
I n d i a n a U n i v e r s i t y
U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s
Univariate Analysis and Normality Test Using SAS, Stata,
and SPSS*
Hun Myoung Park, Ph.D.
© 2002-2008
Last modified on November 2008
University Information Technology Services
Center for Statistical and Mathematical Computing
Indiana University
410 North Park Avenue Bloomington, IN 47408
(812) 855-4724 (317) 278-4740
http://www.indiana.edu/~statmath
* The citation of this document should read: “Park, Hun Myoung. 2008. Univariate Analysis and Normality Test
Using SAS, Stata, and SPSS. Working Paper. The University Information Technology Services (UITS) Center for
Statistical and Mathematical Computing, Indiana University.”
http://www.indiana.edu/~statmath/stat/all/normality/index.html
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 2
http://www.indiana.edu/~statmath
This document summarizes graphical and numerical methods for univariate analysis and
normality test, and illustrates how to do using SAS 9.1, Stata 10 special edition, and SPSS 16.0.
1. Introduction
2. Graphical Methods
3. Numerical Methods
4. Testing Normality Using SAS
5. Testing Normality Using Stata
6. Testing Normality Using SPSS
7. Conclusion
1. Introduction
Descriptive statistics provide important information about variables to be analyzed. Mean,
median, and mode measure central tendency of a variable. Measures of dispersion include
variance, standard deviation, range, and interquantile range (IQR). Researchers may draw a
histogram, stem-and-leaf plot, or box plot to see how a variable is distributed.
Statistical methods are based on various underlying assumptions. One common assumption is
that a random variable is normally distributed. In many statistical analyses, normality is often
conveniently assumed without any empirical evidence or test. But normality is critical in many
statistical methods. When this assumption is violated, interpretation and inference may not be
reliable or valid.
Figure 1. Comparing the Standard Normal and a Bimodal Probability Distributions
0
.1
.2
.3
.4
-5 -3 -1 1 3 5
Standard Normal Distribution
0
.1
.2
.3
.4
-5 -3 -1 1 3 5
Bimodal Distribution
The t-test and ANOVA (Analysis of Variance) compare group means, assuming a variable of
interest follows a normal probability distribution. Otherwise, these methods do not make much
sense. Figure 1 illustrates the standard normal probability distribution and a bimodal
distribution. How can you compare means of these two random variables?
There are two ways of testing normality (Table 1). Graphical methods visualize the
distributions of random variables or differences between an empirical distribution and a
theoretical distribution (e.g., the standard normal distribution). Numerical methods present
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 3
http://www.indiana.edu/~statmath
summary statistics such as skewness and kurtosis, or conduct statistical tests of normality.
Graphical methods are intuitive and easy to interpret, while numerical methods provide
objective ways of examining normality.
Table 1. Graphical Methods versus Numerical Methods
Graphical Methods Numerical Methods
Descriptive Stem-and-leaf plot, (skeletal) box plot,
dot plot, histogram
Skewness
Kurtosis
Theory-driven P-P plot
Q-Q plot
Shapiro-Wilk, Shapiro- Francia test
Kolmogorov-Smirnov test (Lillefors test)
Anderson-Darling/Cramer-von Mises tests
Jarque-Bera test, Skewness-Kurtosis test
Graphical and numerical methods are either descriptive or theory-driven. A dot plot and
histogram, for instance, are descriptive graphical methods, while skewness and kurtosis are
descriptive numerical methods. The P-P and Q-Q plots are theory-driven graphical methods for
normality test, whereas the Shapiro-Wilk W and Jarque-Bera tests are theory-driven numerical
methods.
Figure 2. Histograms of Normally and Non-normally Distributed Variables
0
.1
.2
.3
.4
.5
-3 -2 -1 0 1 2 3
Randomly Drawn from the Standard Normal Distribution (Seed=1,234,567)
A Normally Distributed Variable (N=500)
0
.0
3
.0
6
.0
9
.1
2
.1
5
0 10 20 30 40 50 60
Per Capita Gross National Income in 2005 ($1,000)
A Non-normally Distributed Variable (N=164)
Three variables are employed here. The first variable is unemployment rate of Illinois, Indiana,
and Ohio in 2005. The second variable includes 500 observations that were randomly drawn
from the standard normal distribution. This variable is supposed to be normally distributed with
mean 0 and variance 1 (left plot in Figure 2). An example of a non-normal distribution is per
capita gross national income (GNI) in 2005 of 164 countries in the world. GNIP is severely
skewed to the right and is least likely to be normally distributed (right plot in Figure 2). See the
Appendix for details.
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 4
http://www.indiana.edu/~statmath
2. Graphical Methods
Graphical methods visualize the distribution of a random variable and compare the distribution
to a theoretical one using plots. These methods are either descriptive or theory-driven. The
former method is based on the empirical data, whereas the latter considers both empirical and
theoretical distributions.
2.1 Descriptive Plots
Among frequently used descriptive plots are the stem-and-leaf-plot, dot plot, (skeletal) box plot,
and histogram. When N is small, a stem-and-leaf plot and dot plot are useful to summarize
continuous or event count data. Figure 3 and 4 respectively present a stem-and-leaf plot and a
dot plot of the unemployment rate of three states.
Figure 3. Stem-and-Leaf Plot of Unemployment Rate of Illinois, Indiana, Ohio
-> state = IL
Stem-and-leaf plot for rate(Rate)
rate rounded to nearest multiple
of .1
plot in units of .1
3. | 7889
4* | 011122344
4. | 556666666677778888999
5* | 0011122222333333344444
5. | 5555667777777888999
6* | 000011222333444
6. | 555579
7* | 0033
7. |
8* | 0
8. | 8
-> state = IN
Stem-and-leaf plot for rate(Rate)
rate rounded to nearest multiple
of .1
plot in units of .1
3* | 1
3. | 89
4* | 012234
4. | 566666778889999
5* | 00000111222222233344
5. | 555666666777889
6* | 002222233344
6. | 5666677889
7* | 1113344
7. | 67
8* | 14
-> state = OH
Stem-and-leaf plot for rate (Rate)
rate rounded to nearest multiple
of .1
plot in units of .1
3* | 8
4* | 014577899
5* | 01223333445556667778888888999
6* | 001111122222233444446678899
7* | 01223335677
8* | 1223338
9* | 99
10* | 1
11* |
12* |
13* | 3
Figure 4. Dot Plot of Unemployment Rate of Illinois, Indiana, Ohio
0
5
10
15
Illinois (N=102) Indiana (N=92) Ohio (N=88)
Indiana Business Research Center (http://www.stats.indiana.edu/)
Source: Bureau of Labor Statistics
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 5
http://www.indiana.edu/~statmath
A box plot presents the minimum, 25th percentile (1st quartile), 50th percentile (median), 75th
percentile (3rd quartile), and maximum in a box and lines.1 Outliers, if any, appear at the
outsides of (adjacent) minimum and maximum lines. As such, a box plot effectively
summarizes these major percentiles using a box and lines. If a variable is normally distributed,
its 25th and 75th percentile are symmetric, and its median and mean are located at the same
point exactly in the center of the box.2
In Figure 5, you should see outliers in Illinois and Ohio that affect the shapes of corresponding
boxes. By contrast, the Indiana unemployment rate does not have outliers, and its symmetric
box implies that the rate appears to be normally distributed.
Figure 5. Box Plots of Unemployment Rates of Illinois, Indiana, and Ohio
2
4
6
8
10
12
14
Illinois (N=102) Indiana (N=92) Ohio (N=88)
U
ne
m
pl
oy
m
en
t R
at
e
(%
)
Indiana Business Research Center (http://www.stats.indiana.edu/)
Source: Bureau of Labor Statistics
The histogram graphically shows how each category (interval) accounts for the proportion of
total observations and is appropriate when N is large (Figure 6).
Figure 6. Histograms of Unemployment Rates of Illinois, Indiana and Ohio
0
.1
.2
.3
.4
.5
0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15
Illinois (N=102) Indiana (N=92) Ohio (N=88)
Indiana Business Research Center (http://www.stats.indiana.edu/)
Source: Bureau of Labor Statistics
1 The first quartile cuts off lowest 25 percent of data; the second quartile, median, cuts data set in half; and the
third quartile cuts off lowest 75 percent or highest 25 percent of data. See http://en.wikipedia.org/wiki/Quartile
2 SAS reports a mean as “+” between (adjacent) minimum and maximum lines.
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 6
http://www.indiana.edu/~statmath
2.2 Theory-driven Plots
P-P and Q-Q plots are considered here. The probability-probability plot (P-P plot or percent
plot) compares an empirical cumulative distribution function of a variable with a specific
theoretical cumulative distribution function (e.g., the standard normal distribution function). In
Figure 7, Ohio appears to deviate more from the fitted line than Indiana.
Figure 7. P-P Plots of Unemployment Rates of Indiana and Ohio (Year 2005)
0.
00
0.
25
0.
50
0.
75
1.
00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
Source: Bureau of Labor Statistics
2005 Indiana Unemployment Rate (N=92 Counties)
0.
00
0.
25
0.
50
0.
75
1.
00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
Source: Bureau of Labor Statistics
2005 Ohio Unemployment Rate (N=88 Counties)
Similarly, the quantile-quantile plot (Q-Q plot) compares ordered values of a variable with
quantiles of a specific theoretical distribution (i.e., the normal distribution). If two distributions
match, the points on the plot will form a linear pattern passing through the origin with a unit
slope. P-P and Q-Q plots are used to see how well a theoretical distribution models the
empirical data. In Figure 8, Indiana appears to have a smaller variation in its unemployment
rate than Ohio. By contrast, Ohio appears to have a wider range of outliers in the upper extreme.
Figure 8. Q-Q Plots of Unemployment Rates of Indiana and Ohio (Year 2005)
4.
15
.5
7.
4
0
5
10
15
U
ne
m
pl
oy
m
en
t R
at
e
in
2
00
5
5.641304 7.3501913.932418
3 4 5 6 7 8
Inverse Normal
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
Source: Bureau of Labor Statistics
2005 Indiana Unemployment Rate (N=92 Counties)
4.
5
6.
1
8.
8
0
5
10
15
U
ne
m
pl
oy
m
en
t R
at
e
in
2
00
5
6.3625 8.7608573.964143
2 4 6 8 10
Inverse Normal
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
Source: Bureau of Labor Statistics
2005 Ohio Unemployment Rate (N=88 Counties)
Detrended normal P-P and Q-Q plots depict the actual deviations of data points from the
straight horizontal line at zero. No specific pattern in a detrended plot indicates normality of the
variable. SPSS can generate detrended P-P and Q-Q plots.
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 7
http://www.indiana.edu/~statmath
3. Numerical Methods
Graphical methods, although visually appealing, do not provide objective criteria to determine
normality of variables. Interpretations are thus a matter of judgments. Numerical methods use
descriptive statistics and statistical tests to examine normality.
3.1 Descriptive Statistics
Measures of dispersion such as variance reveal how observations of a random variable deviate
from their mean. The second central moment is
1
)( 22
n
xx
s i
Skewness is a third standardized moment that measures the degree of symmetry of a probability
distribution. If skewness is greater than zero, the distribution is skewed to the right, having
more observations on the left.
232
3
3
3
3
3
])([
)(1
)1(
)(])[(
xx
xxn
ns
xxxE
i
ii
Kurtosis, based on the fourth central moment, measures the thinness of tails or “peakedness” of
a probability distribution.
22
4
4
4
4
4
])([
)()1(
)1(
)(])[(
xx
xxn
ns
xxxE
i
ii
Figure 9. Probability Distributions with Different Kurtosis
0
.2
.4
.6
.8
0
.2
.4
.6
.8
-5 -3 -1 1 3 5
-5 -3 -1 1 3 5
Kurtosis < 3 Kurtosis = 3
Kurtosis > 3
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 8
http://www.indiana.edu/~statmath
If kurtosis of a random variable is less than three (or if kurtosis-3 is less than zero), the
distribution has thicker tails and a lower peak compared to a normal distribution (first plot in
Figure 9).3 By contrast, kurtosis larger than 3 indicates a higher peak and thin tails (last plot). A
normally distributed random variable should have skewness and kurtosis near zero and three,
respectively (second plot in Figure 9).
state | N mean median max min variance skewness kurtosis
-------+--------------------------------------------------------------------------------
IL | 102 5.421569 5.35 8.8 3.7 .8541837 .6570033 3.946029
IN | 92 5.641304 5.5 8.4 3.1 1.079374 .3416314 2.785585
OH | 88 6.3625 6.1 13.3 3.8 2.126049 1.665322 8.043097
-------+--------------------------------------------------------------------------------
Total | 282 5.786879 5.65 13.3 3.1 1.473955 1.44809 8.383285
----------------------------------------------------------------------------------------
In short, skewness and kurtosis show how the distribution of a variable deviates from a normal
distribution. These statistics are based on the empirical data.
3.2 Theory-driven Statistics
The numerical methods of normality test include the Kolmogorov-Smirnov (K-S) D test
(Lilliefors test), Shapiro-Wilk test, Anderson-Darling test, and Cramer-von Mises test (SAS
Institute 1995).4 The K-S D test and Shapiro-Wilk W test are commonly used. The K-S,
Anderson-Darling, and Cramer-von Misers tests are based on the empirical distribution
function (EDF), which is defined as a set of N independent observations x1, x2, …xn with a
common distribution function F(x) (SAS 2004).
Table 2. Numerical Methods of Testing Normality
Test Statistic N Range Dist. SAS Stata SPSS
Jarque-Bera 2 )2(2 - - -
Skewness-Kurtosis 2 9≤N )2(2 - .sktest -
Shapiro-Wilk W 7≤N≤ 2,000 - YES .swilk YES
Shapiro-Francia W’ 5≤N≤ 5,000 - - .sfrancia -
Kolmogorov-Smirnov D EDF YES * YES
Cramer-vol Mises W2 EDF YES - -
Anderson-Darling A2 EDF YES - -
* Stata .ksmirnov command is not used for testing normality.
The Shapiro-Wilk W is the ratio of the best estimator of the variance to the usual corrected sum
of squares estimator of the variance (Shapiro and Wilk 1965).5 The statistic is positive and less
than or equal to one. Being close to one indicates normality.
3 SAS and SPSS produce (kurtosis -3), while Stata returns the kurtosis. SAS uses its weighted kurtosis formula
with the degree of freedom adjusted. So, if N is small, SAS, Stata, and SPSS may report different kurtosis.
4 The UNIVARIATE and CAPABILITY procedures have the NORMAL option to produce four statistics.
5 The W statistic was constructed by considering the regression of ordered sample values on corresponding
expected normal order statistics, which for a sample from a normally distributed population is linear (Royston
1982). Shapiro and Wilk’s (1965) original W statistic is valid for the sample sizes between 3 and 50, but Royston
extended the test by developing a transformation of the null distribution of W to approximate normality throughout
the range between 7 and 2000.
© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 9
http://www.indiana.edu/~statmath
The W statistic requires that the sample size is greater than or equal to 7 and less than or equal
to 2,000 (Shapiro and Wilk 1965).6
2
2
)(
)( xx
xa
W
i
ii
where a’=(a1, a2, …, an) = 21111 ]'[' mVVmVm , m’=(m1, m2, …, mn) is the vector of expected
values of standard normal order statistics, V is the n by n covariance matrix, x’=(x1, x2, …, xn)
is a random sample, and x(1)< x(2)< …
本文档为【检验正态分布的方法】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。