EDFtest
package contains functions for the calculation of goodness-of-fit test statistics and their p-values. The three statistics computed are the Empirical Distribution function statistics called Cramér-von Mises (W2), Anderson-Darling (A2), and Watson statistic (U2).
The statistics and their p-values can be used to assess an assumed distribution. In the simplest situation you have an i.i.d. sample from some distribution F and want to test the hypothesis that F is a member of some specific parametric family. The following families are available: N(location=μ,scale=σ2), Gamma(shape=α,scale=β), Logistic(location=μ,scale=s), Laplace(location=μ,scale=b), Weibull(shape=α,scale=β), and Exponential(scale=θ).
This package computes goodness-of-fit test statistics A2, W2, and U2 – the Anderson-Darling, Cramér-von Mises, and Watson statistics. These statistics are used to test the null hypothesis that a sample X1,…,Xn is drawn from a distribution F which belongs to a specified parametric family of distributions against the alternative that F is not equal to any member of that parametric family.
The three test statistics were originally defined to test the null hypothesis that a sample, say U1,…,Un, is drawn from the uniform distribution on [0,1]. For this problem the empirical cumulative distribution function, Fn, of the U sample is compared to the cumulative distribution function F(u)=u (on [0,1]) of the Uniform[0,1] distribution. All three statistics depend on the random function, called the empirical process, Wn(x)=√n{Fn(x)−x} defined for 0≤x≤1. The oldest of the three statistics is W2 given by ∫10W2n(x)dx. Anderson and Darling suggested dividing Wn(x) by its standard deviation to give more weight to the tails (near 0 and 1).This gives the statistic A2=∫10W2n(x)x(1−x)dx. Finally Watson adapted the statistic to testing for uniformity of data on a circle. In this case the circle is scaled to have circumference 1 so that starting from some point on the circle and going round the arc length traversed moves from 0 to 1. The resulting statistic is U2=∫10(Wn(u)−∫10Wn(v)dv)2du. Watson’s key observation is that this definition gives a statistic whose value does not depend on the choice of ‘some point on the circle’ mentioned above. Note, however, that the statistic can be used even if the data did not come from a circle.
Our package contains generic functions (AD
, CvM
, and Watson
) which compute these test statistics using computing formulas which can be found, for instance, in articles by Michael Stephens in a volume edited by Ralph D’Agostino and Stephens called Tests of goodness-of-fit.
Other problems in which we want to check whether or not a sample X1,…,Xn has some specific continuous distribution F0 (such as N(0,1)) are handled by realizing that the Xi have distribution F0 if and only if the ‘probability integral transforms’ Ui=F0(Xi) have the standard uniform distribution. So to test the null hypothesis F=F0 we do the probability integral transform and then apply one of the tests of uniformity.
These tests are extended to test the composite hypothesis that F is F0(⋅|θ) for some parametric model indexed by θ. The parameter is estimated by maximum likelihood to get ˆθ and this estimate is used to produce ˆUi=F(Xi|ˆθ). Our statistics are then calculated from these ˆUi values which should, if the null hypothesis is correct, be approximately uniform.
Having computed the test statistic we want to use we then need to compute an appropriate p-value. In this package we use the following limit theory to compute asymptotic p-values.
For regular families, assuming the null hypothesis holds, each of the statistics converges in distribution to an object of the form ∫10Y2(u)du where Y is a Gaussian process with mean function 0 and covariance function ρ(u,v) which depends on the particular model being tested and usually too on the correct parameter values so that ρ(u,v)=ρ(u,v,θ). It follows that the distribution of the integral is the same as that of S≡∞∑i=1λ2iZ2i where the Zi are i.i.d. standard normal (so Z2i is χ21) and the λ are the eigenvalues of the covariance ρ. A number λ is an eigenvalue if there is a non-zero function f solving the equation ∫10ρ(u,v,θ)f(v)dv=λf(u). In the i.i.d. sample setting the covariance function ρ for the Cramér-von Mises statistic is given by min
Here {\cal I} is the p \times p Fisher information matrix for a single observation evaluated at \theta and the function \psi is the p vector with ith component \psi_i(v,\theta) = \frac{\partial}{\partial \theta_i} F(y|\theta) evaluated at the value of y solving F(y|\theta) = v.
For the Anderson-Darling statistic this covariance function must be divided by \sqrt{u(1-u)v(1-v)}. For Watson’s statistic we replace \rho above by \rho_U(u,v) = \rho(u,v) -\int_0^1 \rho(s,v) \, ds -\int_0^1 \rho(u,t)\, dt + \int_0^1\int_0^1 \rho(s,t)\, ds \, dt. For the built-in parametric models specified here we use the specific forms of these functions appropriate to the model; for models which are not built-in we offer another approach which is described later.
Our computational flow for built-in distributions is then the following:
Solve the likelihood equations to compute the maximum likelihood estimate \hat\theta for the data set x.
Compute the probability integral transforms of the vector x using the \hat\theta distribution function for the model being tested. Then compute the value of the test statistic the user has selected.
Compute approximate solutions of the eigenvalue equation
\int_0^1 \rho(u,v,\hat\theta) f(v) \, dv = \lambda f(u).
by discretization. Specifically let s_i = \frac{i}{m+1} for i running from 1 to m for some m (in the code we take m=100
by default) and create the matrix R given by
R_{i,j} = \rho(s_i,s_j,\hat\theta).
Notice that the precise form of \rho depends on the statistic being used and on the family of distributions being tested.
Then compute the m eigenvalues, say \hat\lambda_1,\ldots,\hat\lambda_m of R.
Approximate the distribution of S above by the distribution of
\sum_{i=1}^m \hat\lambda_i Z_i^2.
We use the package CompQuadForm
to compute p-values from this distribution.
Users can add their own distributions by providing two functions
Fdist(x,thetahat, ...)
which takes parameter estimates and a data set x
and computes the probability integral transform for each element of x
. Fdist
must return a vector of n probabilities
Score(x,thetahat, ...)
which takes parameter estimates and a data set x
and computes, for each entry in x
, the component of the score function due to observation x
. These must be returned in an n by p matrix with 1 row for each observation and 1 column for each parameter.
The user is also expected to supply the value thetahat
of the maximum likelihood estimate.
The work-flow above is then modified as follows:
Compute the probability integral transforms of the vector x using Fdist(x,thetahat,...)
. The compute the value of the test statistic the user has selected.
Estimate \rho(s_i,s_j,\hat\theta) using sandwich estimates of the covariance function and of the fisher information matrix. To estimate the fisher information matrix we
Call Score(x,thetahat, ...)
to get an n \times p matrix, say A.
Compute the p \times p matrix \hat{FI} given by \hat{FI}=A^TA/n.
Fdist
. Then we compute the n \times p matrix D with ijth entry
D_{ij} = \sum_{k=1}^n 1(s_i \le \hat{U}_k)A_{kj} /n.
We replace the matrix in Step 3 above by the m \times m matrix
R = R_0 - D (\hat{FI}^{-1}) D^T
where the ij component of the matrix R_0 is \min(s_i,s_j) - s_i s_j.For the Anderson-Darling test we then divide R_{ij} by \sqrt{s_i(1-s_i)s_j(1-s_j)}. And for Watson’s test we replace R by (I-J)R(I-J) where I is the m \times m identity matrix and J is the m \times m matrix with every entry given by 1/m. Thus we have swept out row and column means from R. For the built-in cases these adjustments were made in computing \rho(u,v,\hat\theta).
Steps 4 and 5 are then as in the previous case.
Here is an example showing you how to use EDFtest
package to perform goodness-of-fit tests for a given data set. We are going to use Fisher’s or Anderson’s iris
data set for our demonstration. iris
is a data frame with 150 observations and 5 variables which are Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
, and Species
. Assumed that we have a special interest in the width of sepal in this sample. Here is a histogram of it:
hist(iris$Sepal.Width, main="Width of sepal")
We may conclude that this sample might follow a normal or gamma distribution from above histogram. Now, let’s use the EDFtest
package to perform the goodness-of-fit tests to justify our guess.
library(EDFtest)
set.seed("100")
$Sepal.Width
x=irisestimate.gamma(x)[1]
shape=# Anderson-Darling statistic and P-value
asq=AD.gamma(x))
(#> [1] 0.7247644
AD.gamma.pvalue(a=asq,shape=shape)$P
#> [1] 0.057625
#Cramér-von Mises statistic and P-value
wsq=CvM.gamma(x))
(#> [1] 0.1459304
CvM.gamma.pvalue(w=wsq,shape=shape)$P
#> [1] 0.02859593
#You can also use following generic functions
gof.gamma(x,print=TRUE) #Imhof
#> Cramer-von Mises statistic is 0.1459304 with P-value is 0.02859593
#> Anderson-Darling statistic is 0.7247644 with P-value is 0.057625
#> Watson statistic is 0.14585 with P-value is 0.01936176
gof.gamma.bootstrap(x,M=10000) #bootstrap
#> Cramer-von Mises statistic is 0.1459304 with P-value is 0.0289
#> Anderson-Darling statistic is 0.7247644 with P-value is 0.0576
#> Watson statistic is 0.14585 with P-value is 0.0289
We calculated Anderson-Darling and Cramér-von Mises statistics and p-values of the sample by both imhof
and bootstrap methods. In AD.gamma.pvalue
and CvM.gamma.pvalue
functions, we use imhof
function in CompQuadForm
package to calculated 100 eigenvalues, by default, for the calculation of their p-values. Using imhof
method, p-value for A^{2} is 0.057625 and for W^{2} is 0.02859593. At the same time, p-values by 10,000 bootstrap are 0.0289 for A^{2} and 0.0576 for W^{2}. Both methods are fairly consistent.
Now, we can do similar tests for Normal model.
set.seed("100")
# Anderson-Darling statistic and P-value
asq=AD.normal(x))
(#> [1] 0.907955
AD.normal.pvalue(a=asq)$P
#> [1] 0.02037737
#Cramér-von Mises statistic and P-value
wsq=CvM.normal(x))
(#> [1] 0.1806514
CvM.normal.pvalue(w=wsq)$P
#> [1] 0.009486189
#You can also use following generic functions
gof.normal(x,print=TRUE) #Imhof
#> Cramer-von Mises statistic is 0.1806514 with P-value is 0.009486189
#> Anderson-Darling statistic is 0.907955 with P-value is 0.02037737
#> Watson statistic is 0.1712387 with P-value is 0.008225783
gof.normal.bootstrap(x,M=10000) #bootstrap
#> Cramer-von Mises statistic is 0.1806514 with P-value is 0.0105
#> Anderson-Darling statistic is 0.907955 with P-value is 0.0205
#> Watson statistic is 0.1712387 with P-value is 0.0143
We can see that p-value for A^{2} is 0.02037737 and for W^{2} is 0.009486189. At the same time, p-values by 10,000 bootstrap are 0.0205 for A^{2} and 0.0105 for W^{2}. Both methods are fairly consistent.