LIU Yan, LI Shiming, ZHANG Sanguo?
(1 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China;2 Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China;3 Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University,Beijing 100730, China)
Abstract This work is concerned with tests for one-sample mean vectors under high dimensional cases. Existing high dimensional tests for mean vectors base on the assumption of elliptical distribution have been proposed recently. To extend to more distributions, we propose a signed-rank-based test. The proposed test statistic is robust and scalar-invariant. Asymptotic properties of the test statistic are established. Numerical studies show that the proposed test has a good control of the type-I error and is more efficiency. We also employ the proposed method to analyze an ophthalmic data.
Keywords high dimensional analysis; signed-rank; one-sample test; scalar-invariance
Suppose thatX1,…,Xn∈pare independent and identically distribution random samples with mean vectorμand covariance matrixΣ. And consider the following test
H0:μ=μ0vs.H1:μ≠μ0.
(1)
undern
The challenge of testing (1) in high dimensional situation has attracted many researchers. Ref.[1] constructed the test statistics which avoid the inverse of the sample covariance matrix. but the test statistics can only be applied to the case ofp/n→c∈(0,1), which means that the increasing rate of the sample dimension should be same as the sample size. Ref.[2] proposed a new test statistic without any direct relationship betweenpandn. In practice, different components may have different scales. Therefore, scalar-invariant is an important property to a test statistic. Ref.[3], Ref.[4] and Ref.[5] constructed a test statistic with the property of scalar-invariant and under the assumption thatp=o(n2). Ref.[6] proposed a scalar-invariant test that allows the dimension to be arbitrarily large. But their test is not location shift invariant. However, under heavy-tailed distributions, which frequently arise in genomics and quantitative finance, the asymptotic properties of the above test statistics are not established, a natural result is that these tests tend to have unsatisfactory power. Under the assumption of elliptical distributions, Ref.[7] proposed a novel non-parametric test based on spatial-signs, which is more powerful than the test in Ref.[2] for heavy-tailed multivariate distributions and has similar power to the test in Ref.[2] for multivariate normal distribution. But their test is not scalar-invariant. Ref.[8] proposed a novel scalar-invariant test based on multivariate-sign, which is more powerful than the test in Ref.[5] for heavy-tailed multivariate distributions. And their method is under the assumption that log(p)=o(n).
We propose a novel test for hypothesis (1) based on signed-rank method and our study have two main contributions. Firstly, the proposed test statistic works for more distributions because signed-rank method only requires that the distribution of the samples is symmetric. And the test statistic is available whenpis arbitrarily large. Secondly, we show that, under null hypothesis, the proposed test statistic is asymptotically normal. Moreover, the simulation study shows that our method is scalar-invariant and robust, and is more efficient without the assumption of elliptical distributions.
Suppose thatXi,i=1,…,nare independent and identically distribution random samples with dimensionp. We denote thatX(k)=(X1k,…,Xnk),k=1,…,pas the sample of thek-th dimension. And, let (r1k,…,rnk) be the rank of (|X1k|,…,|Xnk|). To test hypothesis (1), we proposed a test statistic based on signed-rank functions, which are defined as:
Ui=diag{sign(Xi1),…,sign(Xip)}(ri1,…,rip)T,
wherei=1,2,…,n. Then, we consider the following U-statistic:
(2)
Setsi=(si1,…,sip)Twith covariance matrixΣs>0, wheresij=sign(Xij). To establish the asymptotic properties of the U statistic under the null hypothesis, we need following conditions:
Remark1.1Condition A1 is necessary condition of the signed-rank test under null hypothesis and it indicates that the random samples have symmetric distributions. Under the first term in condition A1, we haveE(sij)=0. Under the second term in condition A1,rij≠rkjfor anyi≠kand eachjso that (r1j,…,rnj) is a permutation of all the elements in {1,…,n}. Condition A2 is similar to that applied in Ref.[2], and it is a quite mild condition on the eigenvalues ofΣs.
UnderH0, and then suppose condition A1 hold, it is easy to show that
E(Tn)=0,
and
Theorem 1.1 in the following establishes the asymptotic normality ofTn.
Theorem1.1UnderH0, and then suppose conditions A1 and A2 hold, asn→∞ andp→∞,
(3)
We compare the performance of the proposed test (SR) with five alternatives: Ref.[1] (BS), Ref.[2] (CQ), Ref.[5] (SKK), Ref.[7] (WPL), Ref.[8] (FZW). All the following simulations are replicated 1 000 times. And, we setn=20, 50 andp=200, 1 000.
Table 1 stands for the performance of the six tests in Example 1. We can see that the power of SR is similar to those of BS, CQ and WPL whenΣ=Σ1, and is more than those of BS, CQ and WPL whenΣ=Σ2. It indicates that SR has better performance when the scales of different components are different. For example, when (n,p)=(20,200),Σ=Σ2andc=0.1, the power of SR, BS,CQ, and WPL are 0.547, 0.407, 0.420, and 0.394, respectively. And we observe that SR has better performance in power than SKK and FZW whenp?n. The reason is that SKK and FZW are under the assumptions thatpcannot be much larger thann. For example, when (n,p)=(20,1 000),Σ=Σ1andc=0.15, the power of SR, SKK and FZW are 0.589, 0.413 and 0.347 respectively.
Table 1 The empirical size and power at the significance level of 5% in Example 1
Example2In this example,Xiis generated from p-variatet-distribution with 3 degrees of freedom. The setting of mean vectorμand covarianceΣare the same as those in Example 1. And we selectc=0.1 and 0.15 forμto calculate the power.
Table 2 shows the simulation results in Example 2. We can see that SR have better performance in power than that of other five tests in all settings. For example, when (n,p)=(50,200),Σ=Σ1andc=0.15, the power of SR is 0.773 and the power of the other tests in this setting are 0.419, 0.538, 0.549, 0.577, and 0.610 respectively. Fort-distribution is a common heavy-tailed distribution, the results in this table indicate that SR is robust. Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n,p)=(20,1 000),Σ=Σ2, andc=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.
Table 2 The empirical size and power at the significance level of 5% in Example 2
Example3In this example,Xiis generated from p-variate Laplace distribution. And we consider the same setting of mean vectorμand covarianceΣas those in Example 1. To calculate the power, we selectc=0.1 andc=0.15 whenn=20, andc=0.05 andc=0.075 whenn=50.
Table 3 shows the performance of the six tests in Example 3. It shows that SR are more powerful than other five tests in all settings. For example, when (n,p)=(20,1 000),Σ=Σ2, andc=0.15, the power of BS, CQ, SKK, WPL, FZW, and SR are 0.626, 0.615, 0.695, 0.653, 0.650, and 0.949, respectively. Laplace distribution is not a elliptical distribution, and Table 3 shows that SR is more effective in this situation.
Table 3 The empirical size and power at the significance level of 5% in Example 3
Example4In this example, we generateXifrom a mixed distribution. Firstly, we generateZijfrom normal distribution for 1≤j≤2p/5, generateZijfromtdistribution with 3 degrees of freedom for 2p/5+1≤j≤7p/10, and generateZijfrom Laplace distribution for 7p/10+1≤j≤p, and allZijhave mean 0 and variance 1. Then we letXi=ΓZi+μ, whereΓis ap×pmatrix withΓΓT=Σ, andZi={Zi1,…,Zip}T. And we consider the same setting of mean vectorμand covarianceΣas those in Example 1. To calculate the power, we selectc=0.1 andc=0.15 whenn=20, andc=0.05 andc=0.075 whenn=50.
Table 4 stands for the simulation results in Example 4. We can see that the power of SR is more than those of the other five tests in all settings. For example, when (n,p)=(50,1 000),Σ=Σ2andc=0.075, the power of SR is 0.757 and the power of the other tests in this setting are 0.214, 0.271, 0.548, 0.299 and 0.613 respectively. In practice, the variates usually have different distributions. Hence, the results in Table 4 indicate that SR is supposed to have better performancein application.
Table 4 The empirical size and power at the significance level of 5% in Example 4
Moreover, we plot the empirical distributions of SR with the settings of four examples and compare them with the standard normal distribution. And, Fig.1 confirms the asymptotic normal distributions of SR given in Theorem 1.1.
Fig.1 Tn under the null hypothesis with four different distributions of X
In this section, we employ the proposed signed-rank-based method to study an ophthalmic data. This data is collected by the Beijing Tongren Eye Center and Anyang Eye Hospital. We take the data of the fifth and sixth grades of a class in the data, Apply the proposed method to study whether the visual factors and their interaction with eye habits are different in different grades.
Fig.2 The distribution of the standard deviations
Firstly, we remove the visual factors and their interaction with eye habits with missing values greater than 15%, and impute the sample mean into the missing values for the remaining 945 factors. Then, we letXibe the difference between the visual factors and their interaction with eye habits of thei-th student in the sixth grade and those in the fifth grade. And, we calculate standard deviations of each dimension inX, and show the distribution of the standard deviations in Fig.2. It shows that these standard deviations are different, so the scalar-invariance method are supposed to have better performance in the analysis of this data. Applying the proposed SR method, we obtain ap-value <10-9, which illustrates that the visual factors and their interaction factors of eye habits are different in different grades. Through CQ, WPL and FZW methods, thep-values obtained are 0.491 0, 0.491 3 and <10-9respectively. For the standard deviations of each dimension in the sample are different, the CQ and WPL methods are relatively ineffective, while thep-values obtained through FZW and SR methods are small.