Asymptotically Optimal Sequential Confidence Interval for the Gini Index Under Complex Household Survey Design with Sub-Stratification
We examine the optimality properties of the Gini index estimator under complex survey design involving stratification, clustering, and sub-stratification. While Darku et al. (Econometrics, 26, 2020) considered only stratification and clustering and d…
Authors: Shivam, Bhargab Chattopadhyay, Nil Kamal Hazra
Asymptotically Optimal Sequen tial Confidence In terv al for the Gini Index Under Complex Household Surv ey Design with Sub-Stratification Shiv am a , Bhargab Chattopadh ya y b , Nil Kamal Hazra a, ∗ a Dep artment of Mathematics, Indian Institute of T e chnolo gy Jo dhpur Karwar, 342037, India b Scho ol of Management and Entr epr eneurship, Indian Institute of T e chnolo gy Jo dhpur Karwar, 342037, India Abstract W e examine the optimalit y properties of the Gini index estimator under complex survey design in volving stratification, clustering, and sub-stratification. While Darku et al. (Econometrics, 8 , 26, 2020) considered only stratification and clustering and did not provide theoretical guaran tees, this study addresses these limitations by prop osing tw o procedures—a purely sequen tial method and a tw o-stage metho d. Under suitable regularity conditions, w e establish uniform contin uit y in probabilit y for the prop osed estimator, thereby contributing to the dev elopment of random cen tral limit theorems under sequential sampling framew orks. F urthermore, we show that the resulting pro cedures satisfy b oth asymptotic first-order efficiency and asymptotic consistency . Sim ulation results demonstrate that the proposed procedures ac hiev e the desired optimality prop erties across div erse settings. The practical utilit y of the metho dology is further illustrated through an empirical application using data collected b y the National Sample Survey agency of India. Keywor ds: Asymptotic Consistency , Asymptotic Efficiency , Complex household survey design, Gini index. 1. In tro duction Economic inequalit y within a state or coun try primarily arises from the unequal distribution of income among individuals. T o address this disparity , p olicymakers often design economic strategies informed b y the underlying income distribution. Ov er the y ears, numerous measures of inequalit y hav e b een prop osed and extensively studied in the literature. Among these, the Gini index is one of the most used statistical measures for quan tifying economic inequality in a p opulation. Supp ose X is a random v ariable which represents an individual’s income in a giv en popula- tion. Then, G X , the p opulation Gini index, is expressed as G X = 1 − 2 Z 1 0 ϕ ( p ) dp. (1.1) Here ϕ ( · ) is the Lorenz function giv en by ϕ ( p ) = α ( p ) µ , α ( p ) = Z p 0 z ( u ) du, ∗ Corresp onding author Email addr ess: nilkamal@iitj.ac.in (Nil Kamal Hazra) and z ( u ) = inf { x : u ≤ F ( x ) } , u ∈ [0 , 1] , is the u -th quan tile of a p opulation and F ( · ) b eing the c.d.f. (cumulativ e distribution function) of the random v ariable, X . F or a comprehensiv e review on Gini index and problems related to it, we refer the reader to Baydil et al. ( 2025 ), Ibragi- mo v et al. ( 2025 ), Álv arez-V erdejo et al. ( 2021 ), Mukhopadhy a y and Sengupta ( 2021 ), and the references therein. F or other nonparam teric inequality measures and their c haracteristics, we refer to Jokiel-Rokita and Piatek ( 2025 ). Sang et al. ( 2020 ) estimated the Gini index by solving some non-smo oth U-statistic based estimating equations using a weigh ted jackknife empirical lik eliho o d metho d. Egozcue and P awlo wsky-Glahn ( 2019 ) prop osed an inequalit y index based on Aitchison norm and compared it with Gini and Atkinson inequality indices using the mean household gross income p er capita data from autonomous communities in spain. T o estimate the Gini index, data are typically collected through sample survey drawn from the target p opulation. Several inference pro cedures for the Gini index under SRS hav e b een pro- p osed, including those by Darku et al. ( 2023 ), De and Chattopadhy a y ( 2017 ), Chattopadh ya y and De ( 2016 ), and Da vidson ( 2009 ), among others. Ho wev er, real-w orld p opulations are often heterogeneous and geographically disp ersed, making SRS inadequate. Consequen tly , man y na- tional statistical agencies employ complex household survey designs to capture more accurate represen tations of p opulation c haracteristics. F or example, the Household Consumption Exp en- diture Survey 1 , conducted b y the National Sample Survey (NSS) Organization in India, collects detailed data on household consumption exp enditure across a wide range of items, recorded in b oth quantit y and v alue terms. Similarly , the Land and Liv esto ck Holding Surv ey 2 pro- vides information on op erational land holdings and livestock ownership in rural areas. These surv eys serv e distinct purp oses and are essential for capturing key economic and so cial indica- tors. The surv ey designs incorp orated in these studies typically in volv e stratification, clustering, and sub-stratification to b etter capture population heterogeneit y . Consequently , such surveys are commonly referred to as complex household surv eys. A widely used complex survey de- sign, as employ ed by Bhattac harya ( 2005 ), stratifies the p opulation and then forms clusters of households within each stratum. Unlike the SRS scheme, these designs assign unequal selec- tion probabilities to individuals and yield more accurate estimates. F or details on the survey design used in complex household surveys used by different survey agencies, along with the cor- resp onding limit theorem, we refer the reader to Bhattachary a ( 2007 ). F or a complex household surv ey design, Darku et al. ( 2020 ) addressed the construction of a confidence in terv al for Gini index with b ounded-width under suc h a design. T o accoun t for the precision and accuracy of the such a confidence interv al, Darku et al. ( 2020 ) used sequential analysis approac hes which adaptiv ely determined the optimal num b er of clusters thereby attaining a confidence interv al for Gini index with bounded-width. Using sim ulations they demonstrated that, for sufficiently small widths, the final n umber of clusters closely matches the optimal on av erage, and pro ved that the purely sequential pro cedure guarantees the pre-sp ecified interv al width is not exceeded. F or a detail review on sequen tial analysis and their applications, w e refer to T ay anagi and Kurozumi ( 2024 ), Mukhopadhy a y and De Silv a ( 2008 ), Ham and Qiu ( 2023 ), Na v arro-Esteban and Cuesta-Alb ertos ( 2021 ) and the references therein. 1.1. Contribution of this article While Darku et al. ( 2020 ) prop osed tw o-stage and purely sequential pro cedures for finding confidence in terv als for the Gini index with b ounded-width under complex surv ey design, their 1 http://microdata.gov.in/NADA/index.php/catalog/CEXP/?page=1&sort_order=desc&ps=15&repo=CEXP$ 2 https://microdata.gov.in/NADA/index.php/catalog/LLS/?page=1&sort_order=desc&ps=15&repo=LLS 2 framew ork does not accoun t for sub-stratification of clusters within each stratum, a feature common in practical household surveys. Moreov er, they did not establish k ey asymptotic prop- erties, such as asymptotic first-order efficiency and asymptotic consistency , for their pro cedures. These prop erties are critical to ensure that the final cluster size obtained via sequen tial sampling is nearly optimal on a verage while also achiev e the desired confidence interv al with con trolled precision and desired accuracy . Motiv ated by these gaps, this paper makes several key contributions. W e prop ose purely se- quen tial and t wo-stage pro cedures for finding confidence interv als for Gini index with b ounded- width under complex household survey design that incorp orate sub-stratification of clusters within each stratum, based on the economic conditions of households. Under suitable regularity conditions, we establish uniform contin uity in probability for the proposed estimator under this design, thereby contributing to the developmen t of random cen tral limit theorems. F urther- more, we sho w that the resulting pro cedures satisfy b oth asymptotic first-order efficiency and asymptotic consistency . Theoretical findings are v alidated through extensiv e sim ulation studies and an empirical application based on data collected by India’s official surv ey agency , the Na- tional Sample Surv ey (NSS). Additionally , we examine the impact of sub-stratification on the estimation of the Gini index and its asso ciated standard error. The rest of this pap er is presen ted as follows: Section 2 describ es the complex household surv ey design used in this pap er. F urther, we obtain the Gini index estimator under the survey design. Also, we in tro duce the list of notation and the set of assumptions used in subsequent sections. In Section 3 , w e prop ose b oth purely sequential and tw o-stage pro cedures for obtaining a confidence interv al for the Gini index with bounded-width. In Section 4 , w e discuss the efficiency and the consistency prop erties of the pro cedures discussed in Section 3 . In Section 5 , w e do sev eral simulation studies to v alidate the theoretical findings of this pap er. Subsequently , a real data analysis is done in Section 6 . In Section 7 , we in vestigate how sub-stratification influences the final num b er of clusters and the accuracy of Gini index estimation. Finally , the concluding remarks are given in Section 8 . F or impro ved readability , all pro ofs of the theorems along with all supp orting lemmas are deferred to the App endix. 2. Complex survey design and Estimator of Gini index Assume that there are S distinct strata in the p opulation which is indexed by s = 1 , 2 , . . . , S . Eac h of the s th strata is further divided into H s clusters, indexed by c s = 1 , 2 , . . . , H s . Thus, the total num b er of clusters across all strata is H = P S s =1 H s . Additionally , eac h cluster within a stratum is sub divided in to tw o sub-strata based on household affluence: affluent ( b c s = 1 ) and non-affluent ( b c s = 2 ). F urther, supp ose within the c th s cluster of the s th stratum, the b th c s sub-stratum con tains M sc s b c s households. Theref ore, each cluster contains a total of M sc s = P 2 b c s =1 M sc s b c s households. F or estimation, supp ose that from the s th stratum, a sample of n s clusters is dra wn using probabilit y prop ortional to size sampling with replacement, where size refers to the num b er of households. Within eac h selected cluster, k households are selected at random from each sub-stratum using simple random sampling. The n umber of sampled clusters is n = S X s =1 n s , with n s n = H s H = a s . Let x sc s b c s h denote the observ ed v alue for the h th household in sub-stratum b c s of cluster c s in stratum s (e.g., household income or exp enditure). Each sampled household is assigned a 3 w eight W sc s b c s h , which is the inv erse of its probability of inclusion in the sample. This inclusion probabilit y is the pro duct of the probability of including cluster c s from stratum s ( p c s ) and the probabilit y of selecting household h from sub-stratum b c s of the selected cluster ( p b c s h ) . Since w e ha ve sample n s clusters with probability prop ortional to the size with replacement and k households b y simple random sampling, the v alues of p c s and p b c s h are as follo ws: p c s = n s M sc s P H s c s =1 M sc s , p b c s h = k M sc s b c s . Therefore, the sampling w eight is given by W sc s b c s h = 1 p c s · p b c s h = ( P H s c s =1 M sc s ) M sc s b c s n s M sc s k . Therefore, the total of p er-household weigh ts is W = S X s =1 n s X c s =1 2 X b c s =1 k X h =1 W sc s b c s h . Define normalized w eight as w sc s b c s h = W − 1 W sc s b c s h . Using the normalized weigh ts w sc s b c s h , we no w define the un biased estimators for key p opu- lation quantities under the complex design. The unbiased estimator of µ and the cum ulative distribution function F ( x ) are respectively giv en b y b µ n = S X s =1 n s X c s =1 2 X b c s =1 k X h =1 w sc s b c s h x sc s b c s h , and b F n ( x ) = S X a =1 n a X d =1 2 X b d =1 k X c =1 w adb d c 1 ( x adb d c ≤ x ) . Based on our design, an estimator of the Gini index G X is b G n = 1 − 2 b µ n S X s =1 n s X c s =1 2 X b c s =1 k X h =1 w sc s b c s h x sc s b c s h 1 − b F n ( x sc s b c s h ) . (2.1) Alternativ ely , using the Lorenz curve formulation, the Gini index may b e expressed as: b G n = 1 − 2 Z 1 0 b ϕ n ( p ) dp, where for p ∈ [0 , 1] , b ϕ n ( . ) is the estimator of Lorenz function is giv en by b ϕ n ( p ) = b α n ( p ) b µ n , b α n ( p ) = Z p 0 b z n ( u ) du, where for p ∈ [0 , 1] , b z n ( p ) = inf s,c s ,b c s ,h { x sc s b c s h ; b F n ( x sc s b c s h ) ≥ p } is the p th sample quan tile. W e now state the asymptotic distribution of the Gini estimator under the complex survey design. This is essen tial for constructing a b ounded-width confidence interv al with prescrib ed 4 accuracy . Before pro ceeding with finding a b ounded-width confidence in terv al, we present the notations and assumptions that will b e used throughout the pap er. Notations: N1. F or eac h p ∈ [0 , 1] , define θ = ( µ, α ( p ) , z ( p )) is the p opulation parameter. Let θ 0 = ( µ 0 , α 0 ( p ) , z 0 ( p )) is the true v alue of p opulation parameter, and Θ is the parameter space. N2. ˜ m ( x, θ ) = ˜ m 1 ( x, θ ) ˜ m 2 ( x, θ ) ˜ m 3 ( x, θ ) , where, ˜ m 1 ( x, θ ) = µ − x, ˜ m 2 ( x, θ ) = α ( p ) − x 1 ( x ≤ z ( p )) , and ˜ m 3 ( x, θ ) = p − 1 ( x ≤ z ( p )) . N3. ˜ m i ( θ ) = P S s =1 1 a s 1 ( s i = s ) ( P H s i c s i =1 M s i c s i ) M s i i k P 2 b i =1 M s i ib i P k h =1 ˜ m ( x s i ib i h , θ ) . Assumptions: AN1. ˜ m j ( · , θ ) is con tin uous at each θ with probability 1, for each j = 1 , 2 , 3 . AN2. There exists d ( · ) with E ( d ( · )) < ∞ such that | ˜ m j ( t, θ ) | ≤ d ( t ) for each j = 1 , 2 , 3 for all t . AN3. The parameter space Θ is compact. AN4. E ( ˜ m ( x, θ )) is con tinuously differen tiable at θ 0 and 1 n P n i =1 ∂ ∂ θ E ( ˜ m i ( θ )) a.s. − − → K , where K is non-singluar matrix. AN5. The sequence v n ( θ ) = 1 √ n P n i =1 { ˜ m i ( θ ) − E ( ˜ m i ( θ )) } is stochastically equicon tin uous. AN6. sup θ ∈ Θ E | ˜ m i ( θ ) | 3 < ∞ . AN7. lim n →∞ P n i =1 V ar ( ˜ m i ( θ )) i 2 < ∞ and lim n →∞ 1 n P n i =1 V ar ( ˜ m i ( θ )) = W 0 . AN8. Let s, s ′ = 1 , 2 , .., S , c s = 1 , 2 , .., n s , c ′ s ′ = 1 , 2 , .., n s ′ , b c s , b c ′ s ′ = 1 , 2 and h, h ′ = 1 , 2 , .., k . If s = s ′ or c s = c ′ s ′ , { x sc s b c s h } and { x s ′ c ′ s ′ b c ′ s ′ h ′ } are indep enden t unless they are dep enden t. AN9. F or s = s ′ , { x sc s b c s h } and { x s ′ c ′ s ′ b c ′ s ′ h ′ } are not necessarily iden tically distributed. AN10. The v ariables at the cluster level across differen t clusters within eac h stratum are iden- tically distributed. AN11. The v ariance of strata level v ariables is finite for each stratum. AN12. The cluster totals hav e absolute 2 + δ raw moments ( δ > 0) which are uniformly b ounded b y B δ > 0 . AN13. F or each p ∈ [0 , 1] , F ( x ) and dF ( x ) dx are con tinuous for all x in neigh b ourho od J of p th p opulation quan tile z ( p ) . AN14. F or some 0 < C < ∞ , V ar { b F n ( x + δ ) − b F n ( x ) } ≤ C n − 1 | δ | for all n and for x and x + δ in J . AN15. E [ | X | | s ] < ∞ . In what follo ws, we b elo w giv e statistical interpretations of the ab o ve-men tioned assump- tions. Assumption AN1 implies that small p erturbations in the parameter θ lead to corresp ond- ingly small changes in ˜ m j ( · , θ ) . Assumption AN4 ensures that E ( ˜ m ( x, θ )) = 0 holds only when θ = θ 0 , meaning the moment conditions are satisfied uniquely at the true parameter v alue. Assumptions AN1 – AN4 together guarantee the consistency of the estimator ˆ θ , ensuring that it con verges in probabilit y to the true v alue θ 0 . T o maintain stabilit y in the long-run v ariabilit y of the estimating functions, assumption AN7 is required. Consider a p opulation partitioned in to tw o strata, namely , Rural and Urban. The Rural stratum is further divided in to clus- ters representing villages, while the Urban stratum consists of clusters corresp onding to city blo c ks. Since rural and urban incomes are shap ed by distinct economic environmen ts, employ- men t structures, and cost-of-living conditions, it is reasonable to treat them as indep enden t. 5 Lik ewise, incomes from different villages or city blocks are assumed to b e indep enden t due to differing lo cal economic conditions, whereas individuals within the same village or blo c k exp e- rience similar en vironments, creating intra-cluster dep endence. This motiv ates the assumption of dep endence within clusters or strata but indep endence across them, as stated in assumption AN8 . As household income distributions v ary across strata due to demographic, geographic, and so cioeconomic differences, it is reasonable to assume that incomes from different strata are not iden tically distributed, corresp onding to assumption AN9 . In stratified sampling, strata are t yp- ically formed to b e internally homogeneous with resp ect to k ey characteristics such as income, region, or o ccupation. Clusters within the same stratum (for example, villages or census blo cks) represen t comparable p opulation segments and are therefore exp ected to hav e identically dis- tributed cluster-level v ariables, as reflected in assumption AN10 . T aken together, assumptions AN1 – AN10 establish that, for sufficiently large cluster sizes, the estimator ˆ θ is asymptotically normally distributed. Asymptotic normality is v ery useful in constructing confidence in terv al for the true parameter v alue. Assumption AN13 ensures that the quantile function is w ell-defined and exhibits smo oth b eha vior, while assumption AN14 guarantees con trol o ver random fluctu- ations in the empirical distribution function within small interv als. In brief, these assumptions are essential b ecause they ensure the v alidit y of applying the law of large num bers, central limit theorem, and T aylor series expansion in the analysis, enabling consistent estimation and reliable inference. Under the assumptions AN1 - AN10 and AN15 stated ab o ve, following Bhattac harya ( 2007 ), as n s → ∞ at the same rate for all strata, w e hav e √ n ( b G n − G X ) d − → N (0 , ξ 2 ) , where ξ 2 denotes the asymptotic v ariance of √ n b G n . Using the asymptotic distribution, w e construct a 100(1 − α )% confidence interv al for the p opulation Gini index G X whose accuracy is a pre-assigned n umber α ∈ (0 , 1) given b y P b G n − z α 2 ξ √ n < G X < b G n + z α 2 ξ √ n ≥ 1 − α. (2.2) T o construct the b ounded-width confidence interv al, w e fix the width of the confidence interv al to ω , similar to Darku et al. ( 2020 ), w e hav e, L = 2 z α 2 ξ √ n ≤ ω = ⇒ ω √ n 2 ξ ≥ z α 2 = ⇒ n ≥ 4 z 2 α 2 ξ 2 ω 2 ≡ C (2.3) W e note that z α/ 2 is the 100(1 − α/ 2) th p ercen tile of N (0 , 1) and C = ⌈ 4 z 2 α/ 2 ξ 2 ω − 2 ⌉ is the optimal total num b er of clusters to b e sampled from all strata. Thus to ensure that the confidence in terv al width do es not exceed ω , w e need to collect data from at least C clusters. Also, the optimal allo cation of clusters to the s th stratum is C s = C a s , where a s is the allo cation prop ortion. If C is known, the 100(1 − α )% confidence interv al for G X , with width not exceeding pre-assigned confidence in terv al bound ω is b G C − z α 2 ξ √ C , b G C + z α 2 ξ √ C ! . Without knowing the distribution of income, the v alue of C is unkno wn; in particular, the v alue of ξ 2 is unknown. Since C is unkno wn, one must surv ey from the clusters at least in t wo stages 6 to construct a b ounded-width confidence interv al ac hieving the desired cov erage probability at least appro ximately . F or that we require a consistent estimator of ξ 2 . Based on Binder and K ov acevic ( 1995 ), a consistent estimator of ξ 2 is V 2 n = n S X s =1 n s n s − 1 n s X c s =1 ( u sc s − ¯ u s ) 2 . (2.4) where, ¯ u s = 1 n s n s X c s =1 u sc s , and u sc s = 2 b µ 2 X b c s =1 k X h =1 w sc s b c s h " b F ( x sc s b c s h ) − b G n + 1 2 ! x sc s b c s h + S X a =1 n a X d =1 2 X b d =1 k X c =1 w adb d c x adb d c 1 ( x adb d c ≥ x sc s b c s h ) − b µ 2 ( b G n + 1) # . (2.5) F or a comprehensive review of v ariance estimators of Gini index, we refer to Langel and Tillé ( 2013 ). W e b egin b y constructing confidence interv als using arbitrarily fixed cluster size n . The confidence in terv al is giv en by b G n − z α 2 V n √ n , b G n + z α 2 V n √ n ! . T o examine the adequacy of suc h fixed cluster size choice, a sim ulation study is carried out using three synthetic p opulations generated from Gamma(2.649, 0.84), Pareto(20,000, 5), and lognormal(2.185, 0.562) distributions. F or eac h selected cluster size n , the width of the con- structed confidence interv al is computed and av eraged o ver 5000 rep etitions. The results are summarized in T able 1 . The first column sp ecifies the income distribution, while the second and third columns corresp ond to the fixed cluster sizes n 1 and n 2 . The fifth and sixth columns displa y the av erage widths of the resp ectiv e confidence interv als, denoted b y w n 1 and w n 2 . The eigh th and nin th columns show the prop ortions of interv als whose widths exceed 0.015 for n 1 and n 2 , resp ectiv ely . The last column sho ws the prop ortions of interv als whose widths exceed 0.01 for n 2 . F rom T able 1 , it is apparent that the cluster size n 1 pro duces in terv als wider than the desired width of 0 . 015 , indicating insufficient precision, whereas the cluster size n 2 results in in terv als that are mu ch narro w er. Moreov er, for a stricter target width of 0 . 01 , ev en n 2 fails to meet the requirement. Now, consider selecting a cluster size n 3 that lies b etw een n 1 and n 2 . The results indicate that the fixed cluster size approach ma y not efficiently achiev e the desired lev el of precision. Although the cluster size n 2 attains the target width of 0.015, a mo derately smaller cluster size n 3 can also reach the required precision with high probability , thereby re- ducing the ov erall sampling cost. This outcome highlights a fundamental limitation of the fixed cluster size metho d. Since the optimal cluster size is not known in adv ance, it often results in either under-sampling or ov er-sampling. Hence, there is a strong reason for adopting an adap- tiv e strategy , such as a sequential pro cedure, that can determine the cluster size dynamically to meet the desired precision more effectiv ely . 3. Sequen tial Pro cedures In this section, we prop ose b oth tw o-stage and purely sequential sampling pro cedures to es- timate the optimal sample size C , ensuring that the resulting b ounded-width confidence in terv al ac hieving the desired confidence level (1 − α ) asymptotically . 7 T able 1: Simulation results under fixed cluster sizes n 1 , n 2 and n 3 . Distribution n 1 n 2 n 3 w n 1 w n 2 w n 3 b P ( w n 1 > 0 . 015) b P ( w n 2 > 0 . 015) b P ( w n 3 > 0 . 015) b P ( w n 2 > 0 . 01) Gamma(2.649,0.84) 750 1500 1000 0.0163 0.0116 0.0142 0.9902 0.0 0.0354 1.0 Pareto(20000,5) 350 700 500 0.0172 0.0122 0.0144 0.8008 0.0448 0.3206 0.9792 lognormal(2.185,0.562) 900 1700 1200 0.0159 0.0116 0.0138 0.852 0.0 0.0586 1.0 3.1. Pur ely Se quential Pr o c e dur e In the purely sequen tial pro cedure, we b egin by selecting a sample of m s clusters from each stratum s . Th us, the total num b er of clusters sampled in the first stage, often called the pilot stage, is m = P S s =1 m s . Thus the total pilot cluster size is m and m s is the pilot cluster size from the s th stratum. Note as p er the design, each cluster is subdivided in to tw o sub-strata. F rom each of the sub-stratum, w e randomly draw k households. Using the data obtained is the pilot stage, we compute the estimator V 2 m of the asymptotic v ariance ξ 2 , and pro ceed to ev aluate the follo wing stopping rule. N = N ( ω )( ≤ H ) is the smallest in teger n ( ≥ m ) such that for δ > 0 . n ≥ 4 z 2 α/ 2 ω 2 V 2 n + 1 n δ = b C and n s ≥ b C s = b C a s , for all s, (3.1) where the term 1 /n δ is included, because the stopping rule ( 3.1 ) will b e satisfied even for very small sample sizes in the absence of term 1 /n δ due to the possibility of small v alue of V 2 n at the early stages, and V 2 n + 1 /n δ is still a consisten t estimator of ξ 2 . If the ab ov e condition is satisfied, then we stop the pro cedure and our pilot size b ecomes the final sample size. If not, then w e choose m ′ ( ≥ 1) additional n um b er of clusters from eac h stratum having n s < b C s and from each selected additional cluster we randomly select k households. Using this dataset, w e estimate ξ 2 and c heck our stopping rule. W e rep eat this pro cess un til the stopping rule is met. The final cluster size from each stratum s is N s = N a s . Based on final cluster size N , the 100(1 − α )% b ounded-width confidence interv al for Gini index G X is giv en by b G N − z α/ 2 V N √ N , b G N + z α/ 2 V N √ N ! . (3.2) The pilot cluster size is computed as follo ws: Darku et al. ( 2020 ) computed the pilot cluster size for δ = 1 . Through same pro cedure, for an y δ > 0 , we ha ve n ≥ 4 z 2 α/ 2 ω 2 V 2 n + 1 n δ ≥ 4 z 2 α/ 2 ω 2 1 n δ = ⇒ n δ +1 ≥ 4 z 2 α/ 2 ω 2 = ⇒ n ≥ 2 z α/ 2 ω 2 / ( δ +1) . Therefore, the n umber of total clusters to b e selected during the pilot stage is m = min ( H , max ( 2 , & 2 z α/ 2 ω 2 / ( δ +1) ')) , whereas the n umber of clusters to b e selected from s th stratum during the pilot stage is m s = min ( H s , max ( 2 , & 2 z α/ 2 ω 2 / ( δ +1) ! × a s ')) . (3.3) 8 3.2. Two-Stage Pr o c e dur e In tw o-stage pro cedure, at the first stage, w e ha ve to take a pilot cluster sample of size t s from each of the s th stratum, where t s is the same as m s giv en b y equation ( 3.3 ). Based on a pilot cluster size of t = P S s =1 t s collected at the pilot stage, the total final cluster size to b e sampled from all strata is Q = min ( H , max ( t, & 4 z 2 α/ 2 ω 2 V 2 t ')) = min { H , Q ∗ } . (3.4) Hence, the final cluster size from eac h stratum s is Q s = min { H s , ⌈ Qa s ⌉} . Therefore, in the second stage, w e hav e to select Q s − t s n umber of clusters from each stratum s . Then, w e hav e to collect data from randomly chosen k households from each sub-stratum of eac h selected cluster. Based on the collected data, the 100(1 − α )% confidence interv al for Gini index G X is giv en by b G Q − z α/ 2 V Q √ Q , b G Q + z α/ 2 V Q √ Q ! . (3.5) 4. Results In this section, w e discuss the theoretical results related to the sequential pro cedures dis- cussed in Section 3 . Under the assumptions provided in Section 2 , we pro ve the asymptotic ef- ficiency and asymptotic consistency prop erties of our pro cedure. Now, we provide the theorems related to the asymptotic prop erties of our pro cedure. Theorem 1 is related to the asymptotic efficiency prop ert y , whereas Theorem 2 is related to the asymptotic consistency property . Theorem 1. If E [ V 2 n ] exists and for e ach str atum s of the p opulation, the total numb er of clusters H s is lar ge enough, then the fol lowing hold. (i) L et N b e the stopping rule for Pur ely Se quential Pr o c e dur e define d in ( 3.1 ) , then as ω → 0, N C a.s. − − → 1 , and E N C → 1 . (ii) L et Q b e the stopping rule for the Two-Stage Pr o c e dur e define d in ( 3.4 ) , then as ω → 0, Q C a.s. − − → 1 , and E Q C → 1 . Pro of. The pro of of the theorem is giv en in App endix. Theorem 2. Under the assumptions pr ovide d in Se ction 2 , let N b e the stopping rule for Pur ely Se quential Pr o c e dur e define d in ( 3.1 ) , then as ω → 0, P b G N − z α 2 V N √ N < G X < b G N + z α 2 V N √ N → 1 − α. The same wil l hold if we r eplac e N with Q , wher e Q is the stopping rule for the Two-Stage Pr o c e dur e define d in ( 3.4 ) . Pro of. The pro of of the theorem is giv en in the App endix. 9 5. Sim ulation In this section, w e conduct a simulation study under a complex household surv ey design to illustrate the p erformance of b oth the purely sequen tial and the tw o-stage pro cedures in constructing a 100(1 − α )% confidence in terv al for the Gini index, with the constrain t that the in terv al width does not exceed a pre-sp ecified b ound ω . T o illustrate the effectiveness of these pro cedures, w e compare the outcomes such as the ac hieved sample size and cov erage probability with their corresp onding v alues under the ideal scenario where p opulation c haracteristics are assumed to b e kno wn. F or simulation study , construction of pseudo p opulation is giv en in Section 1 of supplemen- tary material. 5.1. R esults for Pur ely Se quential Pr o c e dur e In this subsection, we carry out a sim ulation study to explore the prop erties of our purely sequen tial procedure. After collecting the pilot sample from the pseudo p opulation, the estimator V 2 m of ξ 2 is computed and c heck ed the stopping rule ( 3.1 ). If satisfied, our pilot cluster size ( m ) is our final cluster size. If not, then we choose m ′ (= 1) additional cluster(s) from each stratum s with n s < b C s . W e rep eat this pro cess until the stopping rule is met. Based on final cluster size N , we estimate ξ 2 using V 2 N , p opulation Gini index G X using b G N , resp ectiv ely , and construct 100(1 − α )% confidence interv al for the Gini index G X as giv en in ( 3.2 ). This procedure is rep eated 5000 times. F or each replication, w e estimate ξ 2 , the Gini index G X , and construct the corresp onding confidence in terv al. The empirical results for the purely sequen tial pro cedure are summarized in T able 2 for the setting α = 0 . 05 , ω = 0 . 015 . The first column in T able 2 lists the distribution of the household’s monthly income. The second column reports the av erage final n um b er of clusters ( ¯ N ) along with the standard deviation ( sd ( N ) ). The third column presents the a verage estimated Gini index ( ¯ b G N ) based on the final sample size. The fourth column shows the ratio of the a verage final num b er of clusters to the optimal num b er of clusters ( ¯ N /C ). The fifth column displa ys the av erage ratio of the estimated v ariance to the true v ariance ( ¯ V 2 N /ξ 2 ). The sixth column provides the cov erage probability ( b p ) of the 5000 constructed confidence in terv als along with its standard error ( se ( b p ) ). Column sev en giv es the av erage length of these interv als ( ¯ w N ) and its standard deviation ( sd ( w N ) ). The final column rep orts the percentage of interv als whose widths exceed the pre-sp ecified b ound ω , denoted b y b P ( w N > ω ) along with its standard error ( se ( b P ( w N > ω )) ). T able 2: Results for α = 0 . 05 , ω = 0 . 015 , and δ = 2 . Distribution N b G N N C V 2 N ξ 2 b p w N b P ( w N > ω ) ( sd ( N )) ( se ( b p )) ( sd ( w N )) se ( b P ( w N > ω )) Gamma 888.176 0.3311 0.9923 0.9909 0.9486 0.01498 0 (2.649,0.84) (59.25999) (0.0031) (0.000009) (0) P areto 442.15 0.1109 0.9367 0.9338 0.9384 0.0149 0 (20000,5) (119.2319) (0.0034) (0.00002) (0) lognormal 1008.81 0.3090 0.9910 0.9897 0.9514 0.0150 0 (2.185,0.562) (109.1879) (0.0030) (0.000009) (0) The results in T able 2 demonstrate that the av erage widths of the confidence in terv als are smaller than the pre-assigned thresholds under the purely sequen tial pro cedure. The last column 10 further confirms that, in each replication, the interv al width is consistently below the sp ecified b ound. The estimated cov erage probabilities are close to the nominal level of 100(1 − α )% . A dditionally , the ratio of the a verage final n umber of clusters to the optimal num b er of clusters is approximately 1. These findings collectiv ely support the theoretical prop erties of the purely sequen tial pro cedure discussed in Section 4 . F urther, sim ulation results corresp onding to another setting ( α = 0 . 05 , ω = 0 . 01 ) is given in Section 2 of supplemen tary material. 5.2. R esults for Two Stage Pr o c e dur e In this subsection, we carry out a simulation study to explore the prop erties of our tw o-stage pro cedure. After collecting the pilot sample from the pseudo p opulation, the estimator V 2 t of ξ 2 is computed and the v alue of the final cluster size is computed using the stopping rule ( 3.4 ). If Q = t , our pilot cluster size ( t ) is our final cluster size. If Q > t , then at the second stage, w e ha ve to sample Q s − t s n umber of clusters from each stratum s . Based on final cluster size Q , w e estimate ξ 2 using V 2 Q , p opulation Gini index G X using b G Q , resp ectiv ely , and construct 100(1 − α )% confidence interv al for the Gini index G X as giv en in ( 3.5 ). This pro cedure is rep eated 5000 times. F or each replication, we estimate ξ 2 , p opulation Gini index G X , and construct the corresp onding confidence interv al. The empirical results for the t wo-stage pro cedure are summarized in T able 3 for the setting α = 0 . 05 , ω = 0 . 015 . The first column in T able 3 lists the distribution of the household’s monthly income. The second column rep orts the a verage final n um b er of clusters( ¯ Q ) along with the standard deviation ( sd ( Q )) . The third column presents the av erage estimated Gini index ( ¯ b G Q ) based on the final sample size. The fourth column shows the ratio of the a verage final num b er of clusters to the optimal n umber of clusters ( ¯ Q/C ). The fifth column displa ys the av erage ratio of the estimated v ariance to the true v ariance ( ¯ V 2 Q /ξ 2 ). The sixth column pro vides the cov erage probability ( b p ) of the 5000 constructed confidence interv als along with its standard error ( se ( b p )). Column sev en gives the av erage length of these interv als ( ¯ w Q ) and its standard deviation ( sd ( w Q )) . The final column rep orts the percentage of interv als whose widths exceed the pre-assigned b ound ω , denoted b y b P ( w Q > ω ) along with its standard error ( se ( b P ( w Q > ω )) ). T able 3: Results for α = 0 . 05 , ω = 0 . 015 , and δ = 2 . Distribution Q b G Q Q C V 2 Q ξ 2 b p w Q b P ( w Q > ω ) ( sd ( Q )) ( se ( b p )) ( sd ( w Q )) se ( b P ( w Q > ω )) Gamma 874.62 0.3311 0.9772 0.9904 0.954 0.0156 0.586 (2.649,0.84) (270.03) (0.0029) (0.0022) (0.0070) P areto 449.305 0.1111 0.9519 0.9514 0.936 0.0169 0.623 (20000,5) (382.101) (0.0035) (0.0053) (0.0068) lognormal 983.932 0.3089 0.9665 0.9930 0.946 0.0162 0.6394 (2.185,0.562) (515.263) (0.0032) (0.0032) (0.0068) The results in T able 3 demonstrate that the estimated co verage probabilities are close to the nominal level of 100(1 − α )% . Additionally , the ratio of the av erage final num b er of clusters to the optimal num b er of clusters is approximately 1. These findings collectively supp ort the theoretical properties of the t w o-stage procedure discussed in Section 4 . The width of the confidence in terv al ma y exceed the pre-assigned b ound ω . 11 F or simulation results corresp onding to another setting ( α = 0 . 05 , ω = 0 . 01 ), see Section 2 of supplemen tary material. 5.3. Discussion In this subsection, w e discuss some p oin ts based on our simulation results for purely sequen- tial as well as for tw o-stage pro cedures. W e discuss the prop ortion of confidence interv als whose width exceeds the pre-assigned width ω . Also, we discuss the empirical distribution of the final sample sizes for b oth procedures. 5.3.1. Width of the Confidenc e Intervals F or purely sequential pro cedure, simulation results in T able 2 show that the width of the confidence in terv als is less than the pre-assigned width ω . This happ ens b ecause, by stopping rule ( 3.1 ) of the purely sequen tial pro cedure, w e hav e N ≥ 4 z 2 α/ 2 ω 2 V 2 N + 1 N δ ≥ 4 z 2 α/ 2 ω 2 V 2 N . This implies 2 z α/ 2 V N √ N ≤ ω . Th us, the final cluster size obtained using a purely sequential pro cedure guaran tees the b ounded- width 100(1 − α )% confidence interv al. The simulation results for the tw o-stage pro cedure, rep orted in T able 3 , indicate that the width of the confidence interv al exceeds the pre-assigned threshold ω . In this pro cedure, data are collected in t wo stages. Due to the limited information ab out ξ 2 at the first stage, the estimator V 2 t exhibits high v ariabilit y , whic h in turn increases the v ariabilit y of the final estimator V 2 Q . As a result, a considerable prop ortion of the constructed confidence in terv als exceed the desired width. T o further inv estigate this b ehavior, we examine the prop ortion of interv als exceeding the sp ecified width under v arying pilot cluster sizes. Sp ecifically , w e conduct sim ulations for differen t v alues of δ (i.e., different pilot sample sizes) across the sp ecified income distributions. The corresp onding results are summarized in T ables 4 – 6 . The results in T ables 4 – 6 show T able 4: Results for T w o Stage procedure for Gamma(2.649, 0.84). α = 0 . 05 ω = 0 . 015 δ Pilot Q C b P ( w Q > ω ) Cluster Size se ( b P ( w Q > ω )) 2 42 0.9772 0.586 (0.0070) 1 262 0.9953 0.513 (0.0071) 0.8 486 0.9970 0.4742 (0.0071) 0.65 852 1.007 0.255 (0.0062) T able 5: Results for T w o Stage procedure for P areto(20000, 5). α = 0 . 05 ω = 0 . 015 δ Pilot Q C b P ( w Q > ω ) Cluster Size se ( b P ( w Q > ω )) 2 42 0.9519 0.623 (0.0068) 1.5 86 0.9869 0.5602 (0.0070) 1 262 1.0075 0.3632 (0.0068) 0.85 412 1.0541 0.1282 (0.0047) that, as the pilot sample size increases, the prop ortion of confidence in terv als exceeding the pre- assigned width ω decreases. This is b ecause larger pilot cluster sizes lead to reduced v ariabilit y 12 T able 6: Results for T w o Stage pro cedure for lognormal(2.185, 0.562). α = 0 . 05 ω = 0 . 015 δ Pilot Q C b P ( w Q > ω ) Cluster Size se ( b P ( w Q > ω )) 2 42 0.9665 0.6394 (0.0068) 1 262 0.9952 0.5326 (0.0071) 0.8 486 1.0009 0.4754 (0.0071) 0.62 966 1.0216 0.1702 (0.0047) in the estimator V 2 t . Notably , the results reveal a substantial drop in the prop ortion of wider in terv als b eyond a certain pilot sample size. How ev er, in practical survey applications, selecting a large pilot cluster size ma y not b e desirable due to cost and logistical constrain ts. F rom these observ ations, we conclude that the purely sequen tial procedure consistently yields confidence interv als whose widths are b elow the pre-assigned threshold ω . In con trast, under the t wo-stage pro cedure, the confidence in terv al width may exceed ω , particularly when the pilot sample size is small. A similar discussion for α = 0 . 01 is provided in Section 2 of supplementary material. F urther, for empirical distribution, one may lo ok at histogram provided in Section 3 of supplementary material. 6. Application results for 64th round NSS Data In this section, w e illustrate the purely sequen tial and t w o-stage pro cedures using survey data from the 64 -th round of the National Sample Surv ey (NSS) for four Indian states: Maharashtra, Uttar Pradesh, W est Bengal, and T amil Nadu. T ables 7 and 8 summarize the results for the purely sequen tial and tw o-stage pro cedures, resp ectiv ely . Eac h table lists the state (first column) and the total num ber of clusters ( H ) in the survey data (second column). The third column shows the final num ber of clusters - N for the purely sequential pro cedure in T able 7 and Q for the tw o-stage pro cedure in T able 8 along with the corresp onding pilot sample size. The fourth column presents the respective Gini index estimate and its standard error. The fifth and sixth columns rep ort the lo wer and upp er confidence limits, and the last column giv es the width of the confidence in terv al. T able 7: Results for Purely Sequential Pro cedure α = 0 . 1 , ω = 0 . 02 and δ = 1 . State H N (m) b G N (se( b G N )) Lo wer CI Upp er CI w N Maharash tra 1008 1004(166) 0.2913(0.0060) 0.2815 0.3012 0.0197 Uttar Pradesh 1262 653(166) 0.2653(0.0058 ) 0.2557 0.2748 0.0192 W est Bengal 878 540(166) 0.2806(0.0058) 0.2711 0.2901 0.0190 T amil Nadu 709 512(166) 0.2652(0.0057) 0.2557 0.2746 0.0189 13 T able 8: Results for T w o-Stage Pro cedure α = 0 . 1 , ω = 0 . 02 and δ = 1 . State H Q(t) b G Q (se( b G Q )) Low er CI Upp er CI w Q Maharash tra 1008 822(166) 0.2911 (0.0055) 0.2821 0.3002 0.0181 Uttar Pradesh 1262 854(166) 0.2612 (0.0051) 0.2529 0.2696 0.0168 W est Bengal 878 586(166) 0.2751 (0.0053) 0.2664 0.2839 0.0175 T amil Nadu 709 566(166) 0.2677(0.0055) 0.2587 0.2767 0.0180 The results in T ables 7 – 8 sho w that b oth pro cedures achiev e the desired precision (i.e., a narro w confidence interv al with w N ≤ ω or w Q ≤ ω ) for the Gini index b y selecting relativ ely few clusters while still meeting the required confidence lev el. This is eviden t for Maharash tra, Uttar Pradesh, W est Bengal, and T amil Nadu. T able 9 presents the results obtained from real surv ey data under fixed cluster size. The first column lists the states, while the second and third columns rep ort the fixed cluster sizes n 1 and n 2 . Here, n 1 corresp onds to under-sampling and n 2 to ov ersampling, relative to the cluster size determined through the sequential pro cedure (see T ables 7 and 8 ). The fourth and fifth columns sho w the widths of the confidence in terv als associated with n 1 and n 2 , respectively . As observ ed, under-sampling ( n 1 ) leads to interv al widths exceeding 0 . 02 , whereas ov ersam- pling ( n 2 ) yields interv als narrow er than 0 . 02 , but at a higher sampling cost. In comparison, the sequen tial pro cedure ac hieves the prescrib ed width with substantially b etter precision–cost balance, highligh ting its practical adv an tage o ver fixed cluster size approaches. T able 9: Real data results under fixed cluster sizes n 1 and n 2 . State n 1 n 2 w n 1 w n 2 Maharash tra 800 1008 0.0228 0.0198 Uttar Pradesh 500 1000 0.0232 0.0167 W est Bengal 350 700 0.0261 0.0172 T amil Nadu 320 650 0.0224 0.0154 7. Discussion on Sub-Stratum Effect In this section, we examine effect of introducing sub-stratification on the final n umber of clusters, as well as the precision of the Gini index estimation. As discussed earlier, the com- plex household survey design used by Darku et al. ( 2020 ) inv olves stratifying the p opulation and forming clusters of households within eac h stratum. Their design do es not incorporate sub-strata within the selected clusters, an approach similar to that adopted by sev eral surv ey agencies around the w orld to accoun t for further heterogeneity within clusters. Also, w e note that our prop osed design introduces an additional lay er of sub-stratification by dividing each selected cluster into tw o sub-strata: affluen t and non-affluen t households. T o inv estigate the influence of this additional sub-stratification, we simulate data from three income distributions represen tative of typical household income patterns: Gamma(2.649, 0.84), Pareto(20000, 5), and Lognormal(2.185, 0.562). F or each distribution, we ev aluate the final num b er of clusters determined by the pro cedures outlined in Darku et al. ( 2020 )) and compare them with those obtained using our approac h via equations ( 3.1 ) and ( 3.4 ). T o ensure a fair comparison, w e 14 use the additional term as 1 /n 2 for the sequential procedure prop osed by Darku et al. ( 2020 ) instead of 1 /n . F or clarity of presentation, we refer to our sampling design as “Prop osed” and to the design of Darku et al. ( 2020 ) as “Darku et al.” in T ables 10 - 11 . T able 10 summarizes the final num b er of clusters by the purely sequential and the tw o-stage pro cedures based on 5000 simulations. The first column lists the underlying income distribu- tions. The second column presen ts the av erage n um b er of clusters under the purely sequen tial ( ¯ N ) and tw o-stage ( ¯ Q ) pro cedures using our prop osed sub-stratified approac h. These are com- puted via equations ( 3.1 ) and ( 3.4 ), resp ectiv ely . The third column rep ort the corresp onding a verages obtained using the methods of Darku et al. ( 2020 ). In T able 11 , we compare the p er- T able 10: Final num b er of cluster for α = 0 . 05 , ω = 0 . 015 . Distribution ¯ N ¯ Q ¯ N ¯ Q Prop osed Darku et al. Gamma 888.176 948.344 (2.649,0.84) (874.62) (936.8008) P areto 442.15 468.7324 (20000,5) (449.305) (489.8396) lognormal 1008.81 1077.854 (2.185,0.562) (983.932) (1055.457) T able 11: Estimate of Gini index and asymptotic v ariance using 1200 clusters. Distribution b G n V 2 n b G n V 2 n Prop osed Darku et al. Gamma 0.331149 0.3311575 (2.649,0.84) (0.01307985) (0.01393041) P areto 0.1112591 0.1112986 (20000,5) (0.00692779) (0.007426569) lognormal 0.3090566 0.30910412 (2.185,0.562) (0.01489573) (0.01600393) formance of b oth pro cedures under a fixed n umber of clusters, set at 1200. F or eac h income distribution, w e conducted 5000 simulation runs under this setting. The results are summarized. The first column lists the household monthly income distributions used in the simulations. The second column present the av erage estimated Gini index and its estimated v ariance using equa- tions ( 2.1 ) and ( 2.4 )–( 2.5 ). The results presented in T able 11 demonstrate that, on av erage, the Gini index estimates from the proposed design using sub-stratification are closer to the true v alues than those obtained using the design used in Darku et al. ( 2020 ). Additionally , the prop osed design yields consistently low er estimates of the asymptotic v ariance, indicating that accounting for sub-stratification improv es the precision of the Gini estimator. T ogether, T ables 10 – 11 underscore the b enefits of incorp orating sub-stratification—yielding more accurate estimates with reduced v ariabilit y and low er sampling costs compared to designs that do not accoun t for this structure. 8. Concluding Remarks The Gini index serves as an imp ortan t indicator of economic inequality and pla ys a vital role in informing p olicy decisions. This article prop oses constructing a bounded-width confidence in terv al for the Gini index using a complex household surv ey design that stratifies the p opulation, divides each stratum in to clusters, and further classifies households within clusters into affluent and non-affluen t groups. In this article, we dev elop tw o pro cedures using this design - a purely sequen tial pro cedure and a t wo-stage pro cedure, to determine the optimal cluster size needed to construct a b ounded- width confidence in terv al for the Gini index without assuming any sp ecific income distribution. Unlik e existing approaches, our design accounts for sub-stratification within clusters, aligning with real survey practices. W e sho w that, under mild regularity conditions, b oth methods 15 ensure that when the desired width is sufficiently narrow, the final sample cluster size is close to optimal and the confidence interv al achiev es the required co verage probability . Simulation results confirm these asymptotic prop erties. The purely sequen tial procedure yields smaller standard errors in sample sizes and is therefore preferred, while the t wo-stage pro cedure offers a simpler implementation in settings where only limited sampling stages are feasible due to logistical or cost considerations. 9. App endix Belo w w e giv e a list of lemmas whose pro ofs are given in Section 4 of the supplementary material. Lemma 1. If ξ 2 is the asymptotic varianc e as define d in se ction 2 and V 2 t is its estimator, then E ( V 2 t ) → ξ 2 as t → ∞ . Lemma 2. If ξ 2 is the asymptotic varianc e as define d in se ction 2 and V 2 n is its estimator, then V 2 n a.s. − − → ξ 2 as n → ∞ . Lemma 3. b µ n = P S s =1 P n s c s =1 P 2 b c s =1 P k h =1 w sc s b c s h x sc s b c s h satisfies uniform c ontinuity in pr ob ability c ondition. Lemma 4. F or e ach p ∈ [0 , 1] , √ n ( F ( z ( p )) − b F n ( b z n ( p ))) → 0 as n → ∞ . Lemma 5. F or any ϵ > 0 , ther e exists M > 0 such that P {| √ n ( F ( z ( p )) − b F n ( z ( p ))) | ≥ M } < ϵ for al l n ∈ N . Lemma 6. F or e ach p ∈ [0 , 1] , √ n n b z n ( p ) − z ( p ) + b F n ( z ( p )) − F ( z ( p )) f ( z ( p )) o P − → 0 as n → ∞ . Lemma 7. L et { R n } b e a se quenc e of r andom variables such that √ nR n P − → 0 , then { R n } wil l satisfy uniform c ontinuity in pr ob ability c ondition. Lemma 8. F or e ach p ∈ [0 , 1] , b z n ( p ) satisfies uniform c ontinuity in pr ob ability c ondition. Lemma 9. R 1 0 b α n ( p ) dp satisfies uniform c ontinuity in pr ob ability c ondition. Lemma 10. Under assumptions AN1 - AN9 . F or e ach p ∈ [0 , 1] , as n → ∞ √ n ( b µ n − µ ) d − → N (0 , V 1 ) , and √ n ( b α n ( p ) − α ( p )) d − → N (0 , V 2 ,p ) , for some V 1 > 0 and V 2 ,p > 0 . Lemma 11. If N is the non-ne gative inte ger variable and C is the optimal numb er of clusters such that N C P − → 1 , then under assumptions AN1 - AN9 , for e ach p ∈ [0 , 1] , as ω → 0 (i) √ N ( b µ N − µ ) d − → N (0 , V 1 ) . (ii) √ N ( b α N ( p ) − α ( p )) d − → N (0 , V 2 ,p ) . (iii) ( b µ N − µ ) a.s. − − → 0 and ( b α N ( p ) − α ( p )) a.s. − − → 0 . (iv) R 1 0 ( b α N ( p ) − α ( p )) dp P − → 0 . Pro of of Theorem 1 (i): F rom stopping rule ( 3.1 ), we get, 2 z α 2 V N ω 2 ≤ N ≤ m + 2 z α 2 ω 2 V 2 N − 1 + 1 ( N − 1) δ . 16 By dividing the ab o v e inequality by C , w e get, V N ξ 2 ≤ N C ≤ m C + 1 ξ 2 V 2 N − 1 + 1 ( N − 1) δ . (A.1) Using Lemma 2 and Gut ( 2009 ), w e hav e V 2 N a.s. − − → ξ 2 . (A.2) Since m/C → 0 as ω → 0 , b y equations ( A.1 ) and ( A.2 ), we conclude that N /C a.s. − − → 1 as ω → 0 . Now, we will pro ve that E ( N ) /C → 1 as ω → 0 . F rom equation ( A.1 ), we hav e, N /C − m/C ≤ 1 ξ 2 sup n ≥ m V 2 n + 1 ( m − 1) δ . Since sup n ≥ m V 2 n = sup n ≥ m S X s =1 n × n s n s − 1 n s X c s =1 ( u sc s − ¯ u s ) 2 ! ≤ S X s =1 H × H s sup n ≥ m 1 n s − 1 n s X c s =1 ( u sc s − ¯ u s ) 2 !! , b y taking Exp ectations on b oth sides, w e get E sup n ≥ m V 2 n ≤ S X s =1 H × H s × E sup n ≥ m 1 n s − 1 n s X c s =1 ( u sc s − ¯ u s ) 2 !! . Using Cauc hy-Sc h warz inequality and Lemma 9.2.4 of Ghosh et al. ( 2011 ), we get E sup n ≥ m V 2 n ≤ S X s =1 H × H s × E sup n ≥ m 1 n s − 1 n s X c s =1 ( u sc s − ¯ u s ) 2 ! 2 1 / 2 ≤ S X s =1 H × H s × 4 × E 1 m s − 1 m s X c s =1 ( u sc s − ¯ u s ) 2 ! 2 1 / 2 < ∞ . Since N /C P − → 1 as ω → 0 , b y dominated conv ergence theorem, w e conclude that E N C → 1 as ω → 0 . (ii) F rom the stopping rule in equation ( 3.4 ), w e hav e 2 z α 2 V t ω 2 ≤ Q ≤ t + 2 z α 2 ω 2 V 2 t . By dividing the ab o v e inequality by C , w e get V 2 t ξ 2 ≤ Q C ≤ t C + V 2 t ξ 2 . (A.3) 17 Since t → ∞ as ω → 0 , V 2 t a.s. − − → ξ 2 . Now, t C → 0 as ω → 0 . Hence, using equation ( A.3 ), we conclude that Q C a.s. − − → 1 as ω → 0 . T aking exp ectation b oth sides, w e get E V 2 t ξ 2 ≤ E ( Q ) C ≤ t C + E V 2 t ξ 2 . F rom Lemma 1 , we hav e E ( V 2 t ) → ξ 2 as t → ∞ . Since t → ∞ as ω → 0 and t/C → 0 as ω → 0 , b y using the ab o ve inequality , we conclude that E ( Q ) C → 1 as ω → 0 . Pro of of Theorem 2 . Define T ( u, v ) = u v , for v = 0 . Let u N = R 1 0 b α N ( p ) dp, v N = b µ N , u 0 = R 1 0 α ( p ) dp and v 0 = µ. By T aylor series expansion, w e get T ( u N , v N ) = T ( u 0 , v 0 ) + ( u N − u 0 ) v 0 − u 0 ( v N − v 0 ) ( v 0 ) 2 − ( u N − u 0 )( v N − v 0 ) ( v 0 + p ( v N − v 0 )) 2 + 2( u 0 + p ( u N − u 0 ))( v N − v 0 ) 2 ( v 0 + p ( v N − v 0 )) 3 , for p ∈ (0 , 1) . Multiplying the ab o ve equation by √ N , we get √ N ( T ( u N , v N ) − T ( u 0 , v 0 )) = √ N ( u N − u 0 ) v 0 − u 0 ( v N − v 0 ) ( v 0 ) 2 + R N , (A.4) where R N = − ( u N − u 0 )( v N − v 0 ) ( v 0 + p ( v N − v 0 )) 2 + 2( u 0 + p ( u N − u 0 ))( v N − v 0 ) 2 ( v 0 + p ( v N − v 0 )) 3 . No w, consider √ N ( u N − u 0 ) v 0 − u 0 √ N ( v N − v 0 ) ( v 0 ) 2 = √ N ( u N − u C + u C − u 0 ) v 0 − u 0 √ N ( v N − v C + v C − v 0 ) ( v 0 ) 2 = √ N ( u N − u C ) v 0 − u 0 √ N ( v N − v C ) ( v 0 ) 2 + √ N ( u C − u 0 ) v 0 − u 0 √ N ( v C − v 0 ) ( v 0 ) 2 = A + B × D , where A = √ N ( u N − u C ) v 0 − u 0 √ N ( v N − v C ) ( v 0 ) 2 , B = r N C , D = √ C ( u C − u 0 ) v 0 − u 0 √ C ( v C − v 0 ) ( v 0 ) 2 ! . Since N C P − → 1 , we get from Lemmas 3 and 9 that A P − → 0 . By using Bhattac harya ( 2007 ), w e get D d − → N (0 , ξ 2 / 4) . Consequen tly , by using Slutsky’s Theorem, we get √ N ( u N − u 0 ) v 0 − u 0 √ N ( v N − v 0 ) ( v 0 ) 2 d − → N (0 , ξ 2 / 4) . (A.5) Again, √ N R N = − ( √ N ( u N − u 0 ))( v N − v 0 ) ( v 0 + p ( v N − v 0 )) 2 + 2( u 0 + p ( u N − u 0 ))( √ N ( v N − v 0 ))( v N − v 0 ) ( v 0 + p ( v N − v 0 )) 3 , 18 whic h, in view of Lemma 11 and Slutsky’s Theorem, gives √ N R N P − → 0 . (A.6) By using equations ( A.4 ), ( A.5 ), ( A.6 ), and Slutsky’s Theorem, w e get √ N ( T ( u N , v N ) − T ( u 0 , v 0 )) d − → N (0 , ξ 2 / 4) . (A.7) No w, we hav e T ( u N , v N ) = u N v N = R 1 0 b α N ( p ) dp b µ N and T ( u 0 , v 0 ) = u 0 v 0 = R 1 0 α ( p ) dp µ . F urther, we hav e b G N = 1 − 2( u N v N ) and G X = 1 − 2( u 0 v 0 ) . Consequen tly , we get √ N ( b G N − G X ) = 2 √ N ( T ( u N , v N ) − T ( u 0 , v 0 )) . Then, b y equation ( A.7 ), we get √ N ( b G N − G X ) d − → N (0 , ξ 2 ) . (A.8) F urther, by equation ( A.2 ), w e ha ve V 2 N P − → ξ 2 . Consequen tly , by equation ( A.7 ), we get that P b G N − z α 2 V N √ N < G X < b G N + z α 2 V N √ N = P ( √ N ( b G N − G X ) V N < z α 2 ) → 1 − α, as ω → 0. Hence the result. References Álv arez-V erdejo, E., Mo ya-F ernández, P .J., Muñoz-Rosas, J.F., 2021. Single imputation meth- o ds and confidence in terv als for the gini index. Mathematics 9, 3252. Ba ydil, B., de la Peña, V.H., Zou, H., Y ao, H., 2025. Unbiased estimation of the gini co efficien t. Statistics & Probabilit y Letters 222, 110376. Bhattac harya, D., 2005. Asymptotic inference from m ulti-stage samples. Journal of econometrics 126, 145–171. Bhattac harya, D., 2007. Inference on inequalit y from household surv ey data. Journal of Econo- metrics 137, 674–707. Binder, D.A., K ov acevic, M.S., 1995. Estimating some measures of income inequalit y from surv ey data: an application of the estimating equations approach. Surv ey Metho dology 21, 137–146. Chattopadh ya y , B., De, S.K., 2016. Estimation of gini index within pre-specified error b ound. Econometrics 4, 30. Darku, F.B., Konietsc hk e, F., Chattopadh ya y , B., 2020. Gini index estimation within pre- sp ecified error bound: Application to indian household survey data. Econometrics 8, 26. Darku, F.B., Ofori-Boateng, D., Chattopadh ya y , B., 2023. Comparison of gini indices using sequen tial approach: Application to the us small business administration data. Sequential Analysis 42, 248–268. Da vidson, R., 2009. Reliable inference for the gini index. Journal of econometrics 150, 30–40. 19 De, S.K., Chattopadhy a y , B., 2017. Minimum risk p oin t estimation of gini index. Sankhy a B 79, 247–277. Egozcue, J.J., Pa wlo wsky-Glahn, V., 2019. Comp ositional data: the sample space and its structure. T est 28, 599–638. Ghosh, M., Mukhopadh ya y , N., Sen, P .K., 2011. Sequential estimation. John Wiley & Sons. Gut, A., 2009. Stopp ed random walks. Springer. Ham, D.W., Qiu, J., 2023. Hyp othesis testing in adaptively sampled data: Art to maximize p o w er b eyond iid sampling. TEST 32, 998–1037. Ibragimo v, R., Kattuman, P ., Skrob oto v, A., 2025. Robust inference on income inequality: t-statistic based approac h. Econometric Reviews 44, 384–415. Jokiel-Rokita, A., Piatek, S., 2025. Nonparametric estimators of inequalit y curv es and inequalit y measures. Journal of Statistical Planning and Inference 237, 106251. Langel, M., Tillé, Y., 2013. V ariance estimation of the gini index: revisiting a result several times published. Journal of the Roy al Statistical So ciet y Series A: Statistics in Society 176, 521–540. Mukhopadh ya y , N., De Silv a, B.M., 2008. Sequential metho ds and their applications. Chapman and Hall/CR C. Mukhopadh ya y , N., Sengupta, P .P ., 2021. Gini inequality index: Metho ds and applications. CR C press. Na v arro-Esteban, P ., Cuesta-Alb ertos, J.A., 2021. High-dimensional outlier detection using random pro jections. T est 30, 908–934. Sang, Y., Dang, X., Zhao, Y., 2020. Depth-based weigh ted jackknife empirical likelihoo d for non-smo oth u-structure equations: Wjel for u-structure equations. TEST 29, 573–598. T ay anagi, T., Kurozumi, E., 2024. Change-p oint estimators with the weigh ted ob jective function when sequen tially estimating breaks. Econometrics and Statistics . 20
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment