An Integer Programming Formulation Applied to Optimum Allocation in Multivariate Stratified Sampling
The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods…
Authors: Jose Andre de Moura Brito, Gustavo Silva Semaan, Pedro Luis do Nascimento Silva
1 An Integer Programming Formulation Applied to Optimum Allocation in Multivariate Stratified Sampling José André de Moura Brito National School of Sta tistical Sciences (ENC E/IBGE) - Brazil e-mail: jambrito@gm ail.com Gustavo Silva Semaan Federal University Flum inense – I nstituto de Computação (UFF /IC) - B razil e-mail: gsem aan@ic.uff.br Pedro Luis do Nasci mento Silva National School of Sta tistical Sciences (ENC E/IBGE) - Brazil e-mail: pedronsilva@ gmail.com Nelson Maculan Federal University of Ri o de Janeiro (COPPE/UFR J) - B razil e-mail: nelson.macul an@gmail.com Summary: The problem of optimal allocation of samples in surveys using a stratified sa m pling pla n was first discussed b y Neyman i n 1934 . Since then, ma ny researc h ers have s tudied t h e prob lem of the sa m ple allocation i n multivariate surveys and seve ral m ethods have been proposed . Basically, these methods are divided into two class: The first in v olves for m ing a weig h ted average o f th e stratu m variances a n d finding th e opti m al allocation for the average variance. The second c las s is associated with methods that require that an acceptable coefficient of variatio n for each o f the variab les on which the allo cation is to be do ne. P articularly , this pap er proposes a new optimizatio n appro ach to the second prob lem. This approach is based on an integer progra mm ing formulation. Several experi m ents s h owed that the propo se d ap proach is efficient way to solve this p roblem, considering a co m parison o f this approach with the o th er ap proach from the literature. Keywords: Stratification; Allocation ; Integer p rogramming. 1. Introduction Nowadays, much of the research applied by statistical institutes considers t h e adoption of a sampling plan. The sample survey allows obtaining estimates in relation t o population paramete rs, based on a selected sam ple of this population (Loh r, 2010). When you use a sampling plan, is necessary to b alance the available budget for re search and, the s ame time, to obtain a m inimum l ev el of precision for the estimates to be disclosed. An alternative to obtain these two requirem ents is the use of str atification techniqu es. Other wo rds, with the m eeting of members of the population in H homog eneous s t rata, it is possible to produce estimates with a higher level of accuracy, a nd t h is homogeneity measure based on the evaluation of an e xp ression variance associated wi th a stratification var iable previously chos en. After defined the strata and a s am ple si z e n (defined in term s of the costs of both research and accuracy), independent samples are selected in each of these strata. In a ddi tion, there are s om e situations where t he s am ple size i s not defined a priori. In this case, when performing the allocation of samples (nh, h = 1, ...., M) to the strata being two purpos es: (i) to Minimize a wei g hted sum of the variances associated wit h a set of m search variables; (2) to minimiz e t he total sample size to be distributed am ong t he strata. This way, the variation c oefficients a ssociated w ith these variables being equal to or lower coeffic ients of variation previously set (called cvs targets). I n both cases, we have a multivariate allocation pr oblem. This paper pr esents a new methodology that m eets the second pu rpose. T hi s m ethodology is based on the application of integer programm ing formulation de v eloped in R l an guage. T he pa p er is organized as f o llows: In section t w o we pr e sent some concepts of stra tified sampling and a description of the Multivariate Allocation Problem. The section three bring s the new m ethodology. The section four p resents a small set of computational results, c onsidering the applicat ion of the new m ethodology and the methodolog y proposed from the liter ature (Bethel (1989)). 2 2. Stratified Sampling and Op t imal Allocation Problem In Stratified Sam pling (Cochran, 1977), a popu lation U with N units is d ivided into M s tr ata E 1 , E 2 , ..., E H , f orm ed by, r espectivel y, N 1 , N 2 , ..., N H uni ts. These strata do not overlap and together cover the entire population, in such a way that: N N N N N H h = + + + + + ... ... 2 1 (1) H k i H k E E k i ,..., 1 , 1 ,..., 1 , + = − = ∅ = I (2) U E h H h = = 1 U (3) Once defined t he strata, and a sample size n, a re selected n h obs erv ations (independent samples) among N h observations available in each one of the strata E h , where n = n 1 + n 2 + ... + n H . In general, from this sample are identified information for a s e t of m search variables. Assuming that these variables are denoted by: Y 1 , Y 2 , ..., Y j , ..., Y m , the population variance in each of the strata f o r each of these variables is defin ed by: H h m j y y S h E i hj ij hj ,..., 1 , ,..., 1 , ) ( 2 2 = = − = ∑ ∈ ∀ (4) The y ij is the value of the i-th observation in stratum h associa ted with the j-th variable of resea rch, and this variable is average at h-th st ra tum. Still about si m ple stratified sampling, the variance of the estimator of the total ( t y ) (Cochran, 1977) for e ach of the m search variables is defi ned by: m j N n n S N t V h h H h h hj h y j ,..., 1 ), 1 .( . ) ( 1 2 2 = − = ∑ = (5) Once the val ue s of N h and Sh j 2 ca n be calculated from the defi n ition of the strat a , t h e amount of variance in equation (5) depends solely o n the sample size to be allocated to nh stratum. This distribution is very important , because it is what will ensure the accuracy of the sampling procedure. Bu t in practical terms, ai m ing to make an allocation, is nec e ssary to balan ce the accuracy in relation t o each of the research v ariables and the cost of research in relation to the sampling uni ts to be investigated . According t o the literature, there are two approaches t hat address this is s ue. The first consi d ers the minimization of a weighted sum of the vari ances (or coeffi c ients of variation) associated with t he variables of research interest, set a sample s ize (n) maximum. T he se cond should determine the sample size t o be allocated to n h stratum, so t hat the whole s ample is m inimized, and the coefficient of variation e s timates fro m th e (vari ab le Y j ) is less than or eq ua l to a pr iori define d target cvs for these varia b les. We highlight the main works of lit e rature that deal with two approache s : Kokan (1 963), Kokan in Kha n (1967), Hudd l eston, Claypool and Hoc k ing (1970) , Bethel (19 89 ), Valliant a nd Ge ntle ( 1997), Khan and Ahsa n (2 003), Gar cia and Cortez (200 6 ), Kozak (20 06), Day (2010), Kha n, A li and Ahmad ( 20 11), Ismail and Nass er Ahmad (2 011). In this last c ase, this is equivale nt to formulate the following mathemati cal programm ing pr oblem, where Y j corresponding to the total of the j-th vari able research, i.e.: ∑ ∑ = ∈ = H h E i ij j h y Y 1 Minimize h H h n ∑ = 1 (6) s.t. H h N n h h ,..., 1 , 1 = ≤ ≤ (7) m j cv Y t V j j y j ,..., 1 / ) ( = ≤ (8) + ∈ Z n h ( h=1,. ..,H ) (9) 3 In this formulation, the objective function to b e minimized (equation 6) is the sum of the sample sizes allocated to strata. The constraint given in equation (7) is allocated ensures that at least one sample unit to each of the st r ata and the number of allocated units will not exceed the size of the stratum. Already the restriction associated wi th equation (8) ensure s that the ratio between th e standard deviation of each variable and its respec tive search total is less than or equal to a target coeffi cient of variation s etted i n adv ance. Finally, the restriction of equation (9) ensures that the sam ple size s allocated to strata are in tegers (restriction completeness of the pr oblem). It adds f urther that th e restrictions associa ted with equation (8) can be r ewritten as follows: m j cv Y t V j j y j ,..., 1 , 1 . ) ( 2 2 = ≤ (10) Then, if continue developing the equation (5) and repl acing it w ith the n umerator of the constraints of type (8) yields: m j cv Y S N cv Y n S N H h j j hj h j j h hj h ,..., 1 , 1 . . . . . 1 2 2 2 2 2 2 2 = ≤ − ∑ = (11) Since N h , S hj , Y j and cv j are obtained a pr iori, we can de fine the following cons tant: 2 2 2 2 . . j j hj h hj cv Y S N p = ( h =1,..., H , j=1,..., m ). I n this case, restrictions of type (11) take the fo llowing form: m j p n p H h hj h hj ,..., 1 , 1 1 = ≤ − ∑ = (12) A first alternative the resolution of the formulation defined by (6), (7), (9) and (12) would be the appl ication of a method of non-linear program ming (Bazaraa, Sheralli and Shetty, 2006; Luenberger and Ye, 2008) t hey worked wit h restrictions, su ch as the methods of p enalties, multipliers, among ot hers. Nevertheless, thes e m ethods produce s a mple si z es (solutions) which in g eneral will not be integers. Furtherm ore, when performing rounding, there is no guarantee of global optimum (Wolsey, 1998). A lternatively, sin ce the sam ple size should be i nt eger (variable s of the prob lem), one could think abou t the applying som e integer program ming m ethod, for example, m ethods Branch and Bound (Land and Doig , 1960; Wolsey, 1998; Wols ey and Nem hauser, 1999). But the non- linearity of the constraints of type (1 2) with respect to variables n h (sample sizes) mak es it impossible the application of these m ethods. Given these obs e rvations, the f o llowing section provides a proposal f or a ne w integer programming formulation which solv e this problem and that is equiv alent to the formulation defined by (6), ( 7) , (9) and (12). More specifically , the r eso lution of t h is formulation is possible to produce the smallest sample size ( n h ) which are allocated to whole strata and which satisfies the constraints (7) and (12). Other words, its resolution ensures the global optimum (Wolsey, 1998) with regard to t he value of the objective func tion defined in (6). 3. The Proposed Formulation Considering an opti m ization approach solve the problem de fined by (6) , (7), (9) and (12) involves determining which sa m ple sizes n 1 , n 2 ,..., n H be chosen from the sets defined by A h ={1,2,3,...,N h } (h=1,...,H), in orde r to meet the constraints o f the probl em defined in the previous section and produce the minimum value for the objective function defined in (6). As optimization pr oblem, t he re is a nee d to define the dec ision variables of the m odel. This sense, we i n troduce a bina ry variable x hk that takes the v alue "true" i f t h e sample si z e k ∈ A h is alloc ated to stra tum h (h=1,...,H). A ccording the de f inition of this variable and the equations (6), (7), (9) and (12), we can write the following Binary Integer Programming formulati on (BI P) (Wolsey and Nemhauser, 1999 ). 4 Minimize ∑ ∑ = = H h hk N k x k h 1 1 . (13) s.t. ∑ = = = h N k hk H h x 1 ,..., 1 , 1 (14) m j p p x k H h N k hj hj hk h ,..., 1 , 1 . . 1 1 1 = ≤ − ∑ ∑ = = (15) h hk N k x ,..., 1 H, 1,..., h }, 1 , 0 { = = ∈ (16) In t his f o rmulation, the restriction (14) ensures that, f or each of the strata, there will be a variable x hk assuming exactly one value. This is equ ivalent to ens ure the se lection of onl y one k-value (sample si z e) for each set A h (h=1,...,H) and the c ons traint (15) is equivalent to the constraint (12) of the original formulation. This for m ulation does not address the i s sue of considering different costs in relation to the allocation o f sam ple units t o their respective strata, ie, the cost allocation are un itary. I f there is such a need, the objective function g iven in (13) can be de fined as follows: ∑ ∑ = = H h hk N k h x k C h 1 1 . (17) C h corresponding to the cost and allocat ion of each sample to the stratum h (h=1,...,M). Another issue that can be addressed wi th r eg ard to t he m inimum sa m ple size to be allocated t o each of the strata, or n h ≥ n min (h=1,...,H) (n min =2,...). This question can b e considered from the inclusion of the following restrict ion: ∑ − = = = 1 1 min ,..., 1 , 0 n k hk H h x (18) In orde r t o illustrate t h is formulation, consider the following si m ple example (excluding (17) and (18)) which define t h ree strata ( H =3), N 1 =3, N 2 =5 and N 3 =4 and only search variable ( j=1 ). Th e proposed formulation would be as fo llows: Minimize 34 33 32 31 25 24 23 22 21 13 12 11 . 4 . 3 . 2 . 1 . 5 . 4 . 3 . 2 . 1 . 3 . 2 . 1 x x x x x x x x x x x x + + + + + + + + + + + Subject to: 1 13 12 11 = + + x x x (h=1) 1 25 24 23 22 21 = + + + + x x x x x (h=2) 1 34 33 32 31 = + + + x x x x (h=3) 1 ) . 4 1 . 3 1 . 2 1 . 1 ( ) . 5 1 . 4 1 . 3 1 . 2 1 . 1 ( ) . 3 1 . 2 1 . . 1 ( 31 34 33 3 2 31 3 1 21 25 2 4 23 22 21 21 11 1 3 12 11 11 ≤ − + + + + − + + + + + − + + p x x x x p p x x x x x p p x x x p } 1 , 0 { , , , , , , , , , , , 34 33 32 3 1 25 24 23 22 21 13 12 11 ∈ x x x x x x x x x x x x Generally, the I nt eger Programming Formulation (including P I Bs) are solved by applying an implicit enumeration method as Branch and Bound . Methods like Branch and Bound (Wolsey and Nemhauser, 1999 ) fi nd t h e opt im al solution f o r I n teger Pr og ramming efficiently, considering the resolution of a subset of proble m s ass o ciated with the feasible region of the problem. T he se methods were developed from the pioneering work of Land and Doig (1960). 5 4. Computational Results This s e ction provides a sm all s et of computation al results based on the application of the proposed formulation, and an enh anced version of the algorithm propo sed by Bethel (Be thel, 1989). As reg ards to the form ulation, we created a function (called BS M) on statistical software R (http://www.r-project.org ) usi ng the lpSolve packa ge. The Algorithm of Bethel is available in the SamplingStrata package (also R). The c om putational experiments were perform ed on a computer with 24GB of RAM and process ors of 3.40 GHz (I 7). In order to evaluate the design, p opulations were used three different databas es, which are : (1) POP_CA FE (Agricultural Census, 19 98), (2) POP_FAZ ENDA_CANA e (3) POP_FAZ ENDA _GADO. The Table 1 provides some information on these populations: the strata of the popul ation, in the research v ariables Y (m ) and the total num ber of units ( N ). Table 1 – Information abo ut the used Databases. Population H m N POP_CAFE 3 3 20472 POP_FAZENDA_CANA 4 3 338 POP_FAZENDA_GADO 7 2 430 The T ables 2 and 3 bring, respectively, the cvs and sample sizes (n) produced by applying the proposed formulation and the algorithm of Bethel. These tables shows that the ne w proposed formulation produced better solutions (sizes sample) compared to those produced by the algorithm of the literature. Table 2 – Results o f prop osed formulation (BSM * ). Population n BSM Cv Target Coefficients produced by BSM j=1 j=2 j=3 POP_CAFE 2545 5% 1.23 4.99 2.91 POP_FAZENDA_CANA 144 2 % 1.81 1.99 1.89 POP_FAZENDA_GADO 217 10 % 9.99 8.18 * Propose d Formulation by Brito, Semaan e Mac ulan. Table 3 – Results o f the al gorithm of B eth el. Population n Bethel Cv Target Coefficients produced by algorithm of Bethel j=1 j=2 j=3 POP_CAFE 2546 5% 1.23 5.00 2.91 POP_FAZENDA_CANA 146 2% 1. 7 8 1.96 1.84 POP_FAZENDA_GADO 219 10% 9.89 8.04 Certainly, it is necessary to evaluat e a significant number of people, in order to quantify the gain from the application of the proposed formul ation compared with the algorithm of B et hel and eventually with other algorithms from the literat ure. Thus, new computational experiments will be performed in a future work, consi dering pop ulations of v arying size ( N) and wit h more resear ch. References Bazaraa, M.S., Sherali, H.D. andShetty, C.M, Nonlinear Programm ing: Theory and Algorithms. John Wileyand Sons, N ew York, Third Edition, 2006. Bethel, J ., Sample Alloc ation in Multivar iate Surveys. Survey M ethodology , 15 (1 ), pp. 47-57, 1989. Cochran, Willian G., S ampling Techniques. Third Edit ion – Wiley, 1977. Day, C.D., A Multi- Objective Ev olutinary Algorithm for Multivariat e Optimal Allocation, S ection on Survey Research Metho ds – JSM, 2010. Garciá, J.A.D. and Cortez, L.U., Optimum Allocation in Multivariate Stratified Sampling: Multi-Objective Prog ramming, Com unicaciones Del Cim at, no I-06- 07/28-03-2006, 2006. 6 Huddleston, H.F., Claypool, P.L. and Hocking, R.R. , Optimal Sample Allocation to Strata Using Convex Programming, Jour nal of the Royal Statistical Society, Series C (Applied Statistics), 19 (3), 1970. Ismail, M.V., Nasser, K. and Ahmad Q.S., Solution of a Multivariate Str a tified Sampling Probl em through Chebyshev´ s Goal Programm ing, Pak.J.Stat.Oper.Res, vol v ii, 1, pp. 101-108, 2011. Khan, M.G.M and Ahsan, M .J. , A note on optim um allocation in m ultivariate st r atified sam pling. S.Pac. Journal Nat. Sci , 21, pp. 91- 95, 2003. Khan, M. F., Ali I. and Ahmad, Q.S., Chebyshev Approximate Solut ion to Allocation Problem in Multiple Objectiv e Surveys with R andom Costs, Am erican Journal of Computational Mathematics, 1 , pp. 247-251, 2011. Kokan, A.R. , Optim um Al lo cation in Multivariate Surveys. Journal of the Royal Statistical Society, Series A, 126 (4), pp. 557- 565, 1963. Kokan, A.R. and Khan S. , Optimum Allocation in M u ltivariate Surveys: An Analytical Solution, Journal of the Royal S tatistical Society , Series B, 29 (1), pp.115- 125, 1967. Kozak, M., Multivariate Sample Allocation: Application of Random Sear c h Method, Statistics i n Transition , 7 ( 4 ), pp. 88 9-900, 2006. Land, A.H. and Doig,A. G., An Autom atic method for solving discre te programm ing problems, Econometrica , 28 (3), p. 49 7-520, 1960. Lohr, S.L., Sam pling: Design A nalysis. Brooks/Cole, Ceng age Learning, 2010. Luenberger, D.G. and Ye, Y., Linear and N onLinear Prog ramming. Spring er, Third Edition, 2008. Valliant, R. and Gent le, J.E. , An Application of Mathem atical Programm ing to Sample Allocation, Computational Statis tics & Data Analys is , 25 , pp. 337-360, 1997. Wolsey, L.A .,I nteger Programming. Wiley-I nterscience Series in Discrete Mathem atics and Optimization, 1998. Wolsey, L.A andN emhauser, G.L ., Integerand Combinatorial Optimization. Wiley-Interscience Series in Discrete Ma thematicsand Optimiz ation, 1999.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment