Identifying the potential of sample overlap in evidence synthesis of observational studies

Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique i…

Authors: Zhentian Zhang, Tim Friede, Tim Mathes

Identifying the potential of sample overlap in evidence synthesis of observational studies
Iden tifying the p oten tial of sample o v erlap in evidence syn thesis of observ ational studies Zhen tian Zhang 1* , Tim F riede 1 and Tim Mathes 1 1* Departmen t of Medical Statistics, Universit y Medical Cen ter Göttingen, Low er Saxony , German y . *Corresp onding author(s). E-mail(s): zhen tian.zhang@med.uni-go ettingen.de ; Con tributing authors: tim.friede@med.uni-go ettingen.de ; tim.mathes@med.uni-go ettingen.de ; Abstract Sample ov erlap is a common issue in evidence syn thesis in the field of medi- cal research, particularly when in tegrating findings from observ ational studies utilizing existing databases such as registries. Due to the general inaccessibility of unique iden tifiers for each observ ation, addressing sample ov erlap has b een a complex problem, p oten tially biasing evidence syn thesis outcomes and under- mining their credibilit y . W e dev elop ed a method to construct indicators for the degree of sample ov erlap in evidence syn thesis of studies based on existing data. Our metho d is ro oted in set theory and is based on the co ding of the ranges of sev eral w ell selected sample characteristics, offers a practical solution by fo cusing on making inference based on sample characteristics rather than on individual participan t data. Useful information, such as the ov erlap-free sample set with the largest sample size in an evidence synthesis, can b e derived from this metho d. W e applied our mo del to several real-world evidence syntheses, demonstrating its effectiveness and flexibility . Our findings highlight the growing imp ortance of addressing sample o verlap in evidence synthesis, esp ecially with the increasing relev ance of secondary use of data, an area curren tly under-explored in researc h. Keyw ords: o verlap, evidence syn thesis, meta-analysis, aggregated data, observ ational study 1 1 In tro duction Due to the rapid adv ancements in information technology and the expansion of data infrastructures in recent years, the medical field has witnessed an unprecedented increase in the v olume of data produced and stored. How ev er, a significan t prop ortion of this data is redundant. Sp ecifically , p ersonal medical information, biomedical data, and clinical trial observ ations are often do cumen ted in multiple rep ositories, servers, and registers. Observ ational studies that rely on existing data, such as clinical registries, rep- resen t imp ortan t resources for answering epidemiological and clinical questions. Evidence synthesis is prone to bias due to sample ov erlap, esp ecially when multiple included studies are based on similar databases( Mathes et al , 2023 ). Giv en the esca- lating volume of electronically collected and stored data, there is an anticipated surge in observ ational studies, which increases the risk of incorp orating o verlapping study samples into the same evidence synthesis. F or simplicit y , in this article we fo cus on registries, but the problem also exists for other types of electronic databases. Ov erlap b et ween samples can hav e a significan t impact on the results of evidence syn thesis. F or example, Hussein et al ( 2022 ) show ed that in a meta-analysis, ov erlap can cause spurious high precision and considerable c hanges to the o verall relative effect estimates. Despite the potentially critical impact of ov erlap, the a wareness of this issue is often missing. Although there are some ov erviews of the o verlap problem ( Mathes et al , 2023 ), the lac k of a systematic approach to handle this problem may further hinder researchers in addressing it. Sev eral attempts to address similar problems statistically hav e b een made in the area of genome researc h, where secondary use of samples is common practice ( Lin and Sulliv an , 2009 ; Han et al , 2016 ; Jin and Shi , 2020 ). Lin and Sulliv an ( 2009 ) intro- duced a framework for the meta-analysis of GW AS (Genome-wide asso ciation studies) meta-analysis that can handle ov erlapping sub jects b et ween studies, which is simple to implement and computationally feasible for large GW AS. Han et al ( 2016 ) prop osed a metho d that transforms the co v ariance structure of the data, which could b e then used in downstream analyses, and demonstrated the flexibility of the metho d using empirical datasets. Jin and Shi ( 2020 ) prop osed a metho d that resolves ov erlapping issue in gene-environmen t testing, utilizing Lin and Han’s correlation structures to generalize cov ariance matrices from the regular meta-regression mo del, and provided statistical tests for the joint effects of the gene main effect. How ever, there is very limited kno wledge on the p erformance of their metho ds due to lack of comparison and application to (comparativ e) observ ational studies in other health care areas. In addition, in all of the three studies mentioned abov e, the ov erlapping p ortion in the samples of different genomics databases are assumed to b e known in these approaches, whereas in fields other than genomics it might b e difficult to estimate the true pro- p ortion of sample o verlap. Efforts were also made in more general scenarios ( Bom and Rac hinger , 2020 ; Lunny et al , 2021 ; W olery et al , 2010 ). Bom and Rac hingers’ approac h uses a generalized weigh t solution to handle sample ov erlap. By appro ximat- ing the v ariance-co v ariance matrix that describes the correlation structure b et ween outcomes, the metho d requires again the num b er of ov erlapping sub jects as in the GW AS studies, which is rarely av ailable outside of GW AS field. Also, their approach 2 deviates from the common w ay of in verse v ariance w eighting, is th us less easy to in terpret. Lunny et al ( 2021 ) considered ov erlapping of study instead of observ ations b et w een studies, which is a related but different problem b ecause the ov erlapping unit is the study . W olery et al ( 2010 ) compared four o verlap metho ds for single-sub ject data synthesizing with visual analysts’ judgments, which only concen trates on the sp ecial case of single sub ject design (for example, the ov erlapping p ercen tage of before and after treatment observ ations of a single patient). T o our knowledge no established metho d and guidance exist on ho w to iden tify , quan tify and accoun t for o verlap of observ ations (e.g. patients) systematically in evidence synthesis. The fo cus of this pap er is to provide a first step in addressing the sample ov erlap problem in evidence syn thesis in the field of medical research. W e will dev elop a theory to describ e o v erlap, design metho ds and algorithms to deriv e useful information, suc h as the o verlap-free sample sets with the largest sample size and the graph of the p oten tial of ov erlap, and apply the metho ds in real-world evidence syn thesis to show their practical v alue. Although the metho d is designed to improv e systematic reviews and meta analyses of observ ational studies in medical researc h, the idea can b e relev an t in many other fields. 2 Theoretical approac h to iden tify sample o v erlap in evidence syn thesis 2.1 Preliminaries 2.1.1 Ov erlapping data vs longitudinal/clustered data The o verlap considered in this pap er refers to the same observ ations b eing included in multiple studies, whic h causes problems when w e wan t to com bine their results. This differs from clustered data structures, such as rep eated measurements from the same patien t in longitudinal settings or observ ations within the same unit in hier- arc hical mo dels. In those cases, the data p oin ts corresp ond to distinct observ ations and therefore con tribute additional information even if the outcomes are identical, alb eit with dep endencies that must b e modeled. By con trast, in the ov erlapping case, rep eated inclusion of records of the same observ ation do es not increase the amount of information as the underlying even ts are identical. 2.1.2 Ov erlap as a m ultiv ariate-relationship Ov erlap is a multiv ariate relationship. In contrast to biv ariate relationship suc h as correlation, it cannot b e fully represen ted using a single matrix. Figure 1 shows an example of that. F or an evidence synthesis of n study samples, 2 n − n − 1 combinations are of in terest. 2 n is the cardinality of the p o wer set of the set of samples, min us the num ber of studies, min us the empt y set. In Figure 2 we can see an example of all the combinations of an evidence synthesis of 6 study samples. Ev ery column represents a com bination, and the 57 combinations in the red rectangular are the ones whose ov erlapping situation w e are interested in. 3 Fig. 1 Pairwise o verlap can b e insufficient to c haracterize multiv ariate ov erlap. T w o configurations are shown in whic h the pairwise in tersections ( A ∩ B , A ∩ C, B ∩ C ) are iden tical, but the three-w ay intersection differs: A ∩ B ∩ C  = ∅ in the left configuration, whereas A ′ ∩ B ∩ C = ∅ in the right. This illustrates that ov erlap among multiple studies cannot, in general, b e recov ered from pairwise ov erlap information alone. Fig. 2 Study-combinations of n = 6 studies. Columns corresp ond to subsets A ⊆ Ω , rows corresp ond to studies S 1 , . . . , S 6 , and a filled entry indicates membership S i ∈ A . The highlighted block marks the 2 n − n − 1 non-empt y combinations of size at least tw o for which ov erlap assessment is relev ant. 2.2 F ormal setup F or simplicity and without loss of generality , we will develop our metho d in the con text of meta-analysis using registry-based study in medical area, but the theory can b e applied to evidence synthesis of observ ational studies in general. Registry studies utilize existing registry data to answer clinical questions. W e use "registry" to represen t any organized system that stores uniform data on a p opulation Gliklic h et al ( 2020 ). In order to find the data suitable for a clinical question, the data are filtered using certain characteristics of the ev ent (e.g. measure of outcome, time 4 and lo cation of observ ation, age of sub jects). Such characteristics are the measured and stored versions of the underlying c haracteristics of the observ ation. Our approach uses information from these characteristics to mo del ov erlap of study samples when com bining the study results. 2.2.1 Notations Definition 1 (Intrinsic c haracteristic vector) . L et u uniquely index an underlying observation event (e.g., a p articular subje ct at a p articular me asur ement time/en- c ounter). The intrinsic char acteristic ve ctor of that observation event, given n k intrinsic char acteristics, is a (latent) ve ctor x u = ( d u, 1 , . . . , d u,n k ) c ol le cting these char acteristics inher ent to that event. By inher ent and intrinsic we me an that the char acteristics ar e wel l-define d for an event: for any u, v and any k ∈ { 1 , . . . , n k } , u = v ⇒ d u,k = d v ,k (e quivalently, d u,k  = d v ,k ⇒ u  = v ) . The dimensions of x u can be for example the underlying outcome v alue, the ev ent time and lo cation, and sub ject/en vironmental characteristics at the time of observ a- tion, etc. The exact v alues of certain dimensions of x u ma y b e unknown in practice. Ev en for dimensions that are, in principle, measurable at arbitrarily high resolution (e.g., time or lo cation), real-w orld observ ations are recorded with finite precision and ma y b e subject to measurement error or missingness. In this section we sta y with x u for framing the basic setups, and show our approach for the real-w orld versions of it in the next sections. Let S C = { x 1 , . . . , x n 0 } denote the collective sample, i.e., the whole set of distinct in trinsic c haracteristics vectors underlying all the existing recorded v ersion of them in relev an t registries for the target p opulation. F or study i ∈ { 1 , . . . , n } , let S i ⊆ S C b e its within-study ov erlap-free study sample of size n i , written as S i = { x I i, 1 , . . . , x I i,n i } with indices I i,j ∈ { 1 , . . . , n 0 } and I i,j  = I i,ℓ for j  = ℓ . Let Ω = { S 1 , . . . , S n } denote the set of all study samples in an evidence synthesis. F or an y A ⊆ Ω , S S i ∈ A S i is then the p o oled sample of the studies in A . Denote S = S S i ∈ Ω S i as the sample of the meta-analysis. In other words, the meta-analysis sample is the ov erlap-free aggregation of the intrinsic characteristics vectors that are used by the studies included in the meta-analysis. Remark. By r e quiring I i,j  = I i,l | j  = l in the setup, we assume no overlap within e ach study sample. In other wor ds we r e gar d S i s as sets and thus c ould use the language of set the ory to describ e the overlap structur e among them. F rom the setup, we ha ve S ⊆ S C . In the ideal case where there is no sample ov erlap within a meta-analysis, its S is just the direct aggregation of S i s. How ever, due to the o verlap b et ween S i s, | S | can b e smaller than the sum of the individual sample sizes of the studies. T o know how m uch smaller it is, w e need definitions to describ e the finer structure of o verlap relationship b et ween studies. This motiv ates follo wing definition: 5 Definition 2 (Ov erlap set and prop ortion of ov erlap) . F or any A ⊆ Ω with | A | ≥ 2 , define the overlap set O ( A ) := \ S i ∈ A S i . and the pr op ortion of overlap π ( A ) = | O ( A ) | | S S i ∈ A S i | F or | A | < 2 we set O ( A ) := ∅ and π ( A ) := 0 by c onvention. O ( A ) pro vides a unam biguous basis for the discussion of o verlap; π ( A ) giv es a p ossible quan titative description of the degree of ov erlap among A . 2.2.2 Ov erlap structure W e defined π ( A ) to quan tify o v erlap. It is just one of the possible description of o verlap. W e define ov erlap structure to generalize it to a range of such descriptions via real v alued functions. Definition 3 (Overlap structure) . L et Ω b e a set of studies. A n overlap structur e on Ω is the set { ( A, f ( A )) : A ⊆ Ω } , i.e., the gr aph of a set function f : 2 Ω → R that assigns e ach study-c ombination A a r e al numb er f ( A ) quantifying the overlap within A . Examples for f ( A ) : f 1 ( A ) = I { O ( A )  = ∅} , f 2 ( A ) = | O ( A ) | , f 3 ( A ) = π ( A ) , f 4 ( A ) = | O ( A ) | Q S i ∈ A | S i | ! 1 / | A | . In other w ords, for a given set A, f 1 ( A ) tells us if there is ov erlap or not in A, f 2 ( A ) giv es the amount of o verlapping elements and f 3 ( A ) , f 4 ( A ) giv es differen t quantitativ e description of the degrees of o verlap. Example: Let Ω = { S 1 , S 2 , S 3 , S 4 } ; S = { x a , x b , ..., x i } and S 1 (sample 1) ={ x a , x b , x c } S 2 (sample 2) ={ x c , x d , x e , x f } S 3 (sample 3) ={ x b , x f } S 4 (sample 4) ={ x d , x f , x g , x h , x i } 2.3 Iden tifying ov erlap structure using aggregated information The iden tification and quantification of ov erlap in evidence synthesis would b e straigh tforward, if the intrinsic characteristic vectors are known: for each of the 6 Fig. 3 Example ov erlap structure for four study samples. The V enn diagram visualizes which obser- v ations are shared across S 1 , . . . , S 4 . Shading intensit y reflects multiplicit y under naive sample-size addition (darker regions corresp ond to observ ations that would b e counted more times if ov erlaps were ignored). This example underlies T able 1 , which rep orts f ( A ) v alues for selected study-combinations. T able 1 Illustrative v alues of the o verlap set O ( A ) = T S i ∈ A S i and four ov erlap summaries for selected study-com binations A ⊆ Ω , where Ω = { S 1 , S 2 , S 3 , S 4 } is the to y example shown in Figure 3 . The functions are f 1 ( A ) = I { O ( A )  = ∅} , f 2 ( A ) = | O ( A ) | , f 3 ( A ) = π ( A ) = | O ( A ) |    S S i ∈ A S i   , and f 4 ( A ) = | O ( A ) |   Q S i ∈ A | S i |  1 / | A | . F or | A | < 2 w e use the conv entions O ( A ) = ∅ and f 1 ( A ) = f 2 ( A ) = f 3 ( A ) = f 4 ( A ) = 0 . NR. A O ( A ) f 1 ( A ) f 2 ( A ) f 3 ( A ) f 4 ( A ) 1 ∅ ∅ 0 0 0 0 2 { S 1 } ∅ 0 0 0 0 3 { S 2 } ∅ 0 0 0 0 4 { S 3 } ∅ 0 0 0 0 5 { S 4 } ∅ 0 0 0 0 6 { S 1 , S 2 } { x c } 1 1 1/6 ≈ 0.167 1/ 2 √ 3 ≈ 0.289 7 { S 1 , S 3 } { x b } 1 1 1/4 = 0.25 1/ √ 6 ≈ 0.408 8 { S 1 , S 4 } ∅ 0 0 0 0 9 { S 2 , S 3 } { x f } 1 1 1/5 = 0.2 1/ √ 8 ≈ 0.354 10 { S 2 , S 4 } { x d , x f } 1 2 2/7 ≈ 0.286 1/ √ 5 ≈ 0.447 11 { S 3 , S 4 } { x f } 1 1 1/6 ≈ 0.167 1/ √ 10 ≈ 0.316 12 { S 1 , S 2 , S 3 } ∅ 0 0 0 0 13 { S 1 , S 2 , S 4 } ∅ 0 0 0 0 14 { S 1 , S 3 , S 4 } ∅ 0 0 0 0 15 { S 2 , S 3 , S 4 } { x f } 1 1 1/8 = 0.125 1/2 3 √ 5 ≈ 0.292 16 { S 1 , S 2 , S 3 , S 4 } ∅ 0 0 0 0 2 n − n − 1 com binations of studies in an evidence synthesis, coun t the num b er of intrin- sic characteristic vectors that app ear in all studies in that combination. One of the problem that make the approach impractical is that the data on individual lev el are in general not av ailable for meta-analyst, for example due to data security reasons. This problem can b e partially addressed by using ranges of characteristics instead of single v alues of it. In the following we show how we use aggregated information to construct 7 a proxy of the upp er-bound for ov erlap structure. A more detail explanation of what w e mean by aggregated data/information can b e found in App endix B . F ollowing previous notations, denote d I i,j ,k as the v alue of the k-th dimension of x I i,j , we hav e: Prop osition 1 (Exclusion of pair-wise ov erlap by exclusion of the sets of one in trin- sic characteristic) . Denote D i,k := { d I i,j ,k | j ∈ { 1 , 2 , ..., n i }} as the set of the values of k-th dimension of the observations in S i . we have D i 1 ,k ∩ D i 2 ,k = ∅ ⇒ S i 1 ∩ S i 2 = ∅ The pro of follows directly by applying the new notation to definition 1 , which tells: ( i 1 , i 2 , j 1 , j 2 ) , d I i 1 ,j 1 ,k  = d I i 2 ,j 2 ,k ⇒ I i 1 ,j 1  = I i 2 ,j 2 . Prop osition 1 is equiv alen t to: ∀ k , T i | S i ∈ A D i,k = ∅ ⇒ T i | S i ∈ A S i = ∅ Remark. A c c or ding to the Pr op osition 1 , if ther e ar e multiple intrinsic char acter- istics, the mutual exclusiveness of the set of the value of any one of the intrinsic char acteristics would b e enough to exclude the existenc e of overlap. Remark. Mention that the overlap of observations le ads to the overlap b etwe en al l the sets of the intrinsic char acteristics, but not vic e versa. F r om individual values to r ep orte d envelop es. The set D i,k is the set of individual-level v alues of characteristic k among the observ a- tions used in study i . In a t ypical evidence synthesis, individual v alues are not av ailable to the meta-analyst. What is usually a v ailable instead are study-lev el restrictions rep orted in the pap er or study proto col (e.g., age 18 – 65 , calendar y ears 2010 – 2019 , ICD co des in a given list, hospitals in a given region). More discussion of it can b e found in App endix B . W e represent the theoretical counterpart of suc h rep orted information b y a set R i,k that contain all individual v alues: D i,k ⊆ R i,k . In other w ords, R i,k is a description of where those v alues are allo wed to lie according to what is rep orted. Prop osition 2 (Exclusion of ov erlap by exclusion of the ranges of intrinsic charac- teristic) . Denote R i,k ⊇ D i,k as the r anges of d I i,j ,k s. ∀ A ⊆ Ω , we have ∃ k , T i | S i ∈ A R i,k = ∅ ⇒ T i | S i ∈ A S i = ∅ Pro of of it can be found in App endix A . 8 2.3.1 P otential of ov erlap Prop osition 2 allo ws us to make inferences ab out the qualitativ e ov erlap structure among the studies with only ranges of intrinsic c haracteristics. In tuitively , the size of T i | S i ∈ A R i,k also con tains information ab out o verlap: the smaller the common part of the reported ranges, the lesser amoun t of ov erlap is exp ected; a larger common part leav es more ro om for ov erlap. How ever, in practice, “size” is hard to define in a uniform wa y across different kinds of characteristics (con- tin uous time, discrete age groups, ICD co des, regions) and across different reporting formats (interv als, sets, u nions of interv als). T o obtain a representation that is sim- ple and comparable across c haracteristics, w e map ranges to binary v ectors after partitioning the entire range into bins. More sp ecifically , w e co de R i,k in to binary vectors through follo wing steps: F or each dimension k , 1. Calculate the entire range of all p ossible v alue of k-th intrinsic c haracteristic R · ,k = S n i =1 R i,k 2. Divide R · ,k in to m k ordered and mutually disjoint subsets such that the union of these subsets equals R · ,k . Denote the l-th subset as R · ,k,l , l ∈ { 1 , 2 , ..., m k } 3. F or each study i , enco de the range R i,k as a binary vector r i,k ∈ { 0 , 1 } m k with en tries r i,k,l = I { R i,k ∩ R · ,k,l  = ∅ } , l = 1 , . . . , m k , so that r i,k,l = 1 iff R i,k o verlaps the l -th reference interv al R · ,k,l . Prop osition 3 (Excluding pair-wise ov erlap based on the range v ector of intrinsic c haracteristics) . ∀ i 1 , i 2 , r i 1 ,k · r i 2 ,k = 0 ⇒ S i 1 ∩ S i 2 = ∅ Pro of of it can be found in App endix A . Theorem 1 (Exclusion of ov erlapping sample combination) . ∀ A ⊆ Ω and for al l intrinsic char acteristics k , we have: Q k ( m k P l =1 Q i | S i ∈ A r i,k,l ) = 0 ⇒ T S i ∈ A S i = ∅ This theorem follows directly from Prop osition 2 , only substitutes R i,k s with the partitioned and co ded versions of R i,k s . Definition 4 (Poten tial of ov erlap) . ∀ A ⊆ Ω and for al l intrinsic char acteris- tics k , given a family of p artitions P = ( P 1 , . . . , P n k ) of ( R · , 1 , . . . , R · ,n k ) , wher e P k = ( R · ,k, 1 , . . . , R · ,k,m k ) . W e define the p otential of overlap of A : ˜ π P ( A ) := min k m k P l =1 Q i | S i ∈ A r i,k,l m k − m k P l =1 Q i | S i ∈ A (1 − r i,k,l ) 9 Interpr etation of the p otential of overlap. Giv en a partition family P = ( P 1 , . . . , P n k ) with P k = ( R · ,k, 1 , . . . , R · ,k,m k ) . Think of the bins R · ,k, 1 , . . . , R · ,k,m k as a coarse grid for the en tire range of the k th in trinsic c haracteristic. F or each study i and characteristic k , the reported range R i,k is enco ded as a binary v ector r i,k ∈ { 0 , 1 } m k where r i,k,l = 1 if the range R i,k co vers the l th bin, and r i,k,l = 0 otherwise. In other words, r i,k,l simply records whether the l th bin is included in the study’s range for that characteristic. F or a study-combination A ⊆ Ω with | A | ≥ 2 and a fixed c haracteristic k , define ˜ π P ,k ( A ) := m k P l =1 Q i | S i ∈ A r i,k,l m k − m k P l =1 Q i | S i ∈ A (1 − r i,k,l ) ∈ [0 , 1] . This quantit y is easy to read: • The numerator m k P l =1 Q i | S i ∈ A r i,k,l coun ts the n umber of bins that are included by all studies in A (i.e., bins l for whic h r i,k,l = 1 for every S i ∈ A ). • The denominator m k − m k P l =1 Q i | S i ∈ A (1 − r i,k,l ) coun ts the n umber of bins that are included by at least one study in A (i.e., bins l for which r i,k,l = 1 for some S i ∈ A ). So ˜ π P ,k ( A ) is the prop ortion of the “combined range” (measured in bins) that is shared by all studies in A . W e then define the p otential of overlap ˜ π P ( A ) := min k =1 ,...,n k ˜ π P ,k ( A ) . W e take the minim um b ecause truly shared observ ations would ha ve to fall into the shared part of the ranges for every characteristic. If ev en one c haracteristic has v ery little shared range (in bins), that alone already limits how m uch o v erlap betw een the studies is plausible given the aggregated information. Why this is a sensible pr oxy. The only information used here is which bins are cov ered by each study’s rep orted range. If, for some characteristic k , there is no bin that is cov ered by all studies in A , then there is no wa y for an observ ation to satisfy all studies’ range restrictions on that c haracteristic, and ov erlap is ruled out for A (cf. Theorem 1 ). When shared bins do exist, ˜ π P ,k ( A ) summarizes how large the shared part of the ranges is compared with the ov erall part of the ranges, in a scale-free wa y . Basic pr op erties. F or any fixed partition family P : 10 • (Range) ˜ π P ( A ) ∈ [0 , 1] for all A with | A | ≥ 2 . • (Exclusion) If there exists k such that m k P l =1 Q i | S i ∈ A r i,k,l = 0 , then ˜ π P ( A ) = 0 , which is consistent with Theorem 1 . • (Monotonicit y in A ) F or a given k , when adding more studies to A , the n umber of bins shared by all studies cannot increase, while the num ber of bins co vered by at least one study cannot decrease. Hence ˜ π P ,k ( A ) is non-increasing as A grows, and therefore ˜ π P ( A ) is also non-increasing in A . • (A dding characteristics) Adding an additional intrinsic characteristic (i.e., taking the minimum ov er more k ) cannot increase ˜ π P ( A ) . • (Dep endence on the partition) Using a finer partition t ypically separates v alues that were previously merged into the same bin. This often reduces artificial shar- ing caused by coarse bins, so ˜ π P ( A ) often b ecomes smaller under finer partitions, although strict monotonicity is not guaran teed. Conservativeness and limitations. ˜ π P ( A ) is a deterministic function of the rep orted ranges and the chosen partitioning sc heme. Because binning is a coarsening step, different v alues that would b e distinct on the individual level may fall into the same bin; this can make the shared part (in bins) lo ok larger than it truly is. How ever, without additional assumptions link- ing individual-lev el distributions to the rep orted ranges, a general inequalit y such as ˜ π P ( A ) ≥ π ( A ) cannot b e guaranteed. W e therefore interpret ˜ π P ( A ) as a feasibilit y- based measure of ov erlap p otential rather than an estimator of the real ov erlap prop ortion. 3 Ov erlap iden tification in the real w orld 3.1 In trinsic VS Key characteristics In the theoretical part, we derived a proxy of the upp er b ound of the p ortions of o verlap, using range of the intrinsic characteristics. Y et intrinsic characteristics and their ranges are theoretical subjects whose real v alues are unkno wn in the real w orld. The individual data used in primary studies are not in trinsic c haracteristic vectors, but a transformed version of it, often extracted from the so called IPD (Individual P articipant Data) in medical research. The transformations happ ens at steps such as measuring, recording, extracting and cleaning, p ossibly with losses of resolution (e.g., rounding or categorization). Once x u is defined, eac h dimension of it has a unique, in v ariant v alue. By contrast, differen t transformed versions of the same intrinsic characteristic ma y differ in preci- sion (e.g., the num b er of stored digits), in coarsening rules (e.g., rounding sc hemes), or in representation (e.g., 12/24-hour clo c ks or different time zones). They may also differ in how categorical information is partitioned and co ded, for instance by using differen t lev els of granularit y (e.g., hospital → p ostco de → cit y → province → coun try). Motiv ated by this, we distinguish b et w een the intrinsic characteristic vector of an observ ation and its transformed v ersion that is used for the analysis of it in a study . F or a clear separation betw een the theoretical v alue of a v ariable and the 11 transformed (i.e., observed/recorded/coarsened) v ersion of it in the reality , we denote the transformed v alue of a v ariable b y app ending a prime ( ′ ). Definition 5. L et u uniquely index an underlying observation event. F or I i,j = u , we define x ′ I i,j = ( d ′ I i,j ,k 1 , d ′ I i,j ,k 2 , ..., d ′ I i,j ,k n ′ k ) as the key char acteristic ve ctor of the event indexe d by u in study i , wher e k 1 , . . . , k n ′ k ∈ 1 , . . . , n k ar e distinct indic es identifying which intrinsic char acteristics ar e use d as key char acteristics in study i . d ′ I i,j ,k ar e tr ansforme d versions of intrinsic char acteristics d u,k that ar e use d for the analysis in the observational study i . Remark. Char acteristics r epr esenting administr ative data, datab ase-r elate d or study- r elate d information (e.g., r e gistry identifier, stor age timestamp, curr ent insur anc e status of the p atient, author names and public ation ye ar etc.) ar e not key char acter- istics, b e c ause they ar e not intrinsic to the underlying event and c ould vary b etwe en studies. Thus their c annot b e use d for overlap infer enc e in our appr o ach. 3.2 P oten tial of o verlap based on key c haracteristics. In practice, the meta-analyst observes only study-level restrictions on k ey characteris- tics. F or each study i and characteristic k , let R ′ i,k denote the rep orted range of the key c haracteristic (e.g., age in y ears, calendar time in months, ICD co de set at a chosen lev el), and let R ′ · ,k := S n i =1 R ′ i,k b e the ov erall rep orted range across studies. Cho ose a partition family P ′ = ( P ′ 1 , . . . , P ′ n ′ k ) with P ′ k = ( R ′ · ,k, 1 , . . . , R ′ · ,k,m ′ k ) and enco de r ′ i,k,l = I { R ′ i,k ∩ R ′ · ,k,l  = ∅} , l = 1 , . . . , m ′ k . Define the key-c haracteristic analogue of the p oten tial of ov erlap by ˜ π ′ P ′ ( A ) := min k m ′ k P l =1 Q i | S i ∈ A r ′ i,k,l m ′ k − m ′ k P l =1 Q i | S i ∈ A (1 − r ′ i,k,l ) By assuming that the distortion of the mapping from d I i,j ,k to d ′ I i,j ,k is ignorable giv en our partition of the aggregated inf ormation, w e can bridge the gap b et ween theory and practice. Assumption 1 (P artition-compatibility of the transformation) . F or every include d observation u = I i,j , every char acteristic k , and every bin l , d ′ I i,j ,k ∈ R ′ · ,k,l = ⇒ d u,k ∈ R ′ · ,k,l . Equivalently, r elative to the chosen p artition P ′ k , the tr ansformation fr om d u,k to d ′ I i,j ,k do es not move values acr oss bin b oundaries. Corollary 1 (Real-world o verlap exclusion and prop erties) . Under A ssumption 1 , the exclusion statements derive d fr om The or em 1 r emain valid when r eplacing ( R i,k , r i,k,l ) by ( R ′ i,k , r ′ i,k,l ) . In p articular, ˜ π ′ P ′ ( A ) = 0 implies T S i ∈ A S i = ∅ . 12 Mor e over, the b asic pr op erties of the p otential of overlap (r ange in [0 , 1] , mono- tonicity in A , and non-incr e ase when adding char acteristics) hold for ˜ π ′ P ′ ( A ) exactly as b efor e, sinc e they dep end only on the binary enc o ding. 3.3 Example This to y example illustrates following p oints: (i) if linked individual-level extracted data were av ailable to the meta-analyst, one could in principle identify the ov erlap exactly; (ii) across studies, the same intrinsic characteristic (e.g., time) ma y appear in the extracted data at different granularit y (year vs mon th vs timestamp with time zone); (iii) in t ypical evidence syn thesis, only study-level env elop es of a few key c har- acteristics are a v ailable, and these can be used to compute the ov erlap potential ˜ π ′ P ′ . A hyp othetic al individual-level view (not usual ly available). T able 2 shows what the extracted individual-lev el data could lo ok like in a hypothetical setting where the meta-analyst has access to linked records across studies. Eac h row corresp onds to an included observ ation ev ent u , but the recorded v ariables are study- sp ecific v ersions d ′ I i,j ,k , not the real v alues d u,k . Crucially , the extracted representation can differ betw een studies ev en for the same intrinsic c haracteristic: for example, time is recorded as a full timestamp with time zone (Study 3), a month-y ear v alue (Study 2), or only the year (Study 4). T able 2 A hypothetical example of extracted individual-level data (not only key characteristics!) for the four studies. This table is sho wn for illustration only; meta-analysts typically do not ha ve access to such individual-level records. Study ID outcome location of observ ation time of observation group publication year of pap er 1 a 1 area 1 Aug-18-2021 T reatmen t 2023 1 b 0 area 2 Dec-03-2022 Control 2023 1 c 0 area 2 No v-13-2022 Control 2023 2 d 1 area 3 F eb-2023 T reatment 2024 2 c 0 area 2 Nov-2022 Con trol 2024 2 e 0 area 1 Oct-2023 Control 2024 2 f 1 area 3 Apr-2023 T reatment 2024 3 f 1 area 3 Apr-18-2023, 4:40 CET T reatmen t 2023 3 b 0 area 2 Dec-03-2022, 5:20 CET Control 2023 4 f 1 area 3 2023 T reatment 2024 4 g 1 area 3 2023 T reatment 2024 4 d 1 area 3 2023 T reatment 2024 4 h 1 area 4 2023 Control 2024 4 i 0 area 4 2023 Con trol 2024 In this constructed example, the IDs b, c, d, f app ear in multiple studies, so the realized ov erlaps (on the laten t ev ent lev el) are non-zero for some combinations. F or instance, S 1 and S 4 share no ev ents, while S 2 and S 4 share t wo ev en ts (IDs d and f ). Of course, in real applications such cross-study linkage is usually not av ailable to th e meta-analyst; w e include it here only to define the “ground truth” o verlap prop ortion π ( A ) for comparison. 13 Which char acteristics ar e useful for overlap infer enc e via ˜ π ′ P ′ ? Not every column in T able 2 is suitable for ov erlap inference. A c haracteristic is not useful for our ov erlap-p oten tial approach if it is (a) non-in trinsic (can differ b et ween studies ev en for the same ev ent), or (b) not rep orted in a wa y that yields comparable study-lev el env elopes across studies. F or example, “publication year” (and also author list, registry identifier, extraction timestamp, etc.) is non-intrinsic and therefore excluded. Outcome is intrinsic, but it is often not reported as a restriction that defines the eligible sample, and for man y problems it provides little exclusion p o wer (e.g., binary outcomes). In con trast, time and lo cation are common in trinsic characteristics and are often rep orted as sample restrictions (eligibility windows, regions), making them natural k ey c haracteristics for ˜ π ′ P ′ . What is typic al ly available: study-level envelop es of key char acteristics. In most evidence syntheses, the meta-analyst can extract only aggregated information from the included studies. F or this example, T able 3 summarizes the kind of infor- mation that is usually accessible: sample size, and (crucially) rep orted env elop es for lo cation and time. T able 3 A hypothetical example of aggregated data (constructed from T able 2 ) that is t ypically accessible to meta-analysts. ’#’ means ’the num b er of’ . sample sample size range of location range of time # sub jects: control # sub jects: treatment # even t: control # even t: treatment 1 3 {area 1, area 2} {2021, 2022} 2 1 0 1 2 4 {area 1, area 2, area 3} {2022, 2023} 2 2 0 2 3 2 {area 2, area 3} {2022, 2023} 1 1 0 1 4 5 {area 3, area 4} {2023} 2 3 1 3 F ormally , we interpret the rep orted “range of lo cation” and “range of time” as en velopes of key characteristics: R ′ i, 1 = rep orted lo cation en velope , R ′ i, 2 = rep orted time env elop e (here: y ears) . F or example, R ′ 1 , 1 = { area 1 , area 2 } and R ′ 4 , 1 = { area 3 , area 4 } , hence R ′ 1 , 1 ∩ R ′ 4 , 1 = ∅ , which, under Assumption 1 , already excludes ov erlap b et ween S 1 and S 4 , and also excludes any combination con taining b oth. Enc o ding the envelop es and c omputing ˜ π ′ P ′ . Here the ov erall rep orted ranges are R ′ · , 1 = { area 1 , area 2 , area 3 , area 4 } , R ′ · , 2 = { 2021 , 2022 , 2023 } . 14 Cho ose partitions P ′ 1 and P ′ 2 giv en by the singleton bins, and enco de r ′ i,k,l = I { R ′ i,k ∩ R ′ · ,k,l  = ∅} . The resulting binary vectors are: T able 4 Binary enco ding of the location e n velopes R ′ i, 1 for the four-study to y example. The ov erall domain is partitioned into singleton bins R ′ · , 1 ,l (one p er area). F or each study i and bin l , the entry equals r ′ i, 1 ,l = I { R ′ i, 1 ∩ R ′ · , 1 ,l  = ∅} , i.e., it is 1 if the study’s reported location en velope includes that area and 0 otherwise. Location Partition of R ′ · , 1 r ′ 1 , 1 r ′ 2 , 1 r ′ 3 , 1 r ′ 4 , 1 area 1 R ′ · , 1 , 1 1 1 0 0 area 2 R ′ · , 1 , 2 1 1 1 0 area 3 R ′ · , 1 , 3 0 1 1 1 area 4 R ′ · , 1 , 4 0 0 0 1 T able 5 Binary enco ding of the y ear env elop es R ′ i, 2 for the four-study example. Y ear Partition of R ′ · , 2 r ′ 1 , 2 r ′ 2 , 2 r ′ 3 , 2 r ′ 4 , 2 2021 R ′ · , 2 , 1 1 0 0 0 2022 R ′ · , 2 , 2 1 1 1 0 2023 R ′ · , 2 , 3 0 1 1 1 F or A 1 = { S 1 , S 2 } , ˜ π ′ P ′ , 1 ( A 1 ) = 2 3 (shared areas: 1 and 2) , ˜ π ′ P ′ , 2 ( A 1 ) = 1 3 (shared years: only 2022) , so ˜ π ′ P ′ ( A 1 ) = min  2 3 , 1 3  = 1 3 . In contrast, using the (hypothetical) linked individual-level IDs in T able 2 , the ov erlap prop ortion is π ( A 1 ) = | O ( A 1 ) |   S S i ∈ A 1 S i   = 1 6 , since only one even t (ID c ) is shared and the p o oled sample size is 6. Comp arison acr oss al l c ombinations. T able 6 compares the ov erlap p oten tial ˜ π ′ P ′ ( A ) , computed only from the env elop es, to the o v erlap prop ortion π ( A ) , known here only because we constructed a linked toy dataset. The role of ˜ π ′ P ′ ( A ) is thus best interpreted as a feasibility-based measure of ov er- lap p oten tial: v alues equal to 0 provide strong exclusion evidence under Assumption 1 (Corollary 1 ), while p ositiv e v alues indicate that o verlap is p ossible with the rep orted 15 T able 6 Comparison b etw een ov erlap p oten tial ˜ π ′ P ′ ( A ) (computed from study-lev el envelopes of key characteristics using the c hosen partition family P ′ ) and real ov erlap proportion π ( A ) = | O ( A ) |    S S i ∈ A S i   in the four-study to y example. The overlap set O ( A ) and π ( A ) are sho wn only b ecause the toy data include linked IDs (T able 2 ); in real applications, meta-analysts typically observe only the envelopes used to compute ˜ π ′ P ′ ( A ) . NR. A O ( A ) ˜ π ′ P ′ ( A ) π ( A ) 1 ∅ ∅ / / 2 { S 1 } ∅ 0 0 3 { S 2 } ∅ 0 0 4 { S 3 } ∅ 0 0 5 { S 4 } ∅ 0 0 6 { S 1 , S 2 } { x c } 1 / 3 1 / 6 7 { S 1 , S 3 } { x b } 1 / 3 1 / 4 8 { S 1 , S 4 } ∅ 0 0 9 { S 2 , S 3 } { x f } 2 / 3 1 / 5 10 { S 2 , S 4 } { x d , x f } 1 / 4 2 / 7 11 { S 3 , S 4 } { x f } 1 / 3 1 / 6 12 { S 1 , S 2 , S 3 } ∅ 1 / 3 0 13 { S 1 , S 2 , S 4 } ∅ 0 0 14 { S 1 , S 3 , S 4 } ∅ 0 0 15 { S 2 , S 3 , S 4 } { x f } 1 / 4 1 / 8 16 { S 1 , S 2 , S 3 , S 4 } ∅ 0 0 en velopes but not guaranteed. Moreov er, b ecause the construction depends on the c hosen partitions and on which key characteristics are av ailable, ˜ π ′ P ′ ( A ) should not b e used as a univ ersal upp er b ound on π ( A ) . 3.4 Visualization of o v erlap p otential The k ey-characteristic o verlap potential ˜ π ′ P ′ ( A ) is a set function on study- com binations A ⊆ Ω (with | A | ≥ 2 ), computed from rep orted env elopes R ′ i,k after c ho osing a partition family P ′ . Any visualization therefore is conditional on the selected key characteristics and on P ′ . W e suggest tw o complementary visualizations. Pairwise he at map A heat map provides a compact o verview of whic h study pairs ha ve large shared en velope mass (in bins) and which pairs are excluded. This visualization is in tuitive and often captures a substantial part of the practical o verlap risk, even though o verlap is genuinely m ultiv ariate and cannot be fully summarized by pairwise information (cf. Section 2.2.2 ). F or the four-study example (T able 6 ), the heat map is sho wn in Figure 4 . Combination grid plot (UpSet-plot-style) T o visualize multiv ariate o verlap structure, we plot ˜ π ′ P ′ ( A ) for selected non-empty com binations A (e.g., all combinations with ˜ π ′ P ′ ( A ) > 0 in small examples, or the 16 Fig. 4 Pairwise ov erlap-p oten tial heat map for the four-study to y example. Cell ( i, j ) sho ws ˜ π ′ P ′ ( { S i , S j } ) , computed from study-level env elop es of the key characteristics (lo cation and time) at the c hosen partition resolution. V alues equal to 0 indicate o verlap is excluded under Assumption 1 ; larger v alues indicate greater o verlap compatibility given the rep orted env elop es. top K combinations in larger examples). A practical visualization is a “grid plot” inspired b y UpSet-plot suggested in Lex et al ( 2014 ). Each column corresp onds to one combination A , rows corresp ond to studies S 1 , . . . , S n , and a filled cell indicates mem b ership S i ∈ A . Columns are ordered by decreasing ˜ π ′ P ′ ( A ) , and com binations with ˜ π ′ P ′ ( A ) = 0 may b e omitted for conciseness. In the four-study example, all non-zero com binations from T able 6 are shown in Figure 5 . F or larger n , sho wing all 2 n − n − 1 com binations is not meaningful; in that case we recommend visualizing only the combinations with the largest ˜ π ′ P ′ ( A ) . Interpr etation r eminder. In b oth plots, ˜ π ′ P ′ ( A ) = 0 pro vides exclusion evidence under Assumption 1 (Corol- lary 1 ). Positiv e v alues indicate that ov erlap is p ossible with the rep orted env elop es (giv en P ′ ), but do not imply actual ov erlap. 4 Utilization of the p oten tial of o v erlap 4.1 Ov erlap-free sample com bination with the maximum sample size Based on the p oten tial of ov erlaps, we can derive subsets of Ω for whic h the aggre- gation of all the study samples within them are free of any o verlaps. As the sample sizes of the studies are deriv able from aggregated data, w e can go a step further and choose the subset of study samples with the largest aggregated sample size to 17 Fig. 5 Grid plot of non-zero o verlap potentials in the four-study toy example (T able 6 ). Each column represents a study-combination A ⊆ Ω ; filled cells indicate which studies are included in A . Columns are ordered by decreasing ˜ π ′ P ′ ( A ) , and only com binations with ˜ π ′ P ′ ( A ) > 0 are shown. p erform a new meta-analysis that is completely free of the impact of ov erlap. More sp ecifically , w e conduct the following steps to find the set of studies: 1. W e find the set B 0 of all the sets of study samples whose p otential of ov erlaps are zero. B 0 := { A | A ∈ 2 Ω , ˜ π ( A ) = 0 } 2. In B 0 , we keep only the sets that hav e all their subsets also in B 0 , and denote the new set as B 1 . 3. In B 1 , we k eep only the sets who do not hav e a prop er sup erset in B 1 , and denote it as B 2 . 4. Cho ose the set in B 2 that has the largest sample size. Example: F rom the 4-studies example, we list the result of the steps : Step1: B 0 = {∅ , { S 1 } , { S 2 } , { S 3 } , { S 4 } , { S 1 , S 4 } , { S 1 , S 2 , S 4 } , { S 1 , S 3 , S 4 } , { S 1 , S 2 , S 3 , S 4 }} , 18 as seen in T able 6 Step2: B 1 = {∅ , { S 1 } , { S 2 } , { S 3 } , { S 4 } , { S 1 , S 4 }} Step3: B 2 = {{ S 2 } , { S 3 } , { S 1 , S 4 }} Step4: { S 1 , S 4 } is chosen because it has the largest sample size (3+5=8), as can b e derived from aggregated data in T able 3 . W e can add the sample size directly , b ecause step 2 ensures that. In our example, the original meta-analysis based on {{ S 1 } , { S 2 } , { S 3 } , { S 4 }} would ha ve a spurious sample size of 3+4+2+5=14, and b ecause of the considerable ov erlap, the actual meta-analysis sample size should be 9 instead. By applying the pro cess w e ha ve a new ov erlap-free meta-analysis ( { S 1 , S 4 } ) of p opulation size 8. The meaning of calculating the o verlap-free sample combination with the maxi- m um sample size is that we can do meta-analysis based on that sample com bination and compare with the original meta-analysis to ev aluate the impact of ov erlap. The considerable o v erlap b et ween the study samples in our example makes the appli- cation of the pro cess necessary and helpful. But in some cases, excluding the studies migh t mean thro wing a wa y to o m uch information. W e th us suggest an alternativ e w ay to lo ok at the problem in Section 4.2 . 4.1.1 Alternativ es to sample-size maximization. When selecting a final ov erlap-free study combination from B 2 (Section 4.1 ), max- imizing the total sample size is only one p ossible criterion. Dep ending on the goal of the evidence synthesis, other reasonable criteria include (but are not limited to) (i) minimizing b etw een-study heterogeneity , (ii) maximizing the num b er of included studies, (iii) maximizing the total inv erse-v ariance w eight, and (iv) maximizing the incorp orated risk-of-bias or rep orting-qualit y scores. W e could also use them in com- bination, for example b y imp osing minimum requirements (e.g., at least m studies) and then selecting among the remaining candidates using a secondary criterion (e.g., lo wer heterogeneity), or by combining sev eral criteria into a single score. Example: Supp ose w e hav e an evidence synthesis of 7 studies, where B c 0 := { A | A ∈ 2 Ω , ˜ π ′ P ′ ( A )  = 0 } = {{ S 1 , S 3 } , { S 2 , S 3 } , { S 3 , S 4 } , { S 3 , S 6 } , { S 4 , S 7 }} . Step 1: B 0 consists of the sets marked with green in Figure C1 in the app endices . Step 2: B 1 consists of the sets marked with green in Figure C2 . Step 3: B 2 consists of the sets mark ed with green in Figure C3 , which are { S 3 , S 5 , S 7 } , { S 1 , S 2 , S 4 , S 5 , S 6 } and { S 1 , S 2 , S 5 , S 6 , S 7 } Step 4: W e then ev aluate these candidates using the c hosen selection criteria. F or instance, we can compute τ 2 (or I 2 ) for each candidate and select the combination with lo w er estimated heterogeneit y , while sim ultaneously enforcing other practical preferences (e.g., c ho osing among com binations with similar heterogeneity the one including more studies, or the one with larger total inv erse-v ariance weigh t). 19 4.2 Pro xy of the low er b ound of the meta-analysis sample size When degrees of ov erlap is lo w betw een study samples, we might not wan t to simply exclude studies, as it could significan tly shrink the meta-analysis sample size. An alternativ e is to construct a proxy for a low er b ound for the effectiv e meta-analysis sample size | S | =      [ S i ∈ Ω S i      . A se c ond-or der b ound fr om the princip al of inclusion–exclusion. F or any finite sets Ω = { S 1 , . . . , S n } , the principle of inclusion and exclusion gives      [ S i ∈ Ω S i      = X A ⊆ Ω ( − 1) | A | +1       \ S j ∈ A S j       . The p oten tial of ov erlaps are proxies of the p ortions of o verlaps π ( A ) . As we do not ha ve non-trivial estimations of the low er b ounds of π ( A ) s— π ( A ) has a trivial low er b ound of 0—we discard the terms in the sum for | A | > 2 . T runcating the expansion of the sum after the pairwise terms yields: | S | ≥ n X i =1 | S i | − X 1 ≤ i

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment