Developing Performance-Guaranteed Biomarker Combination Rules with Integrated External Information under Practical Constraint
In clinical practice, there is significant interest in integrating novel biomarkers with existing clinical data to construct interpretable and robust decision rules. Motivated by the need to improve decision-making for early disease detection, we pro…
Authors: Albert Osom, Camden Lopez, Ashley Alex
Dev eloping P erformance-Guaran teed Biomark er Com bination Rules with In tegrated External Information under Practical Constrain t Alb ert Osom ∗ 1 , Camden Lop ez 2 , Ashley Alexander 3 , Suresh Chari 4 , Ziding F eng 2 , and Ying-Qi Zhao 2,1 1 Departmen t of Biostatistics, Universit y of W ashington, Seattle, W A 2 Public Health Sciences Division, F red Hutc hinson Cancer Center, Seattle, W A 3 Kelsey Research F oundation, Houston, TX 4 Departmen t of Gastro enterology , Universit y of T exas MD Anderson Cancer Center, Houston, TX Abstract In clinical practice, there is significant interest in integrating no vel biomarkers with exist- ing clinical data to construct interpretable and robust decision rules. Motiv ated by the need to impro v e decision-making for early disease detection, we prop ose a framework for developing an optimal biomarker-based clinical decision rule that is b oth clinically meaningful and practi- cally feasible. Sp ecifically , our pro cedure constructs a linear decision rule designed to achiev e optimal p erformance among class of linear rules b y maximizing the true p ositive rate while adhering to a pre-sp ecified p ositive predictive v alue constraint. Additionally , our metho d can adaptiv ely incorp orate individual risk information from external source to enhance p erformance when such information is b eneficial. W e establish the asymptotic prop erties of our prop osed estimator and compare to the standard approach used in practice through extensive simulation studies. Results indicate that our approach offers strong finite-sample p erformance. W e also apply the prop osed metho ds to develop biomark er-based screening rules for pancreatic ductal adeno carcinoma (PDA C) among new-onset diab etes (NOD) patients. ∗ Email: aosom@uw.edu 1 1 Intro duction Early detection is critical in the fight against cancer and other diseases, as it offers the p oten tial to iden tify diseases at a stage where they are most treatable. The concept of early detection is grounded in the principle that earlier interv en tion follo wing the detection of disease can lead to b etter outcomes, including higher surviv al rates, reduced treatment complexity , and improv ed qualit y of life for patien ts. An immense amount of researc h has been dedicated to identifying candidate biomarkers to aid in early detection. The primary challenge lies in effectively combining these biomarkers to dev elop a clinical decision rule that satisfies sp ecific practical and clinical needs, while optimizing the detection of the target disease. In this regard, Vick ers and Elkin ( 2006 ) and Kerr et al. ( 2016 ) prop osed net b enefit and standardized net b enefit as a measure of clinical utilit y of risk mo dels, which balances the trade-off b etw een the harm and b enefit of the mo del at a clinically relev an t cost-b enefit ratio. Recen tly , W ang et al. ( 2020 ) developed a learning-based biomark er-assisted rule that optimizes clinical b enefit under a pre-sp ecified risk constraint, offering a flexible framew ork for dev eloping b oth linear and nonlinear rules. Meisner et al. ( 2021 ) in tro duced a framework for linearly combining biomarkers that maximizes the true p ositiv e rate (TPR) for a fixed false p ositiv e rate (FPR). Our work is motiv ated b y early detection of pancreatic ductal adeno carcinoma (PDA C). PDA C has an ov erall 5-year surviv al rate of around 13% ( Siegel et al. , 2025 ) and has been pro jected to b e the second highest cancer-related cause of death in the United States b y 2030 ( Rahib et al. , 2014 ). Due to the low prev alence of PDA C, biomark er-based screening m ust target high-risk p opulations to b e feasible — such as patien ts ov er 50 years with new-onset hyperglycemia and diabetes (NOD) ( Chari et al. , 2005 ). F urther, cost-effective screening tests are essential to selectively iden tify PD AC cases, as it is impractical to recommend all at-risk individuals for further workup that is more exp ensiv e or in v asiv e. F or instance, ov er one million Americans are diagnosed with new- onset diab etes each year, making broad screening in this p opulation prohibitively exp ensive. In the setting of p opulation lev el screening, clinician are often interested in targeting a controllable w ork-up burden from recommending further diagnostic pro cedure, esp ecially when such follo w-up ev aluations are costly , inv asive, or limited by resource constraints. Positiv e predictiv e v alue (PPV) 2 naturally serv es as a clinically meaningful criterion for such need, b ecause it directly quantifies the exp ected n umber needed to screen (NNS) to detect one cancer case. F or example, in ov arian cancer screening, clinical guidelines recommend a minimum PPV of 10% to justify in v asive diagnostic surgery ( Jacobs and Menon , 2004 ). A rule ac hieving PPV of 10% is exp ected to recommend 10 patien ts for further diagnostic work-up to iden tify one cancer case. This NNS interpretation is in tuitive and comp elling for clinicians when ev aluating the feasibility of a screening mo dality . In con trast, existing biomarker-based decision rules that maximize TPR sub ject to an FPR constrain t ( W ang et al. , 2020 ; Meisner et al. , 2021 ) are not well suited for this setting. FPR is a prev alence- free measure and do es not directly determine the clinical work-up burden of a screening rule. It is not straightforw ard to translate an FPR-constrained optimization in to a rule that satisfies a pre-sp ecified PPV requiremen t. F or a fixed disease prev alence, PPV is a nonlinear function of b oth TPR and FPR, and the mapping from PPV to FPR is not identifiable, as there exist infinitely man y (TPR, FPR) pairs that achiev e the same PPV. F urthermore, in our setting, TPR is b oth the quan tity b eing optimized and a comp onent of the PPV constrain t itself. Consequently , for a fixed PPV level, the admissible FPR is not unique but dep ends on the achiev ed TPR. Fixing an FPR threshold to appro ximate a target PPV while maximizing TPR can therefore yield a sub optimal rule relativ e to an approach that directly optimizes the clinical ob jectiv e. See T able 1 for an illustrative n umerical example. Additionally , when developing a new biomarker-based decision rule, there are often published risk mo dels/algorithms a v ailable that can provide v aluable insights in to patien t risk profile. F or in- stance, the Enriching New-Onset Diab etes for Pancreatic Cancer (ENDP AC) mo del w as developed and v alidated for PD AC screening in NOD patients, using patient characteristics suc h as w eight c hange, bloo d glucose c hange, and age at diab etes onset ( Sharma et al. , 2018 ). Other examples in- clude the MyProstateScore test for prostate cancer, endorsed by NCCN guidelines for prebiopsy risk stratification ( National Comprehensive Cancer Net work , 2023 ), and the F ramingham Risk Score for coronary heart disease ( Wilson et al. , 1998 ). Integrating this information could enhance the discriminatory p o wer of a new biomarker decision rule. In the context of risk prediction, v arious strategies ha ve b een prop osed to leverage external information when developing a new risk model. Assuming there is a published risk prediction 3 T able 1: Comparison of the p erformance of the FPR-constrained metho d of Meisner et al. ( 2021 ), whic h maximizes TPR sub ject to an FPR constraint, with our prop osed metho d (DOOLR), which directly maximizes TPR sub ject to a PPV constraint. T o align the FPR constrain t in Meisner et al. ( 2021 ) with a given PPV target, we solve the optimization problem ov er a range of FPR constrain ts and select the v alue that yields the desired PPV level. Data is generated under a linear decision rule; see Section 4.1 . W e rep ort a v erage TPR and PPV with corresp onding standard errors (in parentheses) in the test sample across 500 sim ulation replicates for PPV constrain ts α P t 0 . 030 , 0 . 040 , 0 . 045 u and prev alence p 1 “ 0 . 01. Across all v alues of α , DOOLR ac hieves higher TPR than the FPR-constrained method of Meisner et al. ( 2021 ). α Measure Meisner et al. (2021) DOOLR 0.030 TPR 0.870(0.177) 0.942(0.139) PPV 0.030(0.001) 0.035(0.008) 0.040 TPR 0.818(0.198) 0.926(0.120) PPV 0.040(0.001) 0.044(0.007) 0.045 TPR 0.796(0.187) 0.907(0.153) PPV 0.045(0.002) 0.048(0.013) mo del av ailable, we consider tw o established approaches discussed in the literature. The first estimates the parameters of the new risk prediction model using p enalized likelihoo d, incorp orating a Kullbac k–Leibler divergence p enalt y to enforce similarity b etw een the new and the published mo del ( Kullback and Leibler , 1951 ; Hector and Martin , 2024 ; W ang et al. , 2021 , 2023 ). The second emplo ys a constraint-based maxim um likelihoo d estimation framework, restricting the extent to whic h the parameters of the new prediction mo del deviate from that of the published mo del to encourage alignmen t ( Cheng et al. , 2019 ; Han et al. , 2023 ; Chatterjee et al. , 2016 ). Both approac hes assume either there is access to an external data or correct sp ecification of the published risk mo del. In contrast, our approach imp oses no such assumptions, relying solely on the av ailability of external risk information — such as calculated risk scores under a published risk mo del, algorithm, or risk calculator — while main taining flexibilit y in model sp ecification. Moreov er, as these existing metho dologies primarily fo cus on risk prediction, they do not directly address our researc h ob jectiv e, rendering them less suitable for our framew ork. In this pap er, we introduce a new criterion, termed PPV-constrained b enefit function, from whic h we develop biomarker-based decision rules. Under this criterion, we learn decision rules that maximize the b enefit of screening measured by TPR while enforcing clinical usefulness through a PPV constraint. W e derive the theoretically optimal rule, based up on which a plug-in rule could b e formed using either parametric or flexible nonparametric estimators, such as random forests 4 ( Breiman , 2001 ), additive models ( Hastie and Tibshirani , 1990 ), or an ensemble of estimators implemen ted via the Sup erLearner ( V an der Laan et al. , 2007 ). Considering the imp ortance of in terpretability in clinical practice, we further dev elop an approach using Direct Optimization for Optimal Linear decision Rule (DOOLR). In cases where the optimal decision rule is not linear, the prop osed metho d ensures the developmen t of a linear decision rule that delivers the b est p ossible p erformance within the en tire class of linear rules. Moreo ver, when external risk information, suc h as decisions from a published risk model or established biomark ers is av ailable, w e extend the linear decision rule to incorp orate this auxiliary information, thereb y impro ving predictiv e p erformance while preserving interpretabilit y . W e formalize this in tegration b y introducing a p enalty term in to the ob jectiv e function that adaptively regulates the extent to which external information is incorp orated. Specifically , the p enalty term quan tifies the cost asso ciated with misalignmen t b et ween the decisions generated by the new biomarker rule and those recommended by the published risk model among patien ts with positive disease status. Importantly , our external information transfer approac h do es not require access to individual-level external risk scores; it only requires binary decisions (i.e., whether further screening is recommended) derived from an external risk mo del. The manuscript is organized as follo ws: we introduce the PPV-constrained benefit function function in Section 2 . W e prop ose algorithms for estimating the optimal decision rule that maxi- mizes the PPV-constrained b enefit function in Section 2.2 . In Section 3 , w e discuss the asymptotic prop erties of the prop osed estimator of the optimal linear rule. Simulation studies ev aluating the prop osed pro cedures are presen ted in Section 4 . In Section 5 , w e illustrate our metho d to Pancre- atic Cancer and a retrosp ective cohort of con trols enriched for new onset Diab etes and pre-diab etes for Researc h Analyses (P ANDORA) study . Our goal is dev elop biomarker-assisted decision rule for screening of PD A C among new-onset diab etes (NOD) patients. In Section 6 , we provide a discussion. 5 2 Metho dology 2.1 PPV-constrained b enefit function Let D denote a binary outcome, with D “ 1 indicating presence of the disease of in terest and 0 otherwise. Let X b e the set of biomark ers and other clinical characteristics used for deriving clinical decision rules. Denote b y R “ t d p X q : d p X q P t 0 , 1 uu the class of individualized clinical decision rules, such that d p X q “ 1 means patient is recommended for further diagnostic pro cedure, and d p X q “ 0 otherwise. Let TPR p d p X qq “ pr p d p X q “ 1 | D “ 1 q , and PPV p d p X qq “ pr p D “ 1 | d p X q “ 1 q . W e are interested in constructing biomarker com bination rules that maximizes TPR under a pre-sp ecified clinically meaningful p ositive predictive v alue, α . The problem can b e form ulated as max d p X qP R TPR p d p X qq subje ct to PPV p d p X qq ě α . (1) F or k P t 0 , 1 u , let p k – pr p D “ k q , where p 1 denote the disease prev alence, and pr p X q and pr p D “ k | X q denote the density function of X and the conditional probability of D “ k given X , resp ectiv ely . W e assume p 1 is either known or can b e estimated from indep enden t data sources. W e further assume that b oth pr p D “ 1 | X q and pr p X q are contin uous functions of X , and δ 0 ă pr p D “ k | X q ă 1 ´ δ 0 and δ 1 ă p k ă 1 ´ δ 1 almost surely for some p ositiv e constants δ 0 and δ 1 . F or notational simplicit y , let γ “ p 1 {p 1 ´ p 1 q , η 1 p X q “ pr p D “ 1 | X q{ p 1 and η 0 p X q “ pr p D “ 0 | X q{ p 0 , where p 0 “ 1 ´ p 1 . W e also denote E X as the exp ectation with resp ect to the marginal distribution of X . W e can then write ( 1 ) as max d p X qP R E X r 1 t d p X q “ 1 u η 1 p X qs subje ct to γ E X r 1 t d p X q “ 1 u η 1 p X qs γ E X r 1 t d p X q “ 1 u η 1 p X qs ` E X r 1 t d p X q “ 1 u η 0 p X qs ě α. (2) 6 Our goal is to find the optimal decision rule d ˚ p X q that solv es ( 2 ). The Lagrangian corresp onding to ( 2 ) is giv en as L p d p X q , λ q “ E r 1 t d p X q “ 1 ut η 1 p X q ´ λαγ η 1 p X q ´ λαη 0 p X q ` λγ η 1 p X qus . Let d ˚ p X q denote the optimal rule that solv es the optimization problem in ( 2 ). W e can ob- tain d ˚ p X q by maximizing L p d p X q , λ q for a fixed λ and choose the optimal λ as the solution to E X r 1 t d p X q “ 1 up γ η 1 p X qq ´ αγ η 1 p X q ´ αη 0 p X qs “ 0. The optimal rule is deriv ed as d ˚ p X q “ 1 t f λ ˚ p X q ą 0 u , where f λ ˚ p X q “ η 1 p X q ´ λ ˚ αγ η 1 p X q ´ λ ˚ αη 0 p X q ` λ ˚ γ η 1 p X q for a pre-sp ecified α , and λ ˚ ą 0 solves E X r 1 t f λ p X q ą 0 up αγ η 1 p X q ` αη 0 p X q ´ γ η 1 p X qqs “ 0 . W e show optimalit y of d ˚ p X q in Supplemen tary App endix A. 2.2 Estimation of the optimal decision rule Supp ose t D i , X i u n i “ 1 data are collected from n indep endent identically distributed sub jects. W e denote n 1 b y the num b er of disease sub j ects and n 0 the num b er of non-disease sub jects such that n “ n 0 ` n 1 . Given the form of the theoretical optimal decision rule, a straightforw ard metho d for estimation is to replace the exp ectations in the theoretical optimal decision rule and the constraint with their empirical analogs. Let ˆ η 1 p X q “ ˆ pr p D “ 1 | X q{ p 1 and ˆ η 0 p X q “ ˆ pr p D “ 0 | X q{ p 0 b e estimates of η 1 p X q and η 0 p X q resp ectively , where pr p D “ 1 | X q can be estimated using either parametric (e.g., logistic regression) or data adaptive approaches (e.g., random forest, neural netw ork, SuperLearner , etc). Subsequently , the estimated optimal decision rule is ˆ d ˆ λ p X q “ 1 t ˆ η 1 p X q ´ ˆ λαγ ˆ η 1 p X q ´ ˆ λα ˆ η 0 p X q ` ˆ λγ ˆ η 1 p X q ą 0 u , where ˆ λ solv es n ´ 1 n ÿ i “ 1 r 1 t ˆ d ˆ λ p X i q “ 1 up αγ ˆ η 1 p X i q ` α ˆ η 0 p X i q ´ γ ˆ η 1 p X i qqs “ 0 . W e refer to the metho d discussed ab ov e as the plug-in approach. This metho d is flexible and can accommo date v arious estimation techniques (e.g., random forest, neural net work, additive mo dels, lo cal p olynomials, etc). Remark 1. When estimating pr p D “ 1 | X q in the ab ove plug-in rule using data under c ase-c ontr ol 7 sampling, we ne e d to make adjustments to ac cur ately estimate pr p D “ 1 | X q . T o b e sp e cific, let S b e the indic ator of b eing include d into the c ase-c ontr ol sample. We c an obtain an estimate of pr p D “ 1 | X , S q under the c ase-c ontr ol data. However, we ar e typic al ly mor e inter este d in estimating pr p D “ 1 | X q . In this c ase, we make an adjustment using the r elationship ˆ pr p D “ 1 | X q ˆ pr p D “ 0 | X q “ ˆ pr p D “ 1 | X , S q ˆ pr p D “ 0 | X , S q n 0 n 1 p 1 1 ´ p 1 ( Huang and Pep e , 2010 ). If we estimate pr p D “ 1 | X q using lo gistic r e gr ession, this adjustment simplifies to adding lo g r p 1 n 0 {tp 1 ´ p 1 q n 1 us to the inter c ept of the estimate d c o efficient. In clinical practice, simpler decision rules are preferred for their ease of implemen tation and in terpretability . A common c hoice is linear decision rules — this is usually obtained in tw o steps; first estimate a patien t’s disease probability using logistic regression and app ly a risk score cutoff to ac hieve the pre-sp ecified clinical constrain t. While in tuitive, this metho d can underp erform when mo del assumptions are incorrect and is often ad ho c, meaning it may not yield the optimal rule for a giv en p erformance measure. This c hallenge led us to develop the Direct Optimization for Optimal Linear Decision Rule (DOOLR) approac h, which constructs a robust linear rule that is asymptotically optimal within the class of all linear rules. 2.2.1 Direct optimization for optimal linear decision rule Let R L b e the class of linear decision rules of the form d p X , β q “ 1 t X T β ą 0 u where X “ p 1 , X q for p -dimensional feature v ariables X and β “ p β 0 , β 1 q . Denote by X 1 i and X 0 j the cov ariates for diseased and non-disease d patien ts, resp ectively . Let d ˚ L p X , β ˚ q “ 1 t X T β ˚ ą 0 u b e the optimal linear rule that maximizes the ob jectiv e in ( 1 ) among R L . The empirical analog to ( 1 ) using the observ ed data for the class of linear decision rules can b e written as max β P R p ` 1 n ´ 1 1 n 1 ÿ i “ 1 r 1 t X T 1 i β ą 0 us subje ct to z PPV p β q “ γ n ´ 1 1 ř n 1 i “ 1 r 1 t X T 1 i β ą 0 us γ n ´ 1 1 ř n 1 i “ 1 r 1 t X T 1 i β ą 0 us ` n ´ 1 0 ř n 0 j “ 1 r 1 t X T 0 j β ą 0 us ě α. (3) 8 −4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 h=1 h=0.5 h=0.05 zero−one Figure 1: Comparison of zero-one indicator with Φ p x { h q for different v alues of h . W e can observe that, when h approaches zero, the approximation of the zero-one indicator with Φ p x { h q b ecomes more accurate. The optimization problem in ( 3 ) in volving the zero–one indicator is known to b e nondetermin- istic p olynomial-time hard (NP-hard) ( Natara jan , 1995 ). The presence of the indicator function in tro duces discontin uity into the ob jectiv e and constraint functions, rendering gradient-based opti- mization methods infeasible. Our goal is not to circum ven t this fundamental computational barrier, but to develop a practically tractable approximation that targets the optimal decision rule within the class of linear rules. T o address this computational in tractability , approximating the zero-one indicator with a smo oth surrogate function is a widely used approac h in the classification literature, demonstrating b oth computational and theoretical adv antages ( Bartlett et al. , 2006 ). While com- monly used con vex surrogate losses (e.g., logistic or hinge loss), are computationally conv enient, they do not in general preserv e optimality within a restricted mo del class and may conv erge to sub optimal linear rules under mo del missp ecification. T o ensure w e estimate the optimal linear 9 decision rule with guaranteed p erformance, w e also prop ose using a smo oth approximation loss function that con verges to the zero-one indicator. Sp ecifically , w e c ho ose our surrogate function to b e Φ p x { h q , where Φ is the standard normal distribution function ( Ma and Huang , 2007 ; F ong et al. , 2016 ; Lin et al. , 2011 ). As h approaches zero, the surrogate function con verges to the in- dicator function (see Figure 1 ). The parameter h in Φ p x { h q controls how closely the surrogate function approximates the indicator function and should b e chosen in a data-adaptiv e manner. Ideally , h w ould b e selected through cross-v alidation to ensure that it yields the highest-p erforming rule. How ever, cross-v alidation is computationally burdensome as each candidate v alue of h re- quires solving a nonlinear optimization problem. F or this reason, we adopt a practical choice h “ n ´ 1 { 3 StdDev ´ X T ˆ β 0 { ∥ ˆ β 0 ∥ ¯ , where ˆ β 0 is an initial estimate of the co efficients via say logistic regression. This choice of h has demonstrated strong empirical p erformance in prior w ork ( Lin et al. , 2011 ; Meisner et al. , 2021 ). Replacing the indicator function in ( 3 ) with the prop osed surrogate function, the relaxed optimization problem is giv en as max β P R p ` 1 n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ subje ct to γ n ´ 1 1 ř n 1 i “ 1 Φ ˆ X T 1 i β h ˙ γ n ´ 1 1 ř n 1 i “ 1 Φ ˆ X T 1 i β h ˙ ` n ´ 1 0 ř n 0 j “ 1 Φ ˆ X T 0 j β h ˙ ě α. (4) The solution to the relaxed problem in ( 4 ) ˆ β Φ h , maximizes the Lagrangian L Φ n p β , λ q “ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ` ˆ λ Φ γ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ´ ˆ λ Φ αγ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ´ ˆ λ Φ αn ´ 1 0 n 0 ÿ j “ 1 Φ ˜ X T 0 j β h ¸ , where ˆ λ Φ solv es z PPV p 1 t X T ˆ β Φ h ą 0 uq ´ α “ 0. T o simplify notation, we write ˆ β Φ to denote ˆ β Φ h while k eeping in mind that the estimator implicitly dep ends on h . This is a smo oth ob jectiv e function which we solve using an off-the-shelf non-linear optimization pack age in R ( nlm ). W e then p erform a grid searc h for the Lagrange m ultiplier to iden tify the v alue that satisfies the PPV constrain t, z PPV p 1 t X T ˆ β Φ ą 0 uq ´ α “ 0. Because the decision rule dep ends only on the sign 10 of X T ˆ β , we enforce the l 2 constrain t such that } ˆ β Φ } 2 2 “ 1, which do es not affect the resulting decision rule. In Supplemen tary App endix B, w e summarize the full procedure in Algorithm 1 and justification of the optimization steps. W e denote the estimated linear decision function obtained b y the ab ov e pro cedure as ˆ d L p X , ˆ β Φ q “ 1 t X T ˆ β Φ ą 0 u . W e will show in Section 3 that ˆ d L p X , ˆ β Φ q con verges to the true b est linear rule, d ˚ L p X , β ˚ q , asymptotically . 2.2.2 Incorp orating existing information to improv e linear com bination rule In many situations, there already exist risk mo dels or scores developed to aid in patient risk stratification. If these existing risk information are useful, it w ould b e b eneficial to include them when learning the new biomark er com bination rule. In this section, w e extend the DOOLR approach to allo w incorp oration of useful external risk information. Assume there exists an external risk algorithm or published mo del for which only the resulting binary decisions (i.e., recommend further screening or not) are a v ailable for all patients in the cur- ren t dataset. Our framew ork does not require access to the original risk scores or mo del parameters; it only requires the external decisions. In some settings, the underlying risk scores may also b e a v ailable. Let Z Ď X denote the subset of features used to develop the external mo del, and let r p Z q denote the asso ciated risk score with clinical threshold δ 0 , where larger v alues corresp ond to higher disease risk. F or simplicity , we assume the external decision rule tak es the threshold form 1 t r p Z q ´ δ 0 ą 0 u . F or example, if the external mo del is logistic regression, then r p Z q “ Z J r β , where r β are published coefficients of the the model. More generally , r p Z q ma y arise from a mac hine learning or domain-informed algorithm. Our metho d, ho wev er, relies only on the induced binary decisions and remains applicable ev en when the underlying scores are una v ailable. T o efficien tly incorp orate external risk information when dev eloping the new biomark er com bi- nation rule, w e prop ose a strategy that transfers external information only when it impro ves the p erformance metric. Since our goal is to maximize TPR, w e introduce a p enalt y term in the ob jec- tiv e function in ( 4 ) that quantifies disagreemen t b et ween the decision rules among disease patients. 11 The p enalt y term is 1 n 1 n 1 ÿ i “ 1 1 t X T 1 i β ¨ p r p Z i q ´ δ 0 q ă 0 u . When the t wo rules align, b oth X T 1 i β and p r p Z i q ´ δ 0 q ha ve the same sign. The p enalty therefore imp oses an additional cost when the new rule disagrees with the external rule on true disease cases. Additionally , we in tro duce a tuning parameter, η , to con trol the exten t of information transfer from the external risk mo del. This ensures that if the external mo del is not informativ e, its influence on dev eloping the biomarker combination rule remains minimal. W e no w propose a smo oth Information T ransfer (IT)-based p enalized maximization to construct the linear decision rule, built up on ( 4 ), we term this approach IT-DOOLR. The optimization problem is framed as max β P R p ` 1 n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ´ η n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ ´ X T 1 i β ¨ p r p Z i q ´ δ 0 q h ¸ subje ct to γ n ´ 1 1 ř n 1 i “ 1 Φ ˆ X T 1 i β h ˙ γ n ´ 1 1 ř n 1 i “ 1 Φ ˆ X T 1 i β h ˙ ` n ´ 1 0 ř n 0 j “ 1 Φ ˆ X T 0 j β h ˙ ě α. (5) Similarly , we can solve for ˆ β Φ IT “ argmax β P R 1 ` p L Φ IT ,n p β , λ q using gradient-based metho ds, where L Φ IT ,n p β , λ q “ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ` λγ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ´ λαγ n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ X T 1 i β h ¸ ´ λαn ´ 1 0 n 0 ÿ j “ 1 Φ ˜ X T 0 j β h ¸ ´ η n ´ 1 1 n 1 ÿ i “ 1 Φ ˜ ´ X T 1 i β ¨ p r p Z i q ´ δ 0 q h ¸ , and ˆ λ Φ IT solv es z PPV p 1 t X T ˆ β Φ IT ą 0 uq ´ α “ 0. W e will select η through cross-v alidation to en- sure efficient information transfer. The pro cedure will leverage external risk information when it enhances the TPR and main tains con trol ov er the PPV by selecting larger v alues of η , while minimizing negativ e information transfer by selecting η v alues closer to zero. 12 3 Theo retical properties In this section, we discuss the asymptotic prop erties of the estimated linear rule using the prop osed DOOLR approach. W e presen t asymptotic guarantee that the b est linear rule is ac hieved. W e further show that the TPR under the estimated linear rule con verges to the optimal TPR in the linear class and PPV is controlled for the pre-sp ecified α . W e assume the follo wing standard conditions. (A1) Supp ose n 0 { n, n 1 { n P p 0 , 1 q where n “ n 1 ` n 0 Ñ 8 . (A2) F or each d P t 0 , 1 u , observ ations X di , i “ 1 , 2 , . . . , n d , are indep endent and identically dis- tributed p -dimensional random v ectors with distribution function F d . (A3) F or eac h d P t 0 , 1 u , the conditional distribution of X given D “ d is not supp orted on any prop er linear subspace of R p ; that is, for all prop er linear subspaces S Ă R p , Pr p X P S | D “ d q ă 1. (A4) F or each d P t 0 , 1 u , the distribution and quantile functions of X T β 1 giv en D “ d are globally Lipsc hitz contin uous uniformly ov er β 1 P R p suc h that || β 1 || “ 1. (A5) The map p β 0 , β 1 q ÞÑ TPR p β 0 , β 1 q is globally Lipschitz con tinuous o ver Ω “ t β 1 P R p , β 0 P R : || β 1 || “ 1 u . T o summarize, Conditions (A1) and (A2) are standard assumptions in asymptotic analysis. Condition (A3) ensures X has full supp ort in R p and cannot b e restricted to a low er-dimensional subspace for some D . Conditions (A4) and (A5) are less stringen t smo othness conditions used to sho w asymptotic results. The smo othness condition ensures controlled v ariations in the functions. The ab o ve conditions are similar to those stated in Meisner et al. ( 2021 ). Theorem 1. Under c onditions (A1)–(A5), we have that as h Ñ 0 a) T P R p ˆ β Φ q Ñ T P R p β ˚ q in pr ob ability, b) lim inf P P V p ˆ β Φ q ě α almost sur ely, c) ˆ β Φ Ñ β ˚ in pr ob ability. 13 The pro of of the ab ov e theorem is based on Lemma 2 from Meisner et al. ( 2021 ) and M- estimation theory ( V an der V aart , 2000 ). Our pro of of Theorem 1 utilizes Lemma 2 ( Meisner et al. , 2021 ) to first show that the empirical smo oth appro ximations of TPR and FPR con verge almost surely to the smo oth versions of TPR and FPR resp ectively o ver a fixed parameter space. W e then sho w that the smo oth v ersions of TPR and FPR conv erge to TPR and FPR as h Ñ 0. W e finally sho w c) by emplo ying results from M-estimation theory (theorem 5.7 of V an der V aart ( 2000 )) in addition to the results in b). In sho wing consistency of ˆ β Φ in c), we assume β ˚ is a unique and w ell-separated maximizer of T P R p β q o ver the parameter space. Details of the pro of can b e found in the Supplemen tary App endix D. The general implication of the abov e theorem is that, ev en in cases where the global optimal rule is not linear, we can guaran tee p erformance of our prop osed linear rule to b e asymptotically close to the b est linear rule within the class of linear rules. As men tioned earlier, this is not guaranteed using con vex surrogate functions. 4 Numerical studies In this section, w e compare the finite sample p erformance of the prop osed metho ds, namely the plug-in approac hes, DOOLR, and IT-DOOLR against the standard logistic regression (‘Standard’) approac h. The Standard approach is implemented in tw o steps; i) fit a logistic regression mo del for the outcome; ii) c ho ose a threshold for the predicted probabilities that corresp onds to the PPV con- strain t. F or the plug-in approac h, we considered tw o wa ys of estimating the nuisance risk function, pr p D “ 1 | X q : 1) logistic regression (‘Plug-in Logistic’), and 2) ensem ble of metho ds (generalized additiv e mo dels, logistic regression, random forest, mean of outcome) via SuperLearner (‘Plug- in SL’). F or our prop osed linear rules (DOOLR and IT-DOOLR), we set, across all sim ulations, h “ n ´ 1 { 3 StdDev p X T ˆ β 0 { ∥ ˆ β 0 ∥ q , where ˆ β 0 denotes the co efficient estimate obtained from the Stan- dard approach. W e present a n umerical study examining different v alues of h to demonstrate the robustness of our decision rule in Supplemen tary Appendix C (see T able 2). T o mimic a rare disease screening scenario, w e generate data with disease prev alence appro ximately 1% across all sim ulation settings and rep eat each sim ulation 500 times. W e sp ecify PPV constraints at 3%, 4%, and 4 . 5% in all sim ulations. W e present only the results for 4% here and rep ort the remaining results in Supple- 14 T able 2: Cohort study sampling sim ulation scenario when the true decision rule is linear, piece-wise linear, and nonlinear. Average TPR and PPV with corresp onding standard error (in parentheses) in the test sample across 500 sim ulation replicates under PPV p α “ 0 . 04 q and prev alence p p 1 “ 0 . 01 q . n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR Linear 2500 TPR 0.988(0.006) 0.987(0.010) 0.987(0.013) 0.967(0.040) PPV 0.042(0.003) 0.040(0.003) 0.038(0.012) 0.047(0.015) 5000 TPR 0.990(0.004) 0.990(0.006) 0.990(0.006) 0.980(0.017) PPV 0.042(0.001) 0.040(0.001) 0.040(0.005) 0.043(0.008) Linear with contamination 2500 TPR 0.508(0.284) 0.547(0.281) 0.972(0.020) 0.922(0.123) PPV 0.043(0.025) 0.032(0.014) 0.040(0.003) 0.045(0.008) 5000 TPR 0.571(0.284) 0.651(0.295) 0.977(0.011) 0.959(0.042) PPV 0.05(0.022) 0.037(0.008) 0.040(0.002) 0.043(0.003) Piece-wise linear 2500 TPR 0.604(0.152) 0.678(0.116) 0.771(0.159) 0.624(0.163) PPV 0.055(0.021) 0.043(0.011) 0.045(0.019) 0.050(0.019) 5000 TPR 0.628(0.144) 0.693(0.085) 0.840(0.079) 0.654(0.115) PPV 0.051(0.020) 0.042(0.007) 0.041(0.006) 0.046(0.013) Nonlinear 2500 TPR 0.696(0.256) 0.833(0.201) 0.984(0.033) 0.747(0.218) PPV 0.047(0.010) 0.041(0.006) 0.040(0.004) 0.042(0.009) 5000 TPR 0.601(0.296) 0.868(0.169) 0.989(0.007) 0.808(0.189) PPV 0.048(0.009) 0.041(0.004) 0.040(0.001) 0.043(0.006) men tary App endix C (see T ables 4-7). Our implemen tation of the prop osed methods and codes to replicate sim ulation results in R is av ailable at https://github.com/albertosom/BiomarkerRule . 4.1 Linea r decision rule W e consider biv ariate normal biomark ers with and without con tamination, similar scenarios are de- scrib ed in Croux and Haesbro ec k ( 2003 ) and Meisner et al. ( 2021 ). W e first simulate the cov ariates X “ p X 1 , X 2 q , eac h follo wing a standard normal distribution. The outcome D is generated from a Bernoulli distribution with success probability pr p D “ 1 | X 1 , X 2 q “ 1 {p 1 ` exp p β 0 ` X 1 β 1 ` X 2 β 2 qq . The optimal decision rule has a linear form and is expressed as d p X q “ 1 t β 0 ` β 1 X ` β 2 X 2 ` c ą 0 u , where c is c hosen to satisfy pre-sp ecified PPV-con traint. W e set the parameters β 0 “ ´ 8 . 7, β 1 “ 2 . 4, 15 and β 2 “ 2 . 4 so that the disease prev alence is appro ximately 1%. W e further added 6% contam- inated control ( D “ 0 , X 1 “ 6 and X 2 “ 6) observ ations to depict the scenario where patien ts with biomark er v alues similar to those of cases are lab eled controls. W e estimate the decision rule using training samples of sizes n train “ 2500 and 5000. The results were then ev aluated on an indep enden t test sample of size n test “ 10 5 . The simulation results are presented in T able 2 . When there are no contaminated observ ations, w e observ e that the Standard approach has the b est p erformance. This is exp ected as the data is generated directly from a logistic mo del. The p erformance of the DOOLR approac h impro ves as sample size increases. The tw o Plug-in approac hes demonstrated comparable p erformance to the Standard approac h. In the presence of contaminated observ ations, the prop osed metho ds provided b etter con trol of the PPV compared to the Standard approach. F or the estimate of TPR, the Plug- in SL approac h achiev ed the b est p erformance, likely due to its flexibility in mo deling the nuisance risk function. Also, the developed linear rule via the DOOLR metho d p erformed remark ably w ell, yielding results comparable to the Plug-in SL, while offering the added adv an tage of a more in terpretable decision rule due to its linear structure. This further illustrates its robustness in the presence of outlier observ ations compared to alternativ e metho ds. 4.2 Nonlinea r decision rule In this simulation setting, we consider tw o scenarios where the true clinical decision rule is not linear. The first scenario we consider is when the true rule is piecewise-linear. Supp ose w e hav e t wo indep endent biomarkers p X 1 , X 2 q , each generated from a standard normal distribution. The outcome D , is generated as D “ 1 trp´ 8 . 9 ` 2 X 1 ` 2 X 2 ˚ 1 t X 2 ą X 0 . 025 2 u ` ε s ą 0 u where ε is distributed as a logistic random v ariable with lo cation parameter zero and scale parameter one, and X 0 . 025 2 is the 0 . 025th quantile of X 2 . Among observ ations with 1 t X 2 ă X 0 . 025 2 u , we randomly sample 0 . 4% and set D “ 1. This setting mimics a practical scenario where the true decision rule is piece-wise linear. F or example, in PDA C screening, it has b een shown that higher v alues of serum biomark er CA19-9 in addition to other patient characteristics (e.g., A1c, age, etc.) are generally asso ciated with higher risk of PD AC ( Ballehaninna and Chamberlain , 2012 ). Ho wev er, a small group of patients (e.g., Lewis α ´ β ´ genot yp e) with very lo w v alues of CA19-9 are also at high risk 16 of PD AC ( Y ong and Diyana , 2022 ). F or the second scenario, we consider when the true c linical decision rule is a nonlinear function of the cov ariates. Supp ose we hav e three indep endent biomarkers X “ p X 1 , X 2 , X 3 q , eac h generated from the standard normal distribution. The outcome D , is generated as D “ 1 trp β 0 ` β 1 sin p X 1 q ` β 2 X 2 2 ` β 3 cos p X 3 q ` ε s ą 0 u where ε is distributed as a logistic random v ariable with lo cation parameter zero and scale parameter one. W e set the parameters β 0 “ ´ 8 . 6, β 1 “ 5, β 2 “ ´ 4, and β 3 “ 3 so that the disease prev alence is approximately 1%. W e train the decision rules on training samples of sizes n train “ 2500 , 5000 and ev aluate the p erformance on an indep enden t test sample n test “ 10 5 . W e train our decision rules on training samples of sizes n train “ 2500 , 5000 and ev aluate the p erformance on an indep enden t test sample n test “ 10 5 . The results are presented in T able 2 . All prop osed metho ds yield higher TPR estimates com- pared to the standard logistic regression approach. Notably , the sup erior p erformance of DOOLR highligh ts its optimalit y among the class of linear rules. In terms of the PPV control, the prop osed metho ds also demonstrate fa vorable results. Among all metho ds, the Plug-in SL approac h has the b est p erformance, due to its flexible estimation of the nuisance risk function. 4.3 Inco rp orating existing risk info rmation In this sim ulation setting, we inv estigate the p erformance of IT-DOOLR under three scenarios where existing risk information is av ailable. W e generate data using the same data generation mec hanisms under the nonlinear rule describ ed in Section 4.2 , and compare the p erformance of the IT-DOOLR approac h, whic h incorp orates the existing information, to the DOOLR metho d. W e let the existing mo del b e parameterized b y r β suc h that the decision rule based on the existing mo del is 1 tp 1 , H p X qq T r β ą 0 u , where H p X q “ p sin p X 1 q , X 2 2 , cos p X 3 qq . W e train the decision rules on training datasets with sample sizes of 2500 and 5000, and ev aluate the p erformance on a testing dataset of size of 10 6 . W e consider the following three scenarios • Scenario I: the existing risk mo del is the same as the true model under the data generation r β “ β “ p´ 8 . 6 , 5 , ´ 4 , 3 q • Scenario I I: the existing risk model do es not include all co v ariates under the true data genera- 17 T able 3: Simulation scenarios under incorp orating auxiliary information. Average TPR and PPV with corresp onding standard error (in parentheses) in the test sample across 500 sim ulation repli- cates under PPV p α “ 0 . 04 q and prev alence p p 1 “ 0 . 01 q . n Measure DOOLR IT-DOOLR Scenario I Scenario I I Scenario II I 2500 TPR 0.747(0.218) 0.839(0.136) 0.802(0.165) 0.781(0.179) PPV 0.043(0.009) 0.041(0.006) 0.041(0.007) 0.043(0.007) 5000 TPR 0.808(0.189) 0.880(0.097) 0.860(0.118) 0.845(0.145) PPV 0.043(0.006) 0.041(0.004) 0.042(0.005) 0.042(0.006) tion mechanism, here β “ p´ 8 . 6 , 5 , ´ 4 , 3 q and r β “ p´ 8 . 6 , 5 , 0 , 0 q with tw o cov ariates omitted in the existing mo del. • Scenario I I I: the existing risk mo del is very differen t from the mo del under the true data generation mec hanism, here β “ p´ 8 . 6 , 5 , ´ 4 , 3 q and r β “ p´ 15 , ´ 3 , ´ 1 , ´ 1 q . W e generate outcome for all three scenarios using the same data generation mechanism describ ed in Section 4.2 . The results are summarized in T able 3 . In Scenario I, where the existing mo del aligns with the true data-generating pro cess, IT-DOOLR effectively lev erages the external risk information to impro ve up on the p erformance of DOOLR. In Scenario II I, where the existing risk model deviates from the true data-generating mechanism, IT-DOOLR still ac hieves a mo dest improv ement in TPR estimation. Overall, the prop osed integration approach effectiv ely utilize v aluable external information (Scenarios I & I I) while a voiding negative information transfer (Scenario I I I) when the existing mo del is p oten tially missp ecified. In particular, IT-DOOLR exhibits greater stabilit y and impro ved estimation of both TPR and PPV compared to DOOLR, esp ecially in scenarios where the external information is informativ e. W e additionally consider a scenario where the data is obtained from a nested case-control sampling design, commonly considered in biomarker studies ( Ernster , 1994 ). The p erformance among the metho ds is similar to the p erformance under the simulation scenario in Section 4.1 . The data generation mec hanism and results are summarized in Supplemen tary App endix C T able 1. 18 5 Application to P ANDORA study data Early detection of PDA C can greatly improv e patients’ surviv al. Unfortunately , PD AC lac ks an effectiv e strategy for early detection due to several c hallenges. Notably , the low incidence of PDA C in the general p opulation (10/100,000) makes it cost-prohibitiv e to implemen t a biomark er-based screening program for asymptomatic individuals. Ev en if a biomarker test has high sensitivit y and sp ecificit y , it may still cause thousands of false p ositiv es for every iden tified p ositive cancer. Therefore, biomarker-based screening should b e applied in a p opulation that has a higher risk for PD AC. Particularly , new Onset Hyp erglycemia and Diab etes (NOD) patien ts hav e shown a higher risk for PD AC ( Chari et al. , 2005 ). Lev eraging data from Kelsey-Seyb old Clinic (Houston, T exas) during the p erio d 2011- 2022, the P ANDORA Cohort includes a total of 357,593 patients with data on either their glycated haemoglobin (HbA1c) or fasting bloo d glucose (FBG) level. Out of this nu mber, we obtained the NOD p opulation b y selecting patients based on the first o ccurrence of any of the following; a) HbA1c ě 6 . 5%, b) t wo consecutiv e FBG ě 126 mg { dl within 18 months of each other, or c) FBG ě 126 mg { dl follo wed by the start of diab etes medication within 30 days. Our total NOD p opulation consists of 18,300 patien ts, of whom 35 dev elop ed PDA C within three y ears. Among the 35 confirmed PD AC cases, 32 ha ve complete information on w eight, and estimated glucose lev els at b oth NOD date and left window (appro ximately 1 year prior to the NOD date). Among the 18265 controls, 11,149 hav e complete three years follo w-up of which 7,137 ha ve complete biomark er information at b oth NOD date and left window. The goal of our analysis is to develop a new biomarker com bination rule that incorp orates broader cov ariate information to identify NOD patients at high risk of developing PD AC and recommend them for further screening. Particularly , the rule should b e dev elop ed under a practical constrain t on the PPV. Sharma et al. ( 2018 ) prop osed an algorithm (the ENDP A C mo del) to stratify NOD patients in to high or low PDA C risk based on changes in FBG lev els, weigh t, and age at NOD onset. While the mo del demonstrated promising p erformance in their v alidation cohort, efforts to v alidate it in other p opulations hav e sho wn low er p erformance ( Chen et al. , 2021 ; Boursi et al. , 2022 ). Several factors ma y contribute to this reduced p erformance: (1) the 19 original ENDP A C mo del was developed primarily in a predominan tly white population, and (2) the mo del relies on FBG measurements, which are often una v ailable for many patients. T o address the latter limitation, the authors prop osed a heuristic con version b etw een HbA1c and FBG, giv en b y p FBG “ 28 . 7 ˆ HbA1c ´ 46 . 7 q , for patients without FBG v alues. How ever, this conv ersion is sub optimal, as it fails to fully capture the n uances of FBG and HbA1c measuremen ts. Sp ecifically , FBG reflects glucose con trol at the time of the bloo d test and exhibits higher da y-to-da y v ariability , whereas HbA1c represen ts an av erage blo o d glucose lev el o ver the past 2–3 months and thus has lo wer day-to-da y v ariability . W e incorp orate risk information from the ENDP A C model to assess whether it impro ves the p erformance of our biomark er combination rule. Our analysis data consist of the 32 diagnosed PDA C cases and 640 random sample of controls from a total NOD cohort of size 18,300. F or each case, we randomly sample 20 controls from individuals in the cohort who hav e not developed the outcome by the time the case o ccurred, this depicts a nested case-con trol sampling design scenario. W e construct the rule based on patients’ c hange in HbA1c, change in weigh t and age at NOD. W e standardized these v ariables b efore mo del fitting. T o assess the performance of our metho d, we randomly split the data equally in to training and testing sets. W e develop the prop osed metho ds on the training data and ev aluated their p erformance on the indep endent test set. This random split w as rep eated 200 times, and the a verage of the estimates on the test sets is rep orted. F or our analysis, w e use a prev alence estimate of PDA C in the NOD p opulation of p 1 “ 0 . 44%, and pre-sp ecified PPV v alues of 1%, 1 . 5%, and 2%, corresp onding to screening appro ximately 100, 67, and 50 individuals to detect one cancer, resp ectiv ely . F rom the study data, we calculate ENDP AC score for each patient and iden tify risk thresholds to define high- and lo w-risk groups that satisfy the PPV constrain ts. W e presen t the results of our analysis based on the prop osed methods in T able 4 . W e considered the PPV constraint at 2%, to illustrate the prop osed metho d. The rep orted v alues are the a verage estimates (with standard deviations) of TPR and PPV across the test sets. T o achiev e a PPV constrain t of 2%, the threshold for the ENDP AC score is set at 2, and the corresp onding TPR of the ENDP AC score based rule is 68.8%. The Plug-in Logistic p erform similarly to the Standard approac h. The IT-DOOLR estimate of TPR is 77 . 1%, slightly higher than the TPR estimate for DOOLR p 75 . 5% q , indicating some information transfer from the ENDP A C mo del. Both DOOLR 20 T able 4: Application to screening NOD patien ts at high risk of PDA C data giv en 2% constraint on PPV using an estimate of prev alence of pancreatic cancer to b e 0 . 44%. W e rep ort a v erage TPR, PPV, and estimate of co efficien ts with corresp onding standard error (in parentheses). The linear decision rule from IT-DOOLR is 1 t 0 . 453 ` 0 . 208 ˆ Age ´ 0 . 677 ˆ W eigh tchange ` 0 . 672 ˆ ChangeinHbA1c ą 0 u . Metho ds Standard Plug-in Logistic Plug-in SL DOOLR IT-DOOLR α “ 2 % Measure TPR 0.682(0.114) 0.685(0.125) 0.760(0.101) 0.755(0.124) 0.771(0.122) PPV 0.023(0.010) 0.024(0.010) 0.020(0.010) 0.020(0.007) 0.020(0.006) Predictors Age 0.444(0.113) 0.204(0.098) 0.208(0.097) W eigh t change -0.805(0.086) -0.6828(0.127) -0.677(0.122) Change in HbA1c 0.355(0.095) 0.678(0.147) 0.672(0.149) and IT-DOOLR yield PPV estimates that closely align with the pre-sp ecified v alue of 2%. Finally , T able 4 also presents the estimated co efficients of the predictors. These co efficien ts can b e used to construct linear decision rules (DOOLR and IT-DOOLR) by comparing linear com binations of cov ariates and co efficients to thresholds of -0.472 and -0.453, resp ectively . A reduction in w eight b et ween the NOD date and the left windo w is asso ciated with a higher risk of PD AC. This aligns with the current researc h findings, which suggest that among individuals with new-onset diab etes, unexplained weigh t loss may indicate underlying pancreatic cancer rather than just the metab olic changes from diab etes itself. Result for the other PPV v alues are rep orted in Supplemen tary App endix C T able 2. 6 Discussion This pap er prop oses a new PPV-constrained b enefit function for dev eloping individualized clinical decision rules, designed to maximize TPR while ensuring adherence to a clinically meaningful pre- sp ecified PPV. W e further extend this framework to in tegrate relev ant external risk information from published literature. Our prop osed linear and nonlinear rules offer practitioners the flexibilit y to c ho ose b etw een in terpretability and predictiv e p ow er. When in terpretability is prioritized, we sho w that DOOLR is guaranteed to estimate the b est rule within the class of linear rules. On the other hand, the plug-in approach allows a more flexible estimation of the nonlinear rule, serving as an attractiv e alternative when in terpretability is less of a concern and predictiv e accuracy is more 21 imp ortan t. The strength of our framework is that b oth of our approaches are mo del-free which can b e robust when the risk mo del is mis-sp ecified. A k ey distinction b et ween our prop osed ob jective and prior metho ds that maximize TPR un- der an FPR constraint ( W ang et al. , 2020 ; Meisner et al. , 2021 ) lies in the complexity of PPV. Constraining FPR at a pre-sp ecified v alue to achiev e a target PPV is generally intractable b ecause PPV is a nonlinear function of TPR, FPR, and disease prev alence. Instead, our approach directly maximizes TPR while incorp orating FPR and prev alence from the same dataset, eliminating the need for an explicit FPR constraint and aligning the optimization with the in tended clinical utility: enforcing a PPV constrain t that reflects do wnstream diagnostic burden while optimizing sensitivit y . This results in a fundamen tally different optimization problem from FPR-constrained approaches and provides a more relev ant framework for ev aluating and deploying biomark er-based screening to ols. Additionally , the dev elop ed biomarker decision rules can b e broadly applicable to v arious early detection initiativ es. F or instance, in ov arian cancer screening, clinical guidelines recommend that a biomark er-based screening rule ac hieve a minim um PPV of 10% to justify in v asive surgical diagnos- tic procedures ( Jacobs and Menon , 2004 ). Using our proposed metho d, an optimal biomark er-based decision rule for o v arian cancer can b e dev elop ed while ensuring the PPV is maintained at the de- sired 10% level. Similarly , Mazzone et al. ( 2024 ) in vestigated biomarker combination rule for the early detection of lung cancer, prop osing a decision rule based on a p enalized logistic regression mo del, which yielded a PPV of 1 . 3%. They rep orted their rule reduced the n umber needed to screen with lo w-dose computed tomography (CT) to detect one lung cancer from 143 to 75. How- ev er, while their rule is straigh tforward, it may not be optimal for maximizing the screening b enefit (i.e., TPR). Our prop osed nonparametric approach provides a rigorous alternative to developing optimal decision rules in suc h settings, ensuring b oth improv ed screening efficiency and adherence to clinically relev ant constrain ts. More broadly , this metho d offers a v aluable to ol for medical decision-making, accommo dating patien t heterogeneit y while addressing practical implementation c hallenges in clinical practice. In practice, PPV threshold should b e sp ecified a priori based on clinical and op erational consid- erations. Particularly , PPV directly determines the n umber needed to screen to detect one cancer 22 case, which quantifies the downstream diagnostic burden. A clinically meaningful PPV threshold should reflect the maxim um num b er of false-p ositive diagnostic ev aluations that the health system can supp ort, given the cost, av ailabilit y , and inv asiveness of confirmatory imaging or biopsy . F or PD AC, where prev alence is extremely low (less than 1% q , PPV lev els in the range of 1–5% corre- sp ond to realistic NNS v alues for CT/MRI follow-up, whereas higher PPV levels ma y b e needed to justify in v asive pro cedures. With our c hoice of non-con v ex surrogate function, one would ha ve to tak e some care in c ho osing initial v alues for the optimization. Although w e found using the coefficient estimates from standard logistic regression as initial v alue to work generally w ell, one could also implement the optimization algorithm with differen t m ultiple initial v alues to ensure stabilit y . Other c hoice of surrogate function that could b e considered is the ramp loss ( Huang and F ong , 2014 ) which can b e implemen ted using the difference of con vex functions algorithm (DCA) ( T ao et al. , 1996 ). This could p otentially pro vide a more reliable optimization approach as it decomp oses the problem into conv ex sub- problems, whic h are generally easier to solv e than non-conv ex ones. An interesting direction for future w ork is to incorp orate ranking-based p enalties when inte- grating external risk information. While our prop osed metho d p enalizes disagreement in binary decisions, one could instead encourage agreemen t in pairwise risk ranking in settings where only rel- ativ e risk ordering is av ailable. This ma y provide a principled alternativ e for incorp orating external information. Ackno wledgments The w ork was supp orted by a Catalyst Award from the Pancreatic Cancer Action Net work. W e express our appreciation to Lynn Matrisian for helpful comments. References Ballehaninna, U. K. and Chamberlain, R. S. (2012). The clinical utility of serum ca 19-9 in the diagnosis, prognosis and managemen t of pancreatic adenocarcinoma: An evidence based appraisal. Journal of gastr ointestinal onc olo gy , 3(2):105. 23 Bartlett, P . L., Jordan, M. I., and McAuliffe, J. D. (2006). Con vexit y , classification, and risk b ounds. Journal of the Americ an Statistic al Asso ciation , 101(473):138–156. Boursi, B., Patalon, T., W ebb, M., Margalit, O., Beller, T., Y ang, Y.-X., and Cho dick, G. (2022). V alidation of the enric hing new-onset diabetes for pancreatic cancer mo del: a retrospective cohort study using real-w orld data. Pancr e as , 51(2):196–199. Breiman, L. (2001). Random forests. Machine le arning , 45:5–32. Chari, S. T., Leibson, C. L., Rab e, K. G., Ransom, J., De Andrade, M., and P etersen, G. M. (2005). Probabilit y of pancreatic cancer follo wing diab etes: a p opulation-based study . Gastr o enter olo gy , 129(2):504–511. Chatterjee, N., Chen, Y.-H., Maas, P ., and Carroll, R. J. (2016). Constrained maxim um likelihoo d estimation for mo del calibration using summary-lev el information from external big data sources. Journal of the Americ an Statistic al Asso ciation , 111(513):107–117. Chen, W., Butler, R. K., Lustigov a, E., Chari, S. T., and W u, B. U. (2021). V alidation of the enric hing new-onset diab etes for pancreatic cancer mo del in a div erse and integrated healthcare setting. Digestive dise ases and scienc es , 66:78–87. Cheng, W., T aylor, J. M., Gu, T., T omlins, S. A., and Mukherjee, B. (2019). Informing a risk prediction mo del for binary outcomes with external co efficient information. Journal of the R oyal Statistic al So ciety Series C: Applie d Statistics , 68(1):121–139. Croux, C. and Haesbro eck, G. (2003). Implementing the bianco and y ohai estimator for logistic regression. Computational statistics & data analysis , 44(1-2):273–295. Ernster, V. L. (1994). Nested case-control studies. Pr eventive me dicine , 23(5):587–590. F ong, Y., Yin, S., and Huang, Y. (2016). Combining biomark ers linearly and nonlinearly for classification using the area under the roc curve. Statistics in me dicine , 35(21):3792–3809. Han, P ., T a ylor, J. M., and Mukherjee, B. (2023). Integrating information from existing risk prediction mo dels with no model details. Canadian Journal of Statistics , 51(2):355–374. 24 Hastie, T. and Tibshirani, R. (1990). Gener alize d A dditive Mo dels , v olume 43. CRC Press. Hector, E. C. and Martin, R. (2024). T urning the information-sharing dial: efficien t inference from differen t data sources. Ele ctr onic Journal of Statistics , 18(2):2974–3020. Huang, Y. and F ong, Y. (2014). Identifying optimal biomark er com binations for treatmen t selection via a robust k ernel metho d. Biometrics , 70(4):891–901. Huang, Y. and Pepe, M. S. (2010). Assessing risk prediction mo dels in case–control studies using semiparametric and nonparametric metho ds. Statistics in me dicine , 29(13):1391–1410. Jacobs, I. J. and Menon, U. (2004). Progress and c hallenges in screening for early detection of o v arian cancer. Mole cular & Cel lular Pr ote omics , 3(4):355–366. Kerr, K. F., Bro wn, M. D., Zhu, K., and Janes, H. (2016). Assessing the clinical impact of risk prediction mo dels with decision curves: guidance for correct in terpretation and appropriate use. Journal of Clinic al Onc olo gy , 34(21):2534–2540. Kullbac k, S. and Leibler, R. A. (1951). On information and sufficiency . The annals of mathematic al statistics , 22(1):79–86. Lin, H., Zhou, L., Peng, H., and Zhou, X.-H. (2011). Selection and com bination of biomark ers using ro c metho d for disease classification and prediction. Canadian Journal of Statistics , 39(2):324– 343. Ma, S. and Huang, J. (2007). Combining multiple markers for classification using roc. Biometrics , 63(3):751–757. Mazzone, P . J., Bach, P . B., Carey , J., Sc honewolf, C. A., Bognar, K., Ahluw alia, M. S., Cruz- Correa, M., Gierada, D., Kotagiri, S., Lloyd, K., et al. (2024). Clinical v alidation of a cell-free dna fragmen tome assay for augmentation of lung cancer early detection. Canc er Disc overy . Meisner, A., Carone, M., P ep e, M. S., and Kerr, K. F. (2021). Combining biomark ers b y maximizing the true p ositiv e rate for a fixed false p ositiv e rate. Biometric al Journal , 63(6):1223–1240. 25 Natara jan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM journal on c om- puting , 24(2):227–234. National Comprehensive Cancer Net work (2023). Prostate cancer early detection (v ersion 1.2023). h ttps://www.nccn.org/professionals/physician gls/p df/prostate detection.p df. Rahib, L., Smith, B. D., Aizenberg, R., Rosenzweig, A. B., Fleshman, J. M., and Matrisian, L. M. (2014). Pro jecting cancer incidence and deaths to 2030: the unexp ected burden of th yroid, liver, and pancreas cancers in the united states. Canc er r ese ar ch , 74(11):2913–2921. Rose, S. and v an der Laan, M. J. (2008). A note on risk prediction for case-control studies. Sharma, A., Kandlakunta, H., Nagpal, S. J. S., F eng, Z., Ho os, W., Petersen, G. M., and Chari, S. T. (2018). Mo del to determine risk of pancreatic cancer in patients with new-onset diab etes. Gastr o enter olo gy , 155(3):730–739. Siegel, R. L., Kratzer, T. B., Giaquin to, A. N., Sung, H., and Jemal, A. (2025). Cancer statistics, 2025. Ca , 75(1):10. T ao, P . D. et al. (1996). Numerical solution for optimization o ver the efficien t set b y dc optimization algorithms. Op er ations R ese ar ch L etters , 19(3):117–128. V an der Laan, M. J., P olley , E. C., and Hubbard, A. E. (2007). Sup er learner. Statistic al applic ations in genetics and mole cular biolo gy , 6(1). V an der V aart, A. W. (2000). Asymptotic statistics , volume 3. Cambridge universit y press. Vic kers, A. J. and Elkin, E. B. (2006). Decision curve analysis: a nov el metho d for ev aluating prediction mo dels. Me dic al De cision Making , 26(6):565–574. W ang, D., Y e, W., Sung, R., Jiang, H., T aylor, J. M., Ly , L., and He, K. (2021). Kullback- leibler-based discrete failure time mo dels for integration of published prediction models with new time-to-ev ent dataset. arXiv pr eprint arXiv:2101.02354 . W ang, D., Y e, W., Zh u, J., Xu, G., T ang, W., Zawisto wski, M., F ritsche, L. G., and He, K. (2023). 26 Incorp orating external risk information with the co x mo del under p opulation heterogeneit y: Ap- plications to trans-ancestry p olygenic hazard scores. arXiv pr eprint arXiv:2302.11123 . W ang, Y., Zhao, Y.-Q., and Zheng, Y. (2020). Learning-based biomarker-assisted rules for opti- mized clinical b enefit under a risk constraint. Biometrics , 76(3):853–862. Wilson, P . W., D’Agostino, R. B., Levy , D., Belanger, A. M., Silb ershatz, H., and Kannel, W. B. (1998). Prediction of coronary heart disease using risk factor categories. Cir culation , 97(18):1837– 1847. Y ong, B. J. C. and Diyana, M. W. (2022). Low carb ohydrate an tigen 19-9 (ca 19-9) levels in a patien t highly susp ected of having caput pancreas tumor. Cur eus , 14(4). 27 Supplemen tary Material to “ Developing Performanc e-Guar ante e d Biomarker Combination R ules with Inte gr ate d External Information under Pr actic al Constr aint ” 7 App endix A (Optimalit y of decision rule) Let the ob jectiv e and constraint functions b e denoted as f p d p X qq “ E X “ 1 t d p X q “ 1 u η 1 p X q ‰ , and h p d p X qq “ E X ” 1 t d p X q “ 1 u ␣ γ η 1 p X q ´ αγ η 1 p X q ´ αη 0 p X q ( ı , resp ectiv ely . F or λ ě 0, w e define the Lagrangian L p d p X q , λ q “ E ” 1 t d p X q “ 1 u ␣ η 1 p X q ´ λαγ η 1 p X q ´ λαη 0 p X q ` λγ η 1 p X q ( ı , and its asso ciated dual function g p λ q “ sup d p X q L p d p X q , λ q . F or any fixed λ , the Lagrangian is maximized p oin twise, yielding the decision rule d λ p X q “ 1 t f λ p X q ą 0 u , where f λ p X q “ η 1 p X q ´ λαγ η 1 p X q ´ λαη 0 p X q ` λγ η 1 p X q . Our goal is to establish strong dualit y , namely p ˚ “ sup d P R : h p d qě 0 f p d q “ inf λ ě 0 g p λ q “ g ˚ . 28 W e note that, weak dualit y holds immediately: for an y decision rule d satisfying h p d q ě 0 and an y λ ě 0, f p d q ď L p d, λ q ď g p λ q , whic h implies p ˚ ď g ˚ . T o establish equality , we make the following assumptions: 1. There exists a strictly feasible decision rule ˜ d suc h that h p ˜ d q ą 0. 2. Pr ` f λ p X q “ 0 ˘ “ 0 for all λ ě 0, ensuring that d λ is almost surely w ell-defined. When λ “ 0 (unconstrained problem), f 0 p X q “ η 1 p X q . Since η 1 p X q ą 0, the PPV asso ciated with the decision rule, d 0 p X q , is equal to the prev alence whic h we assume is not greater than the pre-sp ecified, α . Denote the decision rule with λ “ 0 b y d 0 , w e hav e h p d 0 q ă 0 . Since the in tegrand defining h p d λ q , 1 t d λ p X q “ 1 u ␣ γ η 1 p X q ´ αγ η 1 p X q ´ αη 0 p X q ( , is b ounded and measurable, h p d λ q is contin uous in λ by the dominated conv ergence theorem. Therefore, b y the in termediate v alue theorem, there exists λ ˚ ą 0 suc h that h p d λ ˚ q “ 0 . A t this v alue, d λ ˚ is feasible for the primal problem, so p ˚ ě f p d λ ˚ q . Since d λ ˚ maximizes the Lagrangian at λ ˚ , f p d λ ˚ q “ L p d λ ˚ , λ ˚ q “ g p λ ˚ q . 29 Finally , b ecause g p λ ˚ q ě g ˚ ě p ˚ , all inequalities hold with equalit y , implying p ˚ “ g ˚ . This establishes strong dualit y and justifies the use of the Lagrangian formulation in our setting. 8 App endix B (Optimization algorithm and justification) Consider the p opulation v ersion of the constrain t optimization problem; maximize E r D Φ ˜ X T i β h ¸ { p 1 s sub ject to E r D Φ ˜ X T i β h ¸ { p 1 s ´ α E r Φ ˜ X T i β h ¸ s ě 0 (6) The Lagrangian corresp onding to ( 6 ) is L Φ p β , λ q “ E r D Φ ˜ X T i β h ¸ { p 1 s ´ λ p α E r Φ ˜ X T i β h ¸ s ´ E r D Φ ˜ X T i β h ¸ { p 1 sq (7) for Lagrange m ultiplier λ P r 0 , 8s . Denote κ “ λ 1 ` λ P r 0 , 1 s , then for some fixed κ , our goal is solve for β Φ κ where β Φ κ “ arg max β E rt D p 1 ´ κ q ´ κ p αp 1 ´ D qu Φ ˜ X T i β h ¸ s (8) Let P p κ q and B p κ q denote the exp ected b enefit and PPV constraint asso ciated with the optimal decision rules from ( 7 ) B p κ q “ E r D Φ ˜ X T i β Φ κ h ¸ { p 1 s P p κ q “ ´ α E r Φ ˜ X T i β Φ κ h ¸ s ´ E r D Φ ˜ X T i β Φ κ h ¸ { p 1 s If it holds that P p 1 q ă 0 ď P p 0 q , we can obtain some κ ˚ ( may not b e unique) such that P p κ ˚ q “ 0. The restriction P p κ q ă 0 ensures that there is a feasible solution to the constraint and 30 Algorithm 1 Estimating optimal linear decision rule 1: Giv en training data p D i , X 1 i q for i “ 1 , . . . , n and pre-sp ecified α . 2: Start with an initial v alue of β , i.e. ˆ β 0 , whic h one can derive from the standard logistic regression approac h. 3: Obtain ˆ β Φ κ “ arg max β n ÿ i “ 1 rt D p 1 ´ κ q ´ κ p αp 1 ´ D qu Φ ˜ X T i β h ¸ s using an y non-linear optimization (e.g., nlm in R) of y our choice. 4: P erform grid search on r 0 , 1 s for ˆ κ that solves z PPV p 1 t X T ˆ β Φ κ ą 0 uq ´ α “ 0 . 5: Pro ject ˆ β Φ ˆ κ to satisfy the } ˆ β Φ ˆ κ } 2 2 “ 1. 6: Final linear rule is of the form ˆ d L p X , ˆ β Φ κ q “ 1 t X T ˆ β Φ κ ą 0 u . P p κ q ě 0 is to restrict the case when the PPV under the unconstrained problem is larger than α . The ab o ve indicates that to solve ( 6 ) , w e only need to solve the unconstrained problem in ( 8 ) for an y κ and then find κ ˚ suc h that the PPV asso ciated with κ ˚ satisfies the constrain t. 9 App endix C (Additional numerical study and data application) 9.1 Nested case-control sampling Here we consider the scenario where data is obtained from a nested case-control sampling design, commonly considered in biomark er studies ( Ernster , 1994 ). W e first simulate a large cohort of size N “ 10 6 , with cov ariates X “ p X 1 , X 2 q simulated from the standard normal distribution, and the outcomes D are generated from the conditional distribution pr p D “ 1 | X 1 , X 2 q “ 1 {p 1 ` exp p β 0 ` X 1 β 1 ` X 2 β 2 qq . A similar simulation setting is illustrated in Rose and v an der Laan ( 2008 ). W e set β 0 “ ´ 8, β 1 “ 2 . 1, and β 2 “ 2 . 1 such that the disease prev alence is approximately 1%. W e further added 6% contaminated con trol ( D “ 0 , X 1 “ 6 and X 2 “ 6) observ ations to depict the setting where the data con tains some con trols that are actually cases. W e then obtain the case- con trol data by randomly sampling n 1 cases and n 0 “ 20 n 1 con trols resulting in a sample size of n “ n 1 ` n 0 . W e implement our metho ds on training samples of size n train “ 2100 and 4200. P erformance was ev aluated on an indep endent test sample of size n test “ 21 , 000. The results are presen ted in T able 5 . 31 T able 5: Nested case-control study sampling scenario with 6% con taminated con trol observ ations. Av erage TPR and PPV with corresp onding standard error (in paren theses) in the test sample across 500 sim ulation replicates under PPV p α “ 0 . 04 q and prev alence p p 1 “ 0 . 01 q . n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR 2100 TPR 0.710(0.184) 0.591(0.273) 0.956(0.009) 0.932(0.045) PPV 0.053(0.010) 0.037(0.007) 0.040(0.003) 0.042(0.004) 4200 TPR 0.770(0.194) 0.654(0.301) 0.957(0.007) 0.945(0.019) PPV 0.053(0.008) 0.040(0.006) 0.039(0.003) 0.042(0.003) 9.2 Results under different PPV constraint W e present simulation and application results for scenarios in the manuscript for different PPV constrain ts. T able 6: Application to iden tifying NOD patients at high risk of PD AC. Estimate of prev alence of pancreatic cancer in the P ANDORA study is 0 . 44%. W e pre-sp ecify PPV constrain t v alues at 1%, 1 . 5%, and 2% and rep ort a verage TPR, PPV with corresp onding standard error (in parentheses). When the PPV constrain t is lo w at 1%, the Standard approac h p erformed better than our prop osed linear rules. How ever, as the PPV constraints increase, we observ e reduction in p erformance of the Standard approac h. α Measure Standard Plug-in Logistic Plug-in SL DOOLR DOOLR-IT 0.010 TPR 0.964(0.048) 0.969(0.043) 0.961(0.055) 0.906(0.068) 0.912(0.069) PPV 0.010(0.001) 0.010(0.001) 0.010(0.001) 0.010(0.002) 0.010(0.001) 0.015 TPR 0.807(0.110) 0.817(0.097) 0.848(0.092) 0.830(0.111) 0.839(0.099) PPV 0.016(0.006) 0.015(0.004) 0.015(0.003) 0.016(0.004) 0.015(0.003) 0.020 TPR 0.659(0.124) 0.671(0.127) 0.745(0.110) 0.755(0.106) 0.775(0.105) PPV 0.024(0.012) 0.024(0.015) 0.020(0.006) 0.020(0.006) 0.020(0.004) 32 T able 7: Data generation is under the linear rule with contamination scenario ( β 0 “ ´ 12 . 5, β 1 “ β 2 “ 7 , p 1 “ 0 . 1 , and α “ 0 . 3): mean and standard deviation (in parentheses) of TPR and PPV for the DOOLR rule across smo othing parameters h . Here, Adaptive refers to h “ n ´ 1 { 3 StdDev p X T ˆ β 0 { ∥ ˆ β 0 ∥ q . F or this numerical experiment, we observ e that h “ 5 and h “ 0 . 1 gav e the b est empirical p erformance across the selected h v alues. The Adaptive choice of h p erformed comparably to the other c hoices of h and not far from the b est performing h v alues. n Measure h “ 0 . 003 h “ 0 . 02 h “ 0 . 10 h “ 0 . 50 h “ 1 . 00 h “ 5 . 00 Adaptiv e 200 TPR 0.884(0.197) 0.911(0.178) 0.906(0.131) 0.843(0.236) 0.844(0.278) 0.905(0.203) 0.875(0.154) PPV 0.319(0.105) 0.306(0.120) 0.376(0.079) 0.440(0.134) 0.336(0.087) 0.339(0.093) 0.435(0.105) 400 TPR 0.887(0.204) 0.888(0.205) 0.927(0.118) 0.863(0.254) 0.933(0.205) 0.959(0.100) 0.894(0.177) PPV 0.347(0.091) 0.348(0.080) 0.386(0.071) 0.431(0.117) 0.354(0.058) 0.356(0.060) 0.430(0.100) 800 TPR 0.900(0.199) 0.906(0.193) 0.959(0.076) 0.858(0.306) 0.911(0.264) 0.964(0.120) 0.923(0.188) PPV 0.368(0.064) 0.372(0.068) 0.430(0.067) 0.402(0.108) 0.345(0.074) 0.363(0.058) 0.425(0.087) T able 8: Data generation is under the linear decision rule with no con tamination: mean and standard deviation (in parentheses) of estimates of TPR and PPV for v arying v alues of the PPV constrain t p α q . α n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR 0.030 1250 TPR 0.993(0.004) 0.993(0.007) 0.992(0.010) 0.984(0.030) PPV 0.033(0.003) 0.030(0.002) 0.030(0.012) 0.032(0.004) 2500 TPR 0.994(0.002) 0.995(0.003) 0.995(0.004) 0.990(0.011) PPV 0.032(0.002) 0.030(0.001) 0.029(0.007) 0.032(0.005) 0.040 1250 TPR 0.988(0.005) 0.988(0.009) 0.988(0.011) 0.975(0.036) PPV 0.042(0.003) 0.040(0.002) 0.039(0.011) 0.044(0.009) 2500 TPR 0.991(0.003) 0.991(0.005) 0.991(0.006) 0.983(0.017) PPV 0.042(0.002) 0.040(0.002) 0.040(0.005) 0.043(0.006) 0.045 1250 TPR 0.984(0.007) 0.984(0.010) 0.984(0.013) 0.963(0.049) PPV 0.047(0.003) 0.045(0.003) 0.043(0.011) 0.050(0.013) 2500 TPR 0.986(0.004) 0.986(0.007) 0.986(0.007) 0.976(0.019) PPV 0.046(0.002) 0.045(0.002) 0.045(0.005) 0.049(0.007) 33 T able 9: Data generation is under the linear decision rule with contamination: mean and standard deviation (in parentheses) of estimates of TPR and PPV for v arying v alues of the PPV constraint p α q . α n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR 0.030 1250 TPR 0.625(0.308) 0.690(0.288) 0.986(0.019) 0.968(0.068) PPV 0.038(0.019) 0.028(0.007) 0.030(0.002) 0.033(0.006) 2500 TPR 0.677(0.287) 0.738(0.288) 0.989(0.008) 0.979(0.033) PPV 0.043(0.021) 0.029(0.005) 0.030(0.001) 0.032(0.005) 0.040 1250 TPR 0.519(0.295) 0.608(0.306) 0.973(0.022) 0.939(0.091) PPV 0.044(0.025) 0.033(0.012) 0.040(0.003) 0.044(0.009) 2500 TPR 0.606(0.271) 0.664(0.291) 0.977(0.012) 0.962(0.041) PPV 0.051(0.022) 0.037(0.008) 0.040(0.002) 0.043(0.006) 0.045 1250 TPR 0.510(0.293) 0.567(0.298) 0.965(0.024) 0.926(0.084) PPV 0.044(0.024) 0.035(0.014) 0.045(0.004) 0.048(0.010) 2500 TPR 0.591(0.269) 0.626(0.289) 0.970(0.012) 0.948(0.048) PPV 0.052(0.022) 0.040(0.011) 0.045(0.003) 0.049(0.007) 34 T able 10: Data generation is under the non-linear decision rule: mean and standard deviation (in paren theses) of estimates of TPR and PPV for v arying v alues of the PPV constraint p α q . α n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR 0.030 1250 TPR 0.740(0.282) 0.930(0.145) 0.993(0.023) 0.932(0.120) PPV 0.041(0.014) 0.031(0.004) 0.030(0.003) 0.031(0.006) 2500 TPR 0.641(0.338) 0.959(0.102) 0.996(0.004) 0.952(0.081) PPV 0.041(0.013) 0.030(0.002) 0.030(0.001) 0.031(0.003) 0.040 1250 TPR 0.682(0.261) 0.796(0.236) 0.983(0.027) 0.862(0.153) PPV 0.047(0.009) 0.041(0.007) 0.040(0.003) 0.038(0.011) 2500 TPR 0.613(0.301) 0.839(0.201) 0.988(0.010) 0.879(0.124) PPV 0.047(0.009) 0.041(0.004) 0.040(0.002) 0.039(0.009) 0.045 1250 TPR 0.634(0.250) 0.735(0.236) 0.965(0.067) 0.856(0.161) PPV 0.049(0.009) 0.045(0.009) 0.046(0.005) 0.038(0.016) 2500 TPR 0.584(0.259) 0.755(0.224) 0.981(0.017) 0.846(0.141) PPV 0.050(0.008) 0.045(0.005) 0.045(0.002) 0.041(0.013) 35 T able 11: Data generation is under the piece-wise linear decision rule: mean and standard deviation (in paren theses) of estimates of TPR and PPV for v arying v alues of the PPV constraint p α q . α n Measure Metho ds Standard Plug-in Logistic Plug-in SL DOOLR 0.030 1250 TPR 0.457(0.115) 0.514(0.094) 0.919(0.117) 0.499(0.101) PPV 0.035(0.012) 0.031(0.007) 0.031(0.012) 0.044(0.032) 2500 TPR 0.508(0.104) 0.561(0.075) 0.961(0.057) 0.545(0.088) PPV 0.036(0.012) 0.031(0.005) 0.030(0.003) 0.044(0.030) 0.040 1250 TPR 0.373(0.131) 0.433(0.128) 0.909(0.130) 0.458(0.106) PPV 0.044(0.016) 0.041(0.012) 0.044(0.018) 0.069(0.049) 2500 TPR 0.423(0.106) 0.475(0.100) 0.946(0.061) 0.512(0.088) PPV 0.048(0.014) 0.042(0.009) 0.041(0.006) 0.062(0.035) 0.045 1250 TPR 0.368(0.134) 0.423(0.128) 0.869(0.128) 0.458(0.120) PPV 0.050(0.015) 0.046(0.014) 0.049(0.019) 0.077(0.048) 2500 TPR 0.355(0.123) 0.415(0.120) 0.922(0.071) 0.503(0.093) PPV 0.051(0.015) 0.046(0.009) 0.045(0.005) 0.069(0.038) 36 10 App endix D 10.1 Pro of of theorems The pro of of the Theorem 1 is based on Lemma 2 from Meisner et al. ( 2021 ) and M-estimation theory ( V an der V aart , 2000 ). Lemma 2 is stated and pro ved in Meisner et al. ( 2021 ) hence we are not going to pro ve but just state it here. W e first define the following notations, z FPR Φ p β 0 , β 1 q : “ n ´ 1 0 ř n 0 j “ 1 Φ p β 0 ` X 0 j β 1 h q , z TPR Φ p β 0 , β 1 q : “ n ´ 1 1 ř n 1 i “ 1 Φ p β 0 ` X 1 i β 1 h q , FPR Φ p β 0 , β 1 q : “ Pr ´ Φ p β 0 ` X β 1 h q| D “ 0 ¯ , and TPR Φ p β 0 , β 1 q : “ Pr ´ Φ p β 0 ` X β 1 h q| D “ 1 ¯ . Lemma 1 (Lemma 2 from Meisner et al. ( 2021 )) . Under c onditions (1)–(5), we have that sup p β 0 , β 1 qP Ω | z FPR Φ p β 0 , β 1 q ´ FPR p β 0 , β 1 q| Ñ 0 , sup p β 0 , β 1 qP Ω | z TPR Φ p β 0 , β 1 q ´ TPR p β 0 , β 1 q| Ñ 0 almost sur ely as n Ñ 8 , wher e Ω “ tp β 0 , β 1 q P R ˆ R p : || β 1 || “ 1 u . Pro of of theorem 1(a) W e wan t to show that T P R p ˆ β Φ q Ñ T P R p β ˚ q in probabilit y . Pr o of. Define β : “ p β 0 , β 1 q , z TPR p β q : “ n ´ 1 1 ř n 1 i “ 1 1 t X 1 i β ą 0 u , and z FPR p β q : “ n ´ 1 0 ř n 0 j “ 1 1 t X 0 j β ą 0 u , then | TPR p ˆ β Φ q ´ TPR p β ˚ q| ď | TPR p ˆ β Φ q ´ z TPR p ˆ β Φ q ` z TPR p ˆ β Φ q ´ TPR p β ˚ q| ď sup β P Ω | TPR p β q ´ z TPR p β q| ` | z TPR p ˆ β Φ q ´ TPR Φ p ˆ β Φ q| ` | TPR Φ p ˆ β Φ q ´ TPR p ˆ β Φ q| ` | TPR p ˆ β Φ q ´ TPR p β ˚ q| sup β P Ω | TPR p β q ´ z TPR p β q| conv erges to 0 by Lemma 1 , | TPR Φ p ˆ β Φ q ´ TPR p ˆ β Φ q| conv erges to 0 as h Ñ 0, | z TPR p ˆ β Φ q ´ TPR Φ p ˆ β Φ q| conv erges to 0 b y Gliv enko-Can telli theorem, and | TPR p ˆ β Φ q ´ TPR p β ˚ q| conv erges to 0 b y the con tinuous mapping theorem (w e assumed TPR p β q is Lipschitz con tinuous). 37 Pro of of theorem 1(b) Here w e wan t to show that lim sup n P P V p ˆ β Φ q ě α in probabilit y Pr o of. First, we define the contin uous function f p x, y q : “ p 1 x p 1 x `p 1 ´ p 1 q y and express P P V p ˆ β Φ q : “ p 1 T P F p ˆ β Φ q p 1 T P F p ˆ β Φ q`p 1 ´ p 1 q F P F p ˆ β Φ q where ˆ β Φ “ p ˆ β Φ 0 , ˆ β Φ 1 q , β “ p β 0 , β 1 q , and p 1 is prev alence whic h w e as- sumed to b e fixed. Then we ha v e that P P V p ˆ β Φ q “ | { P P V Φ p ˆ β Φ q ` t P P V p ˆ β Φ q ´ { P P V Φ p ˆ β Φ qu| ě | { P P V Φ p ˆ β Φ q| ´ | P P V p ˆ β Φ q ´ { P P V Φ p ˆ β Φ q| ě α ´ sup β P Ω | P P V p β q ´ { P P V Φ p β q| F rom Lemma 1 , w e hav e that sup β P Ω | P P V p β q ´ { P P V Φ p β q| con verges to zero almost surely hence Pr t lim inf n PPV p ˆ β q ě α u ě Pr t lim inf n sup β P Ω | P P V p β q ´ { P P V Φ p β q| “ 0 u “ 1 . Pro of of theorem 1(c) W e wan t to show that ˆ β Φ Ñ β ˚ in probabilit y . Pr o of. T o sho w the abov e, we will emplo y Theorem 5.7 from V an der V aart ( 2000 ). W e first assume β ˚ is a w ell-separated p oint of maximum of T P R p β q and show that the following conditions: 1. sup β P Ω | z T P R Φ p β q ´ T P R p β q| p Ñ 0 (con vergence in probability) 2. z T P R Φ p ˆ β Φ q ě z T P R Φ p β ˚ q ´ o P p 1 q hold. F rom Lemma 1 , we hav e sup β P Ω | z T P R Φ p β q ´ T P R p β q| p Ñ 0 hence 1 q hold. T o show 2), we note that ˆ β Φ is a near maximizer of z T P R Φ p β q , that is z T P R Φ p ˆ β Φ q ě sup β P Ω z T P R Φ p β q ´ o P p 1 q . W e then 38 ha ve that z T P R Φ p ˆ β Φ q ě sup β P Ω z T P R Φ p β q ´ o P p 1 q z T P R Φ p ˆ β Φ q ě sup β P Ω z T P R Φ p β q ` z T P R Φ p β ˚ q ´ z T P R Φ p β ˚ q ´ o P p 1 q z T P R Φ p ˆ β Φ q ě z T P R Φ p β ˚ q ` sup β P Ω z T P R Φ p β q ´ T P R p β ˚ q ` T P R p β ˚ q ´ z T P R Φ p β ˚ q ´ o P p 1 q z T P R Φ p ˆ β Φ q ě z T P R Φ p β ˚ q ` sup β P Ω z T P R Φ p β q ´ sup β P Ω T P R p β q ` T P R p β ˚ q ´ z T P R Φ p β ˚ q ´ o P p 1 q W e therefore hav e that sup β P Ω z T P R Φ p β q ´ sup β P Ω T P R p β q ď sup β P Ω | z T P R Φ p β q ´ T P R p β q| Ñ 0 and T P R p β ˚ q ´ z T P R Φ p β ˚ q ď sup β P Ω | T P R p β q ´ z T P R Φ p β q| Ñ 0 almost surely from Lemma 1 hence w e hav e z T P R Φ p ˆ β Φ q ě z T P R Φ p β ˚ q ´ o P p 1 q . Therefore w e can conclude that ˆ β Φ p Ñ β . 39
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment