A General formulation for standardization of rates as a method to control confounding by measured and unmeasured disease risk factors

The Annals of Applie d Statistics 2008, V ol. 2, No. 3, 1103–112 2 DOI: 10.1214 /08-A OAS170 c  Institute of Mathematical Statistics , 2 008 A GENERAL F ORMULA TION FO R ST ANDARDIZA TION OF RA TES AS A METHOD TO CONTRO L CONFOUN DING BY MEASURED AND UNMEASURE D DISEASE RISK F A C TORS By Steven D. Mark 1 University of Color ado Scho ol of Public He alth Standardization, a common approac h for controlling confounding in popu lation- studies or data from diseas e registries, is deﬁned to be a w eigh ted av erage of stratum sp eciﬁc rates. T y pically , discussio ns on the construction of a particular stand ardized rate regard the strata as ﬁxed, and focus on the considerations that aﬀect th e sp eci- ﬁcation of w eights. Eac h year the data from the SEER cancer registries are analyzed using a weig hting pro cedure referred to as “direct standardization for age.” T o eva luate the p erformance of d irect standardization, w e deﬁne a general class of standardization operators. W e rega rd a particular standardized rate to b e the output of an opera- tor and a given data set. Based on the fun ctional form of th e op erators, we deﬁn e a sub class of standardization op erators that controls for confounding b y measured risk factors. Using the fundamen tal disease probability paradigm for inference, we establish the conclusions that can b e drawn from year-to-year contrasts of stand ardized rates prod uced by these op erators in t he presence of unmeasured cancer risk factors. These conclusions take the form of falsifying sp eciﬁc assumptions ab out the cond itional prob- abilities of disease given all the risk factors (b oth measured and un measured), and t he conditional probabilities of the unmeasured risk factors given the measured risk factors. W e sho w th e one- to-one corresp ondence b etw een these falsiﬁcations and the inferences made from the con trasts of directly stand ardized rates rep orted eac h year in the A n- nual Re p ort to the Nation on the Status of Canc er . W e further show th at the “direct standardization for age” p roced u re is not a member of the class of u nconfounded stan- dardization operators. Consequently , it can, and u sually will, introduce confounding when confounding is not present in the d ata. W e prop ose a particular standardization operator, the SCC op erator, that is in the class of unconfounded op erators. W e contra st the mathematical properties of th e SCC and the SEER op erator (SCA), and present an analysis of SEER cancer registry data that demonstrates the consequences of these diﬀerences. W e further p ro ve that the SCC op erator is a pro jection op erator. W e dis- cuss how this property can enable the SCC op erator to b e developed as a metho d for comparing nested cond itional ex p ectations in the same manner as is currentl y d one with regression metho ds that con trol for confounding. Received January 2008; revised January 2008. 1 Supp orted by th e Universit y of Colorado Health Sciences Cen ter at Denv er. Key wor ds and phr ases. Cancer registry, cancer t rends, causal inference, confoundin g, direct standardization, fund amental disease probabilit y , SEER, standardization. This is a n ele c tronic reprint of the o riginal a rticle published b y the Institute of Mathematical Statistics in The Annals of Applie d St atistics , 2008, V o l. 2, No. 3, 1 103–1 122 . This r eprint diﬀers from the or iginal in pag ination and typo graphic detail. 1 2 S. D. MARK 1. In tro d uction. Eac h y ear the NCI’s Surv eillance, Epidemiology , and End Resu lts (SEER ) program compiles data (henceforth called SEER data) on cancer incidence and mortalit y from (currentl y) 17 p opulation-based can- cer registries in the United S tates [Ho w e et al. ( 2006 )]. Since 1998 the Na- tional Cancer I nstitute, the American Cancer So ciet y , the Cen ters for Dis- ease Con trol, and the North American Asso ciation of Cent ral C ancer Reg- istries ha v e analyzed the SEER data to p ro duce an Annual R ep ort to the Nation on the Status of Canc er in the Unite d States (subsequently referred to as the Annual R ep orts ). Th ese reports con tain estimates of the o ve rall ann ual cancer incidence and mortalit y , as w ell as incidence/mortali t y by cancer site, and incidence/mortalit y within p opulation su bgroups d eﬁned b y gender, race, ethnicit y , and geo graphic location of the cancer registry . Some of the stated goals of these rep orts are to: (1) rep ort on the cancer burd en as it relates to cancer incidence and mortalit y and p atien t su rviv al; (2) iden tify u n usual c hanges and diﬀerences in the p atterns of o ccurrence of sp eciﬁc forms of cancer in p opulation subgroups deﬁned by geog raphic, de- mographic, and so cial c h aracteristics; (3) describ e temp oral changes in can- cer inciden ce, mortalit y , exten t of disease at diagnosis (stage) , therapy , and patien t sur v iv al that might impact cancer pr ev entio n and control strategies; (4) monitor the o ccurrence of p ossib le iatrogenic cancers; and (5) attribute c hanges in cancer rates to temp oral c hanges in diagnostic criteria, screening, prev en tive measures, cancer treatmen ts, or environmen tal exp osures [SEER ( 2005a ), W ard et al. ( 20 06 )]. In addition to the goals common to all of the Annual R ep orts , eac h rep ort h as a sp ecial su b-fo cus. S ince 2001 these re- p orts hav e stated conclusions regarding: (1) absolute p opulation rates and c hanges in cancer rates [Ho we et al. ( 2001 ), Edwards et al. ( 20 02 ), W eir et al. ( 2003 ), Jemal et al. ( 2004 ), Edwards et al. ( 2005 ), How e et al. ( 2006 )]; (2) the impact of screening and treatmen t on sp eciﬁc cancers [Ho we et al. ( 2001 )]; (3) d iﬀerences in cancer rates b y gend er , race, ethn icit y , and geo- graphic lo cation [How e et al. ( 2001 ), W eir et al. ( 2003 ), Jemal et al. ( 2004 ), Edwa rds et al. ( 2005 )]; (4) causes of the diﬀerence in rates within the sub- groups listed in (3) [Jemal et al . ( 2004 ), Edw ard s et al. ( 2005 ), Ho we et al. ( 2006 )]; and (5) the futu re public p olicies and exp enditures that sh ould b e undertake n to in crease cancer p rev ention and improv e access to medical care [W eir et al. ( 2003 ), Edwards et al. ( 200 5 ), Ho we et al. ( 2006 )]. In order to mak e meaningful statemen ts ab out the yea r-to-y ear c hanges in cancer incidence/mortalit y as a function of one set of c haracteristics, it is n ecessary to con trol for d iﬀerences in the frequency of cancer risk f actors that are not in the set of interest. W e refer to an y statistical pro cedure that attempts to separate the eﬀect on cancer r ates of one set of m easured co- v ariates from another set as p ro cedures that cont rol for confounding. The common metho ds of con trolling for confounding are as follo w s: (1) multi- v ariate regression; (2) stratiﬁcation; and (3) s tandardization. Standard iza- tion is virtually alw ays the metho d of c hoice w hen inference is made from A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 3 p opulation-studies, or data from disease r egistries. It is the p ro cedure used in the A nnual R e p orts [Klein and Sc ho en b orn ( 2001 ), Ries and K osary ( 2005 )]. The particular standard ization metho d used to analyze S EER data is de- signed to control for ye ar-to-y ear diﬀerences in age distr ib utions. W e refer to th is metho d as S tandardization Controlli ng for Age (SCA ) . In this pa- p er w e present a new pr o cedure that allo ws r esearc hers to con trol for any set of measured co v ariates. W e refer to this pro cedur e as Stand ardization Con trolling f or C o v ariates (SCC) . The pap er is organized as follo ws. In Section 1 w e deﬁn e n omenclature for a completel y general data stru cture, and describ e the SE E R data in terms of th is nomenclature. In Section 2 we giv e form u lae for the usual represen- tations of standardized rates as w eigh ted a ve rages of a given set of str atum sp eciﬁc rates. W e d etail the rationale for the sp eciﬁc choic e of weig hts used in SCA standardization. W e then deﬁne SCA and SCC standardized rates as the output of S C A and SCC op erators. These oper ators are fun ctionals of the empirical distribution of a giv en set of data, and a u ser-deﬁned we igh ting distribution. Using the op erator form ulation, we deﬁne a general class of all standardization op erators. In Section 3 w e formalize our p revious discuss ion of the goals of standard ization. W e d eﬁne cr iteria that sp ecify when con- trasts of cru de-cancer rates are “not confounded.” W e extend these criteria to deﬁne the sub class of standardization op erators that pro d uce contrasts of standardized rates that are not confounded . The SCC op erator falls within this su b class; the SCA op erator do es n ot. W e show that if one b egins with crude rate diﬀerences that are not confoun ded, the SCA op erator in tro du ces confounding. W e discus s h o w the d iﬀerences in prop erties of the SCA and SCC op erators relate to the diﬀerences in the fu nctionals. In Section 4 w e present analyses of th e SEER 13 data that demonstrate the prop erties de- scrib ed in Section 3 . Up until S ection 5 w e discuss confounding in terms of measured risk fac- tors. In Section 5 w e pro vide a formal framew ork for examining what infer - ences can b e made from the s tand ardized rate d iﬀeren ces pro d u ced by the SCC operator in the presence of unm easur ed risk factors. Suc h inferences re- quire assumptions that can b e neither completely f alsiﬁed nor conﬁrmed by examination of the observed distribution functions. W e sho w that nonzero b et w een-yea r diﬀerences in standardized r ates allo w one to reject certain assumptions ab out unm easur ed r isk f actors, and that violations of these as- sumptions corresp ond directly to the in f erences made by SEER inv estigato rs in the A nnual R ep orts [W ard et al. ( 2006 )]. In Section 6 w e change fo cus from b et wee n-y ear inferences to w ith in -y ear inferences. W e deﬁne “nested” stand ardized rates, deriv e the p r op erties of nested rates pro duced by the SCA and SCC op erators, and discuss implica- tions for within-y ear mo d el bu ilding of standardized rates. 4 S. D. MARK In the discussion section we summarize our results, suggest h o w the stan- dardization op erators w e p rop ose can b e us ed for nonparametric, semipara- metric, or p arametric estimation of conditional m eans, and discuss the di- rection of our curr ent work on dev eloping softw are that w ill implemen t the op erators w e describ e. 2. Data structure, crude-cancer rates and ﬁnest-crude-cancer rates. Let ( D y , Z y ) b e any vec tor of real v alued r an d om v ariables, and P y ( D , Z ) an y set of p robabilit y distributions deﬁned on the su p p ort of ( D y , Z y ); y ∈ Y ; Y ≡ { 1 , 2 , . . . , n } ; n ﬁnite. S ince the supp ort of P y ( D , Z ) do es not v ary with y , w e frequently drop the sup erscripts for random v ariables and write ( D , Z ). In SEE R data D is a v ector of indicator v ariables denoting the presence or absence of a sp eciﬁc form of cancer t yp e; Z is the set of all other measured co v ariates; P y ( D , Z ) is the empirical distrib ution of ( D y , Z y ) for year y ; Y is the set of y ears for which we hav e data on ( D , Z ). T hus, th e S EER data for y ear y consists of an observ ation of P y ( D , Z ). Sin ce in the Annua l R ep orts an ind ividual is classiﬁed as either ha ving or not ha ving a sp eciﬁc cancer (or a cancer in a deﬁned set), and the joint distrib u tion of cancers is not of in terest, we will regard D to b e a binary random v ariable: D = 1 when an individu al is in the set of cancers of interest, D = 0 otherwise. W e assume that our in terest is in the d istribution P y ( D , E ) , wher e E is the (p ossibly improp er ) subset of Z w h ic h inv estigat ors b eliev e conta in all the “measured cancer risk factors.” Without loss of generalit y we adhere to the structure of the SEER data, and regard the supp ort of ( D , E ) as discrete. W e assume that in SEE R the measured cancer risk factors, E , consist en tirely of information ab out an ind ivid ual’s age, gender, r ace, ethnicit y , and catc hm en t area of cancer registry (henceforth called place): E = (age, gender, race, ethnicit y , p lace) . (1) These are, in fact, the only co v ariates requir ed to pr o duce the stand ardized rates giv en in the Annual R ep orts . Let E = ( E 1 , E 2 ) b e an y f actorizat ion of E su c h th at E 1 ∩ E 2 = ∅ ; E 1 ⊆ E . W e use notation suc h as P y ( D | E ), P y ( D | E 1 ), P y ( E 2 | E 1 ) to denote the conditional distributions of P y ( D , E ). F or any E † ⊆ E , we refer to P y ( D | E † ) as the crude-cancer rate for E † . When E † = E , we r efer to P y ( D | E ) as the ﬁnest-crude-cancer rate . An y crude-cancer r ate is related to the ﬁn est-crude-cancer rate b y the in tegral give n in (2). P y ( D | E 1 ) = Z E 2 P y ( D | E 1 , E 2 ) dP y ( E 2 | E 1 ) . (2) The region of inte gration in ( 2 ) is E 2 , the supp ort of E 2 . Throughout the pap er calligraphic letters indicate the supp ort of rand om v ariables. When as A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 5 in the SEER data, ( D y , E y ) ha ve discrete su pp ort, we can express the righ t- hand side of ( 2 ) as a su m of the pro du ct of discrete conditional probabilities, P y ( D | E 1 ) = X E 2 P y ( D | E 1 , E 2 ) × P y ( E 2 | E 1 ) . The cru de-cancer rate on the left-hand s ide is the f r equency of disease in sub jects with a giv e v alue of E 1 . 2.1. Gener al deﬁnition and formulae for standar dize d r ates. W e d eﬁne s ∗ y [ D | E 1 ] to b e a stand ard ized cancer rate giv en E 1 = e ∗ 1 if it can b e expressed in the form of the inte gral giv en in (3), s ∗ y [ D | e ∗ 1 ] ≡ Z E † 2 P y ( D | e ∗ 1 , e † 2 ) dP ∗ ( e † 2 ) . (3) Here E † 2 ⊆ E , e † 2 ∈ E † 2 , and P ∗ ( E † 2 ) is any us er -d eﬁned measure that has the same sup p ort as E † 2 and is consisten t with a probability measure. Under th e restriction that ( D , E ) has discrete supp ort, w e can write ( 3 ) as s ∗ y [ D | e ∗ 1 ] ≡ X e † 2 ∈E † 2 P y ( D | e ∗ 1 , E † 2 = e † 2 ) dP ∗ ( e † 2 ) . (4) Equation ( 4 ) is the w eigh ted sum of stratum sp eciﬁc we ights: E † 2 deﬁne th e strata; dP ∗ ( e † 2 ) is the w eigh t for stratum e † 2 ; P y ( D | e ∗ 1 , E † 2 = e † 2 ) is the crud e- cancer rate within stratum E 1 = e ∗ 1 , E † 2 = e † 2 . Equ ation ( 4 ) is equiv alen t to the usual algebraic deﬁnition of a standardized rate [Rothman ( 1986 )]. T yp ically , discussions ab out standardization assume the s tr ata are ﬁxed and fo cus on th e c h oice of w eights [Rothman ( 1986 )]. The general advice is that the c hoice of w eight s should dep end up on the inte rpretation one desires to ascrib e to the standard ized r ates [Rothman ( 1986 )]. Th e weigh ts used in the S EER imp lemen tation of the SCA metho d are the age-frequency of the US p opulation in y ear 2000 [Klein and S choenb orn ( 2001 ), Ries, Eisn er, and Kosary ( 2005 ), W ard et al. ( 2006 )]. Th is is referr ed to as direct stan- dardization [Klein and Sc ho enborn ( 2001 ), Rothman ( 1986 )]. In particular, the SCA pro cedure of S EER is d escrib ed as: “Age adju stmen t, u s ing the direct metho d, is the application of observed age-sp eciﬁc r ates to a stan- dard ag e distribution to eliminate diﬀerences in cru de rates in p opulations of in terest that resu lt from diﬀerences in the p opulations’ age distribution [Klein and S c ho en b orn ( 2001 )].” The j u stiﬁcation for d irect standardiza- tion of the cancer rates in y ear y is that the stand ardized rate for year y will represent th e cancer rates that would ha ve b een observed in y ear y had the age distribution in yea r y b een iden tical to the age distribu- tion in y ear 2000 [Klein and Sc ho en b orn ( 2001 ), Rothman ( 198 6 ), Anderson 6 S. D. MARK and Rosenberg ( 1998 )]. T he adv an tage of exp r essing standardized rates in terms of “what wo uld h av e b een seen in some y ear y ” is that s uc h w eighting pro du ces standardized r ates whic h preserve the magnitude of the crud e- cancer rates. Since the magnitude of the y ear-to-y ear diﬀerences in cancer rates are of imp ortance to th e inferences m ade in the A nnu al R ep orts , it is desirable to choose a standardization pro cedur e that preserv es these v al- ues. 2.2. Deﬁning the SCA and SCC op er ators. T o con trast the prop erties of SCA and S CC standard ized rates, it is b est to regard them as the outpu t of SCA and SCC op erators. W e deﬁ n e a standardization op erator, S ∗ y [ D | E 1 ], to b e any functional of P y ( D , E ), E 1 , E † 2 and a us er-deﬁned probabilit y distribution, P ∗ ( E † 2 ), that can b e expr essed by the inte gral in ( 5 ): S ∗ y [ D | E 1 ] ≡ Z E † 2 P y ( D | E 1 , E † 2 ) dP ∗ ( E † 2 ) . (5) F or a particular E 1 = e ∗ , the standardized rate is denoted by the left-hand side of ( 3 ). Let P ∗ ( E ) b e a user-d eﬁned prob ab ility distribution w ith the same su p- p ort as E . Let A b e any rand om v ariable in E , and P ∗ ( A ) the marginal probabilit y of A from d istr ibution P ∗ ( E ). W e denote the p oint s of sup p ort of A as ( a 1 , a 2 , . . . , a N ). Using “ \ ” as the set d iﬀeren ce op erator, w e deﬁne E a = E \ A , E a 1 = E 1 \ A ; E a 2 = E 2 \ A . W e deﬁne th e S CA op erator , S ca y [ D | E a 1 ], to b e S ca y [ D | E a 1 ] ≡ Z A P y ( D | E a 1 , A ) dP ∗ ( A ) . (6) s ca y [ D | e ∗ 1 ] denotes the SCA standard ized rate giv en E a 1 = e ∗ 1 . In the Annual R ep orts standardized rates are pro d uced using th e S CA op erator: A is age, and th e s upp ort p oints are the ﬁ v e year in terv als into whic h age is categorized. The wei gh ting d istribution, P ∗ ( E ), is the co v ariate distribution in yea r 2000, P 2000 ( E ). Thus, P ∗ ( A ) is th e age frequency in yea r 2000. F or example, let D b e col on cancer, E a 1 = (gender), and E a 2 = (race, ethnicit y , place). T h e S E ER S CA estimate of the standardized rate of colon cancer conditional on gender b eing male is s ca y [colon cancer | male] = N X j =1 P y (colon cancer | male, age = a j ) × P 2000 (age = a j ) . A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 7 Here P y (colon cancer | male, age = a j ) is the frequency of colon cancer in y ear y f or males in age category a j , and P 2000 (age = a j ) is the frequency of age group a j in y ear 2000. W e deﬁn e the SCC op erator , S cc y [ D | E 1 ], to b e the functional of P y ( D , E ), P ∗ ( E ), and E 1 , giv en by the int egral in ( 7 ): S cc y [ D | E 1 ] ≡ Z E 2 P y ( D | E 1 , E 2 ) dP ∗ ( E 2 | E 1 ) . (7) Using the s ame factorizatio n of E , and the w eigh ting distribution sp eciﬁed b y ( 7 ), the SCC estimate of the standardized colon cancer rate conditional on gender equals male is s cc y [colon cancer | male] = X E 2 P y (colon cancer | male, age, race, ethnicit y , place) × dP 2000 (age, race, ethnicit y , place | male) . Here P y (colon cancer | male, age, race, ethnicit y , place) is th e frequency of colon cancer in y ear y for males within strata d eﬁned b y age, race, ethnicit y , and p lace; dP 2000 (age, race, ethnicit y , place | m ale) is the fr equency among males for a giv en (age, race, ethnicit y , place). F or completeness w e note that we ha ve explicitly pr esen ted the SCA and SCC op erators in terms of th e rand om v ariables ( D , E ) and the empirical distributions P y ( D , E ). If we restrict the data set to some D ‡ ⊂ D , and/or E ‡ ⊂ E , the op erators are deﬁ n ed in terms of the empirical distribution P y ( D ‡ , E ‡ ). In practice, giv en the strong correlat ion of cancer t yp e with age, suc h analyses are common. In Section 4 we present standardized r ates for colon cancer and br east cancer. These rates were made fr om data limited to su b jects 40 y ears of age or older. Similarly , when W ard et al. rep ort standardized rates for childhoo d cancers, they restrict sub jects to those age 19 or less. 3. Con trasting the p rop erties of the SCC an d SCA op erators. Insp ec- tion of the form u las f or the SCA ( 6 ) and S CC ( 7 ) op erators rev eals t wo imp ortant diﬀerences. In the S CA op erator the crude-cancer rate v aries as a function of E 1 , bu t the w eigh t do es not. In the SCC op erator the crude rate is alw ays the ﬁnest-crude-cancer rate, and the we igh t dep end s on E 1 . In this section we examine the consequen ces of these diﬀerences on the prop erties of the standardized rates. W e b egin by formalizing the goals of standardization discussed at the end of Section 2.1 . 8 S. D. MARK 3.1. Standar dization op er ators and the c ontr ol of c onfounding by me a- sur e d risk factors. The Annual R ep orts compare y ear-to-y ear diﬀerences in cancer rates conditional on some s u bset E 1 of E ( 1 ). The n eed for standard - ization arises b ecause of the concern that d iﬀerences in the crude-cancer rates ma y reﬂect y ear-to-y ear diﬀerences in the distribution of E 2 . F r om ( 2 ), w e see that the distribution of E 2 that aﬀects the crude-cancer rate is P y ( E 2 | E 1 ). If for years y † and y †† , P y † ( E 2 | E 1 ) = P y †† ( E 2 | E 1 ) , (8) w e say there is n o E 2 confounding of the E 1 crude rate d iﬀerences , P y † ( D | E 1 ) − P y †† ( D | E 1 ) . (9) When ( 8 ) is true, con trasts of the crude rates pr o vide the b est rates of th e y ear-to-y ear diﬀerences. F rom ( 8 ) we kn o w that the diﬀerences in the crude- cancer r ates cannot b e d ue to diﬀerences in the E 2 distribution; trivially , the cru de rate diﬀerences ac hieve the desired goal of ha vin g the standardized con trasts pr eserve the observ ed magnitude of the d iﬀerences in crude r ates. Assume n o w th at ( 8 ) is true for all y ∈ Y , and that th e weigh ting dis- tribution used is P 2000 ( E ). By insp ection of ( 7 ), w e see that for all y , and an y factorization of E , standardized r ates pro d uced by the SCC op erator equal the crude-cancer rates. Thus, if there is no E 2 confounding of the E 1 crude rate diﬀerences, and one us es the SC C op erator, con trasts of the SCC standardized rates are contrasts of the un confounded cru de rates. T o see that this is not the case for the SCA op erator, we r e-express ( 6 ) in terms of the ﬁ n est-crude-cancer rates: S ca y [ D | E a 1 ] = Z A  Z E 2 P y ( D | E a 1 , E a 2 , a ) dP y ( E a 2 | E a 1 , a )  dP ∗ ( A = a ) . (10) Supp ose in ( 10 ) th at y = 2000. Ev en were ( 8 ) true, and P ∗ ( A ) = P 2000 ( A ), the SCA op erator do es not retur n the cru de-cancer rate. Thus, cont rary to the stated justiﬁcation for the c hoice of weig h ts, the SCA standardized rates in y ear 2000 conditional on E a 1 do not equal the observe d cancer r ates in y ear 2000, ev en though the w eights used are the age d istribution of y ear 2000. This will b e graphically demonstrated in Figure 1 . Consisten t with our deﬁn ition of no confounding of crude r ate diﬀerences ( 8 ), w e deﬁ ne standardized r ate diﬀerences, s ∗ y † [ D | e ∗ 1 ] − s ∗ y †† [ D | e ∗ 1 ] , (11) to b e unconfoun d ed by E 2 , if the standardized rates are pro duced b y a stan- dardization op erator, S ∗ y [ D | E 1 ], that can b e expr essed as an in tegral of the ﬁnest-crude-cancer rates with resp ect to a measure that dep ends on ly on A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 9 the factorization of E [see ( A.1 ) for f ormal deﬁnition]. W e refer to such op erators as standardization op erators w ith no E 2 confounding (SONC op erators) . The SCC op erator is one suc h op erator. In fact, if w e c hose P ∗ ( E ) = P 2000 ( E ), it is th e uniqu e standardization op erator that pr o duces the standardized cance r rates that “w ould ha v e b een seen in yea r y had the co v ariate distr ib ution in y b een identica l to the co v ariate d istribution in y ear 2000.” Note th at SONC op erators with no E 2 confounding p ro du ce standardized rate diﬀerences that are n ot confounded by E 2 regardless of whether ( 8 ) is true. It is clear from ( 10 ) that the SCA op er ator do es not, in general, pro d u ce standardized rate diﬀerences that are un confounded. In fact, for a giv en factorizat ion of E and for s p eciﬁc y † , y †† ∈ Y , the SCA op erator p ro duces unconfound ed standard ized r ate d iﬀerences if and only if P y † ( E a 2 | E a 1 , a ) = P y †† ( E a 2 | E a 1 , a ) . (12) Since ( 8 ) do es not imply ( 12 ), the SCA op erator can pro duce s tandardized rate diﬀerences that are confounded by E 2 ev en when ( 8 ) is true and th e crude rate diﬀerences are not confound ed. F or S CA to pro du ce unconfound ed s tandardized r ate d iﬀeren ces for all factorizat ions of E requ ires ( 13 ): P y † ( E a | a ) = P y †† ( E a | a ) . (13) When ( 13 ) is true for all p ossible com b ination of y ears, ( y † , y †† ), then con- ditional on age, the distribu tion of the other r isk factors are id en tical for all y ears. Th u s, for the SCA op erator, which “standardizes only for age,” to pro du ce u nconfound ed standard ized rate d iﬀerences requires that the P y ( E ) distributions are id en tical except p ossibly f or the marginal distr ibution of age. The equ alit y in ( 13 ) do es not exist for the analysis of the SEER data w e p r esen t in Section 4 . If the P ∗ ( E ) and P y ( E ) distribu tions are such that P ∗ ( E a | A ) = P y ( E a | A ) and P y ( E a | A ) = P y ( E a ) , (14) then the rates pro duced b y the SCA and SCC op erator are identical . Th us, if ( 8 ) is true, the SEER S CA op erator alw ays return s the crud e-cancer rates iﬀ the E a distributions are identica l for all yea r s and age is indep end en t of E a . It is ins tructiv e to consid er the p rop erties of the SCA and SCC op era- tors with regard to confounding by measured risk factors in terms of the familiar regression mo del approac h to con trol for suc h confounding. In this con text, we regard the P y ( D , E ) as r andom samples fr om larger p opu la- tions. F or concreteness, we conceptualize that the P y ( D | E ) ha ve log-linear P oisson distributions. Giv en that age is the pr imary determinant of cancer 10 S. D. MARK rates, we prop ose Po isson mo d els stratiﬁed on age cat egory , and obtain es- timates of P y ( D | E ) from w eigh ted sums of th e pred icted p r obabilities in eac h age stratum. Our goal is to ﬁn d the most parsimonious P oisson mo d- els from w hic h to m ak e inferences r egarding the eﬀect of sex and race on within- and b et ween-y ear cancer r ates. W e b egin with the within-age stra- tum mo dels saturated in (sex, race, ethnicit y , place). T he pr edicted proba- bilities from these saturated mod els are iden tical to the probabilities giv en b y the ﬁnest-crude-cancer rates, P y ( D | E ). W e would mak e in ferences from the m ore p arsimonious mo dels saturated only in (age, sex, race) if: (1) the regression coeﬃcients f or all co v ariates that are a fun ction of either ethnicit y or place w ere equal to zero; or (2) w ithin eac h age stratum , (sex, race) were (in the d ata) statistically indep enden t of (ethnicit y , p lace). Th ese criteria corresp ond to the usu al rubric that co v ariates E 2 are not confoun ders of the eﬀect of E 1 pro v id ed either E 2 are not risk factors for d isease, or E 2 are uncorrelated with E 1 . The cond itions required for a standardization op erator to b e in the class of SONC op erators are r elated to, but more stringen t than, those giv en ab o ve. SONC op erators are in tegrals of the ﬁnest-crude-cancer rates with resp ect to a measure that do es not d ep end on y ear ( A.1 ). If the ﬁrst regression criteria for non-confoundin g we r e true, then the in tegrand in the SC A op erator ( 6 ) w ould b e equiv alen t to the ﬁ nest-crude-cancer rates. Ho wev er, the second criterion sp eciﬁes only that P y ( E a 2 | E a 1 , a ) = P y ( E a 2 | a ), not that P y ( E a 2 | a ) is in v ariant to y ear. The latt er is clearly necessary for SCA to b e a SONC op erator ( 10 ). 4. Using the SCA and SC C op erators to analyze S E ER data. In this section we present results from our analyses of the S E ER data fr om 13 r eg- istries, yea r s 1992–20 03 [SEER 13 Regs Limited Use ( 2005b ); henceforth called SEER 13]. W e consider the sub set of SEER 13 where age is greater than or equal to 40, race is limited to blac k or white, and ethnicit y is limited to either Hispanic or non-Hispanic. Because of the restriction we p lace on race, w e exclude Alask a and consider only 12 of th e 13 cancer registries. In our analyses w e limit the co v ariates to the E d eﬁ ned in ( 1 ). Th e inten t of this section is to demonstrate the existence of diﬀerences in the stan- dardized r ates pro duced by the SCA and S CC op erators, particularly those diﬀerences discussed in Section 3 . W e mak e no commen ts ab out the statis- tical signiﬁcance of these ﬁnd ings and provide no formal estimates of trends (see Discussion in S ection 7 ). All rates giv en are p er 100,0 00 p ersons. Figure 1 is a graph of the crude and standard ized (SCA, and SCC) race-and-gender sp eciﬁc colon cancer incidence for eac h y ear fr om 1992 to 2003. F or the SCA and SCC op erators all rates are pro duced with P ∗ ( E ) = P 2000 ( E ). A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 11 The SCA (dashed line) and SCC (solid line) rates diﬀer for all groups and all years. Thus, ( 14 ) is false. F or the SCA op erator to pro du ce r ate d iﬀeren ces that are not confoun d ed, ( 12 ) m u st b e true f or this factorization of E . In SEER 13 ( 12 ) is false: there exists v ariation in the year-to -year P y (Hispanic, place | age, gender, race = white). During the time p erio d 1992 to 2003 the frequency of Hispanic ethnicit y in creased in eve ry place, for b oth genders, and (with rare exception) for eve ry age group (data not sho wn ). Note, h o w- ev er, th at in Figure 1 th e slop e of eac h segment of the SCA and SCC plots for white males, and b oth white and b lac k females, are virtually iden tical. Th is indicates that though S CA r ate diﬀerences ma y b e confounded ( 11 ), infer- ences ab out the existence of trend s ma y b e iden tical. The slop e of the SC A curv e is determined by b etw een-y ear diﬀerences in th e v alue of the integral of the ﬁn est-crude-cancer rates with resp ect to the measur e P y ( E a 2 | E a 1 , a ) ( 10 ); confounding of SCA diﬀerences requires only yea r-to-y ear v ariation of P y ( E a 2 | E a 1 , a ). The principal motiv atio n f or us ing the weigh ts sp eciﬁed by d ir ect stan- dardization is the d esire to pr o duce stand ardized rates that reﬂect the tr ue Fig. 1. Comp aring sex and age sp e ciﬁc crude-c anc er r ates with SCA and SCC stan- dar dize d r ates for the ye ars 1992–2003. The dotte d line is the crude c olon c anc er r ate. The dashe d line i s the SC A standar dize d c ol on c anc er r ate. The solid line is the SCC standar di ze d c ol on c anc er r ate. 12 S. D. MARK Fig. 2. The absolute value of the p er c ent di ﬀer enc e b etwe en the SCA standar dize d c olon c anc er r ate and the crude c olon c anc er r ate by sex and r ac e for the ye ars 1992–2003. The plotte d values for e ach ye ar wer e pr o duc e d by the fol lowi ng formula: | SCA standar di ze d c olon c anc er r ate − Crude c olon c anc er r ate | SCA standar di ze d c olon c anc er r ate × 100 . The dashe d lines ar e the absolut e value of the p er c ent di ﬀer enc es for whites. The solid lines ar e the absolute value of the p er c ent diﬀer enc e for blacks. absolute v alues of the crude-cancer rates in the E 1 group. In Figure 1 w e see that standard ized rates from the SCC op erator more closely trac k the crude-cancer rates th an those p ro du ced by the SCA op erator. In fact, as indicated earlier, the SCA standard ized rate in the y ear 2000 do es not equal the cru d e-cancer rate that w ou ld hav e b een seen if the age distribution were iden tical to th at of the yea r 200 0. Figure 2 is a graph of the magnitude (the absolute v alue) of the p ercen t diﬀerence in the SCA and cru de-cancer rates. F or b oth males and females the magnitude of these diﬀerences is greatest for blac ks (sol id line). This phenomenon is due to the fact that the P 2000 (age) distribu tion is m uch closer to th e age d istribution for whites than f or b lacks. In addition, the ye ar- to-y ear v ariation in the p ercent deviation app ears to b e greater for blac ks. Th us, graph ically , b lac ks alw ays app ear to ha ve larger y ear-to-y ear c hanges in cancer inciden ce than do whites. Th ese d iﬀerences are more pr ominen t when cancer rates are compared within groups d eﬁned b y ethn icity (data not sho wn ). Em pirically we ﬁ n d that the lo w er the p opu lation frequency of a group, the great er the d eviation of S CA standard ized rates from the actual crude-cancer rates. One pr eviously unm en tioned limitat ion of the S EER SCA op er ator is that it cannot pro duce age-sp eciﬁc cancer rates that cont rol for diﬀerences in the distribution of other risk factors. Since age is b y far the largest risk factor for cancer, comparing within-age-strata rates may r ev eal trends that are other- wise not visible. Figure 3 is a graph of the SCC standardized breast cancer A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 13 rates for white females for eac h y ear from 1992 thr ough 2003. This graph indicates an o ve rall increase in breast cancer rates from 199 2 to 1998, and a decrease f rom 1998 to 2003. Figure 4 con tains a plot of the SCC b reast cancer rates for white females in the yea r s 1992 (dotted line), 1997 (solid line), and 2003 (dashed line), within eac h of the ﬁ v e-ye ar ag e categories 40 y ears old or greater. Th e shap e of the graphs of standard ized rates are s im i- lar for all three yea r s. Consisten t with Figure 3 , we see that the lo west rates are for year 2003, and the h ighest for 1997. What cannot b e d iscerned fr om Figure 3 is that the diﬀerences in cancer rates for the y ears 200 3 and 1997 are greatest for females older than 60; and that the d iﬀeren ces b et we en 2003 and 1992 rates are almost entirely due to rate diﬀerences in females ov er 60. The inf orm ation in Figure 4 suggests that w hen considering p ossible causes of the calendar trend s sho wn in Figure 3 , one should fo cu s on change s that w ere more prominent in f emales age 60 and older. Giv en th at the usual ap- proac h in epidemiology is to mak e inferences ab out cancer rates w ithin age groups, and n ot from m arginal rates obtained b y inte grating out age, w e an ticipate that age-sp eciﬁc standardized rates will p r o vide imp ortan t add i- tional information ab out cancer trends and b et ween su bgroup d iﬀerences in cancer rates. Note that if one were to employ a “stratiﬁcation str ategy” and use the SCA op erator to calculate separate standardized rates for eac h age group, the age standardized rates pr o duced w ould in fact b e th e crude breast cancer rates, P y (breast cancer | white, female, age = a j ). Fig. 3. The SCC standar di ze d br e ast c anc er r ates for white femal es age forty and older for the ye ars 1992–2003. 14 S. D. MARK Fig. 4. The SCC standar dize d br e ast c anc er r ates by ﬁve-ye ar interval for white females age f orty and older. The dotte d line is the SCC standar dize d br e ast c anc er r ates for 1992. The solid line is the SCC standar dize d br e ast c anc er r ates for 1997. The dashe d line is the SCC standar dize d br e ast c anc er r ates for 2003. 5. Making inferences from observ ed rate diﬀerences of SCC standard ized rates. In S ection 3 w e established that the SCC op erator is in the class of op erators in w hic h diﬀerences in the P y ( E 2 | E 1 ) do not aﬀect the standard - ized rates pro du ced by th ose op erators. Thus, if the SCC stand ardized rates for year y † and y †† diﬀer, we can conclude that the ﬁnest-crude-cancer r ates diﬀer for those y ears. Ho wev er, desp ite the nomenclature, w e do not kno w whether for ﬁxed E 2 the ﬁnest-crude-cancer rates diﬀer as a fu n ction of E 1 , for ﬁ xed E 1 they diﬀer as a f u nction of E 2 , or whether the diﬀerences in ﬁnest-crude-cancer rates dep end on b oth E 1 and E 2 . T o m ak e th e inf erences of in terest to the SE ER inv estigators [W ard et al. ( 2006 )], w e require that w e consider d isease rates as a fu nction of b oth mea- sured and unmeasured risk factors. T o incorp orate the eﬀect of unmeasur ed risk factors on in ference, we use the fundamenta l disease pr obabilit y (FDP) paradigm for inf erence p r op osed by Mark ( 2004 , 2005 , 2006 , 2008 ). T he re- sults we presen t in this sectio n dep end only on the deﬁnitions presen ted in this sectio n , and require no kno w ledge of, or r esults from, an y of the other material con tained in Mark ( 2004 , 2005 , 2006 , 2008 ). W e deﬁ n e the fun damen tal disease p robabilit y for yea r y to b e the pr oba- bilit y of disease conditional on all risk factors. W e denote this b y P y ( D | E , U ). Here E are the measur ed risk factors; U is a set of u nmeasured risk factors that, along with E , completely determine the probabilit y of d isease. Th e A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 15 relationship b et w een the FDP and th e ﬁnest-crude-cancer r ates, P y ( D | E ), is P y ( D | E ) = Z U P y ( D | E , U ) dP y ( U | E ) . (15) Using th e FDP paradigm for inf er en ce, w e are able to falsify a subset of assum p tions ab out the u nmeasured risk factors based on contrasts of SCC standardized rates. W e d eﬁne t wo assu mptions: the iden tical disease probabilit y (IDP) assum ption, P y † ( D | E , U ) = P y †† ( D | E , U ) , (16) and the comparable-confounding assumption, P y † ( U | E ) = P y †† ( U | E ) . (17) Without loss of generalit y , w e u se as example the inferen ces that can b e made from cont rasts of the ov erall (marginal) S CC standardized cancer rates in yea r s y † and y †† . T h ese are the standardized p opulation cancer rates not conditional on any risk f actors ( E 1 = ∅ ). In terms of the FDP form ulation this standardized rate is s cc y [ D ] = Z E  Z U P y ( D | E , U ) P y ( U | E )  dP ∗ ( E ) . (18) If IDP ( 16 ) is tru e, and s cc y † [ D ] 6 = s cc y †† [ D ], then w e can conclud e that the assumption of comparable-confounding ( 17 ) is false. F or instance, dietary factors suc h as folate in take are susp ected of b eing risk f actors for colon can- cer [Gio v annucci ( 2002 )]. SEER conta ins no measurement of folate intak e. If w ithin lev els of E , the in tak e of folate has c hanged ov er time, then ( 17 ) is false. In fact, in the United States a p opulation-wide folate sup plemen tation program b egan in 1998 ; it is known th at folate int ak e in the p opulation has increased considerably since then [Qu inliv an an d Gregory ( 2003 )]. Is th e IDP assumption reasonable? If w e b eliev e that the determinan ts of a disease, and th e impact of those determinants on the probability of disease, are inheren t to the biology of humans and do n ot v ary with y ear, then IDP is true. Suc h b elief is consisten t with our cur ren t conceptualization of biologica l pr o cesses. Ho w ev er, hidden in the IDP assumption is th e assertion that the classiﬁcation of disease and exp osures is iden tical in year y † and y †† . I f diagnostic criteria for colon cancer ha ve c h anged, or, if diagnostic pro cedur es for detecting colon cancer h a ve c hanged (for instance, an increase in pro cedures th at lead to early detection of colon cancer), then D y † and D y †† ma y not in fact represen t the s ame b iologica l outcome. S imilarly , if the measuremen t to ols for ascertaining ethnicit y and race h a ve changed, then E † and E †† ma y not measure the same attributes. In either case, we would exp ect IDP ( 16 ) to b e false. 16 S. D. MARK Iden tical reasoning prov es th at if comparable confoun ding is true ( 17 ), then IDP ( 16 ) is f alse. In summary , if the observed SC C s tand ardized rate d iﬀerences are nonzero, w e can conclude that either IDP ( 16 ) and/or comparable confounding ( 17 ) are false. There is a direct corresp ond ence b et we en falsiﬁcation of the abov e as- sumptions and the conclusions made in the Annual R ep orts to the Nation . W ard et al. b egin their p ap er, Interpr eting Canc er T r ends , with the f ollo w- ing sen tence: “T emp oral trends in th e incidence of particular typ es of cance r ma y reﬂect c hanges in exp osu re to underlying etiologic factors, changes in classiﬁcation, or the introdu ction of n ew screening or diagnostic tests.” Th e “c hanges in underlyin g etiologic factors” corresp onds to comparable con- founding b eing false ( 17 ); the “c hanges in classiﬁcation, or the introd uction of new screening or d iagnostic tests,” corresp ond s to IDP ( 16 ) b eing false. The FDP inferences giv en ab ov e apply to an y SONC op er ator. They do not apply to the SCA standardization operator used in W ard et al. ( 2006 ) or an y of the Annual R e p orts to the N ation . 6. Nested standardized rates and within -y ear mo d el b u ilding. Th e An- nual R e p orts provide and interpret tr en ds in cancer rates o v er time for v ari- ous demographic subgroup s. The ma jority of the rep ort examines con trasts in o v erall cancer rates, cont rasts cond itional on gender and r ace, and con- trasts conditional only on gender or only on race. Ho wev er, u n lik e in th e usual regression analysis, no attempt is m ad e to construct a “parsimonious mo del.” W ere the analyses in th e Annual R ep orts u sed only to d escrib e within-group trends o ve r time, su c h mo del develo pment migh t b e of no in- terest. When used to make the t yp e of inferences describ ed in th e ﬁrst p ara- graph of this pap er, the abilit y to test nested mo dels assumes imp ortance. Whether standardized r ates conditional on race and gender are identi cal to standardized rates conditional on race alone has implications for the allo ca- tion of health care resources, the constr u ction of preve n tive pr ograms, and the fo cus of f u ture etiolog ic researc h. Regarding standardized rates as the output of stand ardization op erators allo ws us to ev aluate th e r elationship b et we en standardized rates pro d uced b y the same op erator on the same data. W e deﬁn e the standardized rate s ∗ y [ D | E †† 1 ] to b e nested in s ∗ y [ D | E † 1 ] provided the follo wing three conditions are true: (1) b oth are pro du ced b y the same standardization op erator, S ∗ y [ D | E 1 ], (2) the argumen ts of the op erator, ( D y , E y ) and P ∗ ( E ) , are iden tical, (3) E †† 1 is a p r op er subset of E † 1 . The relationship of n ested standardized rates pr o duced b y the SCC op- erator has familiar prop erties. The SCC op erator is recursive in the sense A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 17 that s cc y [ D | E †† 1 ] = S cc y [[ s cc y [ D | E † 1 ]] | E †† 1 ] . (19) The righ t-hand side of ( 19 ) is deﬁned to b e S cc y [[ s cc y [ D | E † 1 ]] | E †† 1 ] ≡ Z E † 1 \ E †† 1 s cc y [ D | E † 1 ] dP ∗ ( E † 1 \ E †† 1 | E †† 1 ) . (20) The SCC op erator do es not “discard information.” The standardized rate obtained from equation ( 7 ) when E 1 = E †† 1 is the same estimate obtained b y replacing the ﬁn est-crud e-cancer rate in (7) with s cc y [ D | E † 1 ]. Thus, nested rates pro duced b y the SCC op erator ha v e the same p rop erties as nested estimates in regression mo d els of conditional exp ectations. W e are cur ren tly dev eloping inferential pr o cedures analogous to those that exist for regression. Though the iden tity in ( 19 ) can easily b e veriﬁed b y su bstitution, a more instructiv e pro of is based on th e functional form of the S CC operator. W e deﬁne the probab ility measure P y ∗ ( D , E ) ≡ P y ( D | E ) P ∗ ( E ). The SCC op- erator can b e regarded as the conditional exp ectation of the ﬁn est-crude- cancer r ates, P y ( D | E 1 , E 2 ), with resp ect to P y ∗ ( E 2 | E 1 ). Th u s, s cc y [ D | E †† 1 ] is the p ro jection (conditional exp ectation) of the ﬁnest-crude-cancer rates on the subs pace (su bsigma algebra) deﬁned by E †† . Pro jections (conditional ex- p ectations) are en tirely determined (a.e. u nique) by the sub space (subsigma algebra) on wh ic h they are deﬁned (measur ab le) [Dudley ( 1989 )]. Recursion cannot b e deﬁn ed for SCA. The P y ( D | E a 1 , A ) in the integral in ( 6 ) is alw ays a function of A ; s ca y [ D | E 1 ] is n ev er a function of A . Mimic king the form of ( 20 ) one migh t deﬁn e r ecursion for S CA to b e Z E † 1 s ca y [ D | E † 1 ] P ∗ ( E † 1 | E †† 1 ) . (21) The integ r al in ( 21 ) do es not equal the s ca y [ D | E †† 1 ] obtained from ( 6 ). Th e SCA op erator is not a pro jection op erator. 7. Discussion. In order to mak e infer en ce from observ ational data, re- searc hers attempt to separate the eﬀect of the exp osures of interest from the eﬀect of other disease determinants that co v ary with the exp osures of in terest. W e ha ve divided these other determinant s into tw o m u tually ex- clusiv e sets: d eterminan ts that are m easured and determinants that are u n- measured. W e refer to pro cedures that atte mpt to separate the asso ciation of the co v ariates of in terest, E 1 , from the association of the other measured disease determinan ts, E 2 , as p r o cedures that control for confoun d ing. Standardization is one su c h pr o cedure. It is the m ost common pro cedur e used to con trol for confounding in the analyses of p opulation-studies or d ata 18 S. D. MARK from disease registries. In this pap er w e examined the abilit y of v arious stan- dardization pr o cedures to con trol for measured risk factors, and the inter- pretabilit y of diﬀeren ces in standardized r ates in the presence of u nmeasured risk factors. Our motiv ation for conducting th is research w as to ev aluate the prop erties of the “age adjustment usin g the direct metho d” standardization pro cedur e (SCA standardization) u sed in the analysis of SEER cancer reg- istry d ata and , if n eeded, to deve lop standardization pro cedures with b etter prop erties. W e deﬁne a general class of standardization operators, and regard stan- dardized rates to b e the output from a standardization op erator. T h e general class of op erators are an y fun ctionals that are inte grals of a cru de-cancer r ate with resp ect to a user-d eﬁned “weig h ting” distrib ution ( 5 ). Since all cru de- cancer rates are themselves in tegrals of the ﬁnest-crude-cancer rates ( 2 ), P y ( D | E ), stand ardization op erators diﬀer only with resp ect to the measure used to in tegrate P y ( D | E ). Based on this formulatio n , w e deﬁned b et we en- y ear diﬀerences in crude-cancer rates conditional on E 1 as b eing u ncon- founded b y E 2 , p ro vid ed the d istribution of E 2 conditional on E 1 is the same in b oth yea rs ( 8 ). By extension, we deﬁn ed a sub class of standard - ization op erators with no E 2 confounding (SONC op erators). This sub class consists of standardization op erators in whic h the ﬁnest-crude-cancer rates for eac h y are int egrated with resp ect to a distr ib ution that is the same for all y ( A.1 ). The SCA op erator is n ot a SO NC op erator ( 10 ). If the diﬀerences in crude-cancer rates are not confounded, the SCA op erator can int ro duce con- founding and pro d uce b et wee n-y ear diﬀerences in s tandardized r ates that are confou n ded. In Section 3.1 we s ho wed that the S C A op erator will in - tro duce confoundin g u nless the P y ( E ) distr ib utions d iﬀer only in terms of the marginal distribution of age. This criterion was not met in the SEER 13 data that w e analyzed. Figure 1 from that analysis provides a graphic represent ation of the f act that the standard ized rates pro d u ced by th e SCA op erator for y ear y are n ot the “cancer rates one w ould ha v e seen” had the age distribution in y ear y b een iden tical to the age distribution that generated the w eigh ts. It is clear that one should alwa y s choose a stand ardization op erator fr om the sub class of op erators with no E 2 confounding. W e prop osed and exam- ined the pr op erties and p erformance of one su c h op erator: the standardiza- tion con tr olling for cov ariates op erator (SCC). A desirable c h aracteristic of the SC C op er ator is that the y ear-to-y ear diﬀerences in standard ized cancer rates p reserv e the magnitud e of the diﬀerences of the crud e-cancer rates. When the cr u de-cancer rates are not confounded, and thus n o standard - ization pr o cedure is required, the SCC op erator is the only op erator for whic h th e standardized rates equal the crude-cancer rates. Figure 1 pro- vides a grap h ic illustration of these pr op erties. Figure 2 sho ws that the A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 19 largest d iﬀerences b etw een standard ized rates f rom the S CA op erator and crude-cancer rates o ccurs in m inorit y p opulations. If standard ized rates pro duced by standardize op erators with no E 2 con- founding are diﬀerent in y ear y † and y †† , then diﬀeren ces m ust exist in the ﬁnest-crude-cancer rates. Ho we v er, to m ak e the inferences of in terest to the authors of the Annual R ep orts [W ard et al. ( 2006 )], one must consid er the impact of unmeasured cancer r isk factors on the y ear-to-y ear diﬀerences in standardized rates. In Section 5 w e used th e fu ndamenta l d isease p robabilit y paradigm for inference p rop osed by Mark ( 2004 , 2005 , 2006 , 2008 ) to pro v e that nonzero con trasts of standardized rates falsify assumptions ab out the conditional probabilities of disease giv en all the risk factors [the iden tical disease probab ilit y assumption , ( 16 )], and /or the conditional prob ab ilities of unmeasured risk factors giv en the measured risk factors [the comparable- confounding assumption , ( 17 )]. W e describ e the one-to -one corresp ond ence that exists b etw een the inferences made in the SEER A nnual R ep orts [W ard et al. ( 2006 )], and the falsiﬁcation of these assumptions. The analyses in the Annual R ep orts only examine b et we en-y ear d iﬀerences in cance r rates. In Section 6 w e argue that, giv en the t yp e of inferences made from these rep orts, it would b e desirable to examine within-y ear con trasts. W e deﬁned the concept of nested standardized r ates. W e prov ed that, like regression mo dels for cond itional exp ectations, the SC C op erator is a pro jec- tion op erator. In our current r esearc h w e are dev eloping metho ds for testing nested mo dels analogous to those used in regression. Though w e h a ve discussed the SCC op erator in terms of estimating the conditional exp ectatio n of a binary v ariable, these op erators extend in an ob vious manner to the estimation of conditional exp ectations in general. The particular op erators w e hav e deﬁned are nonparametric estimators. O n e could constru ct parametric or semiparametric op erators b y , for instance, replacing P y ( D | E ) with a parametric or s emip arametric mo d el, P y ( D | E ; θ ). The n ext step in our research is to deriv e SCC estimators for the statis- tics currently used for inf erence in the A nnual R ep ort . Our eve n tual goal is to program theseestimators in the R -language [R Dev elopment Core T eam ( 2007 )] and pro duce f reely a v ailable user-friendly soft w are comparable to the SEER*Stat f r eew are [SEE R *S tat.6.3.6. ( 2007 )] a v ailable for S CA analyses of SEER d ata. APPENDIX: DEFINING S T AND ARDIZA TION OPE RA TORS WITH NO E 2 CONF O UNDING F or all factorizat ions E = ( E 1 , E 2 ), let F ∗ ( E 1 , E 2 ) b e a family of proba- bilit y measures indexed by e ∗ 1 , with the prop erty that R E 2 dF ∗ ( e ∗ 1 , E 2 ) = 1 for all e ∗ 1 ∈ E 1 . 20 S. D. MARK S ∗ y [ D | E 1 ] is a standard ization op erator unconfound ed b y E 2 if it can b e expressed in the form S ∗ y [ D | E 1 ] = Z E 2 P y ( D | E 1 , E 2 ) dF ∗ ( E 1 , E 2 ) . (A.1) Note that for the S CC op erator ( 7 ), the P ∗ ( E ) sp eciﬁes the F ∗ ( E 1 , E 2 ) for all factorization of E , and all realizatio ns e ∗ ∈ E . In general, this n eed n ot b e the case. SUPPLEMENT AR Y MA TERIAL F undamenta l disease probabilit y inference: A new p aradigm for causal inference in th e biologi cal sciences (DOI: 10.1214/08 -AO AS170SUPP ; .p df ). REFERENCES Anderson, R. N. and Rosenber g, H. M. ( 1998). Age standardization of death rates: Implementation of the yea r 2000 standard. National Vital Statistics R ep orts 47 . Dudley, R. M . (1989). R e al An alysis and Pr ob abili ty , 1st ed. W adswo rth, Paci ﬁc Grov e, CA. MR0982264 Edw ards, B. K., Ho we, H. L. , Ries, L. A. G. et al. (2002). Annual report to the nation on the status of cancer, 1973–19 99, featuring implications of age and aging on U.S. cancer burd en. Canc er 94 2766–2 792. Edw ards, B. K., Bro wn, M. L. , Wing o, P. A. et al. (2005). A nnual rep ort to the nation on the status of cancer, 197 5–2002, featuring p opulation-based trends in cancer treatment. J. National Canc er I nstitute 97 1407–1 427. Giov annucci, E. (2002). Epid emiologic studies of folate and colorectal neoplasia: A re- view. J. Nutrition 132 2350S–2355S. Ho we, H. L., Wingo, P. A. , Thun, M . J. et al. (2001). Annual rep ort to the nation on the status of cancer (1973 through 1998), featuring cancers with recent increasing trends. J. National Canc er Institute 93 824–842. Ho we, H. L., Wu, X. , Ries, L. A. G. et al. (2006). Annual rep ort to the n ation on th e status of cancer, 1975–20 03, fea turing cancer among U.S. Hispanic/Latino p op u lations. Canc er 107 1711–1742 . Jemal, A., Clegg, L. X. , W ard, E. et al. (2004). Annual rep ort to the nation on the status of cancer, 1975–2001, with a sp ecial feature regarding surviv al. Canc er 101 3–27. Klein, R. J. and S choenborn, C. A. (2001). A ge adjustment pro cedures using the 2000 pro jected U.S. p opu lation. He althy Pe ople Statistic al Notes 20 . National Center for Health Statistics, Hyattsville, MD. Mark, S. D. (2004 ). A formal approach for deﬁning and identifying the fundamental eﬀects of exp osures on disease from a series of exp eriments conducted on p op u lations of non-identical su b jects. In Pr o c e. Amer. Statist. Asso c. 3120 –3142. American Statistical Association, Alexandria, V A. Mark, S. D. (2005). Using V -range maps to lo cate exp osure regions where observ ab le con- trasts identify the eﬀects of exp osure on contra sts of fundamental disease probabilities. In Pr o c. Amer. Statist. Asso c. 299–305. American Statistical A ssociation, Alexandria, V A. Mark, S. D. (2006). F undamental disease probability inference: A new paradigm for causal inference in the biological sciences. I n Pr o c. Americ an Statist. Asso c. 283–290. American Statistical Asso ciation, Alexandria, V A. A GENERAL FORMULA TION F O R ST A NDARDIZA TION OF RA TES 21 Mark, S. D. (2008). Supplement to “A general formulation for standardization of rates as a metho d to control confounding by measured and unmeasured disease risk factors.” DOI: 10.1214/ 08-AO AS170SUPP . Quinliv an, E. P. and Gregor y, J. F. II I . (2003). Eﬀect of foo d fortiﬁcation on folic acid intak e in the United States. Americ an J. Clini c al Nutrition 77 221– 225. R Development Core Team . (2007). R: A language and environmen t for statistical computing. R F oundation f or Statistic al Com puting . V ienna, Austria. Ries, L. A. G., Eisner, M. P. and Kosar y, C. L. (2005). SEER C anc er Statistics R eview, 1975–2002 . National Cancer In stitute, Bethesda, MD. Ro thman, K. J. (1986). M o dern Epidemiolo gy , 1st ed. Little, Brown, and Company , New Y ork . SEER, Sur ve illance, Epidemiology a nd End Resul ts Pr ogram (2005a ). NIH Pub- lication No. 05-4772, N ational Cancer Institute. SEER, Sur vei llance, Epidemi ology and End Resul ts Program (2005b). SEER 13 Regs Limited USE, No v . 2005 Sub (1992–2003). National Cancer Institut e, DCCPS, Surveillance R esearc h Program, Cancer Statistics Branch, W ashington, D.C. The Sur vei llence Research Program of the Division of Ca ncer Control and Popula tion Sciences (2007). SEER*Stat.6.3.3. National Cancer Institute. W ard, E. M., Th un, M. J. and Han na, L. M. et al. (2006). Interpreting cancer trends. Ann . New Y ork A c ad. Sci. 1076 29–53 . Weir, H. K., Thun, M . J. , Hankey, B. F. e t al. (2003). An nual rep ort to the nation on the status of cancer, 1975–2000, featuring the uses of surve illance data for cancer preven tion and control. J. National Canc er Institute 95 1276–129 9. Dep ar tment of Biost a tistics and Biometrics University of Colorado School of Public Heal th 4200 East Ninth A venue, RM1615 Denver, Colorado 80262 USA E-mail: stev en.mark@uchsc.edu

A General formulation for standardization of rates as a method to control confounding by measured and unmeasured disease risk factors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment