Quantifying alternative splicing from paired-end RNA-sequencing data

The Annals of Applie d Statistics 2014, V ol. 8, No. 1, 309–330 DOI: 10.1214 /13-A OAS687 c  Institute of Mathematical Statistics , 2 014 QUANTIFYING AL TERNA TIVE SPLICING FR OM P AIRED-END RNA-SEQUENCING DA T A By D a vid Rossell 1 , ∗ Camille Steph an-Otto A ttolini 2 , † Manuel Kroiss ‡ , § , 3 and Almond St ¨ ocker ‡ , 3 University of Warwick ∗ , Institute for R ese ar ch in Biome dici ne of Bar c elona † , LMU Mu nich ‡ and TU Munich § RNA-sequ encing has rev olutionized biomedical resea rc h and , in particular, our abilit y to study gene alternative splicing. The problem has imp ortant implications for human health, as alternative splici ng ma y b e invo lved in malfunctions at the cellular level and multiple diseases. How ever, the high-d imensional nature of the data and th e existence of exp erimental biases p ose serious data analysis challenges. W e ﬁnd that t he standard d ata summaries u sed to study alternative splicing are severely limited, as they ignore a substantial amount of v aluable information. Current data analysis metho ds are based on such summaries and are hence suboptimal. F urt her, th ey have lim- ited ﬂexibilit y in accounting for technical biases . W e prop ose nov el data summaries and a Bay esian mo deling framew ork t hat o verco me these limitations and determine biases in a nonparametric, highly ﬂexible manner. These summaries adapt naturally t o th e rapid im- prov ements in sequencing technolog y . W e provide eﬃcien t p oint esti- mates an d uncertain ty assessmen ts. The approach allow s to study al- ternative splicing patterns for indiv id u al samples and can also b e th e basis for d o wnstream analyses. W e found a severa lfold improv ement in estimation mean square erro r compared p opular approac h es in sim u lations, and substantial ly higher consistency b etw een replicates in exp erimental data. Ou r ﬁn dings indicate th e need for adjusting the routine summarization and analysis of alternativ e splicing RNA- seq studies. W e provide a softw are implementation in the R pac ka ge casper . 4 Received Sep tember 2012; revised Ma y 2013. 1 Supp orted in part b y Grants R 01 CA158113-01 from the NIH (USA) and MTM2012- 38337 from the Ministerio de Econom ´ ıa y Competitividad ( Spain). 2 Supp orted in part by AGAUR Beatriu de Pin´ os fello wship BP-B 00068 (Spain). 3 Supp orted in p art by a Ba yerisc hes Arb eitsministerium foreign trainee fello wship (Ger- many). Key wor ds and phr ases. Alternative splicing, RNA-Seq , Ba yesian mod eling, estimation. This is an electronic reprint of the or iginal a r ticle published by the Institute o f Mathematical Statis tics in The A nn als of Appli e d Statistics , 2014, V ol. 8, No. 1, 309–33 0 . This r e print diﬀers from the original in pa gination and typog raphic detail. 4 http://w ww.bioconductor.org/pac k ages/release/ bio c/html/casper.html . 1 2 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER Fig. 1. Thr e e splic e variants for a hyp othetic al gene and their r elative abundanc es. Exon 1 i s lo c ate d at p ositions 101–4 00. Exon 2 at 1001–1100. Exon 3 at 2001–2500. 1. In tro d uction. RNA-sequencing (RNA-seq) pr o duces an o ve rwhelm- ing amount of genomic d ata in a single exp erimen t, p ro vid ing an unp rece- den ted resolution to address biological prob lems. W e fo cus on gene expres- sion exp eriments where the goal is to study alternative sp licing (AS), which w e brieﬂy in tro d uce. AS is an imp ortan t biological pro cess b y whic h cells are able to express several v arian ts, also kno wn as isoforms , of a sin gle gene. Eac h splicing v arian t giv es rise t o a d iﬀerent p r otein with a u nique stru c- ture that can p er f orm diﬀeren t functions an d resp ond to in ternal and envi- ronment al n eeds. AS is b eliev ed to con trib ute to the c omplexit y of h igher organisms, and is in fact particularly common in humans [Bl enco we ( 2006 )]. Additionally , it is kno wn to b e in v olved in m ultiple diseases such as cancer and malfunctions at th e cellular lev el. Despite its imp ortance, due to limita- tions in earlier tec h nologies, most gene exp r ession stud ies ha ve ignored AS and focused on ov erall gene expression. Consider the hyp othetical example of a g en e with three splice v arian ts sho w n in Figure 1 . Th e gene is encod ed in the DNA in three exons, sho wn as b oxe s in Figure 1 . When the gene is transcrib ed as messenger RNA (mRNA), it can giv e r ise to three isoforms . V arian t 1 is formed by all three exons, whereas v arian t 2 skip s the second exon and v arian t 3 the third exon. Usually , m u ltiple v arian ts are expressed sim ultaneously at an y giv en time. In our example, v ariant 1 mak es up for 60% of th e o verall exp r ession of the gene, v arian t 2 for 30% and v ariant 3 for 10%. In practice, these pr op ortions are unknown and our goal is to estimate them as acc urately as p ossib le. W e fo cus on p aired-end RNA-seq e xp eriments, as they are the current standard and provide higher resolution for measuring isoform expression than co mp eting tec hnologies, for example, microarrays [Pe pk e, W old and Mortaza vi ( 2009 )]. RNA-seq sequ ences tens or ev en hundreds of millions of mRNA fr agmen ts, whic h ca n th en be aligned to a reference genome us in g AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 3 T able 1 Thr e e p air e d-end RNA-se q f r agments. Aligne d chr omosome and b ase p airs ar e i ndic ate d for b oth ends, al lowing f or gapp e d alignments. The exon p ath indic ates the se quenc e of exons visite d by e ach end. A typic al exp eriment c ontains tens of mil lions of fr agments Chromosome Left read Right read Exon path F ragment 1 chr1 110–185 200–274 { 1 } , { 1 } F ragment 2 chr1 361– 400; 1001–1035 2011–20 85 { 1, 2 } , { 3 } F ragment 3 chr1 301–375 1021–10 95 { 1 } , { 2 } · · · a v ariet y of s oftw are, f or example, T op Hat [T r apnell, Pac hter and Salzb er g ( 2009 )], SO AP [Li et al. ( 20 09 )] or BW A [Li and Durbin ( 20 09 )]. Through- out, w e assu me th at the softw are can hand le gapp ed alignmen ts (we used T opHat in all our examples). E arly RNA-seq studies used single-end se- quencing, where only the left or right en d of a fragmen t is sequen ced. In con trast, paired-end RNA-seq sequences b oth fragmen t ends . T able 1 sho w s three hyp othetical sequenced fr agments corresp ond in g to the gene in Fig- ure 1 . 75 base p airs (bp) w ere sequenced from eac h end. F or instance, b oth ends of fragmen t 1 align to exon 1. As the three v arian ts con tain exon 1, in principle, this f r agmen t could ha v e b een generated by an y v arian t. F or frag- men t 2 the left read aligned to exons 1 and 2 (i.e., it spanned the jun ction b et w een b oth exons), and th e right read to exon 3. Hence, fragmen t 2 can only ha v e b een generated from v ariant 1. Finally , fragmen t 3 visits exons 1 and 2 and, hen ce, it could ha ve b een generated either by v arian ts 1 or 3. The example is simply mean t to pr o vide s ome intuitio n. In pr actice, most genes are substantial ly longer and ha v e m ore complicated splicing p atterns. Precise probability calc ulations are required to ensur e that the conclusions are sound. Ideally , one wo uld wan t to sequence the wh ole v arian t, so that eac h fr ag- men t can b e uniqu ely assigned to a v ariant. Unfortu nately , current tec hnolo- gies sequence hundreds of base pairs, whic h are ord ers of magnitude shorter than t ypical v ariant le ngths. Curr en t statistical approac hes are based o n the observ ation that, while most sequenced fragmen ts cannot b e uniqu ely assigned to a v arian t, it is p ossible to mak e probabilit y statemen ts. F or in- stance, fragmen t 3 in T able 1 ma y ha ve originate d either from v ariant 1 or 3, but the probabilit y that ea c h v arian t generates suc h a fragmen t is diﬀer- en t. As we sh all see b elo w, this obs er v ation pr ompts a direct use of Bay es theorem. In principle, one could formulat e a pr obabilit y mo del that uses the full data, that is, t he exact b ase pairs co vered by eac h fr agmen t such as pro- vided in T able 1 , for example, Glaus, Honk ela and Rattra y ( 2012 ). Ho we v er, our ﬁndings indicate that such strategies can b e compu tationally prohibitive 4 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER and deliv er no ob vious impr o vemen t (Section 4 ). F urther , data storage and transfer requirements imp ose a need for reducing the size of the data. Sev- eral auth ors p rop osed su mmarizing the data by count ing the num b er of fragmen ts either co v ering eac h exon or eac h exon jun ction [e.g ., Xing et al. ( 2006 ), Mortaza vi et al. ( 2008 ), Jiang and W ong ( 2009 )]. I n fact, large-scal e genomic databases r ep ort pr ecisely these summ aries, for example, The Can- cer Genome A tlas pro ject. 5 One can t hen p ose a probabilit y mo del that uses coun t data from a few cat egories as ra w data, whic h greatly simpliﬁes computation. While useful, this app roac h is seriously limited to considerin g pairwise jun ctions, whic h discards r elev ant information. F or instance, sup- p ose that a fragmen t visits exons 1, 2 and 3. Simply adding 1 to the coun t of fragmen ts sp anning exons 1–2 and 2– 3 ignores the join t information that a single fragmen t visited 3 exons and decreases the conﬁdence wh en infer r ing the v ariant that generated th e f r agmen t. Our results su ggest that ignoring this in formation can r esult in a serious loss of precision. I t is n ot un com- mon that a fragmen t spans m ore than 2 exons. Holt and Jones ( 2008 ) found a sub stan tial p r op ortion of fragments bridging sev eral exons in paired-end RNA-seq exp eriments. In the 2009 RGA SP exp erimental data set (Section 4 ) 38.0% and 40.9% of fragmen ts sp anned ≥ 3 exons in r eplicate 1 and 2, re- sp ectiv ely (w e sub d ivided exons so th at th ey a re fully sh ared /not shared b y all v arian ts in a gene). In the 2012 ENCODE data set we found 64.7% and 65.2% in eac h rep licate. T h e 2012 data had substantial ly longer reads and fr agmen ts, w hic h illustrates the r apid adv ancements in tec hnology . As sequencing ev olv es, these p er centag es are exp ected to increase f urther. W e prop ose nov el data summaries that preserv e m ost information relev an t to alternativ e s plicing, while main taining the computational b u rden at a manageable level . W e record the sequence of exons visited b y eac h fr agmen t end, whic h w e refer to as exon p ath , and then coun t the n umb er of fragmen ts follo wing eac h exon path. Th e left end of F ragmen t 2 in T able 1 v isits exons 1 and 2 and the righ t en d exon 3, whic h we denote as { 1 , 2 } , { 3 } . Notice that a fragmen t follo wing the path { 1 } , { 2 , 3 } visits the same exons, so one could be tempted to simp ly record { 1 , 2 , 3 } in b oth cases. Ho w ever, the p r obabilit y of observing { 1 , 2 } , { 3 } for a giv en v ariant diﬀers from { 1 } , { 2 , 3 } and, hence, com binin g the t w o paths w ould resu lt in a p oten tial loss of information. T able 2 cont ains hypothetical exon path counts for our example gene. W e use these count s as the basic input for our pr ob ab ility mo d el. P aired-end RNA-se q is critical f or AS studies. Intuitiv ely , compared to single-end sequencing, it in cr eases the probabilit y of observing frag men ts that conn ect exo n j unctions. La croix et al. ( 2008 ) sho w ed that, a lth ou gh neither pr oto col guarantee s the existence of a u nique s olution, in practice, 5 http://ca ncergenome.nih.go v . AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 5 paired-end (b u t not single-end) can p ro vid e asymptotically corr ect estimate s for 99.7% of the human ge nes. In con trast, for single-end d ata the ﬁgure is 1.14%. Unfortun ately , muc h of the current methodology has b een designed with single-end d ata in mind. Xing et al. ( 2006 ) formulat e the p roblem as that of tra versing a directed acyclic graph and formulate a laten t v ariable based approac h to estimate splice v arian t exp ression. Jiang and W ong ( 2009 ) prop ose a similar approac h within the Bay esian framework. Both approac hes w ere designed for single-end RNA-seq d ata. Ameur et al. ( 2010 ) prop osed strategies to detect splicing junctions, and Katz et al. ( 2010 ) and W u et al. ( 2011 ) in tro duced m o dels to estimate the p ercen tage of isoforms skipping individual exons. Ho wev er, these approac hes do n ot estimate expression at the v ariant lev el. Sev eral authors prop ose strategies that use paired-ends. Mortaz a vi et al. ( 2008 ), Mon tgomery et al. ( 2010 ), T rapn ell et al. ( 2010 ) and Salzman, Jiang and W ong ( 2011 ) mod el the num b er of fragmen ts spanning exon ju nctions. These app roac hes fo cus on pairwise exon conn ections, ignoring v aluable higher-order inf orm ation, and ha ve limitations in incorp orating imp ortant tec hnical biases. First, the sample preparation pr oto cols u sually ind uce an enric h m en t to wa rd the 3’ end of th e transcrip t, that is, fragmen ts are n ot uniformly distribu ted along the gene. Rob erts et al. ( 2011a ), W u, W ang and Zhang ( 2011 ) or Glaus, Honkela and Rattra y ( 2012 ) relax the uniformit y as- sumption. F urther, the fragmen t length distribu tion pla ys an imp ortan t role in the pr ob ab ility calculations and needs to b e estimated accurate ly . While the approac h es ab ov e ac k n o wledge this issue, they either use sequencing fa- cilit y rep orts (i.e., th ey d o not estimate the d istribution from the data) or they imp ose s tr ong parametric assumptions. O ur examples illus tr ate that facilit y rep orts can b e inaccurate and that parametric f orm s d o not captur e the observ ed asymmetries, h eavy tails or multi-mod alities. F urther, all p r e- vious app roac hes assume th at f r agmen t start and length distributions are T able 2 Exon p ath c ounts f or hyp othetic al gene Exon path Count { 1 } , { 1 } 2824 { 2 } , { 2 } 105 { 3 } , { 3 } 5042 { 1 } , { 2 } 27 { 1 } , { 1, 2 } 423 { 1 } , { 3 } 127 { 2, 3 } , { 3 } 394 { 1, 2 } , { 3 } 2 { 1 } , { 2, 3 } 13 6 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER constan t across all genes. W e pro vide empirical evidence that this assump - tion ca n b e ﬂa w ed and suggest a strategy to relax the assumption. A concern with current ge nome annotations is that they ma y miss some splicing v ariants. Our appr oac h can b e com bin ed w ith metho ds that pr edict new v ariants s uc h as Cuﬄin ks RABT mo d u le [T rapnell et al. ( 2010 ), Rob erts et al. ( 2011b )], Scripture [Guttman et al. ( 2010 )] or S pliceGrapher [Rogers et al. ( 2012 )]. This option is implement ed in our R pac k age an d illustrated in Secti on 4.3 . In summary , w e prop ose a ﬂ exible framewo rk to estimate alternativ e splic- ing from RNA-seq studies, b y using nov el data summaries and accoun ting for exp erimenta l biases. In Section 2 we formulate a probabilit y mod el that goes b eyo nd pairwise connections b y considering exon paths. W e mo del the read start and fragmen t size distrib u tions n onparametrically and allo w for sepa- rate estimation w ithin subsets of ge nes with similar characte ristics. Sectio n 3 discusses mo del ﬁtting and provides algorithms to obtain p oint estimates, asymptotic cred ibilit y in terv als and p osterior samp les. W e sho w some resu lts in Secti on 4 and pro v id e concluding remarks in S ection 5 . 2. Probabilit y mo del. W e form ulate th e mo del at the gene lev el and p er- form inference separately for eac h gene. In some cases, exons fr om d iﬀeren t genes ov erlap w ith eac h other. When this o ccurs we group the ov erlapping genes and consider all their isoforms simultaneously . It is also p ossible that t wo v arian ts share on ly a part of an exon. W e sub divide suc h exons into the shared part and the part that is sp eciﬁc to eac h v arian t. F or simplicit y , from here on we refer to gene group s simp ly as genes and to s u b d ivided exons simply as exons. Consider a gene with E exons starting at base pairs s 1 , . . . , s E and ending at e 1 , . . . , e E . Denote the set of splicing v ariant s und er consideration b y ν (assumed to b e kno wn) and its cardinalit y b y | ν | . Eac h v arian t is c haracter- ized b y an increasing sequ en ce of natural n um b ers i 1 , i 2 , . . . that indicates the exons con tained th er ein. 2.1. Likeliho o d and prior. As discuss ed ab ov e, w e formulate a mo del for exon p aths. Let k b e the num b er of exons visited b y the left read, and k ′ b e that for the righ t r ead (i.e., k = k ′ = 1 when b oth reads o verla p a single exon). W e denote an exon path by ι = ( ι l , ι r ), wh ere ι l = ( i j , . . . , i j + k ) are the exons visited b y the left-end and ι r = ( i j ′ , . . . , i j ′ + k ′ ) those b y th e righ t- end. Let P ∗ b e the set of all p ossible exon paths and P b e the subset of observ ed paths, that is, the p aths follo w ed by at least 1 sequenced fragmen t. The observed data is a realization of the r andom v ariable Y = ( Y 1 , . . . , Y N ), where N is the num b er of paired-end reads and Y i ∈ { 1 , . . . , P ∗ } indicates the exon p ath follo wed by read p air i . F ormally , Y i arises from a mixtu re of | ν | discrete probabilit y distributions, eac h co m p onent corresp ond ing to AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 7 a diﬀeren t sp licing v ariant. The mixture weig h ts π = ( π 1 , . . . , π | ν | ) giv e the prop ortion of reads generated by eac h v arian t, that is, its relativ e expression. That is, P ( Y i = y i | π , ν ) = | ν | X d =1 p y i d π d , where p k d = P ( Y i = k | δ i = d ) i s the probabilit y of p ath k und er v arian t d and δ i is a laten t v ariable ind icating the v arian t that originated Y i . Let S i and L i denote the relat iv e start and length (resp .) of fr agment i . The exon path Y i is complete ly determined giv en S i , L i and the v arian t δ i . Hence, p k d = Z Z I ( Y i = k | S i = s i , L i = l i , δ i = d ) dP L ( l i | δ i ) dP S ( s i | δ i , L i ) , (1) where P L is the fr agmen t distribution and P S is the r ead start distrib ution conditional on L . As d iscussed in Section 2.2 , by assu ming th at P S and P L are shared across sets of genes with s im ilar c haracteristics, it is p ossible to estimate th em with high precision. Hence, f or practical pur p oses we can treat p k d as kno wn and pre-compute them b efore mo del ﬁtting. F ull deriv ations for p k d are pro vided in App endix A . Assuming that ea ch fr agmen t is ob s erv ed indep endentl y , the likel iho o d function ca n b e written as P ( Y | π , ν ) = N Y i =1 | ν | X d =1 p y i d π d = |P | Y k =1 | ν | X d =1 p k d π d ! x k , (2) where x k = P N i =1 I( y i = k ) is th e n u m b er of reads follo win g exon path k . Equation ( 2 ) is log-conca ve, wh ich guaran tees the existence of a single max- im u m . Log-co nca vit y is giv en by (i) the log fun ction b eing conca ve and monotone increasing, (ii) P | ν | d =1 p k d π d b eing linear and therefore conca v e, and (iii) the fact that a comp osition g ◦ f where g is conca ve and monotone increasing and f is conca v e is again conca v e. T o see (iii), notice that g ◦ f ( t z 1 + (1 − t ) z 2 ) ≥ g ( tf ( z 1 ) + (1 − t ) f ( z 2 )) ≥ t g ◦ f ( z 1 ) + (1 − t ) g ◦ f ( z 2 ) , where the ﬁrst inequalit y is giv en b y g b eing increasing and f conca ve, an d the second inequalit y is giv en by g b eing conca v e. W e complete the p robabilit y mo d el with a Diric hlet prior on π : π | ν ∼ Dir( q 1 , . . . , q | ν | ) . (3) In Section 4 w e assess sev eral choic es for q d . By default we set th e fa irly uninformative v alues q d = 2, as these in duce negligible b ias and stabilize the 8 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER p osterior mo de b y p o oling it aw a y from the b oundaries 0 and 1. It is easy to s ee that ( 3 ) is log-conca v e when q d ≥ 1 f or all d . Giv en that ( 2 ) is also log-co nca ve , this c hoice of q guaran tees the p osterior to b e log-conca ve , and therefore the u niqueness of the p osterior mo de. 2.2. F r agment length and r e ad start dist ribution estimates. Ev aluating the exon path probabilities in ( 1 ) that app ear in the likelihoo d ( 2 ) requires the fragment start d istribution P S and f ragmen t length distrib u tion P L . Giv en th at it is not p ossible to estimate ( P L , P S ) with p r ecision f or eac h in- dividual gene, we assume they are shared ac ross m ultiple genes (restricting fragmen ts to b e no longer than the v ariant they originated from). By default w e assume that ( P L , P S ) a re c ommon ac ross al l g enes, but w e also stud- ied p osing separate d istributions according to gene length. Su pplementa r y Section 1 s ho ws exp erimen tal evidence that, while P L remains essent ially constan t, P S can dep end on ge ne length and the experimental setup. While this option is imp lemen ted in our R pac k age, to allo w a d irect comparison with previous appr oac hes here, w e assumed a common ( P L , P S ). Denoting b y T the lengt h of v arian t δ i (in bp), w e let P L ( l | δ ) = P L ( l | T ) = P ( L = l )I( l ≤ T ) /P ( l ≤ T ). That is, the conditional distrib ution of L giv en δ is simply a truncated v ersion of the marginal distribution. F u rther, we assume a common fragmen t start distribution relativ e to the v arian t length T . Con d itional on L and T , P S is truncated so that the fragmen t ends b efore the end of the v ariant. Sp eciﬁcally , P S ( S ≤ s | δ i , L = l ) = P  S T ≤ z | T , L = l  (4) = ϕ (min { z , ( T − l + 1) /T } ) ϕ (( T − l + 1) /T ) , where z = s/T and ϕ ( z ) = P ( S T ≤ z ) is the d istribution of the r elativ e read start S T . T o estimate P L note that the fragmen t length is un k n o wn for f r agmen ts that sp an multi ple exons, but it is known exactly when b oth ends fall in the same exon. Therefore, we select all su c h fr agmen ts and estimate P L with the empir ical p r obabilit y mass function of the observe d f ragmen t lengths. In order to prev ent short exons from in ducing a selec tion bias, we only use exons that are sub stan tially longer than the exp ected maxim u m f ragmen t length (b y default > 1000 bp). Estimating the fragmen t start distribution P S is more challe nging, as w e do not kno w th e v arian t that generate d eac h fr agmen t and therefore its relativ e start p osition cannot b e determined. W e address this issue by selecting genes that ha v e a sin gle annotat ed v arian t, as, in principle, for these genes all fragments should ha v e b een generated by that v ariant. Of AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 9 course, the annotated genome d o es not con tain all v ariants and , therefore, a pr op ortion of the selected fragmen ts ma y not ha v e b een ge nerated by the assumed v arian t. Ho wev er, the annotati ons are exp ected to c on tain most common v arian ts (i.e., with highest expression) and, hence, most of the selected f ragmen ts should corresp ond to the annotated v arian t. Under this assumption, w e can determine the exact start S i and length L i for all selected fragmen ts. A diﬃcult y in estimating the read start d istribution is that the observ ed ( S i , L i ) pairs are truncated so that S i + L i < T , whereas w e require the untruncated cumulativ e distribution function ρ ( · ) in ( 4 ). F ortun ately , th e truncation p oin t for eac h ( S i , L i ) is known and, therefore, one can sim p ly obtain a Kaplan–Me ier estimate of ρ ( · ) [Kaplan and Meier ( 1958 )]. W e use the fu nction sur vfit in the R su rvival pack age [Th erneau and Lumley ( 2011 )]. 3. Mod el ﬁ tting. W e pr o vide algorithms to obtain a p oint estimate for π , asymp totic credibilit y in terv als and p osterior samples. F ollo w ing a 0–1 loss, as a p oin t estimate we rep ort the p osterior mo de, whic h is obtained b y maximizing the pro du ct of ( 2 ) and ( 3 ), sub ject to the constrain t P | ν | d =1 π d = 1 . W e note that maximum likel iho o d estimates are ob- tained by simply setting q d = 1 in ( 3 ). This constrained optimization can b e p erform ed with man y numerical optimization algo rithms. Here we used the EM algorithm [Dempster, Laird and Rubin ( 1977 )], as it is compu tationally eﬃcien t ev en when th e num b er of v arian ts | ν | is large. F or a detailed deriv a- tion see App endix B . As noted ab o ve , for q d > 1 the log-p osterior is conca ve and, therefore, the algorithm con verge s to the single maxim u m. T he steps required for the algorithm are as follo w s: 1. Initialize π (0) d = q d / P | ν | d =1 q d . 2. A t iteration j , up d ate π ( j + 1) d = q d − 1 + P |P | k =1 x k p kd π ( j ) d P | ν | i =1 p ki π ( j ) i . Step 2 is rep eated un til the estimates stabilize. In our examples w e r equired | π ( j + 1) d − π j d | < 10 − 5 for all d . Notice th at p k d and x k remain constant thr ough all iteratio n s and, h ence, they need to b e co mputed only once. W e c h aracterize the p osterior un certain ty asymptoticall y us in g a nor- mal appr oximati on in the re-parameterized space θ d = log( π d +1 /π 1 ), d = 1 , . . . , | ν | − 1 and the delta metho d [Casella and Berger ( 2001 )]. Denote b y µ the p osterior mo d e for θ = ( θ 1 , . . . , θ | ν |− 1 ) and by S the Hessian of th e log- p osterior ev aluated at θ = µ . F u rther, let π ( θ ) b e the in verse tr an s formation and G ( θ ) the matrix with ( d, l ) element G dl = ∂ ∂ θ l π d ( θ ) . Detailed expressions for S , π ( θ ) and G ( θ ) are p ro vid ed in App endix C . The p osterior for θ can b e asymptotically appro ximated b y N ( µ , Σ), where Σ = S − 1 . Hence, the delta metho d appro ximates the p osterior for π with N ( π ( µ ) , G ( µ ) ′ S G ( µ )). 10 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER The asymptotic appr o ximation is also us eful for the follo win g in d ep end en t prop osal Metrop olis–Hastings sc heme. In itialize θ (0) ∼ T 3 ( µ , Σ) and notice that a pr ior P π ( π ) on π induces a p rior P θ ( θ ) = P π ( π ( θ )) × | G ( θ ) | on θ , where G ( θ ) is as ab o v e. A t iterat ion j , p er f orm the follo wing steps: 1. Prop ose θ ∗ ∼ T 3 ( µ , Σ) and let π ∗ = π ( θ ∗ ). 2. Set θ ( j ) = θ ∗ with probabilit y min { 1 , λ } , w here λ = P ( Y | π ∗ , ν ) P π ( π ∗ ) | G ( θ ∗ ) | P ( Y | π ( j − 1) , ν ) P π ( π ( j − 1) ) | G ( θ ( j − 1) ) | T 3 ( θ ( j − 1) ; µ , Σ ) T 3 ( θ ∗ ; µ , Σ ) . (5) Otherwise, set θ ( j ) = θ ( j − 1) . P osterior samples can b e obtained b y discarding some burn -in samples and rep eating the p r o cess until p ractical con v ergence is ac hiev ed. By default w e suggest 10,0 00 samples with a 1000 bu rn-in, as it p ro vid ed suﬃcien tly high n u merical accuracy when comparing tw o indep enden t chains (Supp lemen- tary Sectio n 2). 4. Results. W e a ssess th e p erf orm ance of our app roac h in simulations and t wo exp erimenta l data sets. W e obtained the tw o human samp le K562 replicates 6 from the R GASP pro ject ( www.gencodegenes.org/rgasp ) and t wo ENCODE P ro ject Consortium ( 2004 ) replicated samples obtained from A549 cells (acce s sion n umber wgEncodeEH002625 7 ). W e compare our re- sults with Cuﬄinks [T rapnell et al. ( 2012 )], Flux C apacitor [Mon tgomery et al. ( 2010 )] and BitSeq [Glaus, Honke la and Rattra y ( 2012 )]. Cu ﬄinks is based on a probabilistic mod el akin to C asp er, but uses exon and exon junc- tion coun ts instea d of full exon paths, assum es that fragmen t lengths are normally d istributed and estimates the read start distr ibution in an itera- tiv e manner. FluxCapacitor is also based on exon and exon junction counts, but uses a metho d of m omen ts t yp e estimator. BitSeq uses a fu ll Ba y esian mo del at the base-pair resolution (i.e., d ata is n ot summarized as counts) and estima tes the read start distribution with a t w o-step p ro cedur e. Regarding sequence alignmen t, for Casp er, Cuﬄ inks and FluxCapacitor w e used T opHat [T rapnell, Pac h ter and S alzb erg ( 2009 )] with the human genome hg19, u s ing the default parameters and a 200 bp a verage insert size. BitSeq r equ ired aligning to the transcriptome with Bo wtie [Langmead et al. ( 2009 )]. 6 ftp://ftp.sanger.a c.uk/pub/genco de/rgasp/R GASP1/inputdata/human fa stq/ . 7 genome.ucsc.edu/ENCODE . AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 11 Fig. 2. Estimate d fr agment length (top) and start (b ottom) distributions in K562 data (left) and A549 data (right). Black dotte d line: di ﬀer enc e in √ P S b etwe en r eplic ates (values in se c ondary y-ax is). 4.1. Simulation study. W e generated h u man genome-wide R NA-seq data, setting th e simulatio ns to resem ble the K562 R GASP data in ord er to k eep them as realistic as p ossible. Figure 2 (left) sho ws our estimates ˆ P S and ˆ P L . W e s et P S and π for eac h gene with 2 or more v ariants to their estimates in the K562 data. F or ea c h gene we sim ulated a n u m b er of fragmen ts equal to that observ ed in th e K562 sample. W e consid ered a Casp er-based and a Cuﬄinks-based s im u lation scenario. In the former w e set π and P L to the C asp er estimates ( q d = 2 ). The second scenario w as designed to fav or Cuﬄink s b y usin g its π estimates and setting P L to its assumed Normal distribution (mean = 200, standard deviation = 20). W e indicated the data-generating P L to C uﬄinks, wh er eas the remaining metho ds estimated it from the data. An im p ortan t diﬀerence b et w een scenarios is that C asp er estimates with q d = 2 are p o oled a wa y from 12 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER T able 3 Me an absolute and squar e err ors, bi as and varianc e for simulation study for Casp er (top) and Cuﬄ inks estimates (b ottom) MAE MSE Bias sq rt V ariance Casper-based simula tions Casper ( q d = 1) 0.094 0.028 0.004 0.024 Casper ( q d = 2) 0.055 0.004 0.004 0.004 Cuﬄinks 0.141 0.050 0. 028 0.022 FluxCapacitor 0.151 0.054 0. 022 0.032 Cuﬄinks-based simulatio ns Casper ( q d = 1) 0.100 0.055 0.021 0.034 Casper ( q d = 2) 0.111 0.035 0.032 0.003 Cuﬄinks 0.127 0.073 0. 045 0.028 FluxCapacitor 0.138 0.078 0. 036 0.042 the b oun dary , hence, π d is nev er exactly 0 or 1, whereas the Cuﬄ inks es- timates were often in the b oun dary (Supplemen tary Figure 4). Genes with less than 10 reads p er kilobase p er millio n (RPKM) w ere excluded fr om all calculatio ns to reduce biases du e to lo w expression. W e estimated π from the sim u lated d ata u sing our app roac h with p rior parameters q d = 1 and q d = 2, Cuﬄinks and FluxCapacitor. T ab le 3 rep orts the absolute and square errors ( | π d − ˆ π d | and ( π d − ˆ π d ) 2 ) a v eraged a cross all 18,909 isoforms and 100 simulated data sets for b oth simulat ion settings. W e also rep ort the a verage squared bias and v ariance. Th e Cu ﬄ inks and FluxCapacitor MAE are ov er 2.5 and 2.7 folds greater th an that for Casp er with q d = 2 (1.5 and 1.6 for q d = 1 , resp.) in the Casp er-based scenario. In the Cuﬄinks-b ased simulatio n the redu ctions w ere 1.14 and 1.24 fold (1. 27 and 1.38 for q d = 1 ). Th e improv emen ts in MSE are eve n more pronoun ced, with an o ver 2 fold impro vemen t for q d = 2 ev en in the Cuﬄinks -b ased sim u lation. Casp er also sho ws a mark ed impr o veme n t in bias for q d = 1 and v ariance for q d = 2 . See Sup plemen tary Figure 4 for corresp onding plots. Figure 3 (top) sho ws the MAE for eac h transcrip t as a function of RPKM, a m easure of o verall gene expression. Casp er improv es the estimates for essen tially all RPKM v alues in b oth sim ulation settings. Figur e 3 (b ottom) assesses t he MAE vs. the mean pairwise d iﬀerence b et wee n v arian ts in a gene (n um b er of b ase p airs not shared). When v ariants in a gene share m ost exons this diﬀerence is small, that is, v ariant s are hard er to distinguish. Casp er estimates are the most accurate at all similarit y lev els, with the MAE decreasing as v arian ts b ecome more diﬀerentia ted. In terestingly , Cuﬄinks and FluxCapacitor sho w lo w er MAE as similarit y increases from lo w to medium, but then MAE b ecomes higher and more v ariable in genes with medium-highly d iﬀeren tiated v ariants. These results illustr ate the adv an tage AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 13 Fig. 3. Sim ulation study. Me an absolute err or vs. RPKM for Casp er estimates (a) and Cuﬄink estimates (b) and the me an b ase p air diﬀer enc e b etwe en variants in a gene for Casp er (c) and Cuﬄinks-b ase d simulations (d) . of usin g fu ll exon paths, whic h provide more resolution in assignin g reads to splicing v ariant s. Finally , we assessed the fr equen tist co verage probabilities for the asymp- totic 95 % cred ibilit y in terv als (Section 3 ), ﬁ nding that in 95.04% of the cases they con tained the tru e v alue. 4.2. Exp erimental data fr om RGASP pr oje ct. The t wo K562 rep licates w ere sequenced in 2009 with Solexa sequ en cing. The read length w as 75 bp and the mean fragment length ind icated in the d o cumenta tion is 200 bp for b oth replicat es. Figure 2 (top, left) sho ws t he est imated f ragmen t length distributions. W e observ e that the mean length diﬀers signiﬁcan tly from 200 bp and that there are imp ortan t diﬀerences b et w een replicates. Replicate 2 sh o ws a hea vy left tail that indicates a sub set of f ragmen ts sub stan tially 14 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER shorter than a verag e. This distr ib utional shap e cannot b e captured with the usual paramet ric distrib utions. Figure 2 (left, bottom) s ho ws the relativ e start distrib ution. W e obs erv e m ore sequences lo cated n ear the transcript end in r eplicate 1, that is, a h igher 3’ bias. Th e diﬀerences b et w een replicates illustrate the n eed of ﬂexibly mo deling these distr ibutions for eac h sample separately . In fact, we found th at ˆ P S diﬀered across genes with v arying length (Sup plemen tary Sectio n 1 and Sup plemen tary Fig ure 1), the 3’ bias b eing stronger in genes shorter than 3 k ilo-bases. W e estimated the expression of h u man splicing v ariants in the UCSC genome v ersion hg19 for the tw o replicated s amp les separately . Figure 4 (left) and T able 4 compare the estimates obtained in the tw o samples. The Mean Absolute Diﬀerence (MAD) b et wee n replicates w as 0.06 4 for Casp er, 0.126 for Cuﬄinks (97% increase), 16.2 for FluxCapacitor (253% in crease) and 8.5 for BitSeq (31% increase). Figure 4 shows a r oughly lin ear corre- lation for Ca sp er, Cuﬄinks a nd FluxCapacitor, the latter t wo frequen tly pro v id ing ˆ π d = 0 in one replicate and ˆ π d = 1 in the other. Bit Seq a voids these b oundaries b u t exhibits a s trongly nonlinear asso ciation. In terms of computational time, all metho d s requir ed roughly 10–20 min on 4 pro ces- sors. Because BitSeq mo d els the data at the base-pair resolution, it required substanti ally longer time to run on 12 cores. These r esults su ggest that Casp er p ro vides clea r adv anta ges ev en with earlier sequencing tec h nologies. 4.3. Exp erimental data fr om E NCODE pr oje ct. The tw o A549 r eplicated samples were sequenced in 2012 using Illumina HiSeq 2000. The r ead length w as 101 b p and the a v er age f r agmen t length was rou gh ly 300 bp (Figure 2 , top right) . T h ese are su bstant ially longer than th e 2009 samples from Sec- tion 4.2 , and r eﬂect the imp ro vemen t in sequencing tec hnologies. Similar to Section 4.2 , Figure 2 reve als i mp ortant d iﬀerences in the fragment length (top, righ t) and start (bottom, right) distributions b etw een samples. S ee also Supp lemen tary Section 1 and Supp lemen tary Figure 2, wh ere ˆ P S exhibits a stronger 3’ bias for genes longer than 5 kilo-bases. Figure 4 (left) and T able 4 compare the estimates obtained in the tw o replicates. Similar to the RGASP study (Section 4.2 ), Casp er sho ws a roughly linear asso ciation and substan tially higher consistency b et we en r eplicates. The MAD b et wee n replicates w as 0.057 for C asp er, 9.0 for Cuﬄinks (58% in - crease), 12.7 for FluxC apacitor (223% increase) an d 0.098 for BitSeq (72% increase). The computational time for Casp er w as c omparable t o that of Cuﬄinks, higher than FluxCapacitor and sub stan tially lo we r than BitSeq. The ﬁndings sh ow that the a d v an tage of mo d eling exon path counts o v er pairwise exon connections remains p ronounced as tec hnology ev olve s to se- quence longer fr agmen ts. AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 15 (a) (b) (c) Fig. 4. Comp arison of estimate d isoform expr ession π d b etwe en two r eplic ates in K562 and ENCODE studies. (a) Casp er with q d = 2 ; (b) Cuﬄinks; (c) FluxCap acitor; (d) Bit- Se q. 16 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER (d) Fig. 4. (Continue d). W e no w consider the p ossibilit y that some expr essed transcripts may n ot b e present in the UCS C genome annotations. W e u sed a Cuﬄink s RABT mo dule to iden tify no vel transcripts, and then ru n Casp er to joint ly estimate their expr ession with UCSC transcripts. C uﬄinks-RABT foun d 12,512 gene islands with no n ew transcripts, 6229 with some new transcr ip ts and 1527 completely new genes in sample 1. F or sample 2 the ﬁgur es w ere 11,912, 6983 and 1378 completely new genes. While new transcripts had negligible inﬂuence on genes with no new transcripts, in th e remainin g genes ˆ π d de- creased so that a prop ortion of the exp ression could b e assigned to the new v arian ts. F or further details see Supplement ary Sect ion 4. T hese ﬁndings suggest that curren t genome annotatio n s may miss a nonnegligible n umb er of expressed v ariants. F or a careful assessment we recommend follo wing a strategy akin to ours h ere, that is, co m b in ing our approac h w ith a de novo transcript disco ve ry metho d . T able 4 K562 and Enc o de studies. Me an absolute diﬀer enc e (MAD) i n ˆ π d b etwe en r eplic ates and CPU time on 2.8 GHz , 32 Gb OS X c omputer ( + : 4 c or es; ∗ : 12 c or es) K562 Encod e MAD CPU MAD CPU Casper + 6 . 4 × 10 − 2 11.1 min 5 . 7 × 10 − 2 2 h 11 min Cuﬄinks + 12 . 6 × 10 − 2 21.4 min 9 . 0 × 10 − 2 2h 13 min Flux + 16 . 2 × 10 − 2 9.0 min 12 . 7 × 10 − 2 1 h 17 min BitSeq ∗ 8 . 5 × 10 − 2 1 day 13 h 9 . 8 × 10 − 2 8 h 40 min AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 17 5. Discussion. W e prop osed a model to estimat e the expression of a set of kno wn alternativ ely sp liced v arian ts from RNA-se q data. T he model im- pro ves up on previous prop osals b y using exon paths, whic h are more in- formativ e than single or p airwise exon counts, and by ﬂ exibly estimating the fr agmen t start and length distribu tions. W e provided compu tationally eﬃcien t algorithms f or obtaining p oin t estimate s, asymptotic credibilit y in- terv als and p osterior samples. W e found that a fairly un informativ e p rior with q d = 2 impro v es precision relativ e to the t ypical q d = 1 equiv alen t to maxim um like liho o d estimation. The adv an tages stem from the usu al shrin k age argument: q d = 2 p o ols the estimates a w ay f rom th e b oun daries and reduces v ariance. Compared to comp eting approac hes, we observ ed s ubstanti al MSE reductions in simula- tions and increased correlation b et ween exp erimen tal replicates. In mo d ern studies we found that roughly 2 sequences out of 3 visited > 2 exon regions distinguishing v arian ts, s uggesting that the curr en t standard of r ep orting pairwise exon ju nctions adopted by most pu blic d atabases is far from op- timal. Rep orting exon paths w ould allo w researc hers to estimate isoform expression at a m uc h h igher precision. Give n that the metho d ology is im- plemen ted in the R pac k age ca sper , we b elieve that it sh ould b e of great v alue to practitioners. APPENDIX A: DERIV A TION OF EX ON P A TH PR OBABILITIE S Here w e describ e how to compu te the probab ility p k d of obs erving exon path k for an y sp licing v arian t d . Equ iv alen tly , we denote d by δ = ( i 1 , . . . , i | δ | ), where i j indicates the j th exon within d . Consider v arian t δ after splicing, that is, after remo ving th e introns. The new exon start p ositions are giv en b y s ∗ 1 = 1 and s ∗ k +1 = s ∗ k + e i k − s i k + 1 for k = 1 , . . . , | δ | − 1. The end of exon k is s ∗ k +1 − 1. Denote by S the read start position, L the fragmen t length, r the read length, and let T = s ∗ | δ | − 1 b e the transcript length of δ . The goal is to compute P ( ι l = ( i j , . . . , i j + k ) , ι r = ( i j ′ , . . . , i j ′ + k ′ ) | δ ) . W e note that b oth i j , . . . , i j + k and i j ′ , . . . , i j ′ + k ′ m u st b e consecutiv e exons un der v arian t δ , otherwise the probabilit y of the p ath is zero. The left read f ollo ws the exon p ath ι l = ( i j , . . . , i j + k ) if and only if the read: 1. Starts in exon j , that is, s ∗ j ≤ S ≤ s ∗ j +1 − 1. 2. Ends in exon j + k , that is, s ∗ j + k ≤ S + r − 1 ≤ s ∗ j + k +1 − 1. Similarly , the righ t read follo ws ι r = ( i j ′ , . . . , i j ′ + k ′ ) if and only if s ∗ j ′ ≤ S + L − r ≤ s j ′ +1 − 1 and s ∗ j ′ + k ′ ≤ S + L − 1 ≤ s ∗ j ′ + k ′ +1 − 1 . This implies that the desired p r obabilit y can b e w ritten as P ( a 1 ≤ S ≤ b 1 , a 2 ≤ S + L ≤ b 2 | δ ), where a 1 = max { s ∗ j , s ∗ j + k − r + 1 } , 18 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER b 1 = min { s ∗ j +1 − 1 , s ∗ j + k +1 − r } , (6) a 2 = max { s ∗ j ′ + r , s ∗ j ′ + k ′ + 1 } , b 2 = min { s ∗ j ′ +1 + r − 1 , s ∗ j ′ + k ′ +1 } . Assuming that the distrib ution of ( S, L ) dep ends on δ only throu gh its transcript length T , w e can write P ( a 1 ≤ S ≤ b 1 , a 2 ≤ S + L ≤ b 2 | T ) = X l P ( a 1 ≤ S ≤ b 1 , a 2 ≤ S + L ≤ b 2 | T , L = l ) P ( L = l | T ) (7) = X l P (max { a 1 , a 2 − L } ≤ S ≤ min { b 1 , b 2 − L }| T , L = l ) P ( L = l | T ) . In order to ev aluate ( 7 ), w e need to estimate the fragmen t length distribu tion P ( L = l | T ) and the distribu tion of the read s tart p osition S giv en L . W e assume that P ( L | T ) = P ( L = l )I( l ≤ T ) /P ( L ≤ T ) , that is, the conditional distribution of L giv en T is simp ly a truncated v ersion of the m arginal distribution. F urther , notice that the fragmen t end must h app en b efore the end of the t ranscript, that is, S + L − 1 ≤ T or, equ iv alen tly , the relativ e start p osition is t runcated S /T ≤ S T = ( T − L + 1) /T . T he relativ e start distribution is therefore truncated, that is, P ( S T ≤ z | T , L = l ) = ϕ (min { z ,S T } ) ϕ ( S T ) , where ϕ ( z ) = P ( S T ≤ z ) is the d istr ibution of the r elativ e read start S T . Under these assumptions, the pr obabilit y of observing th e exon path ι l = ( i j , . . . , i j + k ), ι r = ( i j ′ , . . . , i j ′ + k ′ ) under v arian t δ is equal to X l [( ϕ (min { b 1 /T , ( b 2 − l ) /T , S T } ) − ϕ (min { max { ( a 1 − 1) /T , ( a 2 − l − 1) /T } , S T } )) /ϕ ( S T )] + P ( L = l | T ) , where a 1 , b 1 , a 2 and b 2 are give n in ( 6 ). Giv en th at h ighly precise estimates of P ( L = l ) and ϕ ( · ) are t ypically a v ailable, for computational simplicit y we treat them as kno wn and p lug th em in to ( 8 ). APPENDIX B: EM ALGORITHM DERIV A TION 1. E-step . Let δ i ∈ { 1 , . . . , | ν |} b e laten t v ariables in d icating the v ariant that r eads i = 1 , . . . , N come from. The augmente d log-posterior is prop ortional to l 0 ( π | y , δ ) = log P ( π | ν ) + log P ( y , δ | π ) (8) = | ν | X d =1 ( q d − 1) lo g ( π d ) + N X i =1 | ν | X d =1 I( δ i = d )[ log( p y i d ) + log ( π d )] . AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 19 Considering δ i as a rand om v ariable, the exp ected v alue of ( 8 ) giv en y and π = π ( j ) is equal to E ( l 0 ( π ′ | y , δ ) | y , π ( j ) ) = | ν | X d =1 ( q d − 1) log ( π d ) (9) + N X i =1 | ν | X d =1 P ( δ i = d | y i , π ( j ) )(log( p y i d ) + log ( π ′ d )) . 2. M-step . The go al is to maximize ( 9 ) with r esp ect to π ′ . Let γ id = P ( δ i = d | y i , π ( j ) ) and re-parameteriz e π | ν | = 1 − P | ν |− 1 d =1 π d . Setting the partial deriv ative s with resp ect to π ′ d to zero giv es the system of equations π ′ d 1 − P | ν |− 1 d =1 π ′ d = q d − 1 + P N i =1 γ id q | ν | − 1 + P N i =1 γ i | ν | , whic h has the trivial solution π ′ d ∝ q d − 1 + P N i =1 γ id . By plu gging in γ id = p y i d π ( j ) d / P | ν | d =1 p y i d π ( j ) d , w e obtain π ′ d ∝ q d − 1 + N X i =1 p y i d π ( j ) d P | ν | d =1 p y i d π ( j ) d . Finally , sin ce x k = P N i =1 I( y i = k ), w e can group a ll y i ’s taking t he same v alue and ﬁnd the maximum as π ′ d ∝ q d − 1 + |P | X k =1 x k p k d π ( j ) d P | ν | d =1 p k d π ( j ) d , (10) re-normalizing π ′ so that P | ν | d =1 π ′ d = 1 . APPENDIX C: ASYMPTO TIC PO STERIOR APPRO XIMA TION Here w e deriv e an asymptotic app ro ximation to P ( π | ν , Y ), the p osterior distribution of the splici ng v ariants expression π conditional on a mo del ν and the obs er ved d ata Y . Giv en that π = ( π 1 , . . . , π | ν | ) ∈ [0 , 1] | ν | with P | ν | i =1 π i = 1, we re-parameterize to θ = ( θ 1 , . . . , θ | ν |− 1 ) ∈ ℜ | ν |− 1 , w here θ d = log( π d +1 π 1 ) for d = 1 , . . . , | ν | − 1. The goal is to appro ximate P ( θ | ν , Y ) ∼ N ( µ , Σ ). F or n otational simplicit y , in the remainder of th e section w e d r op the conditioning on ν . 20 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER A prior P π ( π ) in d uces a prior P θ ( θ ) = P π ( π ( θ )) × | G ( θ ) | on θ , where G ( θ ) is the matrix with ( d, l ) elemen t G dl = ∂ ∂ θ l π d ( θ ) and inv erse tran s form π 1 ( θ ) = (1 + P | ν |− 1 j =1 e θ j ) − 1 , π d ( θ ) = π 1 ( θ ) exp { θ d − 1 } for d > 1. Deﬁne f ( θ ) = log ( P ( Y | θ )) + log( P θ ( θ )). Up to an additive constant, f ( θ ) is equal to th e target log-p osterior distribution of θ giv en Y . W e center the appro ximating Normal at the p osterior m o de, that is, µ = argmax θ ∈ℜ | ν |− 1 f ( θ ). W e set Σ = S − 1 , where S is th e Hessian of f ( θ ) ev aluated at θ = µ w ith ( l, m ) elemen t S lm = ∂ 2 ∂ θ l ∂ θ m f ( θ ). W e appro x im ate µ d = log ( π ∗ d +1 π d ), where π ∗ is the p osterior mo de for π provided by the EM algorithm. Under a π ∼ Diric h let( q ) prior, simple algebra giv es σ lm = ∂ 2 ∂ θ l ∂ θ m f ( θ ) = |P | X k =1 x k ( P | ν | d =1 p k d H dlm )( P | ν | d =1 p k d π d ( θ )) − ( P | ν | d =1 p k d G dl )( P | ν | d =1 p k d G dm ) ( P | ν | d =1 p k d π d ( θ )) 2 (11) + | ν | X d =1 ( q d − 1) H dlm π d ( θ ) − G dl G dm π d ( θ ) 2 , where x k = P N i =1 I( y i = k ) is th e n u m b er of reads follo win g exon path k , p k d = P ( Y i = k | δ = d ) is the probability of observin g path k under v arian t d , th e gradient for π d ( θ ) is G dl = ∂ ∂ θ l π d ( θ ) as b efore and the Hessian is H dlm = ∂ 2 ∂ θ l ∂ θ m π d ( θ ). W e complete the d eriv ation by pro viding exp ressions for G dl and H dlm . Let s ( θ ) = 1 + P | ν |− 1 j =1 e θ j , then G dl = − e θ l s ( θ ) 2 if d = 1 , (12) − e θ d − 1 + θ l s ( θ ) 2 + I( l = d − 1) e θ l s ( θ ) if d ≥ 2 and H dlm = 2 e θ l + θ m s ( θ ) 3 − I( l = m ) e θ l s ( θ ) 2 if d = 1 , (13) 2 e θ d − 1 + θ l + θ m s ( θ ) 3 − I( l = d − 1) e θ l + θ m s ( θ ) if d ≥ 2 , m 6 = l , m 6 = d − 1 , − 2 e 2 θ m s ( θ ) 2 + 2 e 3 θ m s ( θ ) 3 + e θ m s ( θ ) − 2 e 2 θ m s ( θ ) 2 if d ≥ 2 , m = l , m = d − 1 , − e θ d − 1 + θ l s ( θ ) 2 + 2 e θ d − 1 + θ l + θ m s ( θ ) 3 otherwise. AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 21 Ac kn owledgmen ts. D. Rossell and C. Stephan-Otto A ttolini con tr ib uted equally to this w ork. The authors wish to thank Mod esto Orozco for useful discussions. SUPPLEMENT AR Y MA TERIAL Supp lemen tary resu lts (DOI: 10.1214/1 3-A O AS687SUPP ; .p df ). In Rossell et al. ( 2014 ) we assess th e dep en dence of fragment start and length distri- butions on gene length, sh ow add itional sim ulation results, assess MCMC con verge n ce and apply th e approac h to transcripts foun d de novo . REFERENCES Ameur, A. , We tterbom, A . , Feuk, L. and Gyllensten, U. (2010). Global and unbi- ased detection of splice junctions from RNA -seq data. Gen ome Biol. 11 R34. Blencow e, B. J. (2006). Alternative splicing: New insights from global analyses. Cel l 126 37–47. Casella, G. and Ber ge r, R. L. (2001). Statistic al I nfer enc e , 2nd ed. Duxbu ry , N. Scit- uate. Dempster, A. P. , Laird, N. M . and Rubin, D. B . ( 1977). Maxim um likel ihoo d from incomplete data via the EM algo rithm. J. R. Stat. So c. Ser. B Stat. Metho dol. 39 1–38. MR0501537 ENCODE Pro ject Co nsortium (2004). The EN CODE (ENCyclop edia Of DNA Elements) Pro ject. Scienc e 306 636–640 . Glaus, P. , Honkela, A. and Ra ttra y, M. (2012). Identi fying diﬀerentiall y expressed transcripts from RN A-seq data with biologica l v ariation. Bioinformatics 28 1721–1728. Guttman, M. , G arber, M. , Levi n, J. Z. , Donaghey, J. , Robinson, J. , Adiconis, X. , F an, L. , Koziol, M. J. , Gnirke, A. , N usbaum, C. , Rinn, J. L. , La nder, E. S. and Regev, A . (2010). Ab initio reconstruction of cell type- sp eciﬁc transcriptomes in mouse rev eals the conserved multi-exonic structure of lincRNAs. Natur e Biote chnoly 28 503–510 . Hol t, R. A. and Jones, S. J. M. (2008). The new p aradigm of ﬂow cell sequ encing. Genome R ese ar ch 18 839–846. Jiang, H. and Wong, W . H. (2009). Statistical inferences for isoform expression in RNA- Seq. Bioinformatics 25 1026–10 32. Kaplan, E. L. and Meier, P. (195 8). Nonparametric estimation from in complete obser- v ations. J. Amer. Statist. Asso c. 53 457–4 81. MR 0093867 Ka tz, Y. , W an g, E. T. , Airoldi, E. M. and Bu rge, C. B. (2010). Analysis and d esign of RNA sequencing exp eriments for iden tifying isofor m regulation. Nat. M etho ds 7 1009– 1015. Lacr oix, V. , Sam meth, M. , Guigo, R. and Berger on, A. (2008). Ex act T ranscriptome Reconstruction from Short Sequ ence Reads. In Pr o c e e dings of the 8th International Workshop on Algorithms in Bioinformatics . 50–63. Sp ringer, Berlin. Langmead, B. , Trapnell, C . , Pop, M. and Salzberg, S. L. (2009 ). Ultrafast and memory-eﬃcient alignmen t of short DN A sequences t o t h e human genome. Genome Biol. 10 R25. Li, H. and Durbin, R. (2009). F ast and accurate short read alignmen t with Burrows– Wheeler transform. Bioinformatics 25 1754–176 0. 22 ROSSELL, S TEPHAN-OTTO A TTOLIN I , KROISS AND ST ¨ OCKER Li, R. , Yu, C. , Li, Y . , Lam, T. W . , Yiu, S. M. , Kristi ansen, K. and W ang, J. (2009). SOAP2: An improv ed ultrafast to ol for short read alignment. Bioinformatics 25 1966– 1967. Montgomer y, S. B. , Sammeth, M. , Gutie rrez-Arcelus, M. , Lach, R. P. , Ingle, C . , Nisbett, J. , Guigo, R. and Derm itzakis, E. T. (2010). T ranscriptome genetics using second generation sequencing in a Caucasian p opulation. Natur e 464 773–77 7. Mor t aza vi, A . , W illiams, B. A. , M cCue, K. , S chaeffer, L. and B ., W. (2008). Mapping and quantifying mammalian transcriptomes by R NA-Seq . Natur e M etho ds 5 621–628 . Pepke, S. , W old, B. and Mor t aza vi, A. (2009). Comput ation for ChI P-seq and R NA- seq studies. Nat. Metho ds 6 S22–S32. Ro ber ts, A. , Trapnell, C. , Donaghey, J. , R inn, J. L. and P achter, L. (2011a). Improving R NA-S eq ex pression estimates by correcting for fragment bias. Genome Biol. 12 R22. Ro ber ts, A. , Pimentel, H. , Trapnell, C. and P achter, L. (2011b). Identiﬁcation of nov el transcripts in annotated genomes using RNA-Seq. Bioinformatics 27 2325–2 329. Ro gers, M. F. , Thomas, J. , Red dy, A. S. and Ben-Hur, A. (2012). SpliceGrapher: Detecting patterns of alternativ e splicing from RNA-S eq data in the context of gene mod els and EST data. Genome Bi ol. 13 R4. Ro ssell, D. , Stephan-Otto A ttolini, C. , Kr oiss, M. and St ¨ ocker, A. (2014). S up- plement to “Qu antifying alternative splicing from paired-end RNA-sequencing data.” DOI: 10.1214 /13-A OAS687SUPP . Salzman, J. , Jiang, H. and Wong, W. H. (2011). Statistical mo deling of RN A-Seq d ata. Statist. Sci. 26 62–8 3. MR2849910 Therneau, T. and Lumley, T. (2011). Surviv al: Surviva l analysis, including p en alised lik elihoo d. R pack age version 2.36-10. Trapnell, C. , P achter, L. and Salz berg, S. L. (2009). T op H at: Discov ering splice junctions with RNA- S eq. Bioinformatics 25 110 5–1111. Trapnell, C. , William s, B. A. , Per tea, G. , Mor t aza vi, A . , Kw an, G. , v an Baren, M. J. , Salzberg, S. L. , W old, B. J. and P a ch ter, L. ( 2010). T ranscript assem bly and quantiﬁcation b y RNA- S eq rev eals unannotated transcripts and isoform switc hin g during cell diﬀerentiation. Nat . Biote chnol. 28 511–515. Trapnell, C. , Rober ts, A. , Goff, L. , Per tea, G . , Ki m, D. , Ke lley, D. R. , Pi- mentel, H . , S alzberg, S. L. , Rinn, J. L. and P achter, L. (2012). Diﬀeren tial gene and transcript exp ression analysis of R NA-seq exp eriments with T opH at and Cuﬄinks. Natur e Pr oto c ols 7 562–57 8. Wu, Z. , W ang, X . and Zhang, X. (2011). Using non-u niform read distribution mo dels to impro ve is oform expression inference in RNA-S eq. Bioinformatics 27 502–508 . Wu, J. , Akerman, M. , Sun, S . , McCombie, W. R. , Krai ner, A. R. and Zhang, M. Q. (2011). Sp liceT rap: A method t o quantify alternative splicing under single cellular con- ditions. Bioinf ormatics 27 3010–3016 . Xing, Y . , Yu, T. , Wu, Y. N. , Ro y, M. , Kim, J. and Lee , C. (2006). An exp ectation– maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic. A cids R es. 34 3150–31 60. D. R ossell Dep ar tment of St a tistics University of W ar wick Gibbel Hill Rd. Coventr y CV4 7 AL United Kingdom E-mail: D.Rossell@warwic k.ac.uk C. Stephan-Otto A ttolini Institute for Research in Biom edicine of Barcelona Baldiri Reixac 10 Barcelona 08028 Sp ain E-mail: camille.stephan@irbbarcelona.org AL TERNA TIVE S PLICING FROM R NA-S EQ DA T A 23 M. Krois s LMU Munich Geschwister-Scholl-Pla tz 1 M ¨ unchen 089 2180-0 Germany E-mail: kroissm@in.tum.de A. St ¨ ocker TU Munich Geschwister-Scholl-Pla tz 1 M ¨ unchen 089 2 180-0 Germany E-mail: al.st@web.de The Annals of Applie d Statistics 2015, V ol. 9, No. 3, 1706–170 7 DOI: 10.1214 /15-A OAS855 c  Institute of Mathematical Statistics , 2 015 CORRIGENDUM QUANTIFYING AL TERNA TIVE SPLICING FR OM P A IRED-END RNA-SEQ DA T A By D a vid Rossell ∗ , Camille Ste phan-Otto A ttolini † , Manuel Kroiss ‡ , § and Al mond St ¨ ocker ‡ University of Warwick ∗ , Institute for R ese ar ch in Biome dici ne of Bar c elona † , LMU Mu nich ‡ and TU Munich § In Figure 4(d) of Rossell et al. ( 2014 ) (Section 4.3), we follo w ed the stan- dard pip eline for BitSeq and u sed the b o wtie aligner (v ersus T ophat for th e other metho ds, casp er, Cuﬄ in ks and BitSeq). The BitSeq authors noted that w e u s ed the b owtie 1 v ersion, wh ic h ga ve v ery low mapping rates ( < 2% aligned r eads in b oth samples). Figure 1 b elo w u ses b o w tie2, whic h gives mapping rates > 70% (the MAD b etw een replicates also drops, fr om 0.098 to 0.062). W e thank the Bitseq authors for alerting u s to b o wtie2. Fig. 1. Corr e cte d Figur e 4d , right p anel. Received March 2015; revised June 2015. This is an electronic reprint of the or iginal a r ticle published by the Institute o f Mathematical Statis tics in The A nn als of Appli e d Statistics , 2015, V ol. 9, No. 3, 1706–1 707 . This r e print diﬀers from the original in pa g ination and typog raphic detail. 1 2 ROSSELL, S TEPHAN-OTTO A TTOLIN I, KROISS AND ST ¨ OCKER REFERENCE Ro ssell, D. , S tephan-Otto A ttolini, C. , Kroiss, M. and St ¨ ocker, A . ( 2014). Quan- tifying alternative splicing from p aired-end RN A-sequen cing data. A nn. Appl. Stat. 8 309–330 . MR3191992 D. R ossell Dep ar tment of St a tistics University of W ar wick Gibbel Hill Rd. Coventr y CV4 7 AL United Kingdom E-mail: D.Rossell@warwic k.ac.uk C. Stephan-Otto A ttolini Institute for Research in Biom edicine of Barcelona Baldiri Reixac 10 Barcelona 08028 Sp ain E-mail: camille.stephan@irbbarcelona.org M. Krois s LMU Munich Geschwister-Scholl-Pla tz 1 M ¨ unchen 089 2180-0 Germany E-mail: kroissm@in.tum.de A. St ¨ ocker TU Munich Geschwister-Scholl-Pla tz 1 M ¨ unchen 089 2 180-0 Germany E-mail: al.st@web.de

Quantifying alternative splicing from paired-end RNA-sequencing data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment