A Transport-Friendly NIC for Multicore/Multiprocessor Systems

Receive side scaling (RSS) is a network interface card (NIC) technology. It provides the benefits of parallel receive processing in multiprocessing environments. However, existing RSS-enabled NICs lack a critical data steering mechanism that would au…

Authors: ** Wenji Wu, Matt Crawford, Phil DeMar (Fermilab) **

A Transport-Friendly NIC for Multicore/Multiprocessor Systems
!"# $%& '$( Rece ive side sca ling (RS S) is a netw ork inter face card (NIC ) tec hnolo gy. I t pro vides the bene fits o f para llel rece ive proc essing in multip roces sing envi ronm ents. How ever, exist ing RSS-e nable d NICs lack a cr itical data steer ing m echa nism that wo uld auto matic ally steer incomi ng netw ork dat a to the sam e core on wh ich its ap plica tion pro cess re sides . This abse nce caus es i neffic ient cach e u sage if an appli cation is n ot runni ng o n th e c ore on whic h R SS has sche duled the recei ved traffi c to be proc essed . I n Li nux syste ms, it cann ot even ensure that pack ets in a TCP flow are proc essed by a s ingle core , even if the interr upts f or the flow are pinned to a spec ific core . T his r esults in degr aded pe rform ance . I n this pap er, we de velop s uch a data steer ing m echa nism in the NIC for mul ticore or mult iproc essor sy stem s. T his dat a st eering m echa nism is m ainly targ eted at TCP , bu t it can be extende d to othe r transpor t laye r protoco ls. We term a NIC with such a d ata ste ering m echan ism “ A Tran sport Fr iendl y NIC ” (A-T FN). E xperi ment al re sults h ave pro ven the effec tiven ess of A-T FN in acce lerati ng TCP /IP perfo rman ce. ( 1. Intr o duct ion,& ,Moti vatio n, Com putin g is now shifti ng tow ards mult iproc essing (e.g., SMT, CMP , SMP , and U NMA ). The fun dame ntal goal of multi proce ssing is impr oved perfo rman ce throu gh the intr oduc tion of add itiona l hard ware thr eads, CP Us, or cores (a ll of w hich w ill be refer red to as “c ores” for s impli city). The e merg ence of mult iproc essing has brou ght both opp ortun ities and chal lenge s fo r T CP/I P p erfor manc e o ptim izatio n in such env ironm ents. Mod ern netwo rk sta cks can e xploit para llel co res to a llow e ither m essag e-b ased p aralle lism or conn ectio n-bas ed paral lelism as a means of enha ncing perf orma nce [ ! ]. To date, majo r netw ork stack s suc h as W indo ws, S olari s, Lin ux, and Free BSD have bee n re desig ned and par alleliz ed to b etter util ize addi tional core s. Wh ile e xistin g OS es ex ploit para llelism b y allowing m ultip le thre ads to ca rry out netw ork oper ations concurr ently in the kern el, supp orting th is pa rallel ism car ries sig nifica nt c osts, parti cularl y in th e co ntext of c onten tion for sha red reso urces , softw are syn chro nizat ion, an d poor cache effic iencie s [ " ][ # ]. Ho weve r, vario us opt imiza tion ef forts , su ch as fine -grain ed locki ng and the r ead - copy - upda te tech nolog ies, ha ve bee n helpf ul. Wh ile thes e optim izati ons defi nitely help impro ve TCP /IP proc essing in mu ltipro cessin g e nviro nmen ts, t hey alon e are not s ufficie nt to keep p ace with n etwo rk speed s. A scala ble, efficient netw ork I /O in mul tiproc essin g envi ronm ents requ ires furt her optim izatio n and coor dinati on acro ss all laye rs of the netw ork stac k, from netw ork i nterfa ce to appli cation . Inv estiga tions rega rding pro cesso r a ffinity [ 4 ][ 5 ][ 6] [7] indica te t hat the coor dinate d affinity schedul ing of protoc ol proc essing a nd netw ork ap plica tions on th e same ta rget core s c an signif icantl y r educ e c onten tion for sha red reso urces, mini mize softw are s ynch roniza tion over heads , and enha nce c ache e fficie ncy. Coo rdinat ed a ffinity sch eduli ng o f pro tocol proc essing a nd netw ork ap plicat ions on th e sam e t arget core s has the follow ing goa ls: (1) Interru pt affin ity, netw ork inte rrupt s of the sam e type sh ould be d irecte d to a single core. Red istrib uting netw ork inter rupts in eithe r a rando m or rou nd-ro bin fas hion to d iffere nt core s has un desira ble sid e effec ts [6]. ( 2) Flow affini ty , pack ets of each fl ow sh ould b e proc essed b y a sing le core . Flow affinit y is esp ecial ly imp ortan t for TC P. TCP is a c onne ction -orie nted pro tocol , and it has a large an d fr equen tly acces sed state tha t m ust be s hared and pr otecte d w hen pa ckets fro m t he same co nnec tion are pr ocess ed. Ens uring that all pa ckets i n a TCP flow are proc essed by a s ingle cor e re duces con tentio n f or shar ed reso urce s, min imize s soft ware s ynchr oniza tion, and enha nces cach e eff icienc y. (3 ) Ne twork data affin ity , in comi ng ne twork data shoul d be s teered to the sam e co re o n w hich its appl icatio n pr ocess res ides. T his is beco ming m ore im portan t w ith the ad vent of Di rect Cach e Acce ss (DC A) [8][ 9]. DC A is a NIC te chnol ogy that seeks to direc tly pl ace a receive d pac ket int o a core ’s cach e for i mmed iate a ccess b y the proto col st ack and ap plica tion. N etwo rk data a ffinity max imize s cache effic iency an d r educe s c ore- to -core sy nchro nizati on. In a mu lticor e sys tem, the f unctio n of netw ork dat a steer ing is exec uted by d irecti ng th e corr espon ding netw ork i nterru pts to a spe cific core ( or co res). RSS [ !$ ] i s a N IC te chno logy. It su pport s mul tiple rece ive queue s an d in tegra tes a ha shing fun ction in th e NIC . The N IC co mput es a ha sh valu e for each inco ming pac ket. Based on has h v alues , NI C a ssign s pack ets o f the sam e dat a flo w to a single que ue an d even ly di stribu tes t raffic flow s ac ross queu es. W ith Mes sage S igna l Inter rupt ( MSI/M SI -X) [ 11 ] suppo rt, each re ceive qu eue is assign ed a d edica ted inter rup t an d !()%&* # +,%$-.%/ 0 *123(456 (7,%(892$ /',%0:89 2$/+%,'0 ##,%(;3# $ 0<#( % &' ()*%& +,%- ./ / %01. 234 15,%6 7*8% 9'- .1% :'1 ;*8. <,%6 =>=%? 4@% A$$ , %?./ .B*. ,%CD% E$A !$% RSS ste ers inter rupts on a per -queu e b asis. RS S prov ides the b enef its of para llel receiv e pr ocess ing i n mult iproc essing envi ronm ents. Oper ating s ystem s lik e Win dows , Solaris, Linux, and FreeBS D now sup port inter rupt aff inity. W hen an RS S r eceiv e q ueue (or inter rupt) is tie d to a specifi c co re, packets fr om the sam e flow are steer ed to tha t cor e (Flo w pi nning [ 12 ]). This ensure s flow aff inity on most OS es, exc ept for Linu x (se e Sec tion 2 ). How ever, RSS ha s a limita tion: it ca nnot s teer inco ming netw ork d ata to the sam e cor e whe re its appl icatio n proce ss resi des. T he rea son is si mple : the exist ing R SS-e nable d NI Cs do not main tain the relat ionsh ip “Traf fic Flow s → Ne twork a pplica tions → Core s” in the N IC (sin ce netw ork app licati ons run o n core s, the mos t criti cal re lation ship is sim ply “Traf fic Flow s → Cor es (Appl icatio ns)”) an d the existin g O Ses do not s uppo rt suc h ca pabil ity. T his is sy mpto matic of a broa der di sconn ect be twee n exis ting s oftwa re arch itectu re a nd mu lticore h ardwa re. On O Ses lik e Win dows , if an appl icatio n is n ot run ning on th e cor e on w hich RSS has s chedu led th e rec eived traffic to be proc essed , net work data affinity cann ot b e ach ieved , resu lting in deg raded ca che effic iency [ 10 ]. T his limit ation mig ht ca use serio us p erform ance deg radat ion for NU MA s ystem s. Fur therm ore, O n OSes l ike Lin ux, if an ap plicat ion run s o n core s o ther th an thos e w here its co rresp ondin g RSS n etwo rk inte rrupt s a re dir ected , Linu x TCP pro cessin g might alternat e betwee n diffe rent c ores even if the inter rupts for th e flo w are pinn ed to one cor e [se e Sec tion 2]. T here will be ne ither flow affin ity, n or ne twork data a ffinit y. A s a re sult, i t will lead to poor cache e fficie ncy a nd ca use si gnific ant core - to -cor e s ynch roniz ation o verhe ads. T he ov erall syste m ef ficien cy co uld b e sev erely degr aded . The N IC tech nolog ies, su ch as Inte l’s VM Dq [27 ] or the PCI- SIG’s SR -IOV [28 ] , do p rovid e d ata steeri ng capa bilitie s f or the NIC s. But they a re I/O virtu aliza tion tech nolog ies targetin g a t vi rtual mach ines in the virtu alized envi ronm ent, w ith d iffere nt res earch issue s. In this paper , w e propos e a NI C mecha nism to reme dy the RSS lim itati on. It st eers in comi ng netw ork data to th e sam e co re on whi ch its appl icatio n resi des. Our data ste ering me chani sm is m ainl y ta rgeted at TC P , but ca n be ex tende d to UDP and SC TP. W e term a NIC with suc h a da ta st eerin g m echan ism A Tr anspo rt- Frien dly N IC, o r A -TFN . Th e basi c idea is sim ple: A - TFN ma intain s the rela tions hip “ Tra ffic F lows → Core s (App licati ons) ” in the NIC , and O Ses a re corre spon dingl y enhan ced to su pport such cap abilit y. For tran spor t la yer traff ic, A-T FN mai ntains a Flo w- to - Core tab le in the NIC , with on e ent ry p er fl ow; each entry track s whi ch rec eive q ueue ( core) a f low s hould be assi gned to. A- TFN ma kes use of the fac ts t hat (1) TCP conn ectio ns alw ays i nvolv e pac kets fl owin g in both direc tions (ACK s, if nothi ng el se). And (2) whe n an app licati on mak es soc ket-r elated s ystem calls, th at calli ng appli cation ’s contex t m ay be b orrow ed to carry out netwo rk pr ocess ing i n pro cess conte xt. W ith e ach outg oing t ransp ort-la yer p acket , the O S rec ords a proc essor core ID and uses it to u pdate the e ntry i n the Flow - to -C ore t able. As soon as any netw ork proc essing is p erform ed i n a proces s con text, A-T FN l earns of t he core on whi ch a n ap plica tion proc ess reside s a nd can steer fu ture inc omin g t raffic to th e ri ght cor e. C learl y, to design suc h a mec hanism , t here is an ob vious trad e- off be tween t he amo unt of w ork do ne in the N IC and in the OS . In the pap er, we d iscuss tw o desi gn op tions. Opti on 1 is to m inim ize cha nges in t he OS a nd focu ses inste ad o n id entify ing the m inim al se t of mec hanis ms t o add t o the N IC. C learl y, this desig n add s com plexit y and cost to th e NIC . On the o ther end of th e des ign spac e, it co uld be l et the O S up date th e flow - to -co re table dir ectly wit hout chang ing anyth ing in th e N IC hard ware (optio n 2). Conce ptual ly, th is app roach cou ld be fa irly s traigh tforw ard to impl emen t. Ho weve r, it migh t ad d sig nifica nt ex tra co mmu nicat ion o verhe ads betw een t he OS and the N IC, espec ially w hen the F low- to -Co re table ge ts l arge. Du e t o s pace lim itatio n, this pape r is main ly f ocuse d o n th e fir st d esign opt ion. The new N IC is emula ted in softw are and it show s t hat the solu tion is eff ective and prac tical to r emed y RSS’ s limit ation . In our future work , we wi ll expl ore the seco nd desig n option . The con tribu tions of th is paper are th reefo ld. Fi rst, we show for ce rtain OSes , such as Linu x, th at tyi ng a traffi c flow to a sing le co re do es no t nece ssaril y ensure flow affin ity or netw ork data affin ity. Se cond , we sh ow th at exis ting R SS-e nable d NIC s l ack a m echa nism to a utoma ticall y steer p acket s of a dat a flo w to the sam e co re(s ), wh ere t hey will be proto col- proce ssed and finally cons umed by the appl icatio n. This is sym ptom atic of a broad er disco nnec t b etwee n e xistin g s oftw are archi tectur e a nd mult icore h ardw are. Th ird, we develo p such a data steer ing m echa nism in t he N IC fo r m ulti-c ore o r mult iproc essor syste ms. % The rem ainde r o f th e pa per is o rgan ized as follow s: In S ectio n 2, we pres ent probl em f ormu lation . Se ction 3 desc ribes the A-T FN m ech anism . In secti on 4, we discu ss expe rimen t r esults th at s howc ase the effec tiven ess of our A-T FN mec hanis m. In sectio n 5 , we p resen t rela ted re searc h. W e con clude in se ction 6. 2. Pro b lem , Form ulatio n, 2.1 P acke t Re ceive -Proc essin g wit h RS S RSS is a NIC tech nolo gy. I t sup ports mu ltiple rece ive queue s an d in tegra tes a ha shing fun ction in th e NIC . NIC comput es a h ash value for each incomin g pack et. B ased on has h va lues and an indi rectio n ta ble, NIC ass igns pa ckets of the sam e data flo w t o a sin gle queu e and e venly distr ibutes traffi c flow s acro ss queu es. With Me ssage Sign al Interru pt (MSI/ MSI -X) and Flow Pinn ing s uppo rt, eac h rec eive q ueue is assig ned a de dicat ed i nterru pt a nd tied t o a spe cific core . The dev ice drive r all ocate s and m aintai ns a ring buff er for each rece ive q ueue withi n sys tem m emo ry. For pa cket rec eption , a rin g b uffer mu st b e in itiali zed and pre -alloc ated wi th em pty pa cket buff ers that hav e been m emo ry-m apped i nto the addre ss space that is acce ssibl e by the NI C ov er t he s ystem I/O bu s. T he r ing buff er si ze i s de vice - an d dr iver- depen dent . Fig . 1 illus trates pac ket rece ive-p roces sing wit h RS S: (1 ) Whe n incom ing pack ets arrive , the hash fun ction (e.g ., Toep litz h ashing [ 10 ]) is a pplie d to th e head er to prod uce a ha sh re sult. The hash typ e, w hich is conf igura ble, c ontrol s whi ch in comin g pac ket f ields a re used to ge nerat e the hash r esult. O Ses can enab le any com binati on o f th e fo llowin g fi elds: sour ce a ddre ss, sour ce po rt, d estina tion add ress, dest inatio n po rt, and prot ocol . The has h m ask is a pplie d to the has h re sult to iden tify th e num ber o f bits t hat ar e used t o inde x the indir ection tabl e. Th e indi rectio n tab le is t he da ta struc ture that con tains an a rray of c ore n umb ers t o be used for R SS. E ach l ookup from the i ndirec tion table iden tifies the core and hen ce, the assoc iated rece ive queu e. (2) The NIC assign s incomin g packets t o the corre spon ding re ceive qu eues . (3 ) The N IC DM As (dire ct memor y access) t he receive d packe ts into the corre spon ding ring bu ffers in the host sys tem mem ory. (4) T he N IC se nds in terru pts to the c ores that are asso ciated with the non -emp ty queue s. Subse quen tly, the co res res pond to t he netw ork in terrup ts and p roces s rece ived p acke ts up throu gh th e ne twor k stac k fro m the corre spon ding ring b uffer s one by o ne. The O S can peri odica lly reba lance t he netw ork load on cor es by upd ating the in direct ion tabl e, b ased on the ass umpt ion that the h ash func tion will ev enly distr ibute in comi ng traf fic flow s acro ss the in direct ion table en tries. S ince th e O S d oes no t kn ow wh ich spec ific ent ry in the in direct ion tab le a n inco ming traff ic fl ow will be m apped to , it can o nly passivel y reac t to l oad imba lance s ituatio ns by chang ing eac h core ’ s nu mber of ap peara nces in th e in direct ion tabl e. For bett er lo ad b alanci ng p erform anc e, the size of the indir ection tab le i s typ ically tw o to eig ht ti mes the num ber o f cor es in the syst em [ 10 ]. F or ex ampl e, in Fig. 1, the indire ction table has 8 entr ies, whi ch are popu lated as sho wn. As suc h, t raffic loa ds d irecte d to Core 0, 1, 2, an d 3 ar e 50% , 25% , 12.5 %, an d 12. 5%, resp ective ly. Som e OSes lik e Linux a nd FreeB SD do not supp ort the functi on of a n indirec tion table ; th e inco ming packet s are dire ctly ma pped to the rec eive queu es. T hese O Ses cannot p erfor m dy namic load bala ncing . 2.2 R SS L imit ation and the R easo ns RSS pro vides the ben efits of paral lel r eceive proc essing . H owev er, this m ech anism do es presen t certa in lim itati on: i t can not steer inco ming netw ork data to the same core on w hich its ap plica tion r eside s. The reas on is s imple : th e e xistin g R SS-e nable d NIC s do not m ainta in th e rela tions hip “T raffi c Flo ws → Netw ork ap plicat ions → Co res” in the NI C ( since netw ork a pplica tions run on core s, the most critic al relat ionsh ip is simpl y “Traf fic Flow s → Cores (App licati ons) .”) an d the exis ting O Ses d o not supp ort such ca pabil ity. W hen pac kets arr ive, the ha sh func tion is appl ied to the h eader to p roduc e a h ash res ult. B ased on th e hash value s, the N IC as sign s pack ets to r eceiv e queu es and then cores, with no way to c onsid er o n whic h core the corr espon ding a pplica tion is ru nnin g. Alth ough re ceive q ueues c an be i nstruc ted to se nd inter rupt t o a sp ecific set o f cor es, e xistin g gen eral purp ose OS es can o nly pro vide li mited p rocess - to - inter rupt a ffinity capa bility ; netw ork in terru pt del ivery is not synch ronize d with proce ss sche dulin g. This is beca use the OS s chedu lers hav e o ther pri orities , s uch as load balanc ing and fairnes s, over pr ocess - to -in terru pt affin ity. B esides , multi ple net work applic ations ’ traffi c migh t map to a single interru pt, wh ich brin gs new chal lenge s to a n OS schedule r. Therefo re, a network appl icatio n m ight be sc hedu led on cores ot her than thos e whe re its co rresp ondin g netw ork i nterru pts ar e direc ted. T his is sy mpto mati c o f a bro ader d iscon nect betw een exist ing softwa re archite cture an d m ultic ore hard ware OSe s like Win dows im plem ent th e functio n of the indir ection ta ble, whi ch can pr ovide lim ited da ta steer ing capab ilitie s fo r R SS-en abled NI Cs. How ever, it still can not steer pa ckets of a data flow to the sam e core whe re the appli cation proc ess re sides . Tu rning again to % ./=> (?(@& 'A0$ ( B0'0 /C/*= (@%, '0## ( D/$E (B;; ( Fig 1 , proce ss P is s chedu led to run on C ore 3 . Its traff ic m ight be h ashed to an en try that d irects to other core s. The O S does n ot know whic h spec ific ent ry in the i ndirec tion table a traff ic flo w wil l be m appe d to. With exis ting R SS capab ility, the re are man y ca ses in OSes in whic h a netw ork app licatio n r eside s on cor es othe r than t hose t o whi ch its c orresp ondi ng net work inter rupts are direc ted : ( 1) A sing le-thr eaded appl icatio n mi ght h andle m ultip le co ncurr ent T CP conn ectio ns. A ssum ing su ch an appl icatio n han dles n conc urren t TCP conne ctions and ru ns on an m -c ore syste m, a n RS S-en abled NIC will eve nly (s tatist ically ) distr ibute t he n con necti ons ac ross th e m cor es. Sin ce the applic ation ca n only run on a sin gle core at an y mom ent, o nly n/ m conn ectio ns' net work in terrup ts are direc ted to the same core whe re the appl icatio n run s. (2 ) Soft pa rtition t echno logie s l ike CP USET [ 13 ] are appl ied in the co ntext of ne twork ing env ironm ents . Sinc e t he OS (o r s ystem a dmin istrat or) has no w ay of know ing to w hich spe cific core they will be mapp ed, netw ork ap plicat ions m ight be s oft-p artitio ned on cores othe r than t hose t o whi ch the ir netw ork in terr upts ar e direc ted. (3 ) T he gene ral pur pose O Ses sc hedul er prior itizes load bala ncing or pow er sa ving over pro cess- to -int errupt affin ity [ 14 ][ 15 ]. For O Ses like Linu x, whe n th e m ultic ore peak p erform ance m ode is enab led, the sche duler tries to use all core s in para llel to the grea test ext ent pos sible, distrib uting th e load eq ually amo ng t he m. W hen the mul ticore pow er savin g m ode is enab led, the sched uler is b iased to r estric t th e wo rkloa d to a single phy sical proc essor . As a r esult, a n etwo rk appl icatio n m ight be s chedu led on core s o ther than thos e to w hich its ne twork inter rupts are direct ed. For clar ity, we illu strate th e a bove c ases in Fig . 2 . The s ystem conta ins two p hysic al proc essor s, eac h with two c ores. P1 – P 5 are p roces ses tha t run w ithin the syste m. P1 is a netwo rk process that i nclud es traffic flow s. A n R SS-e nable d N IC stee rs th e tr affic flo ws to diffe rent co res, as shown in the f igure (r ed arro ws). In all of th ese case s, P1 res ides on cores ot her tha n those to wh ich its c orres pondi ng net work i nterr upts a re direc ted. On OS es l ike Wind ows, wh en a co re r espon ds t o the network inter rupt, the corr espon ding inter rupt hand ler is ca lled, w ithin which a defer red pro cedu re call (DP C) i s sc hedul ed. On the core, DPC p roces ses rece ived p acke ts up throu gh th e ne twor k stac k fro m th e corre spon ding ring bu ffer one by one [ 16 ]. The refor e, on W indow s, tyin g a traf fic flo w to a si ngle co re doe s ensu re in terrup t affi nity and flow affi nity. How ever , i f netw ork inter rupts are not dire cted to c ores on whic h the c orres pondi ng a pplica tions r eside , netw ork data affin ity can not be a chiev ed, resu lting i n degra ded ca che effic iency [ 10 ]. This reality might cause serious perfo rman ce de grada tion for N UMA sys tems. On some OSe s, lik e Li nux, tying a tr affic flow to a sing le c ore does not neces sarily ens ure flow affin ity or netw ork data a ffinity due to Linux T CP’s uni que preq ueue -back log qu eue d esign . In th e follo wing secti ons, w e d iscus s in deta il wh y the co mbin ation of RSS an d Fl ow Pinni ng c anno t en sure flow aff inity and netw ork d ata af finity in L inux. F>G( H/*9 I(40 $ D,% A(@%, '0## /*=(/ *(89 2$/', %0( ;3#$ 0<# ( As a m ode rn par allel n etwor k stac k, Lin ux ex ploits pack et-ba sed paral lelism , w hich allo ws multi ple threa ds to simult aneou sly proces s d iffere nt p acke ts fr om the sam e or d iffere nt co nnect ions. Two t ypes of thre ads may perfo rm n etwor k proc essin g in L inux: appli cation threa ds in pr ocess co ntext an d inter rupt th reads in inter rupt c ontex t. Wh en an appl icatio n ma kes so cket - relat ed sy stem calls , tha t app licati on’s pro cess con text may be bor rowe d to car ry o ut n etwo rk pr oce ssing . Whe n a NI C i nterru pts a c ore, the ass ociate d h andle r serv ices the N IC an d sched ules th e softir q, softn et. Afte rward s, the s oftne t h andle r pro cesses r eceiv ed pack ets u p thr ough the netwo rk s tack in int errup t cont ext. TC P is a con nectio n-or iented p rotoc ol, and it has a la rge a nd fr equen tly a ccess ed st ate th at m ust b e shar ed a nd prote cted. In the cas e of the Li nux TCP , th e data str uctur e s ocke t m ainta ins a con necti on’s va rious TCP sta tes, and the re i s a per -sock et l ock to prot ect it from unsyn chron ized ac cess. The lock consis ts of a spin lock a nd a binary s emap hore. The binary sem aphor e c onstr uction is ba sed on the spinloc k. In Linu x, sin ce an in terru pt thre ad can not sl eep, w hen i t acce sses a soc ket, the soc ket is pro tected wi th t he spin lock. W hen a n app licatio n thre ad acc esses a sock et, the s ocket is l ocked w ith th e bin ary se maph ore an d is cons idere d “ owne d- by -u ser.” T he bin ary sem aph ore sync hroni zes multi ple applic ation threa ds am ong % A. P 1 has mult iple conc urren t con nectio ns % B. S oft pa rtition ing % % % C. L oad b alanc ing % % D. P ower savin g % ./=> (F(40 $D,% A( Irq s an d Ap ps. o n Dif feren t Cor es them selve s. I t is als o u sed as a f lag to notif y in terru pt threa ds that a soc ket is “ow ned - by -use r” to co ordin ate sync hroni zed acce ss t o th e s ocket bet ween int errup t an d appl icatio n thre ads. Our pre vious rese arch [ 17 ][ 18 ] stu died the deta ils of the Lin ux pack et receiv ing pro cess . H ere, we sim ply sum mariz e Linux T CP pro cessi ng of the d ata rece ive path in int errup t and proce ss co ntexts , resp ectiv ely. a) T CP P roces sing i n Inte rrupt Cont ext (1) Whe n th e NI C int errup ts a core, the n etwor k inter rupt’s a ssocia ted han dler se rvice s the NIC and sche dules th e sof tirq, s oftne t. (2) The s oftne t h andle r m oves a pa cket f rom th e ring buff er an d pr ocess es th e pa cket up th roug h the netw ork s tack. If th ere is no p acket avai lable in th e ring buffe r, the softn et han dler e xits. (3) A T CP pack et (se gme nt) i s de livere d up to the T CP laye r. T he n etwo rk st ack first trie s to ide ntify the sock et to whic h t he pack et b elong s, and the n s eeks to lo ck (sp inloc k) the sock et. (4) The netw ork stack chec ks i f the soc ket is “o wned - by -use r” or if a n app licatio n thre ad is sl eepi ng an d awa iting d ata: • If yes, the packet will be enqueue d into th e sock et’s b acklo g qu eue o r pre queu e. TC P proc essing w ill be pe rform ed lat er in pro cess cont ext by the applic ation threa d. • If not, the n etwo rk stack will p erfor m T CP proc essing on t he pa cket in int errupt cont ext. (5) Unlo ck th e soc ket; g o to s tep 2. b) T CP P roces sing i n Pro cess C onte xt (1) An applicat ion t hread mak es a s ocket -relat ed rece ive sy stem call. (2) Onc e th e sy stem cal l re aches the TC P l ayer, the netw ork sta ck see ks to loc k (sem apho re) the s ock et first. (3) The netw ork st ack m oves data f rom t he so cket in to the u ser sp ace. (4) If the so cket’ s prequ eue an d/or back log queu e a re n ot e mpty , th e calli ng app licati on’s p roces s cont ext wo uld b e borr owed t o carry out TCP proce ssing . (5) Unlo ck the soc ket and re turn from the s ystem call . For t he dat a tran smit p ath, netw ork p roce ssing starts in the proc ess cont ext whe n an ap plicat ion mak es soc ket-re lated syste m call s to send da ta. If T CP give s pe rmiss ion to send (b ased on TC P recei ver wind ow, cong estio n w indo w, and sende r wind ow sta tuses ), netw ork p roces sing in pr ocess c ontex t can r each down t o the b ottom of th e proto col s tack; otherw ise, t ransm it sid e netw ork proc essing is trigg ered by incom ing TC P AC Ks for th e data rece ive path, w hich are per form ed in the ir exec ution en viron ments (i nterru pt or p rocess co ntext s). In this pa per, we fo cus mai nly on rec eive sid e proc essing bec ause it is k nown to b e mo re m emo ry inten sive and comple x. Furth ermo re, TCP process ing on th e tran smit s ide is a lso de pend ent on ACK s in th e data recei ve pa th. As desc ribed a bove , whethe r TCP proc essin g is perfo rmed in p roces s or inter rupt c ontex ts de pends on the volat ile r untim e en viron ment s. F or e xamp le, w e used FTP to downl oad Li nux ke rnels f rom www .ker nel.or g ; we instr umen t the Linu x netwo rk stack to reco rd th e pe rcenta ge o f traf fic p roces sed i n proc ess con text. Th e r ecord ed perc entag e ra nges fr om 50% to 75 %. It is cle ar th at, i n a multic ore system , wh en a n appl icatio n’s proc ess conte xt i s b orrow ed to execu te the netw ork s tack, T CP p roces sing i s perf orme d on t he core (s) whe re the a pplica tion is s ched uled t o run. W hen TCP pro cessi ng is perfo rmed in int errup t co ntext , it is perfo rmed o n the co res to w hich t he netw ork in terrup ts are d irecte d. Tak e, for ex amp le, Fi g. 3, in w hich netw ork interru pts a re di rected to core 0 and the asso ciated netw ork appl icatio n is sched uled to r un o n core 1. In inter rupt cont ext, TCP is proc essed on cor e 0 ; in pr ocess conte xt, thi s occu rs on core 1 . Sin ce TC P proc essing per forme d in process or inter rupt cont exts depe nds o n vol atile r untim e con dition s, it m ay al ternat e betw een these t wo c ores. Ther efore , alth ough th e com binati on of RSS and F low Pinn ing can tie a traffi c flow to a si ngle co re, w hen a ne twor k a pplic ation resid es on so me oth er core , T CP pro cessi ng mig ht alter nate b etwe en di fferen t core s. W e wou ld ac hieve neith er flo w aff inity nor n etwor k dat a affin ity . 2.4 N egat ive I mpac ts % !>(H /*9I ()6@ (@%, ' 0##/ * =(/* (@%,' 0##(6 ,*$0 I$( % J>(H /*9I ()6@ (@%, '0##/ *=(/* (5*$0 %%9+ $(6, * $0I$ ( ./=> (G(H/ * 9I() 6@(@ %,'0 ##/* = (6,* $0I$ #(/*($ E0(K & $&(B 0'0/C 0(@& $E( If an a pplic ation r uns on cores other than th ose whe re it s cor respo nding RS S n etwor k in terrup ts a re direc ted, va rious negati ve imp acts res ult. On both Win dows and L inux sys tems , ne twork da ta a ffinit y cann ot be ach ieved . F urthe rmor e, o n OSe s l ike Lin ux, TCP pro cessin g m ight a ltern ate b etwee n d iffere nt co res even if th e i nterru pts for the flo w are pinne d t o a spec ific core. As a result, it will lead to poor cach e effic iency and caus e sign ifican t cor e- to - core sync hroni zation overhea ds. Also , it r ende rs the DCA tech nolog y in effect ive. In mu ltiple cor e sy stems , co re- to -cor e syn chron izatio ns i nvolv e cos tly sn oops and MES I o perat ions [ 19 ], resultin g in ex tra system b us traff ic. Th is is especial ly ex pensiv e wh en th e cont endin g cor es ex ist wi thin d iffere nt ph ysica l proc essor s, w hich us ually in volve s s ynch ronou s read /write oper ations to a certai n mem ory locat ion. In ad dition , for Li nux, i nterru pt and applic ation threa ds conte nd for shared r esour ces, such a s l ocks, whe n t hey conc urren tly proc ess pack ets from th e s ame flow . Th e so cket’ s sp inloc k, f or e xamp le, wou ld b e i n seve re con tentio n. Wh en a lo ck is in c onten tion, co nten ding thre ads s impl y wa it in a lo op ( “spin ”), repe atedly check ing unt il the loc k beco mes a vaila ble. Whi le waiti ng, no use ful wo rk is exe cuted . Conte ntion for oth er shar ed resou rces, su ch as me mory an d syste m bus , a lso occ urs freq uentl y. Since t his intra -flow cont ention m ay oc cur on a per-p acket ba sis, the to tal cont ention o verhe ad cou ld be sev ere in hig h b andw idth netw ork e nviro nmen ts. F4% 5' ;4( G/1./ '% /7'% ( 'H./* B'% im pacts , % 2'% 1. (% 5./. % /1.(G ;*GG *4(% '@ I'1*; '(/ G% 4B'1 % .(% *G4 8./'5 % G+I/ ' 14(% " #TEU V,% !=P WUX ,% UF! = % Y C0N% ? 14.5L 4;% Y'/Z /1'; '% CC% A[$ P,% !W< IG,% 90 S% (4/% G + II4 1 /'5=% > \N% D*(+ @%"=E ="P=% B0'0 /C0% L % \+I '1- * L14% \ '1B' 1 =% 06Q N% /2 4% C(/ ' 8% Z'4 ( % 06 QG,% "=EE% WUX ,% :. ;*8]% E,% -45' 8% !A =% Y C 0N% C (/'8% 6O> ^!$$ $,% !W \N% D*(+ @% "=E=" P=% F7 '% 1'L' *B'1MG % 06Q % .1L7 */'L/ +1'% *G% . G% G74 2 (% *(%:* H=%T=% % C(% / 7 '% '@ I'1*; '(/ G ,% 2'% +G '5% !"#$ %& _ "$ `% /4% se nd % 5./ . % * (% 4('% 5* 1'L/*4 (=% F7' % G'(5 '1% /1 .(G; *//'5 % 4(' % F06 % G /1'. ; % /4% /7'% 1'L' *B'1% 341% !$$ % G' L4(5 G= In the rec eiver, networ k interr upts were a ll directe d to c ore 0. How ever , ip erf was pi nned to dif feren t cores : (1) C I'13% 2 .G% I* (('5 % /4% L41 '% $% a('/ 241K % *( /'11 + I/G% . (5% .I I8*L. / *4(G% 2 '1'% I *((' 5 % /4% /7' % G .;'% L 41'b=% ( 2) CI'1 3% 2 .G% I* (('5 % /4 % L 4 1'% !% a('/ 241K % *( /'11 + I/G% . (5% .I I8*L. / *4(G% 2 '1'% I *((' 5 % /4% 5*33' 1'(/% L41 'G,% <+/% 2* / 7*(% /7' % G. ;'% I14L 'GG4 1 b =% (3) CI'1 3% 2 .G% I *(('5 % /4 % L41 '% "% a(' /241 K% *( /'11+ I/G% .(5 % .II8*L ./*4( G% 2 '1'% I*(('5 % /4% 5*33'1' (/% I14L 'GG4 1 Gb=% F7 '% /714 +H7I +/% 1./ 'G% *(% /7 'G'% '@I' 1*;' (/G% . 88% G./ +1./' 5% /7' % !W< I G% 8*( K% a.14 +(5% cT$ % -< IGb=% F7' % '@I '1*; '(/G % 2' 1'% 5 'G*H( '5% / 4% 3'./ +1'% /7' % G. ; '% /714 +H7I +/% 1./' G% 34 1% / 7'% G.K' % 43 % <'// '1%L4 ;I.1 *G4(G =% &'% 1 .(% '"$ '%!(# %_ "! `% /4% I 143*8' % G]G/' ;% I'13 41;. (L'% *(% /7' % L. G'% 43% /7'% 1'L '*B'1 =% F 7 '% ;'/1 *LG% 43% inter est % 2'1 'N% C Y\Fd OVFC OV9, % /7' % (+ ;<' 1 % 43% *(G/ 1 +L/*4 (G% 1 '/*1' 5e% ? Q\dF O SYd SYf, % /7' % /4/. 8% (+; <'1% 43% L4 ;I8 '/'5% <+G % /1.( G.L/* 4 (Ge% . (5% ?Q\ dUCF- d9O g,% /7 '% (+ ; <'1 % 4 3% UCF -% a7 */% ; 45*3*' 5% L.L7 '% 8* ( 'b% G *H(. 8G% .G G'1/' 5% _ "" `=% :4 1% /7 'G'% ;'/ 1*LG,% /7'% ( +;< '1% 43 % 'B'( /G% <'/ 2''( % G.; I8'G% 2 .G% ! $$$$ =% &'% .8G4 % '( .<8'5 % /7 '% D* (+@% D4LK G/./% _ !# `% /4% L488' L /% 84LK % G/./ *G/*LG =% >(% /7*G % <.G* G% 2' % L.8L +8./' 5% /7 '% /4/ .8% /*;' % G I '(/% 2 .*/*( H % / 4 % . Lh+*1 '% B .1*4+ G% K'1( '8% 84LK G,% .(5 % 2'% L.88 '5% /7 *G% &S CFFC -VJ F >FS D=% Co nsiste nt resu lts we re ob taine d acro ss re peate d run s. The res ults are a s liste d in F ig. 5 , wit h a 95% con fidenc e inte rval. The /714+ H7I+ /% 1 ./'G% *( % /7 'G'% '@ I'1*; '(/G % .8 8% G./+ 1./'5 % / 7'% !W< IG% 8*(K =% U 42' B'1,% :*H =% A % L8 '.18] % G742 G% /7./ % / 7'% ;'/ 1*LG% 43% *I '13% i% 04 1'% !% .(5 % 0 41'% "% .1'% ;+ L7% 7 *H7' 1% /7 .(% / 74G' % 43% *I' 13% i % 04 1'% $ =% F7 * G% L8'. 18]% B '1*3*' G% /7 ./% w hen a ne twork app licati on is sche duled on core s oth er th an th ose to w hich the corre spon ding netw ork inter rupts are direc ted, seve rely degr aded syste m eff icienc y wil l resu lt. CY\F dOVF COV9 % ; '.G+ 1'G% /7'% 84.5 % 4( % /7 '% 1'L' *B'1=% F 7'% 1'G+ 8/G% L8' .18]% G 7 42% /7 ./% con tentio n for shar ed reso urces betw een in terrup t and a pplica tion % ./=> (M(B0 '0/C0 %(6@ N#( ( ./=> (O>(PI +0%/ <0* $(B0# 92$#( threa ds led to a n e xtra lo ad. The e xtra loa d is main ly relat ed to t ime s pent waiting for locks. % F 7'% '@I' 1*;' (/.8% &SC F FC- VJF> FSD% 5./. % B '1*3] % / 7*G% I4*( /= It is s urpris ing tha t the B US_T RAN S_A NY of iperf @ Core 2 is a lmos t tw ice that of ip erf @ C ore 0 . The BU S_HI TM_D RV of ip erf @ Core 0 is f ar les s that th at of ip erf @ C ore 1 an d Core 2 . Sinc e the /714 +H7I +/% 1./' G% *(% /7'G' % ' @I'1 *;'( /G% .88% G./ + 1./' 5% /7'% !W< IG% 8 *(K,% /7'% '@/ 1.% ? Q \dFO SY\ dSYf % .(5 % ?Q\ dUCF- d9O g% /1 .(G. L/*4( G% 43 % *I' 13% i % 04 1 '% !% .(5 % 041' % "% 2' 1'% L.+G '5% <]% L.L 7'% /1.G 7*(H % .( 5% 8 4 LK% L4(/ '(/*4 (,%.G %.(.8 ]X'5 %.<4B '=%% 3. A,Trans port,F riend ly,NI C,(A‐ TFN) 3.1 A -TF N De sign Princ iples & Al tern atives Prev ious an alys es and ex perim ents c learly s how that, e xistin g R SS -ena bled N ICs can not aut omat ically steer in comin g n etwo rk d ata to the core on whi ch its appl icatio n proc ess re sides . In th is pap er, we prop ose a NIC m echa nism t o reme dy thi s limit ation . It stee r pack ets o f a data flow to the same cor es w here they wi ll be p rotoc ol-pr ocess ed an d su bseq uently con sume d by the a pplic ation. Our data st eering mec hanis m is m ainly targe ted at TCP , but can be ex tende d to U DP and SCT P. We te rm a NI C with su ch a da ta stee ring mec hanis m A T rans port- Friend ly N IC, o r A -TF N. A-T FN’s basi c ide a is s imple : it m aintai ns th e relat ionsh ip “Tra ffic Fl ows → Core s (App licati ons) in the N IC and OSes are cor respo nding ly enh ance d to supp ort suc h c apab ility. F or tran sport la yer tra ffic, A - TFN m aintai ns a Fl ow- to - Core tab le i n th e N IC , with one e ntry p er fl ow; e ach e ntry t racks whi ch re ceive queu e ( core) a flow sh ould be assig ned to. A-T FN mak es u se o f th e f acts that (1) a T CP conne ction ’s traff ic is bi direct ional . F or a u nidire ction al d ata flow , ACK s on t he reve rse path r esult in b idirec tional tr affic. And (2 ) whe n an ap plica tion m akes s ocket -rela ted syste m ca lls, th at app licati on’s proce ss con text m ight be borr owed to c arry out network p roces sing in p roces s cont ext. This is true for almos t all OSes in the data trans mit path. With each outgoin g tr anspo rt-la yer pack et, the O S recor ds a proc esso r core ID a nd use s it to up date th e entr y in th e Flow - to -C ore tab le. A s soon as any net work process ing is perf ormed in a process cont ext, A-TF N le arns of t he co re o n wh ich an appl icatio n pro cess r esides and can steer fu ture inco ming traffi c to t he rig ht cor e. Clea rly, the d esign o f s uch a me chan ism inv olves a trade -off betwe en t he a moun t of work don e in the NIC and in th e OS. The re are two d esig n option s. Optio n 1 is to minim ize chang es in the O S and focu ses in stead on iden tifyin g the min imal se t o f mech anism s to add to th e NIC . Cl early, this des ign adds com plex ity a nd cost to the N IC. O n the othe r end of the design space , it co uld be let the OS u pdate t he flow - to -cor e ta ble dire ctly with out ch angin g any thing i n the N IC ha rdwar e (opt ion 2). Co ncep tually , this ap proac h could be fair ly strai ghtfor ward to impl emen t. Howeve r, it mi ght add sign ifican t ext ra co mmu nicat ion o verh eads betw een the OS a nd th e NIC , espe cially when the F low- to -Cor e table ge ts large . D ue to spa ce limi tation , th is pape r is main ly foc used on the first d esign optio n. In o ur fut ure work , we will explo re th e sec ond d esign opti on. Besi des, opti on 1 des ign h as o ther goals: ( 1) A-TFN mu st be si mple a nd eff icient . T his is b ecaus e the N IC cont roller s usu ally util ize a les s po werfu l C PU with a simp lified instru ction set and insu fficie nt me mory to hold c ompl ex firm ware . (2) A -TFN m ust p reserv e in- orde r p acket de livery . (3) Th e c omm unica tion over heads be tween the OS a nd A-TF N must be mini mal. 3.2 A -TF N De tails Fig. 6 ill ustrat es th e A -TF N d etails . A- TFN exte nds th e curr ent R SS tec hnolo gies. I t s uppo rts mult iple rece ive queu es in the NIC , up to the nu mbe r of core s in the system . W ith M SI and Flow- Pinni ng supp ort, e ach r eceiv e que ue ha s a d edica ted in terrup t and is tied to a spec ific core; eac h c ore in the syst em is assig ned a sp ecific r eceiv e que ue. A- TFN h andle s n on - trans port layer traffic in the sam e way as doe s R SS . That is , based on a ha sh of the inc omin g p acke t’s head ers, th e NIC assign s it to th e same queue as oth er pack ets f rom the sam e da ta fl ow, and dist ribute s diffe rent flow s acr oss queue s. F or tr anspo rt la yer traff ic, A -TF N m ainta ins a Flo w- to - Core tabl e w ith a sing le e ntry per flow . Each ent ry t racks t he r eceiv e queu e (cor e) to wh ich a fl ow sho uld be a ssign ed. Th e entri es within the Flow - to -Co re table are update d by outg oing packe ts. For unid irecti onal TCP data f lows, % ./=> (Q (! -) .4(8 0'E& */#< #( outg oing A CKs up date the F low- to -Core ta ble. Fo r a n outg oing t ransp ort-la yer p acket , the O S rec ords a proc essing c ore ID in t he trans mit des cripto r and pas ses it to the N IC. Sin ce each p acke t c ontai ns a com plete iden tificat ion of th e flow it belo ngs to , the sp ecific Flow → C ore relations hip coul d be effe ctivel y extra cted fro m th e o utgoi ng p acke t an d its acco mpan ying tra nsmit de script or. As so on as any netw ork pr ocess ing is pe rform ed in the proce ss cont ext, A-T FN le arns o f on whic h core an a pplica tion reside s. 3.3 F low- to -Cor e Ta ble a nd its Ope ratio ns The F low- to -Cor e table a ppea rs in Fig. 7 . Flow en tries are man aged in a hash table , w ith a linke d li st to r esolve col lision s. Each entr y con sists of: • Tr affic Flo w . A - TFN m akes use of the 5-tu ple {src_ addr, dst_ addr , pr otoco l, s rc_po rt, dst_ port} in the recei ve d irectio n to sp ecify a f low. Ther efore , fo r an ou tgoin g p acket wi th t he head er {( src _add r: x), (dst _addr : y), ( proto col: z ), (src_ port: p), (ds t_por t: q )}, its corr espon ding flow entry in th e table is id entifi ed as {( src _a ddr: y), (dst _addr : x), (pr otoco l: z ), ( src_p ort: q), (dst_ port: p)} . • C ore ID . The core t o whi ch the f low should be steer ed. • Tr ansit ion Sta te . A fla g to indic ate if th e flow is in a tra nsitio n state. The goal is to ens ure in-o rder pack et de livery . • Pa ckets in Tra nsiti on . A sim ple pac ket list to acco mmo date tem porar y pac kets whe n th e flo w is in a trans ition sta te. The g oal is to en sure in -orde r pack et de livery . In addi tion, to avo id non- determ inist ic p acke t proc essing ti me, a col lision -reso lving lin ked lis t is limit ed to a max imum siz e of € MaxListSize . Fl ows are not e victed in ca se of c ollisio n. Wh en a s pecifi c hash ’s colli sion- resolv ing list rea ches € MaxListSize , l ater flow s with that h ash w ill n ot ent er into the table. a). F low E ntry Gene ratio n and Dele tion A-T FN m onito rs eac h inco ming a nd ou tgoin g pack et to m aintai n the F low- to -Cor e Tabl e. An e ntry is gene rated i n the Flow - to -C ore tab le as soon a s A-T FN dete cts a s ucce ssful thre e-w ay hands hake . H owev er, to redu ce NI C com plex ity, A -TF N nee d no t run a full TCP state ma chine in the NIC . A flo w e ntry is delete d a fter a conf igura ble per iod of tim e , € T delete , has ela psed wi thout traff ic. I n th is w ay, A-TF N n eed not hand le a ll exce ption s such as mis sing F IN pa ckets and va rious time outs. To prev ent mem ory exha ustion or ma liciou s attac ks, A -TFN sets an u pper bou nd on the num ber o f entri es i n th e F low - to -Co re T able. W hen the Fl ow- to - Core ta ble star ts to beco me full, TCP flo ws can be a ged out m ore a ggres sively by u sing a sm aller € T delete . Fo r traff ic flows th at are no t in t he Flow - to -C ore tabl e, pack ets ar e deli vered based on a has h of t he inc omin g pack ets’ h eade rs. b). Flo w Entr y Upd ating The entri es of the Flow - to - Core table are upda ted by outg oing pack ets. For each outg oing trans port- layer pa cket, the O S recor ds a p roces sing cor e ID in the tran smit des cripto r a nd pass es it t o t he NIC. A naiv e way to update t he corresp ondin g flow entry is with the pa ssed co re ID, o mittin g any ot her mea sures . As s oon as a ny n etwor k p roces sing is pe rform ed i n proc ess c ontex t, A - TF N learns of the proc ess m igra tion and ca n s teer fut ure inco ming tra ffic to the ri ght core . How ever, thi s si mple flo w e ntry upd ating me chan ism cann ot g uaran tee i n-ord er p acket deli very. TCP perfo rman ce su ffers in t he ev ent of s evere p acke t reord ering [ 23 ]. I n the follow ing sectio ns, w e u se a simp lified m odel t o a nalyz e why th is app roach c annot guar antee in-or der p acket deliv ery. As s hown in F ig. 8, a t tim e € T − ε , Flow 1 ’s flow entry m aps to C ore 0 in the F low- to -Cor e t able. A t t his insta nt, p acket S of Flow 1 arr ives; based on the F low- to -Co re table , it is assign ed to Cor e 0. At time € T , due to p roces s m igratio n, Flow 1 ’s flow en try i s up dated and ma ps t o C ore 1. At € T + ε , P acket S+ 1 of Flo w 1 arriv es a nd is a ssign ed to t he n ew core , na mely Co re 1 . As de scribe d abov e, afte r assig ning re ceive d pack ets to the corre spon ding rece ive q ueue s, A -TFN cop ies th em into sy stem m emo ry via D MA, an d final ly fires netw ork inte rrupts if nece ssary . W hen a c ore resp onds to a n etwo rk in terrup t, it p roces ses r eceive d pa ckets up throu gh the netw ork stac k f rom the cor respo nding ri ng buff er on e by one . In our case , Cor e 0 proc esses pack et S u p th rough the net work stac k fr om Ring Buf fer 0 , an d Core 1 s ervic es pa cket S+1 fro m R ing Buf fer 1 . Let € T service ( S ) and € T service ( S + 1 ) be th e time s at wh ich th e netw ork st ack st arts to servi ce pac kets S and S +1, resp ective ly. I f € T service ( S ) > T service ( S + 1 ) , the net work % ./=> ( R( .2,D - $, -6, %0()& "20( % ./=> (S(!(; /<+ 2/7/0 1(8, 102(7 ,%( @&' A 0$(B 0,%1 0%/* =(!*& 23#/# ( stack w ould se e p acke t S +1 earl ier tha n p acket S , resu lting in p acke t reo rderin g. L et D be the ring bu ffer size a nd let the n etwor k s tack’ s p acket se rvice ra te b e € R service (pack ets per s econd ). Assu me the re are n packe ts ahea d o f S in Ring B uffer 0 and m pack ets ahead of S+1 in Ri ng Bu ffer 1 . The n: € T service ( S ) = T − ε + n / R service (1) € T service ( S + 1 ) = T + ε + m / R service (2) If € ε is sma ll, the con dition of € T service ( S ) > T service ( S + 1 ) wou ld hol d and l ead to packe t reord ering . Sin ce the ring buffe r size is € D , the wors t case is € n = D − 1 and € m = 0 : € T service ( S ) = T − ε + ( D − 1 ) / R service (3) € T service ( S + 1 ) = T + ε (4) Howev er, if th e deli very o f pac ket S+ 1 to C ore 1 can be del ayed f or at lea st € ( D − 1 ) / R service , then € T service ( S + 1 ) ≥ T + ε + ( D − 1 ) / R service . A s a res ult, € T service ( S + 1 ) > T service ( S ) an d in -orde r pa cket deliv ery can be g uaran teed. Therefo re, A-TFN ad opts the follow ing flow ent ry upda ting m echan ism: for each outgo ing tra nspor t-lay er pack et, the OS r ecord s a p rocess ing core ID i n t he trans mit desc riptor and pa sses it to the NIC to upd ate the corr espon ding flo w entry. For a TC P flow ent ry, if the ne w cor e id is dif feren t from th e old o ne, the flow ente rs th e “ transi tion” sta te. C orre spond ingly , its “ Tra nsitio n S tate ” is s et to “ Ye s ” a nd a time r is starte d for th is ent ry. Th e time r’s exp iratio n valu e is set to € T timer = ( D − 1 ) / R service . Inc omin g pa ckets of a flow in t he trans ition sta te are add ed to the ta il of “ Pa ckets in Tran sition ” ins tead o f bein g imm edia tely d eliver ed. Whe n the tim er ex pires , the flow lea ves t he tr ans ition state . Th e “ T rans ition Stat e ” i s se t bac k to “ No ” and all of the pa ckets in “ Pack ets in Tra nsitio n, ” if th ey exist , are assig ned to the new cor e. Fo r a flow in the “no n- trans ition” sta te, its pac kets are di rectly ste ered to the corre spon ding core . Th e ring buff er siz e D i s a d esign para meter for the NI C a nd d river. For exam ple, t he Myr icom 10G b N IC is 512 , and Inte l’s 1 Gb NIC is 25 6. With cu rrent comp uting po wer, € ( D − 1 ) / R service is usua lly at the sub -mil liseco nd le vel, a t be st. For A - TFN , € T timer is a d esign para mete r and is con figur able. 3.4 R equ ired O S Su ppo rt Opti on 1’ s A-T FN d esign requ ires o nly t wo sm all OS ch anges in o rder to be pr operl y suppo rted. Thes e can be easil y im plem ented . (1) Fo r an outg oing trans port- layer p acket , the OS n eeds to r ecord a proc essing core ID i n the tran smit de script or passe d to the NIC. (2) The transm it de scrip tor n eeds to b e upda ted w ith a ne w ele ment to stor e this core ID . A sing le-by te elem ent can su pport up t o 2 56 core s, wh ich is s uffici ent for mo st o f t oday ’s syste ms. In addi tion, the si ze of a tr ansm it d escri ptor is usua lly sm all, typic ally less tha n a cac he l ine. Tran smit desc ripto rs ar e usua lly copie d to t he NIC by DM A using w hole cach e line m emor y trans actio ns. Add ing a by te to the transm it desc riptor introd uces a lmost no extr a comm unic ation over head betwe en th e OS and N IC. 4. Analysi s and Exp erime nts The A- TFN me chani sm is s imple . It gua rante es i n- orde r pa cket deliv ery and requir es th e m ost mini mal OS supp ort. In add ition, the c omm unica tion o verhe ads betw een th e OS and A -TF N are reduc ed to a min imum . Com pared to th e exte nsive ly pur sued T OE ( TCP Offl oadin g E ngine ) te chno logy, wh ich seek s to of fload proc essing of th e eng ine T CP/I P stac k to t he NI C, A - TFN i s m uch le ss com plex: (1 ) A-T FN doe s not req uire a comp licate d TCP e ngine w ithin t he NIC ; (2) there is no n eed to sy nchro nize TCP flow state s bet ween the O S and A-TF N; an d (3) there is no n eed t o enfo rce f low acce ss contr ol in the NIC . Theref ore, with the latest hard ware and so ftware te chnol ogie s, A -TFN c an b e effec tively im plem ented . In the fol lowin g se ctions , we use a com bina tion of an alytic al an d exp erime ntal tech nique s t o eval uate th e effe ctiven ess of A -TF N mec hanis ms. 4.1 A naly tical Evalu ation a) D elay F4% '(G+ 1'% * (J415 '1% I .LK' /% 5' 8*B'1 ] ,% in co ming pack ets o f a flow in th e tra nsitio n sta te ar e ad ded t o the tail of “ P acke ts i n T rans ition ” . T hese pa ckets are deliv ered later, whe n th e flow ex its th e tra nsitio n sta te. Obv iously , % /7 *G% ; 'L7. ( *G;% (4/% 4(8]% .55G% 5'8.] % /4% L'1/ .*(% I.L K'/G% <+ /% .8G4% 1' h +*1' G% ' @/1. % ; ';41 ]% /4% .LL4 ;;4 5./' % /7' ;=% U 42'B '1,% L 4(G*5 '1*( H% /7. /% /7 '% GL7' 5+8' 1% 43 % a gen eral pu rpose OS tr ies to the ex tent poss ible to s chedu le a pro cess on to the s ame co re as that on w hich it wa s prev iousl y run ning, proc ess migr ation do es n ot o ccur fre quent ly. As a resul t, o nly a very few p acket s w ill be held and w ill e xper ience ex tra dela y. Our expe rimen ts in S ection 4.2 co nfirm this poin t. C learl y, the max imum d elay a held pa cket can expe rienc e is € T timer . P reviou s a nalys is ha s sh own that in- orde r pack et de livery is gu arante ed w hen € T timer is se t to € ( D − 1 ) / R service . Howev er, in the real wo rld , incom ing pack ets rarel y fill a ring buf fer . If € T timer w ere con figur ed to be s malle r, this w ould s till ensure in -ord er packet deliv ery in most ca ses. In [ 23 ], we record the du ration for wh ich the OS pro cesses the ring buffer . Our expe rimen ts have c learly sh own t hat the du ration is gene rally shor ter th an 20 micr oseco nds. I n mo st case s the e xtra d elay is so smal l that it can be ig nored . % b) F low A ffinity and Netw ork D ata Affini ty The in tent of A -TFN is to au toma tically st eer inco ming n etwor k data t o the sa me co re on wh ic h its appl icatio n pro cess r eside s. A s soon as a ny ne twor k proc essing is perf orme d in a pr ocess c ontex t, A-T FN learn s of the cor e on whic h an app licatio n proce ss resid es and can st eer fu ture in comin g traf fic to the rig ht core . As a r esult, the desire d flo w a ffinity an d ne twork data a ffinit y are guaran teed. Fo r Linux, in r are circu msta nces, ne twork proce ssing m ight alw ays oc cur in th e i nterru pt c onte xt. W hen this h appe ns, A -TF N cann ot learn of pro cess m igrati on a nd can not ste er inco ming t raffic t o the core s on w hich t he corre spon ding app licati ons are runni ng. Flow affin ity still hold s, b ut, n etwor k da ta af finity does not in t hese case s. c) H ardw are d esign consi derat ions A-T FN’s mem ory is main ly u sed to m ainta in t he Flow - to -C ore tab le, h olding flow entrie s and acco mmo datin g p acket s f or flows in the tr ansiti on state. To h old a singl e flow entr y, 20 byte s is q uite s uffici ent. Ther efore , a 10 ,000- entry Fl ow- to -Core tab le requi res only 0.2 M B o f mem ory. (The se fig ures a pply to IPv 4; IPv6 suppo rt wou ld add 24 byt es to the size of e ach entry , or les s if the fl ow lab el cou ld be rel ied u pon. ) In addi tion, to ac comm odat e pa ckets for f lows in trans ition, if € T timer i s set to 0. 2 milli secon d, eve n for a 10G bps NI C, the m emo ry requ ired is 0 .2 mil liseco nd × 10G bps = 0. 25 MB , a t maxi mum . I n the wo rst case , an extra 0 .5 MB o f fa st SRA M is en ough to su pport th e Flow - to -C ore Table . A Cy press 4M b ( 10ns) SR AM now costs a round $7, w ith ICC =90 m A@ 10ns a nd ISB2 =10m A. Tabl e 1 lists the cost, mem ory size and pow er co nsum ption of thr ee m ain 10 G Et herne t NIC s in the m arket. C learly , A -TF N’s req uirem ent of an e xtra 0.5 M B fast SR AM in the N IC wo n’t add m uch ext ra cost (< 1% ) and pow er con sum ption (<5% for I ntel an d Che lsio; < 10% for M yrico m) to curr ent 10 Gbp s NIC s. A lin ked lis t in HW is exp ensiv e to bu ild giv en all the ex tra hand ling. H owev er , there w ill be a trad eoff in hard ware com plex ity a nd A -TFN eff ectiv eness . We furth er dis cuss this la ter. Ven dor Cost Mem ory Pow er Intel $1500 N/A 10.4 W Che lsio > $7 50 256M B 16 W Myr icom $ 85 0 2MB SRA M 4.1W )&"20 (?(?TU (@65- P I+%0# #(P$E0 %0*$(4 56#V(;/ *=20( 40$D ,%A(@ , %$V(?T UJ� -;BV(W +$/'#(. /"0%( ) %&*#' 0/C0% 4.2 E xper imen tal E valua tion &'% I14/4/ ]I'5 % . (% SJ F:Y% G ]G/'; % .G% G 742 ( % *(% :*H=% c S =% S% G'(5 '1% L4( ('L/G % / 4% .% 1'L' *B'1% B*. % / 24% I7] G *L.8% <.LK J /4 J<.L K% ! $W\ % *G% ; 45*3*' 5% / 4% G+ II41 /% /7 '% S JF:Y % ;'L 7.(*G ;G . For an outgoin g t ransp ort-l ayer packet , the OS re cord s a p rocess ing cor e I D in the “tra nsmit desc riptor ” an d p ass es it t o “A -TF N.” Here , w e m ake use of f our res erved b its in the T CP hea der as th e “tran smit descrip tor” t o com mun icate the co re ID . Whe n t he s ender re ceive s a “tr ansm it d escri ptor ,” it extra cts th e pas sed C ore ID and upd ates th e corre spon ding flow entry in the Flo w- to -C ore t able. Unle ss oth erwis e spec ified, € T timer is set to 0.1 milli secon d. The F low- to -Co re table i s uppe d limite d to 10, 0 00 en tries. \*; * 8.18] ,% 2 '% *;I8' ;'( /'5% .% /24J 1'L'* B'% h+'+ '% O\\ % YC0 ,% .G% G74 2(% *(% : * H=% c ?=% C( % <4 /7% /7 '% G ' (5'1 % .(5 % /7'% 1' L'*B' 1,% /7'% /2 4% -] 1 *L4; % !$ W() E0(+ % ,$,$ 3+01 (!-) .4( % J>() E0(+ %,$, $3+01 (B;; ( ./=> (X(PI +0%/ <0*$ (;3#$ 0<# ( !"#$ %&)* &$#* #!+#$ &), &-&). &/00 &)" &100 / &2 & !"#$ %&)* &$#* #!+#$ &), &-&&3. &/00 &3"& 400 /&2& H/#$ /*=(? (;0* 10%(P I+0% /<0 *$(;' %/+$# ( dst_ port } 341% ' .L7% inco ming pack et . C( % /7'% 1'L' *B'1, % '.L7 %G8. B '%Y C 0%a1' L'*B ' %h+' +'b% *G%I* (('5 %/4%. %GI' L*3*L% L41' =% &'% 1 .(% 5. /.% /1. (G;* GG*4( % ' @I'1 *;'( /G% 2*/ 7% *I'1 3% +G*( H% /7' % /'G/ % G]G/' ; % G74 2(% * ( % :*H= % c=% F7' % '@I' 1*;' (/G% 1 '8*'5 % 4 (% /7' % 34884 2*(H N% a!b % C perf is a mult i-thre aded netwo rk app licati on. W ith m ultipl e para llel TC P da ta stre ams, a de dicate d chi ld thre ad is spaw ned and a ssigne d to h andle e ach strea m in the syste m. (2) W hen ipe rf is pinne d to a specif ic core, its child thr eads are als o pi nned to that cor e. % C(% 4 +1% '@I' 1*;' (/G,% * I'13% G '(5G %3 14; % /7'% G '(5' 1% /4% /7 '% 1'L' *B'1% 2* /7% n" I.1 .88'8% F0 6% G /1'. ; G% 341% !$ $% G'L4 (5G,% /4 % I 41/G% A$ $!% .( 5% E $$!, % 1' GI'L/ *B'8] =% F7' 1 '341' ,% /4/ .88]% 2n % I .1.88 '8% F0 6% G/1 '.; G % .1' % /1.( G;*/ / *(H% *(% '. L7% '@I '1*; '(/=% F 7 '% (+; <'1% n" 2.G % B.1* ' 5% .L1 4GG% ' @I'1* ;'(/ G=% F7 ' % '@I' 1*;' (/% GL 1*I/G % 341% /7'% G'(5 '1% . II'. 1% *( % D*G/ *(H% !=% C( % /7 '% 1' L'*B' 1,% /7'% 1'L' * B'% h +'+' G% . (5% * I'13% .1'% I*((' 5% /4 % 5*3 3'1'( /% L41' G% /4 % G*; +8./' % .% /24 JL41' % G]G / ';=% F7' % '@I' 1*;' (/.8% L4(3* H+1./ *4(G% .1'%8 *G/'5 %*(%F. <8'%" =% In ou r e mula ted sy stem, w e mea sure th e Flow - to - Core Tabl e’s se arch t ime. T he se arch t ime to acce ss the first item in a c ollisi on-re solvi ng li nked list ta kes arou nd 26 0 ns , w hich includ es th e hash ing a nd lo cking over heads . For each n ext ite m in the li st, it ta kes appr oxim ately an ext ra 150 ns . Th erefor e, the long est searc h in ou r sy stem take s € 260 + 150 * ( MaxListSize − 1 ) ns . Fo r a 1 0Gbp s NI C, th e tim e bu dget to p roces s a 1500 byte pac ket is arou nd 1200 ns . To eval uate € MaxListSize ’s effe ct on A- TFN’ s perfo rman ce, we se t € MaxListSize to 1 and 6, r espec tively . C orresp ondi ngly, A-T FN is term ed as A-TF N-1 and A -TFN - 6. 56 ,# $%'$7 5-* #&89" #$!7 #-.:& Exp erime nts 1 a nd 2 si mulat ed the network cond itions that a si ngle a pplica tion m ust ha ndle mult iple con curre nt T CP con nectio ns. In bot h expe rimen ts, T CP strea ms o f a speci fic p ort ( 5001 or 6001 ) wer e pinn ed to a partic ular c ore. % W *B'( % / 7'% G. ;'% '@I' 1*;' (/ .8 % L4(5*/ *4(G, % 2' % L4 ;I. 1 '5% /7'% 1'G+8 /G% 2*/7 % S JF:Y % /4 % /7 4G'% 2*/ 7% O \\=% F7' % ; '/1*L G % 4 3% *(/' 1'G/% 2 '1'N % a!b% F 714+ H 7I+ / e% a"b% &SC FFC- VJ F>F SDe% . (5% a #b% ? Q\dU CF-d 9Og = %a F7' % (+; <'1% 4 3% 'B'( /G% <'/ 2''( % G .;I 8'G% 2. G% !$$ $$=b% 04 (G*G/ '(/% 1'G+ 8/G% 2 '1'% 4 (?T(P I+0% /<0 * $(F(B 0#92 $#( lead to the und esira ble situa tion in wh ich netw ork appl icatio ns are so ft-pa rtition ed on co res oth er than thos e to which the ir n etwor k in terrup ts a re d irecte d. Also , an O S sch edule r prio ritize s load bala ncing (or pow er saving ) o ver proc ess- to -inter rupt affin ity. In these en viron ment s, n etwo rk a pplic ations m ay als o b e sche duled on cor es oth er than thos e wher e their corre spon ding netw ork inter rupts are dire cted . &' % 1. (% '@I' 1*;' (/G% *(% /7 'G '% '(B *14(; '(/ G . C onclu sion s simi lar to those ab ove can be dr awn, but due to sp ace limit ation s, tho se res ults a re not discu ssed here. ;6 < #'$= # $!-> &8 9"# $!7 #-.: & A-T FN us es a sp ecial flow entry u pdati ng mec hanis m to guarant ee in -orde r pac ket d eliver y. Exp erime nts 3 and 4 w ere des igned t o evalu ate whe ther this m echan ism act ually w orks. In both expe rimen ts, iperf s ( ports 50 01 and 6 001) were allowe d t o r un in both core s wh ere t he tw o re ceive que ues w ere pinn ed. Linu x was configur ed to r un in multicor e peak perfo rman ce m ode; the sched uler tries to us e all core reso urces in para llel a s m uch as possib le, distrib uting the lo ad equ ally am ong th e core s. As a re sult, i perf threa ds may m igra te a cross c ores. T he rece iver was instr umen ted to reco rd an y out - of -ord er pa ckets , and we calcu lated rel evant pa cket reo rderi ng ratio s. F or A- TFN -6, w e set to 0 or 100 µ s . T he ex perim ental resu lts, w ith a 95% con fidenc e int erval, are show n in Tabl e 6. 2n Pack et R eord ering Rati o (Ex perim ent 3) € T timer = 0 ( µ s) € T timer = 10 0 ( µ s) 40 3.52 4E- 07 ± 3.539 E- 07 0 2 00 7.57 3E- 07 ± 8.569 E- 07 0 1000 1.25 2E- 04 ± 1.01 5E- 04 0 2000 2.07 6E- 04 ± 7.200 E- 05 0 Pack et R eord ering Rati o (Ex perim ent 4) 2n € T timer = 0 ( µ s) € T timer = 10 0 ( µ s) 40 5.11 0E- 07 ± 6.809 E- 07 0 2 00 6.27 8E- 06 ± 8.553 E- 06 0 1000 3.63 9E- 05 ± 2.75 4E- 05 0 2000 2.17 4E- 04 ± 8.515 E- 05 0 )&" 20(Q( B0,% 10%/ *=(PI +0%/ <0* $#( Whe n € T timer is 0, i ncom ing pa ckets o f a flo w in th e trans ition state are i mme diatel y de livere d, in stead of bein g add ed to the tail o f “ P acke ts in Tran sition . ” A s discu ssed in Sect ion 3.3 , this might lead to packe t re orde ring. T he resu lts in Tab le 6 refle ct this fac t. The reaso n why th e pack et reorde ring rat io was so s mall is beca use (1 ) the sc hedu ler trie d to the e xtent p ossib le to sche dule a pr ocess onto the sam e co re o n w hich it w as prev iously r unnin g, so t hat pro cess m igra tion w as not frequ ent; and ( 2) wh en a fl ow en try w as in th e trans ition sta te, its pack ets migh t n ot arriv e during th is perio d. Thi s show s, from anoth er pers pectiv e, tha t very few pack ets are h eld as “ P acke ts in Tra nsitio n ,” wher e they wou ld e xperie nce extr a del ay. W hen € T timer is 100 µ s, n o o ut- of -or der pack ets are rec orded . Th is s hows that A-TFN can eff ective ly gu aran tee in -ord er pa cket deliv ery. 5. Related Wor ks >B' 1% / 7'% ]' .1G,% 1' G'.1 L7% 4(% . 33*(*/ ]% *(% (' /241 K% I14L 'GG*( H% 7 .G% < ''(% '@/' (G*B' =% \. 8'7*% '/% .8=% _ T `% G/+5 *'5% /7' % ' 33'L/* B'(' GG% 43% .33* ( */]J< .G'5 % GL7' 5+8*( H% *(% ; + 8/*I1 4L'G G 41% ('/ 241K % I14/4 L48% I14L 'GG*( H% +G*( H% <4/ 7% I.LK '/J8' B'8% .( 5% L4( ('L/ * 4(k 8'B' 8% I. 1.88' 8*X./* 4(% . II14 .L7' G =% ?+ /% G* (L'% /7'G' % .II 1 4.L7 'G% 2 41K' 5% * ( % /7 '% +G '1% G I.L' ,% /7' ]% 5 *5% ( 4/% L4(G *5'1% '* /7'1% G] G/'; % 41 % *; I8'; '(/. /*4(% L4 G/G=% C(% _A`% .(5% _E `,% S=% :44(H % '/% . 8=% '@I '1*; '(/' 5 % 2*/ 7% .33*( */*X*( H% I14L 'GG' G^/71 '.5G ,% . G% 2 '88% . G% * (/'11 +I/G % 314; % YC0G,% /4 % G I'L* 3 *L% I14L 'GG41 G% *(% .(% \- 6% G]G/' ;=% V@I '1*;' (/.8 % 1' G+8/G % G +HH' G/'5% /7 . /% I14L' GG41% .33*( */]% *(% ('/ 241K % I1 4L'G G *(H% L4( /'@/ G% L. (% G*H( *3*L.( /8]% *;I 14B' % 4B ' 1.88% I' 1341; .(L' =% C( % _[ `,% J . Hye -Chu rn e t al . st udied the pr oblem o f m ulti-c ore awa re proces sor affin ity for TCP /IP over m ultip le netw ork i nterfa ces, u sing a sof tware -onl y app roach . Thei r rese arch topics are s imila r to us . % >/7 ' 1%1 'G'.1 L7'1 G %7. B'% .54I /'5% .%7 .15% I.1 /*/*4( % .II 1 4.L7 % _ "T `_ "A `_ "E `=% C(% ;+8 / *I14L 'GG4 1% '(B* 14(; '(/G ,%. % G+% *(/'13 .L'G= % F7'% 8*;* /./*4( % 4 3% /7*G% .II14. L 7% *G% /7 ./% /7'% >\% .1L7 */'L/ +1'%1 ' h+*1 'G%G*H (*3*L . (/%L7 .(H' G=% F7' % YC0 % / 'L7( 484H* 'G,% G+L7 % .G % C( /'8MG % B ;5h% _ "[ `% 41% /7 '% 60C J\CWM G % \ OJC> g% _ "P `, % .8G4% I1 4B*5' % 5 ./.% G/'' 1*(H% L.I. <*8*/* 'G% 34 1% /7 ' % YC0 G=% ? +/% /7 ']% . 1'% C ^ >% B*1/+ .8*X. /*4(% /'L7 (484 H*'G% /.1H '/*(H % ./% B *1/+. 8% ;.L 7*('G % *(% /7'% B*1/+.8 *X'5% '(B *14(; '(/ ,% 2*/7% 5*33' 1'(/ % 1'G'. 1L7%* GG+'G =% 6. C onclu sion and Discu ssion Exis ting R SS -e nable d NICs cannot automa tically steer in comin g n etwo rk d ata to the core on whi ch its appl icatio n pro cess resid es. I t cau ses v ariou s ne gati ve impa cts. We p ropos e an A-T FN mec hanis m to remed y this lim itati on. In the paper , we disc uss two A-T FN desig n option s. O ptio n 1 is to mini mize cha nges in the OS and foc uses inst ead on iden tifyin g the mi nimal set of mec hanis ms to add to the NI C. C learl y, this de sign adds com plexi ty an d co st to the NIC. On t he ot her e nd of t he d esign spac e, it cou ld be let the OS upd ate t he flow - to -co re tab le dire ctly witho ut cha ngin g any thing in the NIC hard ware ( optio n 2). C once ptual ly, this appr oach co uld be fair ly strai ghtfo rward t o imple ment. How ever, it m ight add si gnific ant extr a c omm unica tion over heads betwe en the OS and the NIC , espe cially whe n t he Flo w- to -C ore ta ble get s l arge. D ue to sp ace limit ation , this p aper is ma inly f ocuse d on the fir st desig n o ption. The ne w N IC i s em ulate d in softwar e and it show s that the solut ion is effect ive and prac tical to rem edy R SS’s limita tion. I n our fu ture w ork, w e will expl ore th e sec ond d esign optio n. Refe renc es: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % _!`% P . Wi llman n et al., “An Eva luatio n of Net work Stac k Para lleliz ation Stra tegies i n Mo dern Ope rating System s,” In Proc. USENIX Annual Tech nical Conf erenc e , pp . 91– 96, 2 006 [2] J. H urwi tz et al ., “ End- to -end pe rform ance of 10 -gig abit e theren t on com modi ty sy stems ,” IEEE Mic ro, V ol. 24 , No. 1, 20 04, p p. 10 -22. [3] V. Roc a et al., “D emult iplexe d ar chite ctures : a solu tion for ef ficien t S TRE AMS -base d com muni cation stacks,” I EEE net work, July /Augu st 19 97, p p. 16 - 26. [4] J. S alehi et al., “T he e ffecti venes s of a ffinit y- base d sc hedu ling in mul tiproc esso r ne twor king, ” IEEE /AC M Tra nsacti ons on Netw orkin g, Volu me 4 , Issu e 4 , 19 96 P age(s ): 516 - 530. [5] A. Fo ong et al ., “Arch itectu ral Ch aracte rizati on of Pr ocess or Aff inity in Ne twork Proce ssing ,” In Proc . th e IEEE Inter natio nal Symp osium on Perf orma nce A nalys is of S ystem s an d Sof tware , pp. 2 07-2 18, 2 005. [6] A. Foon g et al., “An in-d epth anal ysis of the impa ct of proc essor affinity on netwo rk perfo rman ce,” In Pro c. IEEE I nterna tiona l Con ferenc e on Netw orks, 2004 . [7] J. Hy e-Ch urn et al., “M iAM I: Mu lti-C ore A ware Proc essor A ffini ty for T CP/I P ove r Mul tiple Netw ork Inter faces ,” In P roc. IEEE Sym posi um on H igh P erfor manc e Inte rconn ects, 2009 . [8] R. Hu ggah alli et al. , “ Direc t Cach e Acce ss for High Band width Netw ork I/O ,” In P roc. 3 2nd Ann ual Inte rnatio nal Sym posi um on Compu ter Arch itectu re, 2 005. [9] A. Ku mar, et al ., “ Impa ct of Cac he Coh erenc e Prot ocols on t he Proc essin g of Netw ork Tr affic, ” In Proc. 40th Annual IEEE/ ACM I nterna tiona l Sym posiu m on Micr oarch itectu re, 20 07 [ 10 ] Mic rosof t C orpo ration , R ecei ve-S ide Scalin g Enh ancem ents in Wi ndow s Ser ver, 2 008 . %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % [ 11 ] Mi ndsha re In c et al., PCI Expr ess S ystem Arch itectu re: P C Sy stem Arch itectu re S eries, Add ison- Wesle y Pro fessio nal, 20 03, IS BN -10: 0321156307. [ 12 ] Inte l Co rpora tion, “Sup ra-lin ear P acke t Proc essin g Perfo rman ce w ith Inte l Mult i-cor e,” http: //ww w.inte l.com /tech nolog y/adv ance d_com m/31 1566 .htm , 2006 . [ 13 ] http ://ker nel.o rg/ [ 14 ] S. Sidd ha et al., Chip Mul ti Pro cessi ng aw are Linu x Kern el Sch edule r, In Pr oc. the L inux Sym posiu m, p p. 329 – 34 0, 20 06 [ 15 ] V. Palli padi et al. , “Pro cesso r Pow er Man agem ent fea tures a nd Pro cess S ched uler: D o we need to t ie the m to geth er?” In P roc. Linu xCon f Eur ope, 2007 . [ 16 ] M.E . R ussi novic h e t a l., Mic rosof t Wind ows Inter nals, four th e d.,M icroso ft P ress, 200 4. IS BN 0735 6191 74. [ 17 ] W. Wu et al ., T he perf orman ce anal ysis of Lin ux netw orkin g – pack et receiv ing, Co mpu ter Com mun icatio ns 30 (200 7) 104 4– 1057. [ 18 ] W. W u et al ., Pote ntial p erform ance b ottlen eck in Li nux T CP, Internati onal J ourn al of Com mun icatio n Syste ms 20 (1 1) (20 07) 126 3– 1283. [ 19 ] L. Iv anov et al., Mo deling and ver ificat ion of cach e coher ence pr otoco ls, In Pr oc. the IE EE Inter natio nal Sy mpos ium on C ircui ts and Syst ems, pp. 1 29 – 132, 2 001 [ 20 ] http ://das t.nlan r.net/ Projec ts/Ipe rf/ [ 21 ] http ://opr ofile.s ource forge .net/ [ 22 ] Inte l 64 and IA -32 A rchite cture s Softw are Dev elope r’s Ma nual, V olum e 3B: S ystem Prog ramm ing G uide , Inte l Corp orati on, 20 08 [ 23 ] W. Wu et al., “Sortin g r eorde red packets wi th inter rupt coal escing ,” comp uter netw ork, Volu me 5 3, Iss ue 15 , 200 9, pag es: 2 646- 2662. [ 24 ] T. Br echt et al . E valua ting Ne twork P roces sing Effic iency with Proce ssor Partit ionin g and Asy nchro nous I/O. In Pro c. Eu roSy s, 20 06. [ 25 ] G. Re gnier et al ., E TA: Ex perie nce wit h an Intel Xeo n pro cesso r as a p acke t pro cess ing e ngine , IEEE Mic ro, 2 004. [ 26 ] S. Mu ir et al., Pig let: a L ow -Intru sion Ver tical Ope rating Sys tem, Uni versit y of Pen nsylv ania T/R MS-C IS- 00 - 04, 20 04. [ 27 ] Inte l Corpo ratio n, “Inte l VMD q Tec hnolo gy,” 2008. [ 28 ] http ://ww w.pc isig.c om/sp ecific ation s/iov

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment