Why Does Flow Director Cause Packet Reordering?

Intel Ethernet Flow Director is an advanced network interface card (NIC) technology. It provides the benefits of parallel receive processing in multiprocessing environments and can automatically steer incoming network data to the same core on which i…

Authors: ** - Wenji Wu (Fermilab) - Phil DeMar (Fermilab) - Matt Crawford (Fermilab) **

Why Does Flow Director Cause Packet Reordering?
Abstr act( )( Intel Ethern et Flow Dire ctor is an advanc ed netwo rk i nterfa ce ca rd (N IC) t echno logy. It pr ovide s the bene fits of par allel re ceive p rocess ing in m ultip rocess ing en vironm ents and can autom atica lly steer incomi ng networ k data to the same core on w hich its a pplicat ion p roces s resi des. H owev er, o ur analy sis and exp erime nts show that Flow Dir ector cannot guara ntee in-ord er packet deliv ery in mult iproce ssing envir onme nts. Pack et reorde ring cau ses variou s negativ e impa cts. E.g., TC P perform s poorly with severe packet reord ering . In t his p aper, we u se a simpl ified model to a nalyze why Flow Di rector can ca use packe t re order ing. Our exper iment s veri fy our analy sis. Index Te rms – Pack et Reord ering , Fl ow Direc tor, TCP , H igh Perfo rman ce Net worki ng. ! 1. I ntrod ucti on, Com putin g is now shif ting towa rds m ultip roce ssing (e.g ., CMP , SMP, and UNMA ). The fund amen tal goal of mult iproc essin g i s im prov ed perfo rma nce throu gh the intro ducti on of addit iona l hardwa re threa ds, CPU s, or cores (all of whic h will b e refe rred to as “co res ” for s impli city) . The eme rgenc e o f m ultip roce ssing ha s b roug ht b oth opp ortun ities and chall enge s for TCP /IP perfo rman ce o ptim izat ion in suc h envi ronm ents. M odern ne twor k s tacks ca n e xploi t pa ralle l core s to all ow eithe r m essa ge-b ased par alleli sm or conn ectio n- base d par alleli sm as a me ans o f enh ancin g pe rform ance [1]. Whi le ex isting OSe s exp loit paral lelism by allow ing multi ple threa ds to carr y out netw ork opera tions con curre ntly in the kern el, supporti ng this pa rallel ism car ries sign ifica nt cost s, parti cular ly in the conte xt o f co nten tion for s hared reso urce s , softw are syn chro niza tion, and po or c ache eff icien cies [1 ][ 2 ]. Inve stiga tions rega rding pro cesso r affinity [3][4 ][5 ] ind icate that the co ordin ated aff inity sch edul ing of proto col proc essin g and net work applica tions on the same targ et cores can sign ifican tly redu ce c onten tion for sha red reso urces , m ini mize softw are synchro nizat ion over head s, and enhan ce cach e effic iency . Coo rdina ted affin ity sche dulin g o f pr otoco l p roces sing and n etwo rk ap plica tions on the sam e targ et cor es has the follo wing goals : (1) I nterr upt a ffinity : Ne twor k inte rrupts of the sam e typ e sh ould be d irect ed to a s ingle core . Red istrib uting net work in terrup ts i n e ither a rando m or r ound - robin fashi on to dif feren t cores has und esira ble sid e effec ts [4 ]. (2) Flow affini ty : Packe ts belong ing to a spe cific flow shou ld be proc esse d by the sam e core . Flow af finity is espe cially impo rtant for TCP. TCP is a conn ectio n-or iente d proto col, and i t has a larg e an d freq uent ly ac cesse d stat e tha t mus t be shared and pro tected wh en packets from the sam e conn ectio n a re proc essed . E nsur ing that all pa ckets in a TC P flow are pro cess ed by a single core reduces con tenti on for shar ed r esou rces, mi nimiz es softw are syn chro nizat ion, and enha nces cache effici ency . (3) Ne twor k data affini ty : Inco ming ne twor k d ata shoul d b e st eered to the sa me core on whic h its appl icatio n proc ess resi des. T his is bec omin g more impo rtan t with the adven t of D irec t Cac he A cces s (DC A) [ 6 ]. Netw ork data affi nity max imiz es ca che effic iency an d red uces core - to -co re sy nchr oniz ation . In a mu lticor e sy stem, the func tion of netw ork data ste ering is e xecut ed by direct ing the corre spon ding netw ork interr upts to a s peci fic co re (o r cor es). Rece ive S ide Sc aling ( RSS ) [ 7 ] is a NIC t echn olog y. It supp orts m ultip le rec eive que ues an d int egrat es a ha shin g func tion in the N IC. T he NI C com pute s a has h valu e for each inco ming pa cket . B ased on ha sh value s, NIC ass igns pa cket s of the s ame d ata flow to a sin gle qu eue an d even ly dist ribut es traff ic flo ws ac ross queues . With Mes sage Sign al Int erru pt (MS I/MS I-X) [8] supp ort, eac h receiv e queu e is assign ed a dedi cated in terru pt and R SS stee rs inter rupts o n a p er- queu e basis . R SS pro vides th e b enef its of paral lel recei ve proc essin g in mult iproc essin g envir onm ents. O pera ting sy stem s like Win dows , Sola ris, Li nux, a nd Fr eeBS D now supp ort in terrup t affin ity. W hen an R SS rec eive q ueue (o r inter rupt) is t ied to a spec ific c ore, p acke ts fro m th e sam e flo w are steer ed to that core (F low pinning [ 9]). This ensures flow affinity on m ost OSe s, w ith Linux be ing the majo r ex cept ion. How eve r, R SS has a limit ation : it can not s teer i ncom ing n etwo rk da ta to th e sam e core w her e its ap plica tion pr oces s resid es. T he re ason i s simp le: the exis ting RS S-en able d NICs do no t mainta in the relat ionsh ip in the NIC: Tra ffic F lows → N etw ork a pplic ation s → Cor es Sinc e netw ork ap plica tions run on co res, the most c ritica l relat ionsh ip is simp ly: Traf fic Fl ows → C ores (App licat ions ) Unfo rtun ately , RSS do es n ot sup port s uch c apab ility. T his i s sym ptom atic of a bro ader discon nect be twee n existi ng softw are archit ectur e and mu lticor e ha rdwa re. With OSes like Win dows and Linux , if an app licatio n is runn ing o n on e cor e , whil e R SS has sc hedu led rece ived tra ffic to be pro cess ed on a diffe rent core , poo r cac he ef ficie ncy a nd si gnifi cant c ore - to - core s ynch roniz ation overh eads will re sult . Th e ove rall sy stem effic iency may be severe ly degr aded . To remed y the RSS limit ation , th e Intel Et hern et F low Di recto r techn olog y [ 10 ] has been int rodu ced . The basi c id ea is sim ple: Flo w D irect or main tains the re latio nship “Tra ffic F lows → C ores (App licat ions) ” in the NIC . OS es are cor respo ndin gly enha nced t o supp ort suc h capa bility . Flo w Dire ctor no t only prov ides the benefits of pa rall el rec eive proc essin g i n mult iproc essin g env ironm ent s, it also can a utom atic ally st eer pack ets o f a sp ecifi c dat a flow to t he sa me c ore, wher e they will be proto col- proc essed and fin ally c onsu med b y the appl icatio n. Ho weve r, our an alysi s and ex perim ents s how th at Flow Direc tor can not gua rante e in-or der pac ket del ivery in Why!Do es!Flow !Directo r!Cause !Packet!R e orderin g?! # We nji#W u, #Phi l#De Ma r,#M att #Cra wfo rd# Fer mil a b,#P .O. #Box #50 0,#B ata via, #IL#6 05 1 0# This#Manuscript#is#submitted#to#IEEE#Communication#Letters# mult iproc essin g en viron men ts. T CP perfo rma nce s uffer s in the even t of s evere pack et reo rder ing. I n this pape r, w e use a simp lified m ode l t o a naly ze why Flow D irecto r can c ause pack et re order ing. Our expe rime nts ve rify our a naly sis. 2. W hy, does ,Flow ,Dir ecto r ,Cau se,P acke t,Reo rder ing? , Intel E ther net Flow Direc tor [ 10 ] is a NIC techn ology . It supp orts m ulti ple re ceive qu eues in the NIC , up t o the num ber of cor es in th e s ystem . Wit h MSI /MS I-X a nd Flo w-Pi nnin g supp ort, each rece ive q ueue has a de dicat ed in terru pt an d is tied t o a sp ecifi c core ; eac h cor e in th e sys tem is as signe d a spec ific rec eive qu eue. T he NIC de vice dri ver allo cates an d main tains a ri ng buff er in system memory f or each r eceiv e queu e. F or pack et r ecep tion, a ring buf fer mus t be ini tializ ed and pre -allo cated w ith emp ty p acke t bu ffers th at h ave bee n mem ory- map ped into addr ess space ac cess ible by the NIC over the sys tem I/O bu s. T he ring bu ffer size is dev ice and drive r-de pend ent. For Flow -Dir ector -stee ring traff ic, F low Dire ctor ma intai ns a “Traf fic Flow → Cor e” table with a sing le ent ry pe r flow . Ea ch e ntry tracks t he re ceive queu e (core ) to wh ich a fl ow sh ould be assi gned . En tries w ithin the “Tra ffic Flo w → C ore” tab le are up date d b y outg oing p acke ts. To supp ort Flow Dire ctor, OS must be multipl e TX queue capa ble [ 11 ]. Eac h cor e in t he sy stem is a ssign ed a spec ific trans mit qu eue. O utgoi ng traf fic gene rated o n a speci fic core is tran smit ted via its corr espon ding transm it qu eue. Fo r an outg oing tra nspo rt-la yer pac ket, the OS r ecord s a p roces sing core ID an d p ass it t o th e N IC to up date the co rresp on ding entry in the table. F low Dir ector makes use of the 5-tup le {src _add r, dst _add r, pr otoco l, sr c_po rt, ds t_po rt} in the rece ive dir ection to spec ify a f low. T heref ore, fo r an out goin g pack et with t he hea der {( src _ addr: x ), (dst _addr : y ), (pro tocol : z), (src _por t: p ), (d st_po rt: q )}, its cor respo ndin g flow en try in the ta ble is ide ntifie d a s {( src _ad dr: y), (dst _addr : x), ( proto col: z), (s rc_p ort: q ), (d st_po rt: p )} . Fig. 1 illu strat es p acke t rec eive- proc essin g fo r tran spor t- laye r pack ets w ith F low D irec tor. ( 1) Wh en in com ing pa cket s arriv e, t he h ash funct ion is ap plied to the hea der to p rodu ce a hash resul t. Bas ed on the h ash re sult, the N IC i denti fies th e core and hence , the asso ciated rec eive queue . (2) The NIC assig ns the incom ing pa cket s to the co rresp ondi ng rec eive queu es. (3) Th e NIC depo sits via dir ect mem ory access (DM A) the receiv ed pac kets into the corres pond ing rin g buff ers in sy stem m em ory. (4 ) The NIC s ends i nterru pts to the core s ass ociat ed w ith th e no n-em pty queu es. S ubse quen tly, the core s respo nd to the netw ork interru pts and pro cess the rece ived pack ets up through the netw ork stack from the corre spon ding rin g b uffer s o ne by on e. A s f or n on-F low - Dire ctor- steer ing t raffic , ple ase r efer t o [ 10 ] for m ore deta ils. Flow Di recto r no t on ly p rovid es th e be nefit s of par allel rece ive p roce ssing in multip roce ssing env iron ment s, it a lso can au toma ticall y steer packets of a dat a flow to the same core , w here the y w ill be prot ocol -pro cesse d a nd final ly cons ume d by the appli catio n. Howev er, our analysi s shows that Flow Dir ector can not g uara ntee in-o rder pack et de liver y in mul tiproc essin g e nviro nme nts. T CP per form s p oorly wi th seve re pac ket reo rderi ng [ 12 ]. In the fo llow ing sec tion , we use a sim plifie d mo del to anal yze why Flow Dire ctor cann ot guar antee in-or der pa cket deliv ery. As s hown in F ig. 2 , at time € T − ε , Flo w 1 ’ s flow en try maps to Cor e 0 in the “ Traff ic Flow → Core ” table . At this insta nt, pac ket S of Flo w 1 arrives; bas ed on the “Traffic Flo w → Core” table, it is assig ned to C ore 0. At ti me € T , due to pr ocess m igra tion, F low 1 ’s flow en try is u pdat ed and map s to C ore 1. A t € T + ε , P acket S+1 o f Flo w 1 arr ives an d is as signe d to th e new c ore, n ame ly Core 1 . As de scribe d abo ve, aft er assi gnin g recei ved pa ckets to the corresp ondin g receive que ues, NIC copies them into syst em memo ry via DMA, an d fires netwo rk interru pts, if nece ssary . When a cor e respond s to a netwo rk interrup t, it proc esses re ceiv ed pack ets up th roug h t he netw ork sta ck from the corres pond ing ring buffe r on e by on e. In ou r ca se, C ore 0 proc esses pack et S u p thr ough the n etwo rk st ack fr om R ing Buff er 0, and Co re 1 servi ces pac ket S+1 fr om Rin g Buffe r 1. Let € T service ( S ) and € T service ( S + 1 ) be the time s at wh ich th e netw ork stack starts to service pack ets S and S+1, resp ective ly. If € T service ( S ) > T service ( S + 1 ) , the ne twor k stack wou ld rec eive p acke t S+1 earlie r than pack et S, re sultin g in pack et reo rderin g. Le t D be th e r ing bu ffer si ze and l et the netw ork st ack’s packe t serv ice rate be € R service (pack ets pe r seco nd). Ass ume the re a re n pa cket s ah ead of S i n R ing Buff er 0 an d m p acke ts ah ead o f S+ 1 in Ring Buff er 1. Then : € T service ( S ) = T − ε + n / R service (1) € T service ( S + 1 ) = T + ε + m / R service (2) If € ε is small and € n > m , the condi tion of € T service ( S ) > T service ( S + 1 ) w oul d easi ly ho ld and l ead to p acke t reord erin g. Sinc e t he ring bu ffer siz e i s € D , t he wor st case is € n = D − 1 and € m = 0 : € T service ( S ) = T − ε + ( D − 1 ) / R service (3) € T service ( S + 1 ) = T + ε ( 4) The ri ng buff er size D is a desig n par amet er for the NIC and dr iver. F or exam ple, t he My ricom 10Gb NIC is 51 2, and Intel ’s 1G b N IC is 256. # Fig. !1!Fl ow!D ire ctor !Mec han i sm! # Fig. !2!A! Sim plifi ed!M ode l!for ! Pac k et!R eor deri ng!A naly sis! In a mul ticor e system, a genera l-pu rpos e OS sch edule r tries t o use all co re reso urce s in para llel as m uch as po ssib le, distr ibutin g and a djus ting th e l oad am ong th e core s. Proc ess migr ation a cross co res oc curs fr equen tly. F low Di recto r c an easil y cau se p acket reor derin g in these cond ition s. To# va lidat e# our # analy sis , # w e# ran # data# t r ansm issi on# expe rim e nts# over # an # iso lated # ne two rk .# A # se n de r# w as# dire ctly# conn ect ed # to# a # rece iver# via# a# ph y sical # 10G bps# link .# The # send er#a nd#re ceiv er’s# deta iled#f eatu res# we re :## Sen d er: # Del l#R‐ 8 05.# CPU :#tw o#Q uad# Cor e #AM D#O pte ron# 234 6HE , # 1.8G Hz.# N IC:# M yric om# 1 0Gb ps# E ther net# N IC.# O S:# Linu x#2.6 .28. # # Rece iver : # S upe rMic ro# Serv er.# CPU :# tw o# Inte l# X eon# CPU s,# 2.6 6# GH z,# Fa mily # 6,# Mo del# 15.# N IC:# In tel# X 520 # Serv er# Ada pter# with # Fl ow# Direc tor# enab led# (con figur ed# with# sugg este d# defa ult# pa rame ters # [ 11 ]:# FdirM ode =1,# AtrS amp leRa te=2 0),# 1 0Gb p s.# OS :# Lin ux# 2. 6 .34 ,# Mul tiple # TX# Que ue#C apab le.# In #ou r#ex perim ent s,# ip erf& [ 13 ]#is #use d#to #sen d# n( pa ralle l# TCP # s tream s# from # send e r# to# rece i ver# for # 1 00# sec onds .# W e# ran# “ipe r f# –s ”# in# the# r ecei ver.# Linu x wa s co nfigu red t o run in mult icore pe ak perfo rma nce mo de [ 14 ]. A s a c onseq uenc e, the sche duler tries to use a ll cor e reso urce s in p arall el as m uch as poss ible, distr ibutin g the loa d equ ally amo ng th e cor es. Iper f is a mu lti-th reade d net work a pplic ation . Wit h mul tiple p arall el TCP da ta strea ms, a dedi cated ch ild thre ad is spaw ned an d assig ned to hand le each strea m. As a re sult, ip erf threa ds may m igra te acros s core s. The re ceive r was instr umen ted to rec ord out - of -orde r pack ets, and w e calcu lated r eleva nt pac ket reord erin g ratio s. Th e expe rime nt resu lts, w ith a 95% co nfid ence interv al, are s how n in T able 1. The degre e of pac ket r eord ering is signifi cant . At n=20 0 , pack et reo rder ing # r atio# r each es# a s# hig h # as# 0 .897 %.# T he expe rime nt re sults valid ated our anal ysis. When the sched uler tries t o use all co re reso urce s in paral lel as mu ch as po ssible , distr ibutin g t he l oad equa lly amon g th e c ores, it will lead to frequ ent pro cess mi gratio n. As our a naly sis sugg ested , the Flow Di recto r me chan ism wo uld cause pac ket reord ering whe n pro cess migr ation occ urs. In add ition ,( we# ran# tc pdu m p# to# re cord # a# si ngle # stre a m’s # pack et# t r ace# at# th e# rec eive r @ n=200 .# Th e# pa cke t # tra ce# a naly sis# in# Fi g.# 3 # (usi ng# t cptr ace# and # x plot# [ 15 ]) # c lear ly# sho ws# th e # occu rren c e# of# du plica te# ACK s,# SA CKs , # and # data # retr ansm issi ons# d ue# t o# pa cket # reor derin g.## We# t h en# ra n# “ta skse t# 0x01 # iper f# –s”# in # the# re ceive r# to# pin# ipe rf# to# core #0# an d#r epe a t ed # the# abo ve# exp erim ents .# N o pack et r eorde ring was dis cove red. This is beca use wh en# iper f# is# pi n ned # to# a# spec ific# c o re,# it s# chi ld# th read s# are # also # pinn ed# to# that# cor e .# T her e # w ill# be# no# proc e ss# migr ation # i n # this # ca se.# In #t hese #c ondi tions ,# Flow #D irec tor# do es# no t# caus e# pack et#r e ord e ring .# 3. C onc l usion , In this pap er, we us e a sim plif ied mo del to anal yze wh y Flow D irect or can cause pa cket reor derin g i n m ultip roce ssing envi ronm ents . Our e xper imen ts val idate our an alys is. Th e cont ributi ons of t his pap er are tw ofol d . F irst, we s how In tel Ethe rnet Flo w Dire ctor ca nnot gu arant ee in- order p acke t deliv ery in m ultip roce ssing env iron ment s. S econ d, w e de velo p a simp lified m ode l to a naly ze why F low Di recto r c auses pack et re order ing. REF ERE NC E ####################################################### # [1] P. Wi llmann e t al., “An Eval uation o f Netw ork Stac k Paral lelization Strategie s in Mod ern Opera ting Syst ems,” US ENIX AT C, 2006. [2] J. H urwitz et al., “End- to -en d per formance of 1 0-gigabit ether ent on commod ity syste ms,” IEE E Micro, Vol. 24, N o. 1, 20 04, pp. 10- 22. [3] J. Saleh i et al., “The effective ness of affinity-base d schedul ing in multipro cessor network ing,” IEEE/ACM Transactions on Network ing, Volu me 4, Iss ue 4, 199 6 Page(s) : 516- 530. [4] A. Foong et al., “ An in -depth analys is of the impac t of proces sor affinity on netwo rk p erformanc e,” In Pr oc. I EEE Internati onal Conferen ce on Ne tworks, 2 004. [5] J. Hye-Ch urn et al., “MiAM I: Multi-Co re Aware Proc essor Affini ty for TCP /IP over Multipl e Networ k Interfac es,” In P roc. IEE E Sympos ium on H igh Perfo rmance In terconnec ts, 2009. [6] R. Huggah alli et al., “Direct Cac he Access for Hig h Bandwidth Network I/O,” In Pro c. 32nd Annual Internation al Symposium on Comput er Archite cture, 200 5. [7] www.m icrosoft.co m, Re ceive-Sid e Scal ing En hancemen ts in Window s Server, 2008. [8] PCI Express System Architect ure: PC System Architecture Series, Addison -Wesley Profession al, 2003, ISBN-10 : 032115 6307. [9] www.in tel.com, Supra-linea r Pa cket Proc essing P erforman ce w ith Intel Mu lti-core, 2 006. [ 10 ] Intel 8 2599 10G bE Contr oller Data sheet, 20 09. [ 11 ] www. intel.com , IXGBE d evice driv er READ ME. [ 12 ] W. Wu et a l., “Sorting reordere d pa ckets with interrupt coales cing,” compute r network , Volume 53, Issue 15, 2009 , pages: 2 646- 2662. [ 13 ] http:// dast.nlan r.net/Proj ects/Iperf/ [ 14 ] www. kernel.org [ 15 ] www. tcptrace.o rg n Reo rder ing R atio 40 0.49 8% ± 0.0 67% 1 00 0.70 5% ± 0.0 42% 200 0.89 7% ± 0.0 38% 500 0.63 5% ± 0.1 54% 1000 0.40 9% ± 0.0 09% 2000 0.12 9% ± 0.0 03% Tab le!1 ! Exp e rim ent! Resu lts! Fig.3 !Pack e t!Tra ce!An alysis !(@ n =200) #

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment