Meaningful Matches in Stereovision
This paper introduces a statistical method to decide whether two blocks in a pair of of images match reliably. The method ensures that the selected block matches are unlikely to have occurred "just by chance." The new approach is based on the definit…
Authors: Neus Sabater (CMLA), Andres Almansa (LTCI), Jean-Michel Morel (CMLA)
1 Meaningful Matches in Stereo vision Neus Sabater ∗ , Andr ´ es Almansa ∗∗ and Jea n-Michel Morel ∗ ∗ ENS Cachan , CNRS-CMLA. France. ∗∗ T elecom ParisT ech, CNRS-L TCI. France. Abstract —This paper introduces a statistical method to decide whether two blocks in a pair of images match reliably . The method ensures that the selected block matches are u nlikely to hav e occurr ed “just by chance. ” The new approa ch is based on the d efinition of a si mple but faith ful statistical back ground model for image blocks learned from the image itself. A theorem guarantees th at un der thi s model not more th an a fixed number of wrong matches occurs (on av erage) for the wh ole image. This fixed n umber (the number of false alarms) is the only method parameter . Furthermore, th e nu mber of false alarms associated with each match measures its reliability . T his a contra rio block-matchin g method, howev er , cannot rul e out false matches due to t he presence of periodic objects in the images. But it is successfully complemented by a parameterless self-similarity threshold . Experimental evidence shows that the pr oposed method also detects occlusions and incoher ent motions due to vehicles and pedestrians in non simultaneous stereo. Index T erms —Stereo vision, Block-matching, Number of False Alarms (NF A), a contrario detection. I . I N T RO D U C T I O N Stereo algo rithms aim at reconstru cting a 3D mod el f rom two or more im ages of the same scene acquired at different angles. This work only co nsiders previously stereo-r ectified image pairs. In that case the 3D reconstruction requires th at the matched points in both images belong to the sam e h orizonta l epipolar line. The matching process of stereo image p airs has b een studied in dep th for more than fou r decad es. [1] and [35] contain a fairly comp lete co mparison o f th e main methods. Ac cording to these surveys there are r ough ly two main classes of algor ithms in b inocular stereovision: local matching methods and global methods. Global meth ods aim at a coheren t solu tion obtained by minimizing an energy fu nctional containing matching fidelity terms an d regularity co nstraints. Th e most efficient o nes seem to b e Belief Propa gation [15], [ 42], Gr aph Cuts [16], Dynamic Programm ing [9], [ 28] and solvers of the multi-labe l pr oblem [13], [30]. They often resolve ambig uous matches b y main- taining a c oheren ce along th e epipolar line ( DP) or alo ng and across epipo lar lines (BP & GC). They rely on a regularization term to elimina te o utliers and reduc e the n oise. They give a match to all points which a re not detected as occluded . Global methods are, h owe ver , at risk to make o r prop agate err ors if the regular ization term is n ot ad apted to the scene. A classic example i s when a lar ge portion of the scene is nearly constant, for example a scene including a cloudless sky , since there is no informa tion in su ch a region to com pute reliable matches (see Fig. 11 for an example). On such ambig uous regions, global methods perfor m an interp olation by using the informa ti ve pixels. This in terpolation can be lucky , as it is the case in mo st images of the Midd lebury b enchmar k 1 . But it can also fail, as is app arent in the above example a nd in many o utdoo r scenes. Furthermo re, the en ergy in g lobal method s, has at least two terms and one parame ter weighting them (and sometimes three terms and two p arameters [1 6]). These p arameters are difficult to tune, and e ven to mo del. Thus, it remains a v alid question how to rule out b y a par ameterless metho d the dub ious regions where the matches cannot be scien tifically demonstrated. On the other han d local me thods are simpler, but equally sensiti ve to local ambig uities. Local metho ds start by compar- ing f eatures o f th e right a nd lef t im ages. T hese fe atures c an be blocks in blo ck-matchin g meth ods, o r ev en local descrip tors [21] like SIFT descripto rs [1 8], [31], curves [36], corners [3], [11], etc. T he dr awback of lo cal meth ods is that they d o no t provide a den se ma p as g lobal methods do (meaning that th e percentag e of matched points is lower than 100%). Recent years have therefo re seen a b looming of global meth- ods, wh ich rea ch the b est pe rforman ce in re cent be nchmark s such as the Mid dlebury d ataset [3 5]. But our p urpo se is to show that lo cal methods can also be comp etitiv e. This paper considers the com mon den ominato r of most local methods, block-m atching. It shows that this too l is amenable to a local statistical decision r ule telling us whether a match is reliable. In fact, not a ll the pixels in an imag e pair can be reliably matched in rea l scenes ( 40 to 80 % of p ixels). The lack of corresp onding p oints in the second imag e o r the ambigu ity in certain poin ts stirs u p gr oss er rors in dense stereovision. In particular block- matching methods suffer from two mismatching causes that mu st b e tackled one by one: 1) The main m ismatch cau se in local meth ods is the absence of a theoretically well foun ded thresho ld to decide wh ether two blocks really match or n ot. Ou r main g oal h ere will be to define such a thresho ld by an a con trario b lock-m atching (ACBM) rejection rule, ensuring that two blocks d o no t match “ju st by chance. ” 2) A second m inor mismatch cause is th e pr esence on the ep ipolar line of repetitive shapes or textur es, a problem sometim es called “stroboscop ic p henom enon, ” or “self-similarity . ” The p roposed A CBM o nly ru les out stochastic similar ities, not d eterministic ones. While the A CBM rule mism atches rep etiti ve pattern s, these typ es of mismatche s a re easily eliminated by a simple self- similarity rule (SS). W e shall, howe ver , verify that a self- similarity rule by itself is far from reaching the A CBM perfor mance. Both ru les are n ecessary an d co mplemen- tary . The elim ination of these two so rts o f mismatch es is a key 1 http:/ /vision.middle bury .edu/stereo/ 2 issue in block -matching method s. T he pr oblem of siftin g ou t matching er rors in stereovision has o f course been add ressed many times. W e shall discuss a ch oice of the significant contributions fo r each cause of m ismatch. Occlusions are still an open prob lem in stereovision and one of the main causes of mismatch . For this reason nu merous stereo app roaches focu s o n detectin g them. Global energy methods [ 16] address occlusions b y a dding a penalty term for occluded pixels in their energy fu nction. In [14] th e major contribution is the reasonin g about v isibility in multi-view stereo. [4 2] co mputes two disparity m aps symm etrically and verifies the left-rig ht coheren ce to detect occlud ed p ixels. [28] asserts that if two poin ts in the epipolar line match with two points with a different order then there is an occlusion . Again this can lead to erro rs if there are narrow objects in the scen e. See also [7], which compar es a choice of meth ods to detect occlusions. Matching pixels in p oorly textur ed re gio ns , where noise dominates signal, is clearly the m ain cause of error . Based on local SNR estimates, [ 5] has propo sed to reject matches by thresholdin g the second derivati ve o f the correlation function: the flatter the corr elation fu nction, th e less reliable the match. In [34], the m ismatches due to weakly textured objects or to p eriodic structur es ar e co nsidered. Th e a uthor defines a confidently stable matching in or der to establish the largest possible un ambiguo us match ing at a g i ven confiden ce level. T wo para meters c ontrol the comp romise betwe en th e percent- age of bad match es and th e match density of the m ap. Y et, the m atch den sity falls dram atically when the percentag e o f mismatches d ecreases. W e will see that the metho d presented here is able to get d enser dispar ity maps with less mismatch es. Similarly , [20] tries to elimina te er rors on r epeated pattern s. Y et th eir matches seem to conc entrate main ly on image edges and therefor e h av e a low den sity . A more primitive version of the rejection metho d developed h ere was applied success- fully to th e detection of moving a nd disap pearing ob jects in [33]. This is a foremo st pro blem in the q uasi-simultaneo us stereo usu al in aerial or satellite imaging w here vehicle s and pedestrians perturb stron gly th e stere o match ing p rocess. Th e extended me thod p resented here deals with a mu ch broader class of mismatches, inclu ding those due to poor signal to noise ratio. A. Anterior Statistical A Contrario Decision Metho ds Because o f the above mention ed reasons on e cannot pre sup- pose the existence o f uniquely determined corr esponde nces for all pixels in the imag e. Th us, a d ecision must b e taken on wheth er a b lock in the left im age actually meaningf ully matches or not its best m atch in the right im age. T his problem will be add ressed b y the a co ntrario ap proach initiated by [6]. This meth od is gene rally viewed as an adaptatio n to image analysis of cla ssic hyp othesis testing. Bu t it also has a psy- choph ysical justification in th e so -called He lmholtz p rinciple, accordin g to which all percep tions cou ld be character ized as having a lo w prob ability of o ccurring in noise. Early versions of this principle in computer vision are [ 17], [10], [38]. A prob abilistic a contrario argument is also in voked in the SIFT method [ 18], whic h inclu des an emp irical reje ction threshold. A match b etween two d escriptors S 1 and S ′ 1 is rejected if the second closest m atch S ′ 2 to S 1 is actually almost as close to S 1 as S ′ 2 is. The typ ical distan ce ratio rejection threshold is 0 . 6, which mean s th at S 2 is accep ted if d is t ( S ′ 1 , S 1 ) ≤ 0 . 6 × d ist ( S ′ 2 , S 1 ) and rejected other wise. I n- terestingly , L owe ju stifies this thresh old b y a proba bilistic ar- gument: if the secon d best match is almost as good as the fir st, this only mean s th at both match es are likely to occur casua lly . Thus, they must be re jected. R ecently , [3 1] p ropo sed a rigorou s theory for this intuitive method . SIFT matches are accepted or rejected by an a con trario m ethodolo gy inv olving th e Ea rth mover distance. The a co ntrario methodo logy has also already been used in stereo m atching. [23] p roposed a pr obabilistic criterion to d etect a rigid m otion between two po int sets taken from a stere o pa ir , an d to estima te the fu ndamen tal matrix. This metho d, ORSA, shows improved ro bustness comp ared to a classic RANSAC m ethod. In the co ntext o f foregro und detection in video , [22] p roposed an a contrario method f or discriminating f oregroun d from b ackgro und pixels that was later r efined by [29]. Even thou gh this p roblem ha s some points in co mmon with stereo matc hing, it is in a way less strict, since it only needs to learn to discriminate two classes of p ixels. Hence they do not need to resort to imag e blo cks, but re ly on ly on a 5 dimension al f eature vector compo sed of the color and motion vecto r of ea ch pixel. Among influential related works, Robin et al. [32] describe a method for chang e detectio n in a time series o f Ear th observation images. The chang e region is defined as the compleme nt of th e maximal region where th e time series d oes not ch ange significantly . Thus, what is contr olled b y the a contrario method is th e number of false alarm s (NF A) of the no-cha nge r egion. This method can th erefore b e r egarded a s an a con trario region matching method. It is fund amentally different from th e method we shall pr esent. Indeed, Robin ’ s method assumes (in addition to the statistical backgrou nd model) a statistical imag e m odel that the time series f ollows in the region s where no ch ange occurs, wh ich is not feasible in stereo matching. The method in [27] is also worth mention ing. It is an a contrario method f or d etecting similar regio ns between two images. This meth od is a classic statistical test rather than an a contrario detection m ethod in the sen se of [6]. Indeed , th e role o f the b ackgro und model ( H 0 hypoth esis) and the structure to b e tested ( H 1 hypoth esis) are reversed: This m ethod only contro ls the false negative rate an d n ot the false positive rate (as in typical a c ontrario m ethods). Furthermo re the significance level of the statistical test is set to α ≈ 0 . 1 in accordance with classical statistical testing, whereas as demonstrated in [6] the significance le vel can be made mu ch more secure, of the order o f 10 − 6 . The a contrario mode l fo r region matching in stereo vision used in [ 12] is simple and efficient. The gr adient orientations at all region p ixels are assumed indep endent and un iformly distributed in the backgr ound model. A more elab orate ver- sion learns the pr obability distribution of gra dient orientation differences under th e hypo thesis that the d isparity (o r motion ) is ze ro, and uses this d istribution as a backg round model. Still, pixels ar e all con sidered as independ ent und er th e backg roun d 3 model. Once this bac kgrou nd mod el is lear ned, a given dis- parity (o r motio n model) is consider ed as mean ingful if the number of aligned gradien t orientatio ns is sufficiently large within the tested region. T his region m atching method works well, but requires an initial over-se gmenta tion of the gray- lev el image wh ich is later refin ed by an a contrario region merging pro cedure. Becau se of the r ough backgro und mode l, false p ositiv e region match es can be observed. The key to a good backgrou nd or a con trario model in block-m atching would be to learn a r ealistic p robability d is- tribution o f th e high -dimension al space of image patches. Th e seminal works [25] a nd [4] in th e con text of sh ape matchin g (where shapes are rep resented as pieces of level lines o f a fixed size) showed that high-d imensional shap e distributions can be efficiently approx imated by the tensor p rodu ct of (well cho sen) marginal distributions. The marginal laws are one-dim ensional, an d theref ore easily learn ed. In [26] these marginals are learned along the orien tations of the pr inciple compon ents. The present work can be viewed as an extension of this curve matchin g meth od to block-match ing. [2] pro posed an altern ativ e way of choosing d etection thresholds such tha t the nu mber o f false detectio ns un der a giv en backgro und model is ensured to stay below a giv en threshold. T he pr ocedure does not req uire an alytical co mpu- tations or decomposing the probability as a tensor pr oduct of m arginal distributions. Instead, d etection thresholds are learned by Mo nte-Carlo simulation s in a way th at ensures the target NF A rate. This meth od, that was developed in the context of imag e segmen tation, in volves the defin ition of a set of thresholds to deter mine whe ther two n eighbo ring r egions are similar . Howe ver , as in [27], the detected event whose false positive r ate is con trolled is “the two r e gions ar e differ ent, ” and not the one we are in terested in in the case of region matching, namely “the two r egions a r e similar . ” In co nclusion, the a co ntrario m ethodo logy is expan ding to many match ing d ecision rules, but do es not seem to have been pr eviously ap plied to the blo ck-matchin g prob lem. W e shall now pro ceed to describ e the a contrario or bac kgrou nd model for bloc k-matchin g. Th e prop osed model is the simplest that worked, b ut the reader may w onder if a still simpler model could actually work. In the next sectio n we analy ze a list of simpler proposals, and we explain why they must be discarded. B. Choosing a n Adeq uate A Contrario Model for P atch Com- parison. The goal of this section is is to reject simpler alternatives to the pro babilistic block model that will be used. In recent years, patch models and p atch spa ces are b ecoming incr easingly popular . W e refer to [1 9] and r eferences therein for alg orithms generating spa rse bases o f p atch spaces. Here, o ur go al can be formulated in one sing le question, th at clearly d epends on the observed set of p atches in one par ticular im age and n ot on the prob ability space of all patches. The question is: “ What is th e pr obability that given two images and two similar patches in th ese images, th is similarity arises ju st b y chan ce? ” The “just by ch ance” imp lies the existence of a stoc hastic backgr ou nd model , often called the a contrario mo del. When tr ying to d efine a well suited m odel for imag e blocks, many p ossibilities op en up. Simple argum ents show , howe ver , that over-simplified models do not work. Let H be the gray-level histogram of the seco nd image I ′ . The simplest a contrario model of all migh t simply assume that th e ob served values I ′ ( x ) are instances of i.i.d. ran dom variables I ′ ( x ) with cumulative distribution H . This would lead u s to affirm th at pixels q in im age I and q ′ in im age I ′ are a meaningf ul m atch if their gray level difference is unlikely small, P [ | I ( q ) − I ′ ( q ′ ) | ≤ | I ( q ) − I ′ ( q ′ ) | : = θ ] ≤ 1 N t est s . As we shall see later, the numb er of tests N t est s is quite large in this case ( N t est s ≈ 1 0 7 for typical image sizes), since it must consid er all possible pairs of pixels ( q , q ′ ) that may match. But su ch a small pr obability can be achiev ed ( assume that H is unifor m over [ 0 , 255 ] ) only if the threshold θ = | I ( q ) − I ′ ( q ′ ) | < 1 28 · 10 − 7 . On the o ther hand , | I ( q ) − I ′ ( q ′ ) | cannot be expected to be very small because bo th images are cor rupted b y no ise, among other distor tions. Even in a very op timistic setting, wher e there would be on ly a small noise distortio n between both imag es (o f abou t 1 gray level standard de viation), such a small difference would only happen for about a tiny p ropo rtion (3 . 2 ∗ 10 − 5 ) of the corr ect match es. This means that a pixel-wise compar ison would requ ire an extremely strict detection threshold to ensure th e ab sence of false m atches, but this lead s to an extre mely spar se detection (about thirty meaning ful m atches per mega-pixel image). Th is suggests that th e use of lo cal informa tion ar ound the pixel is unav oidable. The next simplest approach c ould be to co mpare b locks of a certain size √ s × √ s with the ℓ 2 norm, and with the same backgr ound m odel as befo re. Thus, we co uld d eclare b locks B q and B q ′ as meaningf ully similar if P " 1 | B 0 | ∑ x ∈ B 0 | I ( q + x ) − I ′ ( q ′ + x ) | 2 ≤ 1 | B 0 | ∑ x ∈ B 0 | I ( q + x ) − I ′ ( q ′ + x ) | 2 : = θ # ≤ 1 N t est s (1) where B 0 is th e b lock o f size √ s × √ s centered at the position (0 ,0). Now th e test would be passed f or a mo re reasonable threshold ( θ = 6 , 28 , 47 for b locks o f size 3 × 3, 5 × 5, 7 × 7 r espectiv ely), which would ensure a much den ser response. Howe ver , th is a contrario model is by far too naive and pr oduces m any false matche s. Ind eed, block s stemming from natural images a re much more regular than the white noise gen erated by the back groun d mo del. Considering all pixels in a block as independ ent leads to overestimating the similarity probab ility of two observed similar blo cks. It therefor e leads to an over -detection . In orde r to fix this problem , we need a back groun d mode l better reflectin g the statistics of n atural image b locks. But directly learning such a prob ability d istribution fr om a single image in dimension 8 1 (for 9 × 9 blo cks) is h opeless. Fortunately , as po inted out in [2 5], high-d imensional dis- tributions of shapes can be ap proxim ated by the ten sor prod - uct of their adequ ately chosen marginal distributions. Such 4 Fig. 1. Left: Reference image of a stereo pair of images. Right: the nine first princip al components of the 7 × 7 blocks. marginal laws, being on e-dimension al, are easily learned from a sing le image. I deally , I CA (I ndepen dent Compo nent Anal- ysis) sho uld b e used to lea rn which marginal laws are the most inde penden t, but the simp ler PCA analysis will show accurate enoug h for ou r pu rposes. Ind eed, it ensures that the principal compon ents are deco rrelated, a first approx imation to in depend ence. Fig. 2 gives a visual assessment of how well a local PCA mod el simulates image patches in a class. Nev ertheless, the indepen dence assumption will b e used as a tool for building the a-contra rio mo del. This indep enden ce is not an empirical finding on the set o f patches. C. Plan Section II introd uces th e stochastic block mod el learned from a reference image. Section II-B presents the a c ontrario method applied to d isparity estimation in stereo pairs and treats the main pr oblem of deciding wheth er two p ixels match. Theorem 1 is the main r esult of this section , ensur ing a controlled numbe r of false detections. Section III tackles the strobosco pic p roblem by a p arameterless meth od, an d demonstra tes the ne cessity and compleme ntarity of the a contrario and self-similar ity rejection s. Experime ntal results and compa rison with other method s are in Section IV. Section V is c onclusive. An appen dix su mmarizes the algor ithm an d giv es its complete pseudo-cod e. I I . T H E a contrario M O D E L F O R B L O C K - M AT C H I N G W e shall deno te by q = ( q 1 , q 2 ) a pixel in the re ference image I a nd by B q a block centere d at q . T o fix ideas, the block will be a square throug hout this paper, but this is by no means a re striction. A different shape (rectang le, disk) wou ld be possible, and ev en a variable shape. Giv en a point q and its block B q in th e referen ce im age, block- matching alg orithms look for a point q ′ in the second im age I ′ whose b lock B q ′ is similar to B q . A. Principal Componen t Ana lysis For building a simple a contrario model th e pr incipal compon ent analysis can play a cr ucial ro le, as shown in [26]. In deed, it allows for effecti ve dimen sion redu ction an d decorre lates these d imensions, giving a first app roxima tion to indepen dence. This facilitates the con struction o f a proba bilis- tic d ensity function f or the blo cks as a tensor produ ct o f its marginal den sities. L et B q be the block of a pixel q in the referenc e imag e an d ( x q 1 , . . . , x q s ) the inten sity gray levels in B q , wh ere s is the n umber of p ixels in B q . Le t n be the number of p ixels in the image. Consider th e matrix X = ( x j i ) 1 ≤ i ≤ s , 1 ≤ j ≤ n consisting of the set of all data v ectors, one column per p ixel in the imag e. Then , the covariance matr ix of the block is C = E ( X − ¯ x1 )( X − ¯ x1 ) T , wher e ¯ x is the column vector o f size s × 1 storing th e mean values of m atrix X and 1 = ( 1 , · · · , 1 ) a row vector of size 1 × n . Notice that ¯ x correspo nds to the block whose k -th p ixel is the average of all k -th pixels of all b locks in the image. Thus, ¯ x is very close to a con stant block , with the constant eq ual to the imag e average. The eigen vectors of the covariance matrix are ca lled principa l compon ents and are or thogon al. They g iv e the new coor dinate system we sh all use for blocks. Fig . 1 sh ows the first princip al blocks. Usually , the eigenvectors are sorted in ord er of decr easing eigenv alue. In that way the first principal compon ents are the ones that co ntribute most to the variance of the d ata set. By keeping the first N < s componen ts with larger eigen values, the dimension is reduced b ut th e sign ificant inf ormation retain ed. While this g lobal o rdering could be used to select th e main compon ents, a local ord ering for each block will instead be used for th e statistical matching ru le. In other words, for each b lock, a new ord er for the pr incipal co mponen ts will be estab lished given by the cor respond ing ordered PCA coordin ates (the decreasing o rder is for the a bsolute values). In that way , compar isons of these compon ents will be made from th e m ost mean ingfu l to the least meaningf ul one fo r th is particular block. Each block is repr esented by N ordered coefficients ( c σ q ( 1 ) ( q ) , . . . , c σ q ( N ) ( q )) , where c i ( q ) is the resulting co- efficient after p rojecting B q onto the principal comp onent i ∈ { 1 , . . . , s } and σ q the per mutation r epresenting the fin al order wh en ord ering the absolute values of co mpon ents fo r this particular q in d ecreasing ord er . By a slight abuse of notation we will write c i ( q ) instead of c σ q ( i ) ( q ) knowing that it re presents the lo cal ord er of the b est princip al compon ents. But notice that σ q ( 1 ) = 1 for most q becau se of the do minance of th e first princip al compo nent. Moreover notice th at this first compon ent h as a quite different coefficient histogram than the other ones ( see Fig. 4), b ecause it ap proxim ately com putes a mean value of the block. Indee d, the bary center of all b locks is ro ughly a constan t block whose average grey value is the image av erage g rey level. The set o f b locks is elo ngated in the directio n o f the average g rey level an d, therefor e, the first compo nent com putes r ough ly an average grey level o f the b lock. T his explains why th e first compon ent histogram is similar to the image histogram. B. A Contrario S imilarity Mea sur e between Blocks Definition 1 (A contrario model): W e call a contrario block model associated with a r eference imag e a r andom block B described by its (rand om) compon ents B = ( C 1 , . . . , C s ) on the PCA basis of the block s of the refer ence im age, satisfyin g • the compo nents C i , i = 1 , . . . , s a re inde penden t rand om variables; • for each i , the law of C i is the em pirical histogram of the i -th PCA co mpon ent c i ( · ) of th e blo cks of the refe rence image. 5 (a) (b) Fig. 2. (a) Pat ches of the referenc e image, chosen at random. (b) Simulated random blocks follo wing the law of the reference image. This experiment illustra tes the (relati ve ) adequac y of the a contrario model. Nev erthele ss, the PCA components are empirically uncorrela ted, but of course not independent. The reference imag e will be the seco ndary image I ′ . Fig. 2 shows patches genera ted acco rding to th e a bove a con trario block model an d com pares th em to blo cks picked at random in the ref erence image. The a con trario mo del will be used for comp uting a block resemblanc e pr obability as the p roduct of the margina l resem blance probab ilities of the C i in the a contrario m odel, which is justified b y the indep endenc e of C i and C j for i 6 = j . There is a strong a dequacy of th e a co ntrario model to the empir ical mode l, since the PCA transform ensur es that C i and C j are un correlated for i 6 = j , a first approx imation of the indepen dence req uirement. W e start by de fining the resemb lance prob ability between two blocks for a single com ponen t. Denote by H i ( · ) : = H i ( c i ( · )) the no rmalized c umulative histogram of the i -th PCA block compo nent c i ( · ) for the seco ndary image I ′ . Definition 2 (Resemblan ce pr o bability): Let B q be a block in I an d B q ′ a block in I ′ . Define the prob ability that a ran dom block B of I ′ resembles B q as clo sely as B q ′ does in the i -th compon ent by b p i q q ′ = H i ( q ′ ) if H i ( q ′ ) − H i ( q ) > H i ( q ) ; 1 − H i ( q ′ ) if H i ( q ) − H i ( q ′ ) > 1 − H i ( q ) 2 | H i ( q ) − H i ( q ′ ) | otherwise. Fig. 3 illustrates how the resemblance proba bility b p i q q ′ is computed and Fig. 4 shows empirical marginal den sities. 00 00 11 11 00 11 00 00 11 11 00 11 00 11 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 Ci(q) Ci(q’) Hi(q’) Hi(q) Hi 1 0 Fig. 3. Normalized cumulati ve histogram of i -th PCA coordi nates of the secondary image. c i ( q ) is the i -th PCA coordinat e val ue in the first image. The resemblance probabilit y b p i q q ′ for the i -th component is twice the distance | H i ( q ) − H i ( q ′ ) | when H i ( q ) is not too close to the values 0 or 1. C. Robust Similarity Distance The first principa l comp onents of B q , being in decr easing order, contain the relev ant informatio n on th e blo ck. Thus, if two blocks ar e no t similar for on e of the first comp onents, they sh ould not be match ed, even if their next compon ents ar e similar . Du e to this fact, the compon ents of B q and ano ther block B q ′ must be comp ared with a non -decreasin g exigency lev el. In add ition, in the a contrario m odel, the num ber of tested correspond ences should be as small as possible to reduce the numb er of false alarms. A quantiza tion of the tested resemb lance pro babilities is th erefore requ ired to limit the number of tests. These two remar ks lead to d efine the q uantized resemb lance probab ility as the smallest n on-de creasing seq uence of quan - tized probabilities bounding from above th e sequence b p i q q ′ . Definition 3 (Quantized pr o bability): Let B q be a block in I . Let Π : = { π j = 1 / 2 j − 1 } j = 1 ,..., Q be a set of quan tized probab ility threshold s an d let ϒ : = p = ( p 1 , . . . , p N ) | p i ∈ Π , p i 6 p j if i < j be the family of n on-d ecreasing N -tuples in Π N , endowed with the order a > b if and only if a i > b i for all i . The quantized probab ility sequence associated with the event th at random block B resembles B q as closely as B q ′ does in the ith compo nent is defined by ( p i q q ′ ) i = 1 ,... N = inf t ∈ ϒ { t | t > ( b p i q q ′ ) i = 1 ,... N } . (2) Notice that the infimum ( p 1 q q ′ , . . . , p N q q ′ ) is un iquely defined and belon gs to ϒ . Put another way the q uantized p robability vector ( p 1 q q ′ , . . . , p N q q ′ ) is the sma llest uppe r bo und of the resemblance pr obabilities ( b p 1 q q ′ , . . . , c p N q q ′ ) that can be fou nd in ϒ . Fig. 5 illustrates th e quantized proba bilities in two cases. 00 00 11 11 0 0 1 1 0 0 1 1 0 0 1 1 00 00 11 11 00 00 11 11 0 0 1 1 0 0 1 1 0 0 1 1 00 00 11 11 0 0 1 1 0 0 1 1 0 0 1 1 00 00 11 11 00 00 11 11 0 0 1 1 0 0 1 1 0 0 1 1 1/4 1 1/2 0 1/16 1/8 1 2 3 4 5 6 7 8 9 1/4 1 1/2 0 1/16 1/8 1 2 3 4 5 6 7 8 9 Fig. 5. T wo examples of probabi lities with Q = 5 and N = 9. The probabili ty threshold s are in ordinate and the feature s in abscissa. The resemblance probabil ities are represent ed with small crosses and quantize d probabilit ies with small squares. T he exa mple on the left has a final probability of 1 / ( 16 2 · 8 2 · 4 4 · 2 ) . The right example has the same resemblan ce probabilitie s exc epting for features 1 and 2, but the final probability is 1 / 2 . Only the configurat ion on the left corresponds to a meaningfu l match. Pr o position 1 (Quantized r e semblance pr ob ability): Let B q ∈ I and B q ′ be two block s. Assume the pr incipal compon ents i ∈ { 1 , 2 , . . . , s } a re reorder ed so tha t | c 1 ( q ) | > | c 2 ( q ) | > · · · > | c s ( q ) | . Th e p robability of the ev ent “the ran dom block B h as its N first co mpone nts as 6 0 256 0 0.07 0 1800 0 0.035 −600 0 600 0 0.12 −600 0 600 0 0.14 −600 0 600 0 0.1 −600 0 600 0 0.1 Fig. 4. Histogram of the reference image, follo wed by the first fiv e histograms of the block PCA coordina tes. The first principal component roughly computes a mean of the block, which expla ins why its histogram is s o similar to the image histogram. similar to those of B q as to those of B q ′ ” is Pr q q ′ = N ∏ i = 1 p i q q ′ . (3) This is a direct co nsequen ce o f Def. 1, the pr incipal c ompo- nents of B being ind epende nt. The resemb lance p robab ility is the prod uct of the margin al resemblan ce pro babilities. As classic in statistical decision, we co uld stop and use the above resemblan ce prob ability . Bu t, despite having a low resemblance prob ability fo r each Pr q q ′ , the large n umber of resemblance tests allows for a very large n umber of false matches. Ou r n ext goal th erefor e is to define a num ber of false ala rms, and n ot a pr obability , as the right criterio n. T o this aim, we need to estimate th e number of tests. D. Number of T ests The number o f tests for compa ring all the blo cks o f imag e I with all the b locks in im age I ′ is th e pr oduct o f th ree factor s. The first one is the image size # I . The secon d is the size of the search region d enoted by S ′ ⊂ I ′ . W e mention ed b efore that th e sear ch is do ne on the ep ipolar line. In practice, a segment of this line is enoug h. If q = ( q 1 , q 2 ) is th e po int of referenc e it is enou gh to look for q ′ = ( q ′ 1 , q 2 ) ∈ I ′ such that q ′ 1 ∈ [ q 1 − R , q 1 + R ] where R is a fixed in teger larger than the maximal possible disparity . The third and most imp ortant factor is the number of different non -decreasing prob ability distributions F C N , Q = # ϒ that can b e envisaged. Of cour se n ot all of these tests ar e perfor med, but o nly the on e ind icated by the observed block B q ′ . Y et, th e cho ice of th is un ique test is steered by an a posteriori o bservation, while the ca lculation o f the expectation of th e number of false alarms (NF A) must be calculated a prio ri . Thu s we must compute th e NF A as though all com parisons for a ll qua ntized decreasing probab ilities were effectuated. A test can n ev er be defined a posteriori , it can not be steered by the observation. Thus the numb er of tests is no t the nu mber of tests ef fectively perfor med. T here are # ϒ ways each cou ple of bloc ks co uld a priori be com pared. In oth er terms # ϒ different distan ces are a priori tested. Theore m 1 will ultimately justify the following definition . Definition 4 (Number of tests): W ith the above notation we call the number of te sts for m atching two im ages I and I ′ the integer N t est = # I · # S ′ · # ϒ = n ( 2 R + 1 ) F C N , Q . Lemma 1: W ith the ab ove notation, F C N , Q = Q − 1 ∑ t = 0 ( t + 1 ) · N + Q − t − 3 Q − t − 1 , (4) where F C N , Q : = # { f : [ 1 , N ] → [ 1 , Q ] | f ( x ) 6 f ( y ) , ∀ x ≤ y } . In order to prove this result we write F C N , Q : = # { f : [ 1 , N ] → [ 1 , Q ] | f ( 1 ) = 1 , f ( N ) = Q ; f ( x ) 6 f ( y ) , ∀ x 6 y } . Since F C N , Q = Q − 1 ∑ t = 0 ( t + 1 ) FC N , Q − t and F C N , Q = N + Q − 3 Q − 1 the result follows. W e are now in a position to d efine a number of false alarms, which will contr ol the overall nu mber of false detections on the whole image. Definition 5 (Number of fa lse alarms): Let B q ∈ I and B q ′ ∈ I ′ be two observed blo cks. Assum e the p rincipal compo - nents i ∈ { 1 , 2 , . . . , s } are reo rdered so tha t | c 1 ( q ) | > | c 2 ( q ) | > · · · > | c s ( q ) | . W e define the Number of False Alarms associated with event “the r andom blo ck B ha s its N first compo nents as similar to those o f B q as those of B q ′ are” by N F A q , q ′ = N t est · Pr q q ′ = N t est · N ∏ i = 1 p i q q ′ , where N t est comes form Def . 4 and Pr q q ′ is th e prob ability that the random b lock B hav e its first N PCA c ompon ents as similar to those o f B q as those of B q ′ are (Prop. 1). Definition 6 ( ε -meaning ful match): A pair o f pixels q and q ′ in a stereo pair ( I , I ′ ) is an ε -meaning ful m atch if N F A q q ′ 6 ε . (5) E. The Main Theor em As it is co mputed ab ove the NF A d imensionality is that of a numb er (of false alarms) per image . An alternative would be to measure the NF A as a nu mber of false alarms per pixel, in wh ich case th e numb er of tests would not con tain the cardin ality of th e imag e factor # I . W ith the pro posed NF A, it is up to the users to decid e which numb er of false alarms p er im age th ey consid er tolerab le. The NF A o f a match actually gives a security level: the smaller the NF A, th e mo re meaningf ul the match in tuitiv ely is. But Thm. 1 will give the real m eaning o f the NF A. T o state it, we will use a c lev er tr ick used by Shan non in his in formation theory [ 37], page 22-2 3, namely to treat the prob ability of an e vent as rando m variable and to play with its expectation. Here th e NF A will become a random v ariable, replacin g B q ′ with B in its definition. In the a con trario mo del, each com parison of B q with some B q ′ is interpr eted as a comparison of B q to a trial o f the random b lock model B . In total, B q is compar ed with 2 R + 1 ot her bl oc ks for each q ∈ I . So, we are led to distinguish for each q ( 2 R + 1 ) trials which are as m any i.i.d. random 7 blocks B q , j , j ∈ { 1 , 2 , . . . 2 R + 1 } , all with the same law as B . They m odel a con trario the ( 2 R + 1 ) trials b y which B q is matched to ( 2 R + 1 ) blo cks in I ′ . W e are interested in the expectation of the n umber of such trials being successfu l (i.e. ε -meaning ful), “just by chan ce. ” Consider the event E q , j that a rand om b lock B q , j in the a contrario model with reference image I ′ meaningf ully m atches B q . If th is hap pens, it is obviously a false a larm . W e shall denote by χ q , j the rand om ch aracteristic function associated with this event, with the co n vention th at χ q , j = 1 if E q , j is true, χ q , j = 0 otherwise. Similarly N FA q , j and p i q , j are the NF A and quantized probab ilities associated with the e vent E q , j . Theor em 1: Let Γ = Σ q ∈ I , j ∈{ 1 ,..., 2 R + 1 } χ q , j be th e r andom variable repr esenting the n umber of o ccurren ces of a n ε - meaningf ul match between a determ inistic patch in the first image and a rand om patch in the secon d image. Then the expectation o f Γ is less th an or equal to ε . Pr o of: W e h av e χ q , j = 1 , if N F A q , j 6 ε ; 0 , if N F A q , j > ε . Then, by the linearity of the expec tation E [ Γ ] = ∑ q , j E [ χ q , j ] = ∑ q , j P [ N F A q , j 6 ε ] . The p robability inside the above su m can b e comp uted by Definitions 5 and 2: P [ N F A q , j 6 ε ] = P " N ∏ i p i q , j 6 ε N t est # There a re m any pro bability N -tuples p = ( p i q , j ) i = 1 ,..., N per- mitting to obtain the inequ ality inside the ab ove pro bability . Nev ertheless, the probabilities having been q uantized, we ca n reduce it to a (non- disjoint) unio n of events, namely all p ∈ ϒ such that ∏ i p i 6 ε / N t est . By the Bon ferron i corr ection the considered p robability can be upp er-bounded b y the sum of their p robabilities sum. In addition the intersection be low in volves only ind ependen t e vents accordin g to our back grou nd model. Thus P " N ∏ i p i q , j 6 ε N t est # = P [ p ∈ ϒ ∏ i p i 6 ε / N t est \ i p i q , j 6 p i 6 ∑ p ∈ ϒ ∏ i p i 6 ε / N t est ∏ i p i 6 ε # I # S ′ , where we h ave also used N t est s = # I # S ′ # ϒ . So we have shown that E [ Γ ] = ∑ q , j E [ χ q , j ] 6 ∑ q , j ε # I # S ′ = ε . The ε parameter is th e o nly legitimate parameter of th e method, the other o nes namely the block size √ s , the number of prin cipal compon ents N a nd the numb er o f quantize d probab ility thresh olds Q can b e fixed on ce an d for all for a giv en SNR (Signal to Noise Ratio). All experiments are made with a common SNR, but a lower SNR would allow smaller b locks and con sequently a different set of parameters. The question of h ow many false alarms shou ld be a cceptable in a stereo pair depend s on the size of the imag es. I n all experiments with mo derate size images, of the o rder of 1 0 6 pixels, the d ecision was to fix ε = 1 . Th anks to Th eorem 1 this means that it is expec ted to find one false alarm in av erage for im ages with 10 6 pixels. Then, fixing ε makes the metho d into a parameterless m ethod for all moder ately sized images. I I I . T H E S E L F - S I M I L A R I T Y T H R E S H O L D Urban environments contain many periodic local structur es (for example the windows o n a fac ¸ ade). Since, in general, th e number o f repetition s is insignificant with respect to the nu m- ber o f blocks that have been used to estimate the empir ical a contrario pr obability d istributions, the a contrario model does not lear n th is r epetition, and can be fooled b y such repetitio ns, thus sign aling a significant match for each r epetition o f th e same structu re. Of co urse, one o f tho se significant match es is the correct o ne, but chances are that the cor rect o ne is n ot a lso the mo st sign ificant. In such a situation two cho ices are left: (i) try to match the whole set of self-sim ilar blocks of I as a single multi-b lock (typically , g lobal method s such as graph - cuts d o th at implicitly) ; or (ii) re move any (probably wro ng) response in the case wh ere the strobosco pic effect is detected. The first alterna ti ve would lead to errors a nyway , if the similar blocks d o no t h av e the same heigh t, o r if so me of them are out of field in one of the imag es. Fortuna tely , stereo pair block-m atching yields a straightfo rward adaptive th reshold. A distance functio n d between blocks being defined, let q and q ′ be po ints in the ref erence and secon dary images r espectively that are can didates to match with each o ther . The match of q and q ′ will be acc epted if the following self-similarity (SS) condition is satisfied: d ( B q , B q ′ ) < min { d ( B q , B r ) | r ∈ I ∩ S ( q ) } (6) where S ( q ) = [ q 1 − R , q 1 + R ] \ { q 1 , q 1 + 1 , q 1 − 1 } and R is th e search ran ge. As noted earlier, the sear ch for corr esponde nces can be restricted to the ep ipolar line. This is why the automatic threshold is restricted to S ( q ) . The distance used in the self- similarity threshold is the sum of squared differences (SSD) of all the p ixels in the block and the blo ck size is th e same than the block size use for A CBM. Computing the similarity of match es in o ne of th e images is not a new idea in stereovision. In [2 0] the author s defin e the distinctiveness of an image p oint q as the perceptu al distance to th e most similar p oint other than itself in the search window . In particular, they study th e case o f the au to- SSD fu nction (Sum o f Squared Differences comp uted in the same image). The flatness of the function contains the expected match a ccuracy and the height of the smallest minimum o f th e auto-SSD function beside the one in the origin giv es the risk of mismatch. They are able to match ambig uous p oints co rrectly by matchin g in trinsic cu rves [ 39]. Howe ver , the propo sed 8 algorithm only accepts match es when their qu ality is above a cer tain threshold. The obtained d isparity ma ps ar e r ather sparse and the accepted match es are completely concentr ated on th e ed ges of the image. According to [3 4], th e am biguo us correspo ndences sho uld b e rejected . In th is work a n ew stability pr o perty is defined . This prop erty is one co ndition a set of match es must satisfy to be consider ed unamb iguous at a given con fidence level. Th e stability constra int an d the tuning of two parameters permits to take car e of flat or periodic autocor relation func tions. T he compar ison of this last algorithm with our results will b e done in section IV. A. A Contrario vs Self- Similarity Is the self-similarity (SS) thre shold r eally necessary? One may wond er whether the a contrario decision r ule to accept or reject correspo ndences b etween patches would be sufficient by itself. Conversely , is the self-similarity thr eshold enough to reject false matches in a correlatio n algorithm? This section addresses b oth q uestions an d analy zes so me simple examples enlightenin g the necessity and complem entarity of both tests. For each example we are go ing to compar e the result o f th e a contrario test an d the result of a classic c orrelation algor ithm combined with the self-similarity threshold alone. First con sider two independ ent Gaussian n oise images (Fig. 6). It is obvious that we would like to reject any possible match between these two images. As expected, (th is is a sanity check !) the a contrario test rejects all the po ssible patch match es. On the other h and, the correlatio n algo rithm combined with the self-similarity is not sufficient: many false matches are accepted. (a) (b) (c) Fig. 6. (a ) Reference noise image. (b) No match at all has been accepted by the a contrario test! (c) Many fal se corresponden ces have been accepted by the self-similarity threshold. The seco nd co mparative test is abo ut oc clusions. If a po int of the scene can b e observed in only one of the im ages of the stereo pair, then an estimation of its disparity is simply impossible. The b est decision is to r eject its ma tches. A go od example to illustrate the p erform ance o f both rejection tests A CBM and SS is th e map image (Middlebury stereovision database, Fig. 7) which h as a large baseline and ther efore an imp ortant n umber of occ luded pixels. ACBM gives again the best result (see T ab le I). The table indicates that the self- similarity te st o nly rem oves a few additio nal poin ts. Y et, even if the propo rtion of eliminated p oints is tiny , such mismatche s can be very ann oying and the gain is not negligible at all. The a contrario meth odolog y canno t d etect the ambiguity inherent in p eriodic patterns. Indee d, periodicity certainly do es not occu r “just by chance. ” Th e match b etween a win dow an d (a) (c) (b) (d) Fig. 7. (a) Reference image (b) Secondary image. The rectangula r object occlude s part of the backgroun d (c) T he a contrario test does not accept any match for pixe ls in the occluded areas. (d) W ith the self-similarity threshol d the disparity map is denser , but wrong disparities remain in the occluded regi on. Bad m atches T otal matches SS 3.35% 85.86% A CBM 0.37% 64.85% A CBM+SS 0.36% 64.87% T ABLE I Q U A N T I T A T I V E C OM PA R I S ON O F S EV E R AL A L G O R I T H M S O N M I D D L E B U RY ’ S M A P I M AG E : T H E B L O CK - M A T C H I N G A L G O R I T H M W I T H T H E S EL F - S I M I L A R I T Y T H R E S H O L D ( S S ) , T HE a contrari o A L G OR I T H M ( A C BM ) A ND T H E A L G OR I T H M C O M B I N I N G B OT H ( AC BM + S S ) . T HE P ER C E N TAG E O F M AT C H E S F O R E A C H A L G O R I T H M I S C OM P U T E D I N T HE W H O L E I M A G E A N D A M O N G T H E S E T H E N U M BE R O F W RO NG M AT C H E S I S A L S O G I V E N . A M AT C H I S C O N S I D E R E D W R O NG I F I T S D I S P A R I T Y D I FF E R E N C E W I T H T H E G RO U N D T R U TH D I S PA RI T Y I S L A R G E R T HA N O N E P I X E L . another iden tical window o n a building fac ¸ a de is obviou sly non ca sual an d is therefo re legally accepted b y an a contrario model. In this situatio n, th e self-similarity test is necessary . A syn thetic case h as b een con sidered in Fig . 8, where the accepted co rrespon dences are c ompletely wrong in the a contrario test for the rep eated lines. On th e contrary , the self- similarity threshold is able to reject m atches in this region of the image. In short, A CBM and SS are both n ecessary and comple- mentary . SS only r emoves a tiny add itional num ber of errors, but even a few o utliers can be very an noying in stereo. From now on, a p ossible m atch ( q , q ′ ) will th erefore be accepted only if it is a m eaningfu l match ( A CBM test in Def. 6) and satisfies the SS condition g iv en by (6). I V . C O M P A R A T I V E R E S U LT S The algorith m parameter s are iden tical for all experim ents throug hout this paper . The comparison w indow size is 9 × 9, the nu mber of co nsidered principal comp onents is 9, the number of quantu m pro babilities is 5. The previous section showed how the pro posed me thod (A CBM + SS) 9 (a) (b) (c) Fig. 8. (a) Refere nce image with a text ure and a stripes periodic motif. T he secondary image is a 2 pixels translati on of the reference image. The obtained disparit y map s hould be a constant image with v alue 2. (b) The a contrario test giv es the right disparity 2 ev erywhere , exce pt in the stripes region. (c) The repeated stripes are local ly similar , so the self-similari ty threshold rejects all the patches in this region. deals with noise, occlusion s and repe ated structur es. The detection method is also adapted to quasi-simultan eous stereo from aerial or satellite image s, where moving o bjects (c ars, pedestrians) are a serious disturba nce. E ssentially , this is the same problem as the occlusio n prob lem, but the occlusion is caused by camer a mo tion in presenc e of a depth difference instead of ob ject motion . Figure 9 shows a stereo p air of images of the city o f Marseille (France) . In both cases, se veral cars h av e changed p osition between the two images. They are duly d etected. Th e shad ow region s, wh ich contain more n oise th an signal, have also be en r ejected. W e have also compar ed our results with the K olmog orov’ s graph cut implementatio n [16] which rejects a posteriori inco herent matches and are labeled as occ lusions. In these examples, graph cuts is able to reject some mismatches due to th e moving objects in the scen e but a lot of co nspicuou s err ors remain in the fin al d isparity map. Like wise, Open CV’ s stereo matching algorithm [8] fails completely on this kind of pairs, even tho ugh it obtains cor rect results in mor e simple examples like the one in figure 7. The proposed algorithm will now be compared with th e non- dense algor ithms of [34], [40], [41] an d [24], who se aims are compara ble. All of th ese paper s hav e p ublished experimental results on the first Middlebury dataset [3 5] (Tsuk uba, Saw- tooth, V enus and Map pair of images), o n the non -occlud ed mask. The se four a lgorithms comp ute sparse disparity maps and prop ose techniq ues rejec ting unreliab le pixels. W e also show som e ad ditional com parison with the blo ck match ing method imp lemented in th e Ope nCV library version 2.2.0 [8], because it is possibly the most widely used one since it come s close to real-time perform ance. The authors of [24] com pute an initial classic c orrelation disparity m ap and select correct m atches based on the support these pixels receive from their neighbo ring candid ate matches in 3D after tensor voting. 3 D points are grouped into smo oth surfaces using color a nd geo metric inform ation and the points which ar e inconsistent with the surface color distribution are removed. Th e rejection of wrong pixels is no t complete, because the algo rithm fails when some ob jects app ear o nly in one im age, or whe n occlude d surfaces chan ge o rientation. A v ariation of the cr itical rejection paramete rs ca n lead to q uite different results. referenc e image secondary image A CBM+SS graph- cuts OpenCV Fig. 9. From top to bottom: referenc e image, secondary image, A CBM+SS disparit y map, graph cuts disparity map, and O penCV disparity map. In our disparit y map, red points are points which have n’t been matched. Notice that patche s containing a mov ed car or bus haven’ t been m atche d. Poorly textu red regi ons (shado ws) where noise dominates hav e also been rejec ted. Red points in the graph cuts disparity map are rejec ted a posteriori and considere d as occlusi ons. The graph cuts disparity map is denser and smoother but sev eral mismatches appear in the lo w textured areas and regions with moving objects. [40] de tects and matches so called “dense fe atures” which consist o f a c onnected set of pixels in the left im age and a correspo nding set of pixels in the right ima ge suc h that the intensity edges o n the bou ndary of these sets ar e strong er than their matching error on the bou ndary (which is the absolute inte nsity d ifference between correspon ding bound ary pixels). T hey call this th e “bo undary con dition. ” The id ea is that even the bou ndary of a non textured region can give a correspo ndence. Then , ea ch dense feature is associated with a d isparity . T he main limitation is the way dense features are extracted. They are extracted using a local algo rithm which processes each scan line indepen dently fro m the others. As a 10 result, to p a nd bottom bou ndaries are lost. On the contr ary , [41] uses graph cuts to extract “den se features” (which of course d oes n ot necessarily imply a den se dispar ity map) thus enforcin g the bound ary co ndition s. The results in [40] are rather den se and the erro r rate is one of the most com petitive ones. Y et these go od results are also due to the p articularly well adapted structure o f the benchmark. In deed, the Sawtooth, V enus an d Map scenes consist of piecewise planar surfaces, with almo st fron to-parallel surface patches. Th e gr ound truth of Tsukuba is a p iecewise constant disparity map with six different disparities. T ab le II summarizes the percentag e o f matched pixels ( den- sity) and the perc entage of mismatches (wh ere the estimated disparity d iffers by more th an one pixel fro m th e g round truth). Th is table reports first the result of AC BM+SS, whose error rate is very small and yields larger match densities than Sara’ s re sults [34]. T o compare with other alg orithms yielding denser disparity maps, the resu lts of A CBM+SS have b een densified by the most straig htforward proxim al in terpolation (a 3 × 3 spatial me dian filter). Doing this, the match d ensity rises significantly wh ile keeping small erro r rates. Still, large regions containing p oor textures, typically shad ows in aerial imaging, are impossible to fill in because they contain no informa tion at all. Besides the compare d algorithms in T able II [14] also publishe d non-de nse results for the Tsuk uba image (error rate of 2 . 1 % with a den sity of 45 %) but since non-de nse results o n oth er imag es are n ot pu blished it doe s not ap pear in our table. Fig. 10 c ompares th e A CBM+SS r esults with o pencv , graph cuts and the Sara published results on the classic CMU Shrub pair 2 . Sara’ s disparity map has sev eral mismatches and the A CBM+SS results are obviously d enser . On the other hand, Kolmogorov’ s graph cut impleme ntation is d enser but the misma tches have r isen co nsiderably . Open CVs disparity map is mor e dense than Kolmogorov’ s, and less den se than Sara’ s, but it h as also th e high est number of wrong match es. So, the p roposed alg orithm ACBM+SS has a better trade-off between den sity and mismatches. In the K olmogor ov graph cuts implem entation the occ lusions are d etected, providin g a non-d ense disparity ma p. It is clea r that detec ting oc clusions in real im ages is no t enoug h to avoid mismatch es. Another example is sh own in Fig. 11, where the almost dense dispar ity map obtained with graph c uts is compared with the AC BM+SS disparity m ap. The top left of the image g ets by Graph Cuts a completely wron g disparity : the sky and the tr ee branch es are clearly not at the same depth in the scene. This type of err or is un av oidable with glob al method s. The dep th of th e smoo th sky is inheren tly amb iguous. By the minimization p rocess it inherits the depth of the twigs through wh ich it is seen. An interesting question ar ises out o f th e comparative results about th e du ality erro r/density . W e have seen that o ur algo- rithm gives very low error percentages with den sities between 40% and 9 0%. The p arameter ε can be increased but then the error r ate will rise. Our goal is to match with h igh reliably the points be tween two image s and reject any po ssible false match. So the choice o f on e expected false alarm ( ε = 1) is a 2 http:/ /v asc.ri.cmu.edu/idb/html/jisct conservati ve cho ice but ensur es a very small error p ercentag e. Discussion on the oth er p arameters: W e have men tioned that th e nu mber o f co nsidered prin cipal comp onents N and the number of q uantum pro babilities Q can be incr eased without noticeable alteration of the re sults. In pr actice, th e two values are chosen (for c omputatio nal reasons) to the minimal v alues not affecting the qu ality of the r esult. T hey are fixed once an d for all to N = 9 and Q = 5 respe ctiv ely . Ano ther par ameter is the search region size (2 R + 1) but it is easy to find since we only need R to be larger than the largest dispar ity in the image, which is a classic assump tion in stereovision algorithms (in pr actice R can be estimated from th e sparse m atching of interest p oints th at was pr eviously ob tained fo r th e ep ipolar rectification step). Finally , the last param eter is the size o f the b lock. W e k now that very sm all block s are affected by image no ise but at th e same time, th e big ger the block, the bigger the fattenin g er ror (also nam ed ad hesion er ror). This error beco mes apparen t at the o bject bo rders of th e scene causing a dilation of their real s ize, which is proportional to the block-size. The fattening ph enomeno n is not the ob ject of the this paper but different solutions have alread y been suggested to av oid it [5]. Fixing the size o f the block to 9 × 9 seems to b e a g ood c ompro mise b etween n oise and fattening fo r a realistic SNR con ditions, ra nging from 200 to 20 ( the SNR is measured as the ratio between the av erage grey level and the noise standard d eviation.) Computation al time: For the sake of computatio nal sp eed, the PCA basis is previously lear nt on a set of r epresentative images an d stored once and for a ll. 3 Then, this basis is used to compute all image coefficients. Notice that o nly th e im age coefficients of the seco nd image need to be so rted in or der to com pute the r esemblance pr obability between all possible matches. With our imp lementation, which is still not high ly optimized f or speed, an experimen t with a pair of images of size 5 12 × 512 with disparity rang = [-5 ,5], takes 4.5 seconds runnin g on a 2.4 GHz Intel Core 2 Duo processor . A similar experime nt with the OpenCV stereo algo rithm takes between 5 and 500 milisecond s. This is much closer to real-time requ irements, but r esults are a lso m uch more d ata- depend ent, prod ucing good results in easy examples like the Middlebury pair, but much less d ense and less r eliable results than our m ethod in more d ifficult scenes like shrub , marseille or e ven the stereo pair s provided with OpenCV . V . C O N C L U S I O N Wrong m atch th resholds we re, in ou r opin ion, the princ ipal drawbacks for b lock-ma tching alg orithms in stereovision. The a contrario b lock-match ing threshold, that was th e princip al object of the present paper, comb ined with the self-similar ity threshold is able to detec t mismatch es systematically , by an algorithm which is essentially parame ter-free. Indeed, the o nly user param eter is the expec ted number of false matches, which can b e fixed once an d fo r a ll in most applications. T he m ethod indiscriminately detects occlu sions, movin g objects and p oor or periodic textured regions. 3 In our experienc e the (computationa lly intensi ve) choice of this basis does not significantly af fect the results, but the (computatio nally fast ) learni ng of margin al distribut ions for a particul ar image on this basis does. 11 Tsukuba Sawtoo th V enus Map Error(%) Density(%) E rror(%) Density(%) Error(%) Density(%) Error(%) Density(%) A CBM + S S 0.31 45.6 0.09 65.7 0.02 54.1 0.0 84.8 A CBM + S S + Median filter 0.33 54.3 0.14 77.9 0.0 66.6 0.0 93.0 Sara [34] 1.4 45 1.6 52 0.8 40 0.3 74 V eksler 02 [40] 0.38 66 1.62 76 1.83 68 0.22 87 V eksler 03 [40] 0.36 75 0.54 87 0.16 73 0.01 87 Mordohai and Medioni [24] 1.18 74.5 0.27 78.4 0.20 74.1 0.08 94.2 T ABLE II Q U A N T I T A T I V E R E S U L T S O N T H E FI R S T M I D DL E B U RY B E N C H M AR K DAT A S E T . T H E E R RO R S TA T I S T I C S A R E C O M P U T E D O N T H E M A S K O F N O N O C C L U D E D P I X E L S . A N Y E R RO R L AR G E R T H A N 1 P I XE L I S C O N S I D E R ED A M I S M A T C H . AC B M + S S O B T A I N S L ES S M I S MAT C HE S I N A L L F OU R I M A G E S . (a) left image (c) Sara [34] (e) Proposed algorithm (b) right image (d) K olmogo rov [1 6] (f) OpenCV SGBM Fig. 10. CMU Shrub scene. (a) and (b) Reference and secondary images. (c) Method of Sara [34]. Red points are rejected. Density: 24% (d) Kolmogoro v’ s Graph-Cuts [16]. Red points are points detected as occlusions. Density: 77% (e) ACBM+SS. Red points are reject ed points. Density: 42%. Sara’ s disparity map has a lo wer density and has sev eral e vident mismatches. K olmogorov’ s disparit y map is denser but has many obvious errors. (f) The block matching algorit hm included in OpenCV is also not ve ry dense AND contains many errors. It is only provided as a reference of what can be easily obtain ed with a freely ava ilable quasi-real -time block matching algorithm. (e) Proposed method ACBM+ SS. Mismatches in b lock-ma tching h av e led to the overall dom- inance of g lobal energy m ethods. Howe ver , glob al meth ods have n o validation p rocedu re, and the p ropo sed a con trario method m ust b e viewed as a validation proced ure, no m atter what the stereo m atching pr ocess was. Block-matc hing, to- gether with the reliability thresholds established in this paper, giv es a fairly dense set of reliab le matches ( from 50% to 80 % usually). It may be objected that the ob tained d isparity m ap is not dense. This ob jection is no t crucial fo r two r easons. First, having (a) (c) (b) (d) Fig. 11. Flower -garden scene. (a) and (b) Referenc e and secondary images. (c) Graph Cuts (method of [16]). Red points are occlud ed points. (d) A CBM+SS. Red p oints are reject ed points. Density: 59%. Most rej ected points are obviously mismatched by the graph cut algorithm, which equates the depths of trees, sky and house. only validated m atches ope ns th e path to benchm arks ba sed on accu racy , and to raise c hallenges ab out which precision can be ultimately attained (o n valida ted m atches only ). Second , knowing which match es ar e reliable allows o ne to co mplete a given disparity map b y fu sing several stereo pairs. Since disposing of multiple o bservations of the same scene b y se veral cam eras and/o r at se veral d ifferent tim es is by now a common setting , it beco mes m ore an d m ore imp ortant to b e able to fu se 3D information obtained f rom many stereo p airs. Having almost only reliable matches in each pair prom ises an easy fu sion. A straigh tforward solution in our ca se would b e the f ollowing: Gi ven m > 2 im ages, th e dispa rity m ap between each p ossible p air of images is co mputed with ACB M+SS. Then th e final disparity map is the accumulated d isparity map considerin g all mean ingful matches com puted with all the image pairs whenever all the compu ted disparities for the s ame pixel ar e c oherent. V I . A C K N O W L E D G E M E N T S The author s th ank Pascal Getreu er for helpf ul commen ts on this work. W ork pa rtially supp orted b y the f ollow- ing projects FREEDOM (ANR07- JCJC-0048-01 ), Callisto (ANR-09-CORD-00 3), ECOS Sud U06E01 and STIC Am sud (11STIC-0 1 - MMVPSCV). 12 R E F E R E N C E S [1] M. Z. Bro wn, D. Burschka, and G. D. Hager . Adva nces in computational stereo. IEE E T ransactio ns on P attern Analy sis and Machin e Intellig ence , 25(8):993– 1008, 2003. [2] N. Burrus, T . M. Bernard, and J.-M. Jolion. Image s egment ation by a contrar io simulation . P attern R ecog nition , 42(7):1520–1532, July 2009. [3] F . Cao. Applicat ion of the Gestalt principle s to the detect ion of good continu ations and corners in image le ve l lines. Computing and V isualisation in Scienc e , 7:3–13, 2004. [4] F . Cao, J. -L. Lisani, J.-M. Morel, and P . Mus ´ e F . Sur . A Theory of Shape Identifi cation . Springer , 2008. [5] J. Delon and B. Roug ´ e. Small baseline stereov ision. J ournal of Mathemat ical Imagi ng and V ision , 28(3):209–223, 2007. [6] A. Desolneux, L. Moisan, and J.M. Morel. F rom Gestalt Theory to Imag e Analysis. A pr obabilist ic Appr oach . Springer , 2007. [7] G. Egnal and R. P . Wilde s. Detect ing binocular half-occlusi ons: Empirical comparisons of fiv e approaches. IEE E T ransact ions of P atte rn Analysis and Machine Intelli gnece , 24(8):1127 –1133, 2002. [8] V ictor Eruhimov . stereo match.cpp sample program. provi ded wit h Ope nCV softw are library version 2.2.0 http:/ /opencv .willo wgarage.com/wiki/, January 2010. [9] S. Forstmann, Y . K anou, J. Ohya, S. T huering, and A. Schmitt. Real- time stereo by using dynamic programming. In Confer ence on Computer V ision and P attern Recogn ition W orkshop , 3:29–36, 2004. [10] W .E.L. Grimson and D.P . Huttenlocher . On the verificati on of hypothe- sized matc hes in model-based reco gnition. IEEE T ransactions on P attern Analysis and Machine Intelli gence , 13(12), 1991. [11] C. G. Harris and M. Stephens. A combined corner and edge detec tor . 4th Alve y V ision Confer ence , pages 147–151, 1988. [12] L. Igual , J. Preci ozzi, L. Garrido, A. Almansa, V . Caselles, and B. Roug ´ e. Automatic lo w baseline stereo in urban areas. In ver se P r oblems and Imagin g , 1(2):319 –348, 2007. [13] H. Ishikawa . Exact optimiz ation for markov random fields with con vex priors. IEE E T ransac tions of P atte rn Analysis and Machin e Intelli gnece , 25(10):133 3–1336, 2003. [14] S. B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-vie w stereo. Computer V ision P attern R ecocgi tion , 1:103–11 0, 2001. [15] A. Klaus and K. Sormann, M. and Karner . Segment -based stereo matching using belief propagat ion and a self-ada pting dissimilari ty measure. In Proc eedings of the Internation al Confer ence on P attern Recogn ition , pages 15–18, 2006. [16] V . Kol mogorov and R. Zabih. Graph Cut A lgorithms for Binocular Ster eo with Occlusi ons . Mathe matical Models in Computer V ision : The Handbook, Springer -V erlag, 2005. [17] D. L owe . P er ceptual Organizati on and V isual R eco gnition . Kluwer Academic Publishers, 1985. [18] D. L owe . Distincti ve image features from scale-in varia nt ke ypoints. Internati onal J ournal of Computer V ision , 60(2):91–110, 2004. [19] J. Mairal, M. E lad, and G. Sapiro. Sparse represen tation for color image restorat ion. IE EE T ransactions on Image Pr ocessin g , 17(1):53–69, 2008. [20] R. Manduchi and C. T omasi. Distincti ve ness maps for image matching. Pr oceedin gs of the Internati onal Confer ence on Imag e Analysis and Pr ocessing , pages 26–31, 1999. [21] K. Mikolaj czyk and C. Schmid. A performance e v aluatio n of local descript ors. IEE E Confer ence on Computer V ision and P attern R ecog - nition , 2:257–26 3, 2003. [22] A. Mittal and N. Pa ragios. Motion-ba sed background subtra ction using adapti ve kernel density estimation. Compute r V ision and P attern Recogn ition , 2:302–309, 2004. [23] L. Moisan and B. Stiv al. A probabili stic criterion to detect rigid point matches between two images and estimate the fundamental matrix. Internati onal J ournal of Computer V ision , 57(3):201–218 , 2004. [24] P . Mordohai and G. Medioni. Stereo using monocular cues within the tensor voting frame work. IEEE T ransacti ons on P attern Analysis and Mach ine Intellig ence , 28:968–982, 2006. [25] P . Mus ´ e, F . Sur , F . Cao, Y . Gousseau, and J. -M. Morel. An a contrario decisio n method for shape element recognitio n. International Journal of Compute r V ision , 69(3):295–31 5, 2006. [26] P . Mus ´ e, F . Sur, and J. -M. Morel. Sur les seuils de reconnaissa nce des formes. T raiteme nt du Signal , 20(3):279–294, 2003. [27] G. N ´ ee, S.e Jehan-B esson, L. Brun, and M. Re venu. Significance tests and statistica l inequal ities for region matchi ng. Structur al, Syntactic , and Stati stical P attern R ecog nition , pages 350–360, 2008. [28] Y . Ohta and T . Kanade. Stereo by intra- and inter-sca nline search using dynamic programming. IEE E T ransact ions on P atte rn Analysis and Mach ine Intellig ence , 7(2):139–15 4, 1985. [29] K. A. Patwa rdhan, G. Sapiro, and V . Morellas. Robust foregroun d detec tion in video using pixel layers. IEEE T ransac tions on P attern Analysis and Machine Intelli gence , 30(4):746 –751, 2008. [30] T . Pock, T . Schoene mann, G. Graber , H. Bischof, and D. Cremers. A con vex formulation of continuous multi-lab el problems. In P r oceedi ngs of the Eur opean Confe re nce on Computer V ision , pages 792–805, Berl in, Heidelb erg, 2008. Springer -V erlag. [31] J. Rabin, J. Delon, and Y .Gousseau. A contrar io matchin g of sift-li ke descript ors. In International Confere nce on P atte rn Recognit ion , 2008. [32] A. Robin, L. Moisan, and S. Le H ´ egarat-Masc le. An a-contrario approac h for sub-pix el change detectio n in satellit e imagery . T echnical Report MAP5 Nro. 2009-15, Univ ersit ´ e Paris Descart es, 2009. [33] N. Sabater , J.-M. Morel, A. Almansa, and G. Blanchet. Discarding m ov- ing objects in quasi-simul taneous stereo vision. In IEEE Internati onal Confer ence on Imag e Pr ocessing , ICIP , 2010. [34] R. Sara. F inding the lar gest unambiguous component of stereo matchi ng. In Pr ocee dings of the Europ ean Confer ence on Comput er V ision-P art III , pages 900–914. S pringer -V erlag, 2002. [35] D. Scharstein and R. Szeliski. A taxonomy and ev aluat ion of dense two-fra me stereo correspondenc e algorithms. Internati onal Journal of Computer V ision , 47(47(1/2/3 )):7–42, 2002. [36] C. Schmid and A. Zisserman. The geometry and matching of lines and curve s ov er multiple vie ws. International J ournal of Computer V ision , 40(3):199– 234, 2000. [37] C.E. Shanno n. A mathematica l theory of communication. ACM SIGMOBILE Mobile Computing and Communicat ions Revie w , 5(1):3– 55, 2001. [38] C.V . Stewa rt. Minpran: A new robust estimator for comput er vi- sion. IEE E T ransac tions on P attern Analysis and Mach ine Intell igenc e , 17(10):925 –938, 1995. [39] C. T omasi and R. Manduchi. Stereo matching as a nearest-nei ghbor problem. IEEE T ransaction on P attern Analysis and Machine Intelli - genc e , 20(3):333–340, 1998. [40] O. V eksler . Dense features for semi-dense stere o correspon dence. Internati onal J ournal of Computer V ision , 47(1-3):2 47–260, 2002. [41] O. V eksler . E xtract ing dense feature s for visual correspondenc e with graph cuts. In Computer V ision and P attern Recogniti on , volu me 1, pages 689–694, 2003. [42] Q. Y ang, L. W ang, R. Y ang, H. Ste wenius, and D. Nister . Stereo matching with color-weigh ted correlat ion, hierachic al belief propagati on and occlusion handling. In P r oceedi ngs of the IEEE Confer ence on Computer V ision and P attern Recog nition , pages 2347–2354, 2006.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment