Learning an Interactive Segmentation System

Learning an Interactive Segmentation System Hannes Nickisch Max Planck Institute for Biological Cybernetics Spemannstr . 38, 72076 T ¨ ubingen, Germany hn@tue.mpg.de Pushmeet K ohli Microsoft Research Cambridge 7 J J Thomson A v enue, Cambridge, CB3 0FB pkohli@microsoft.com Carsten Rother Microsoft Research Cambridge 7 J J Thomson A v enue, Cambridge, CB3 0FB carrot@microsoft.com Abstract Many successful applications of computer vision to im- age or video manipulation ar e inter active by natur e. How- ever , parameters of such systems ar e often trained neglect- ing the user . T raditionally , interactive systems have been tr eated in the same manner as their fully automatic counter- parts. Their performance is evaluated by computing the ac- curacy of their solutions under some ﬁxed set of user inter- actions. This paper pr oposes a ne w evaluation and learning method which brings the user in the loop. It is based on the use of an active r obot user - a simulated model of a human user . W e show how this approac h can be used to evaluate and learn parameters of state-of-the-art interactive se gmen- tation systems. W e also show how simulated user models can be integr ated into the popular max-mar gin method for parameter learning and pr opose an algorithm to solve the r esulting optimisation pr oblem. 1. Introduction Problems in computer vision are known to be extremely hard, and very few fully automatic vision systems exist which have been shown to be accurate and robust under all sorts of challenging inputs. These conditions in the past had made sure that most vision algorithms were conﬁned to the laboratory en vironment. The last decade, ho wev er , has seen computer vision ﬁnally come out of the research lab and into the real world consumer market. This great sea change has occurred primarily on the back of the development of a number of interactive systems which have allowed users to help the vision algorithm to achiev e the correct solution by giving hints. Some successful examples are systems for image and video manipulation, and interactive 3D recon- struction tasks. Image stitching and interactiv e image seg- mentation are two of the most popular applications in this area. Understandably , interest in interacti ve vision system has gro wn in the last few years, which has led to a number of workshops and special sessions in vision, graphics, and user-interf ace conferences 1 . The performance of an interacti ve system strongly de- pends on a number of f actors, one of the most crucial being the user . This user dependence makes interactiv e systems quite different from their fully automatic counterparts, es- pecially when it comes to learning and ev aluation. Surpris- ingly , there has been little work in computer vision or ma- chine learning dev oted to learning interactiv e systems. This paper tries to bridge this gap. W e choose interacti ve image segmentation to demon- strate the efﬁcac y of the ideas presented in the paper . How- ev er , the theory is general and can be used in the context of any interacti ve system. Interacti ve segmentation aims to separate a part of the image (an object of interest) from the rest. It is treated as a classiﬁcation problem where each pixel can be assigned one of two labels: foreground (fg) or background (bg). The interaction comes in the form of sets of pixels marked by the user by help of brushes to belong either to fg or bg. W e will refer to each user interaction in this scenario as a brush str oke . This work addresses two questions: (1) Ho w to ev aluate any giv en interactiv e segmentation system? and (2) How to learn the best interactiv e segmentation system? Observe that the answer to the ﬁrst question giv es us an answer to the second. One may imagine a learning algorithm gener- ating a number of possible segmentation systems. This can be done, for instance, by changing parameter values of the segmentation algorithm. W e can then ev aluate all such sys- tems, and output the best one. W e demonstrate the efﬁcacy of our e valuation methods by learning the parameters of the state-of-the-art system for interactiv e image segmentation and its v ariants. W e then go further , and show ho w the max-margin method for learning parameters of fully automated structured prediction models 1 e.g. ICV07, and NIPS09 can be extended to do learning with the user in the loop. T o summarize, the contributions of this paper are: (1) The study of the problems of ev aluating and learning interac- tiv e systems. (2) The proposal and use of a user model for ev aluating and learning interactiv e systems. (3) The ﬁrst thorough comparison of state-of-the-art segmentation algo- rithms under an explicit user model. (4) A new algorithm for max-margin learning with user in the loop. Organization of the paper In Section 2, we discuss the problem of system e valuation. In Section 3, we giv e de- tails of our problem setting, and explain the segmentation systems we use for our ev aluation. Section 4 explains the na ¨ ıve line-search method for learning segmentation system parameters. In Section 5, we show how the max-margin framew ork for structured prediction can be e xtended to han- dle interactions, and sho w some basic results. The conclu- sions are giv en in Section 6. 2. Evaluating Interactive Systems Performance e valuation is one of the most important problems in the development of real world systems. There are two choices to be made: (1) The data sets on which the system will be tested, and (2) the quality or accuracy measure. T raditional computer vision and machine learn- ing systems are ev aluated on preselected training and test data sets. For instance, in automatic object recognition, one minimizes the number of misclassiﬁed pixels on datasets such as P ASCAL V OC [7]. In an interactive system, these choices are much harder to make because of the presence of an activ e user in the loop. Users behave differently , prefer dif ferent interactions, may hav e dif ferent error tolerances, and may also learn o ver time. The true objecti ve function of an interactive system – although intuitiv e – is hard to e xpress analytically: The user wants to achiev e a satisfying result easily and quickly . W e will now discuss a number of possible solutions, some of which, are well known in the literature. 2.1. Static User Interactions This is one of the most commonly used methods in pa- pers on interactiv e image segmentation [4, 18, 6]. It uses a ﬁxed set of user-made interactions (brush strokes) asso- ciated with each image of the dataset. These strokes are mostly chosen by the researchers themselves and are en- coded using image trimaps. These are pixel assignments with foreground, background, and unknown labels (see Fig- ure 2b). The system to be ev aluated is giv en these trimaps as input and their accuracy is measured by computing the Hamming distance between the obtained result and the ground truth. This scheme of ev aluation does not consider how users may change their interaction by observing the current segmentation results. Evaluation and learning meth- ods which work with a ﬁxed set of interactions will be re- ferred to as static in the rest of the paper . Although the static ev aluation method is easy to use, it suffers from a number of problems: (1) The ﬁxed interac- tions might be very different from the ones made by actual users of the system. (2) Different systems prefer different type of user hints (interaction strokes) and thus a ﬁxed set of hints might not be a good w ay of comparing tw o compet- ing segmentation systems. For instance, geodesic distance based approaches [3, 9, 18] prefer brush strokes which are equidistant from the segmentation boundary as opposed to graph cuts based approaches [5, 16]. (3) The ev aluation does not take into account how the accuracy of the results improv es with more user strokes. For instance, one sys- tem might only need a single user interaction to reach the ground truth result, while the other might need many inter- actions to get the same result. Still, both systems will hav e equal performance under this scheme. These problems of static ev aluation make it a poor tool for judging the relativ e performance of newly proposed se gmentation system. 2.2. User Studies A user study inv olves the system being gi ven to a group of participants who are required to use it to solve a set of tasks. The system which is easiest to use and yields the correct segmentation in the least amount of time is consid- ered the best. Examples are [13] where a full user study has been done, or [3] where an adv anced user has done with each system the optimal job for a few images. While ov ercoming most of the problems of a static ev al- uation, we hav e introduced new ones: (1) User studies are expensi ve and need a large number of participants to be of statistical signiﬁcance. (2) Participants need to be giv en enough time to familiarize themselves with the system. For instance, an average driv er steering a Formula 1 car for the ﬁrst time, might be no faster than with a normal car . Ho w- ev er , after gaining experience with the car , one would ex- pect the driv er to be much faster . (3) Each system has to be ev aluated independently by participants, which mak es it in- feasible to use this scheme in a learning scenario where we are trying to ﬁnd the optimal parameters of the segmenta- tion system among thousands or millions of possible ones. 2.3. Evaluation using Cro wdsourcing Crowdsourcing has attracted a lot of interest in the ma- chine learning and computer vision communities. This is primarily due the success of a number of money [19], rep- utation [24], and community [17] based incentiv e schemes for collecting training data from users on the web . Crowd- sourcing has the potential to be an excellent platform for ev aluating interactiv e vision systems such as those for im- age segmentation. One could imagine asking Mechanical Method user user can inter- effort parameter time price in loop learn action model learning User model yes yes yes yes this paper f ast low Crowd sourcing yes yes yes yes conceiv able slow a bit User study yes yes yes yes infeasible slow very high Static learning no no no no used so far fast v ery low T able 1: Comparison of methods for interactive learning. T urk [1] users to cut out different objects in images with different systems. The one requiring the least number of interactions on a verage might be considered the best. How- ev er , this approach too, suffers from a number of problems such as fraud prev ention. Furthermore, as in user-studies, this cannot be used for learning in light of thousands or e ven millions of systems. 2.4. Evaluation with an Activ e User Model In this paper we propose a new ev aluation methodology which overcomes most of the problems described above. In- stead of using a ﬁxed set of interactions, or an army of hu- man participants, our method only needs a model of user in- teractions. This model is a simple algorithm which – gi ven the current segmentation, and the ground truth – outputs the next user interaction. This user model can be coded up us- ing simple rules, such as “give a hint in the middle of the largest wrongly labelled region in the current solution”, or alternativ ely , can be learnt directly from the interaction logs obtained from interactiv e systems deployed in the market. There are many similarities between the problem of learn- ing a user model and the learning of an agent policy in re- inforcement learning. Thus, one may exploit reinforcement learning methods for this task. Pros and cons of e valuation schemes are summarized in T able 1. 3. Image Segmentation: Problem Setting 3.1. The Database W e use the publicly av ailable GrabCut database of 50 images, in which ground truth segmentations are known [2]. In order to perform large scale testing and compari- son, we down-scaled all images to ha ve a maximum size of 241 × 161 , while keeping the original aspect ratio 2 . For each image, we created two different static user inputs: (1) A “static trimap” computed by dilating and eroding the ground truth segmentation by 7 pixels 3 . (2) A “static brush” con- sisting of a few user made brush strok es which very roughly indicate foreground and background. W e used on average about 4 strokes per image. (The magenta and cyan strokes in Fig. 2c give an example). All this data is visualized in 2 W e conﬁrmed by visual inspection that the quality of segmentation results is not affected by this do wn-scaling operation. 3 This kind of input is used by most systems for both comparison to competitors and learning of parameters, e.g. [4, 18]. (a) Images x k (b) User trimaps u k (c) T ight trimaps u k Figure 1: As detailed in the paper, we took the 50 GrabCut images [2] with gi ven ground truth segmentations (coded as black/white). W e considered two kinds of user inputs (codes as red/blue) : User deﬁned strokes and tight trimaps generated by eroding the groundtruth segmentation. The user strokes where drawn by only looking at the ground truth segmentation y k and ignoring the image x k . Figure 1. Note, in Sec. 3.3 we will describe a third “dy- namic trimap” called the r obot user where we simulate the user . 3.2. The Segmentation Systems W e now describe 4 different interactive segmentation systems we use in the paper . These are: “GrabCutSim- ple(GCS)”, “GrabCut(GC)”, “GrabCutAdvanced(GCA)”, “GeodesicDistance” (GEO). GEO is a very simple system. W e ﬁrst learn Gaussian Mixture Model (GMM) based color models for fg/bg from user made brush strokes. W e then simply compute the short- est path in the likelihood ratio image as described in [3] to get a segmentation. The other three systems all built on graph cut. They all work by minimizing the ener gy function: E ( y ) = X p ∈V E p ( y p ) + X ( p,q ) ∈E E pq ( y p , y q ) (1) Here ( V , E ) is an undirected graph whose nodes correspond to pixels, y p ∈ { 0 , 1 } is the segmentation label of image pixel p with color x p , where 0 and 1 correspond to the back- ground and the foreground respectiv ely . W e deﬁne ( V , E ) to be an 8-connected 2D grid graph. The unary terms are computed as follows. A probabilis- tic model is learnt for the colors of background ( y p = 0 ) and foreground ( y p = 1 ) using two different GMMs Pr ( x | 0) and Pr ( x | 1) . E p ( y p ) is then computed as − log( Pr ( x p | y p )) where x p contains the three color channels of pixel p . An important concept of GrabCut [16] is to update the color models based on the whole segmentation. In practice we use a few iterations. The pairwise term incorporates both an Ising prior and a contrast-dependent component and is computed as E pq ( y p , y q ) = | y q − y p | dist ( p, q )  w i + w c exp h − β k x p − x q k 2 i where w i and w c are weights for the Ising and contrast-dependent pairwise terms respectiv ely , and β = 0 . 5 · w β / D k x p − x q k 2 E is a parameter, where h·i denotes expectation over an image sample [16]. W e can scale β with the parameter w β . T o summarize, the models hav e two linear free param- eters: w i , w c and a single non-linear one: w β . The sys- tem GC minimizes the energy deﬁned above, and is pretty close to the original GrabCut system [16]. GrabCutSim- ple(GCS) is a simpliﬁed version, where color models (and unary terms) are ﬁx ed up front; they are learnt from the ini- tial user brush strokes (see Sec. 3.2) only . GCS will be used in max-margin learning and to check the acti ve user model, but it is not considered as a practical system. Finally , “GrabCutAdv anced(GCA)” is an advanced GrabCut system performing considerably better than GC. Inspired by recent work [14], foreground regions are 4- connected to a user made brush stroke to a void deserted foreground islands. Unfortunately , such a notion of con- nectivity leads to an NP-hard problem and various solutions hav e been suggested [23, 15]. Ho wev er , all these are either very slow and operate on super-pixels [15] or have a very different interaction mechanism [23]. W e simply remove deserted foreground islands in a postprocessing step. 3.3. The Robot User W e now describe the different active user models tested and deployed by us. Giv en the ground truth segmentation y k and the current segmentation solution y , the active user model is a policy s : ( x k , y k , u k,t , y ) 7→ u k,t +1 which speciﬁes which brush stroke to place next. Here, u k,t de- notes the user interaction history of image x k up to time t . W e have inv estigated v arious options for this policy: (1) Brush strokes at random image positions. (2) Brush strokes in the middle of the wrongly labelled region (center). For the second strategy , we ﬁnd the largest connected region of the binary mask, which is giv en by the absolute differ - ence between the current segmentation and ground truth. W e then mark a circular brush stroke at the pixel which is inside this region and furthest away from the boundary . This is motiv ated by the observation that users tend to ﬁnd it hard to mark pixels at the boundary of an object because they ha ve to be very precise. W e also tested user models which took the segmenta- tion algorithm explicitly into account. This is analogous to users who hav e learnt how the segmentation algorithm works and thus interact with it accordingly . W e consider the user model which marks a circular brush stroke at the pixel (1) with the lowest min marginal (sensit). (2) which results in the largest change in labeling (roi size). (3) which decreases the Hamming error by the biggest amount (Ham- ming). W e consider each pixel as the circle center and choose the one where the Hamming error decreases most (Hamming). This is very expensi ve, b ut in some respects is the best solution 4 . “Hamming” acts as a very “advanced user”, who knows exactly which interactions (brush strokes) will reduce the error by the largest amount. It is quite ques- tionable that a user is actually able to ﬁnd the optimal posi- tion, and a user study might be needed. On the other hand, the “centre” user model behav es as a “novice user”. Fig. 2c shows the result of a robot user interaction, where cyan and magenta are the initial ﬁxed brush strokes (called “static brush trimap”), and the red and blue dots are the robot user interactions. The robot sets brushes of a maxi- mum ﬁxed size (here 4 pixel radius). Apart from the true object boundary , the maximum brushes size is used. At the boundary , the brush size is scaled down, in order to avoid that the brush straddles the boundary . Fig. 2d shows the performance of the 5 dif ferent user models (robot users) over a range of 20 brushes. Here we used the GCS system, since it is computationally infeasible to apply the (sensit; roi; Hamming) user models on other interaction systems. GCS can be used because it allows ef- ﬁcient computation of solutions. It does this by recycling computation when doing the optimization [11]. In the other systems, this is not possible, since unaries change with ev- ery brush strok e, and hence we ha ve to treat the system as a black box. As expected, the random user performs badly . Interest- ingly the robot users minimizing the energy (roi, sensit) also perform badly . Both “Hamming” and “centre” are consider- ably better than the rest. It is interesting to note that “centre” is actually only marginally worse than “Hamming”. It has to be said that for other systems, e.g. GEO this conclusion might not hold, since e.g. GEO is much sensitiv e to the location of the brush stroke than a system based on graph cut, as [18] has shown. T o summarize, “centre” is the robot user strategy which simulates a “novice user” and is com- putational feasible, since it does not look at the underlying system at all. Also, “centre” performed for GCS nearly the same as the optimal strategy “Hamming”. Hence, for the 4 Note, one could do ev en better by looking at tw o or more brushes after each other and then selecting the optimal one. Howe ver , the solution grows exponentially with the number look-ahead steps. (a) Input image (b) T ight trimap (c) Robot user , er = 1 . 4% (d) Different robot users Figure 2: An image from the database (a), tight trimap (b), robot user (red/blue) started from user scribbles (magenta/cyan) with segmentation (black) after B = 20 strok es (c) and segmentation performance comparison of dif ferent robot users (d). rest of the paper we always stick to the user (centre) which we call from here onwards our r obot user . 3.4. The Error Measur e For a static trimap input there are many different ways for obtaining an error rate, see e.g. [4, 10]. In a static setting, most papers use the number of misclassiﬁed pix- els (Hamming distance) between the ground truth segmen- tation and the current result. W e call this measure “ er b ”, i.e. Hamming error for brush b . One could do variations, e.g. [10] weight distances to the boundary differently , but we have not in vestigated this here. Fig. 2d shows how the Hamming error behav es with each interaction. For learning and e valuation we need an error metric giv- ing us a single score for the whole interaction. One choice is the “weighted” Hamming error av eraged over a ﬁxed num- ber of brush strokes B. In particular we choose the error “ E r ” as: E r = [ P b f ( er b )] /B where f ( er b ) = er b . Note, to ensure a f air comparison between systems, B must be the same number for all systems. Another choice for the qual- ity metric which matches more closely with what the user wants is described as follows. W e use a sigmoid function f : R + → [0 , c ] of the form: f ( e ) = ( 0 er b ≤ 1 . 5 c − c ( er b − 0 . 5) 2 er b > 1 . 5 , c = 5 (2) Observe that f encodes two facts: all errors below 1 . 5 are considered negligible and lar ge errors do nev er weigh more than c . The ﬁrst reasons of this settings is that vi- sual inspection showed that for most images, an error below 1 . 5% corresponds to a visually pleasing result. Of course this is highly subjectiv e, e.g. a missing limb from the seg- mentation of a cow might be an error of 0 . 5% but is visu- ally unpleasing, or an incorrectly segmented lo w-contrast area has an error of 2% but is visually not disturbing. The second reason for having a maximum weight of c is that users do not discriminate between two systems giving lar ge errors. Thus errors of 50% and 55% are equally penalized. Due to runtime limitations for parameter learning, we do want to run the robot user for not too many brushes (e.g. maximum of 20 brushes). Thus we start by gi ving an initial set of brush strokes which are used to learn the colour mod- els. At the same time, we want that most images reach an error le vel of about 1 . 5% . When we start with a static brush trimap we get for 68% of images an error rate smaller than 1 . 5% and for 98% smaller than 2 . 5% , with the GCA sys- tem. W e also conﬁrmed that the inital static brush trimap does not affect the learning considerably 5 . 4. Interactive Learning by line-search Systems with few parameters can be trained by simple line (or grid) search. Our systems, GC and GCA, ha ve only 3 free parameters: w c , w i , w β . Line search is done by ﬁx- ing all but one free parameter w φ and simulating the user interaction process for 30 different discrete values w φ,i of the free parameter w φ ov er a predeﬁned range. The opti- mal value w ∗ φ from the discrete set is chosen to minimize the leave-one-out (LOO) estimator of the test error 6 . Not only do we prevent ov erﬁtting but we can efﬁciently com- pute the Jackknife estimator of the variance [25, ch. 8.5.1], too. This procedure is done for all parameters in sequence with a sensible starting point for all parameters. W e do one sweep only . One important thing to notice is that our dataset was big enough (and our parameter set small enough) as to not suffer from over -ﬁtting. W e see this by observing that training and test error rates are virtually the same for all e x- periments. In addition to the optimal value we obtain the variance for setting this parameter . In rough words, this variance tells us, how important it is to have this particular value. For instance, a high v ariance means that parameters different from the selected one, would also perform well. Note, since our error function (Eq. 2) is deﬁned for both, trimaps which are static and dynamic, the abov e procedure can be performed for all three dif ferent types of trimaps: “static trimap”, “static brush”, “dynamic brush”. T able 2 summaries all the results, and Fig. 3 illustrates some results during training and test (caption explains de- 5 W e started the learning from no initial brushes and let it run for 60 brush strokes. The learned parameters were similar as with starting from 20 brushes 6 This is number-of-data-point-fold cross v alidation. 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 w c weighted Hamming err (brush avg) GCA, w c static brush static trimap dynamic brush (a) GCA, w c training 0 2 4 6 8 10 1 2 3 4 5 w β weighted Hamming err (brush avg) GCA, w β static brush static trimap dynamic brush (b) GCA, w β training 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 w c weighted Hamming err (brush avg) GC, w c static brush static trimap dynamic brush (c) GC, w c training 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GCA, (w c ,w β ) static brush static trimap dynamic brush (d) GCA, ( w c , w β ) test Figure 3: Line search. W e compare 3 dif ferent training procedures for interactiv e segmentation: Static learning from a ﬁxed set of user brushes, static learning from a tight trimap and dynamic learning with a robot user starting from a ﬁxed set of user brushes. Train (a-c): Reweighted Hamming errors ( ± stdev .) for two segmentation systems (GC/GCA) as a function of two line-search parameters ( w c , w β ). The optimal parameter is shown along with its Jackknife v ariance estimate (black horizontal bar). T est (d): Segmentation performance using the optimal parameters ( w ∗ c , w ∗ β ) after iterated line search optimisation. Note that the dynamically learnt paramters dev elop their strength in the course of interaction. tails of the plots). One can observe that the three different trimaps suggest different optimal parameters for each sys- tem, and are differently certain about them. This leads to ke y contribution of this study: A system which is interac- tiv e in test time has also to be trained in an interactiv e way . W e see from the test plots that indeed, a system trained with “dynamic trimap” does better than trained with either “static brush” or “static trimap”. Let us look closer at some learnt settings. For system GCA and parameter w c (see T able 2 (ﬁrst ro w), and Fig. 3a we observe that the optimal value in a dynamic setting is lower ( 0 . 03 ) than in any of the static settings. This is sur- prising since one would ha ve guessed that the true value of w c lies some where in between a loose and very tight trimap. Interestingly in [18], the authors had learned a parameter by av eraging the performance from two static trimaps. From the above study , one might hav e concluded the static “tight trimap” might giv e good insights about the choice of param- eters. Howe ver , when we now consider the training of the parameter w β in the GCA system, we see that such a con- clusion would be wrong, since the “tight trimap” reaches a very different minimum ( 9 . 73 ) than the dynamic trimap ( 2 . 21 ). 7 T o summarize, conclusions about the optimal pa- rameter setting of an interacti ve system should be drawn by a large set of interaction and cannot be made by looking solely at a few (here tw o) static trimaps. T rimap w c w i w β dynamic brush 0.03 ± 0.03 4.31 ± 0.17 2.21 ± 3.62 static trimap 0.07 ± 0.09 4.39 ± 4.40 9.73 ± 7.92 static brush 0.22 ± 0.52 0.47 ± 8.19 3.31 ± 2.13 T able 2: System GCA. Optimal values ± stdev . 7 Note, the fact that the uncertainty of the “tight trimap” learning is high, giv es an indication that this value can not be trusted very much. For the sak e of completeness, we ha ve the same numbers for the GC system in T able 3. W e see the same conclusions as abov e. One interesting thing to notice here is that the pairwise terms (esp. w c ) are chosen higher than in GCA. This is expected, since without post-processing a lot of iso- lated islands may be present which are far away from the true boundary . So post-processing automatically removes these islands. The ef fect is that in GCA the pairwise terms can now concentrate on modeling the smoothness on the boundary correctly . Ho wever , in GC the pairwise terms hav e to additionally make sure that the isolated regions are remov ed (by choosing a higher value for the pairwise terms) in order to compensate for the missing post-processing step. T rimap w c w i w β dynamic brush 0.24 ± 0.03 4.72 ± 1.16 1.70 ± 1.11 static trimap 0.07 ± 0.09 4.39 ± 4.40 4.85 ± 6.29 static brush 0.57 ± 0.90 5.00 ± 0.17 1.10 ± 0.96 T able 3: System GC. Optimal values ± stdev . It is interesting to note that for the error metric f ( er b ) = er b , we get slightly different values, see T able 4. For in- stance, we see that w c = 0 . 07 ± 0 . 07 for GCA with our ac- tiv e user . This is not too surprising, since it says that larger errors are more important (this is what f ( er b ) = er b does). Hence, it is better to choose a larger v alue of w c . In Figure 3d of the paper we plot the actual segmenta- tion error f ( er b ) and not the error measure P B b =1 f ( er b ) for f ( er b ) = sig moid ( er b ) . In T able 4, we have collected all ﬁnal error measure v alues. It is visible from the table that the dynamically adjusted parameters only perform bet- ter in terms of the instantaneous error but not in terms of the cumulativ e error measure. In order to get a complete picture, we provide the full set of plots for the line search experiments. W e report re- static brush static trimap dynamic brush f ( er b ) = sigmoid ( er b ) [sho wn] 0 . 379 ± 0 . 134 0 . 416 ± 0 . 135 0 . 321 ± 0 . 132 f ( er b ) = sigmoid ( er b ) , P b f ( er b ) 0 . 984 ± 0 . 165 0 . 820 ± 0 . 173 1 . 007 ± 0 . 163 f ( er b ) = er b 1 . 191 ± 0 . 072 1 . 232 ± 0 . 071 1 . 160 ± 0 . 074 f ( er b ) = er b , P b f ( er b ) 1 . 564 ± 0 . 103 1 . 422 ± 0 . 084 1 . 610 ± 0 . 115 T able 4: System GCA. Optimal values ± stdev . sults for the two systems GCA and GC on three parameters w c , w i and w β and two error weighting functions f ( er b ) in Figures 6 and 7. Novice vs Advanced User When comparing different in- teractiv e systems, we hav e to decide, whether the system is designed for an advanced or a no vice user . In a user study , one has full control ov er selecting ad- vanced or novice users. This can be done by changing the amount of introduction given to the participants. Howe ver , this process is lengthy and therefore infeasible for learning. In our robot user paradigm, we can simulate users with different lev els of experience. W e run the (center) user model to simulate a novice user and e valuate four dif fer- ent systems. The results are shown in Fig. 4. The order of the methods is as expected, GCA is best, followed by GC, then GCS, and GEO. GEO performs badly since it does no smoothing at the boundary , compared to the other systems. 0 5 10 15 20 1.2 1.5 2 5 8 B: number of strokes er: average Hamming error center user from brush GCA GC GCS GEO Figure 4: System comparison : Segmentation performance comparison between 4 different systems: GCA, GC, GCS and GEO using the robot user started from initial user brushes. 5. Interactive Max-margin Learning The grid-search method used in Section 4 can be used for learning models with fe w parameters only . Max-margin methods deal which models containing large numbers of parameters and hav e been used extensi vely in computer vi- sion. Howe ver , they work with static training data and can- not be used with an acti ve user model. In this Section, we show ho w the traditional max-margin parameter learning al- gorithm can be extended to incorporate an acti ve user . 5.1. Static SVMstruct Our exposition builds heavily on [20] and the refer- ences therein. The SVMstruct framew ork [22] allows to adjust linear parameters w of the se gmentation energy E w ( y , x ) (Eq. 1) from a giv en training set { x k , y k } k =1 ..K of K images x k ∈ R n and ground truth segmentations 8 y ∈ Y := { 0 , 1 } n by balancing between empirical risk P k ∆( y k , f ( x k )) and regularisation by means of a trade- off parameter C . A (symmetric) loss function 9 ∆ : Y × Y → R + measures the degree of ﬁt between two segmen- tations y and y ∗ . The current segmentation is given by y ∗ = arg min y E w ( y , x ) . W e can write the energy func- tion as an inner product between feature functions ψ i ( y , x ) and our parameter vector w : E w ( y , x ) = w > ψ ( y , x ) . W ith the two shortcuts δ ψ k y = ψ ( x k , y ) − ψ ( x k , y k ) and ` k y = ∆( y , y k ) , the margin rescaled objecti ve [21] reads min ξ ≥ 0 , w o ( w ) := 1 2 k w k 2 + C K 1 > ξ (3) sb .t. min y ∈Y \ y k  w > δ ψ k y − ` k y  ≥ − ξ k ∀ k . In fact, the con ve x function o ( w ) can be re written as a sum of a quadratic regulariser and a maximum over an expo- nentially sized set of linear functions each corresponding to a particular segmentation y . Which energy functions ﬁt under the umbrella of SVMstruct? In principle, in the cutting-planes approach [22] to solve Eq. 3, we only re- quire ef ﬁcient and exact computation of arg min y E w ( y ) and arg min y 6 = y k E w ( y ) − ∆( y , y k ) . For the scale of images i.e. n > 10 5 , submodular energies of the form E w ( y ) = y > Fy + b > y , F ij ≥ 0 , b i ∈ R allow for ef ﬁcient minimisation by graph cuts. As soon as we include connec- tivity constraints as in Eq. 1, we can only approximately train the SVMstruct. Howe ver some theoretical properties seem to carry ov er empirically [8]. 5.2. Dynamic SVMstruct with “Cheating” The SVMstruct does not capture the user interaction part. Therefore, we add a third term to the objectiv e that mea- sures the amount of user interaction ι where u k ∈ { 0 , 1 } n is a binary image indicating whether the user provided the label of the corresponding pixel or not. One can think of u k as a partial solution fed into the system by the user brush strokes. In a sense u k implements a mechanism for the SVMstruct to cheat , because only the unlabeled pixels have to be segmented by our arg min y E w procedure, whereas the labeled pixels stay clamped. In the optimisation prob- lem, we also hav e to modify the constraints such that the only segmentations y compatible with the interaction u k 8 W e write images of size ( n x , n y , n c ) as vectors for simplicity . All in- volv ed operations respect the 2 d grid structure absent in general n -vectors. 9 W e use the Hamming loss ∆ H ( y ∗ , y k ) = 1 > | y k − y ∗ | . are taken into account. Our modiﬁed objective is gi ven by: min ξ ≥ 0 , w , u k o ( w , U ) := 1 2 k w k 2 + C K 1 > ξ + ι (4) sb .t. min y ∈ Y | u k \ y k  w > δ ψ k y − ` k y  ≥ − ξ k ∀ k ι ≥ a > u k ∀ k For simplicity , we choose the amount of user interaction or cheating ι to be the maximal a -reweighted number of la- beled pixels ι = max k P i a i | u k i | , with uniform weights a = a · 1 . Other formulations based on the avera ge rather than on the maximal amount of interaction proved feasible but less conv enient. W e denote the set of all user interac- tions for all K images x k by U = [ u 1 , .., u K ] . The compat- ible label set Y | u k = { 0 , 1 } n is given by { ˆ y ∈ Y | u k i = 1 ⇒ ˆ y i = y k i } where y k is the ground truth labeling. Note that o ( w , U ) is con ve x in w for all values of U and efﬁciently minimisable by the cutting-planes algorithm. Ho wev er the dependence on u k is horribly difﬁcult – we basically hav e to ﬁnd the smallest set of brush strokes leading to a correct segmentation. Geometrically , setting one u k i = 1 halves the number of possible labellings and therefore removes half of the label constraints. The optimisation problem (Eq. 5) can be re-interpreted in two dif ferent ways: Firstly , we can deﬁne a modiﬁed energy ˜ E w , v ( y ) = E w ( y ) + P i ∈V u k i φ i ( y i , y k i ) with additional cheating po- tentials φ i ( y i , y k i ) := ∞ for y i 6 = y k i and 0 otherwise al- lowing to treat the SVMstruct with cheating as an ordinary SVMstruct with modiﬁed energy function ˜ E w , v ( y ) and ex- tended weight vector ˜ w = [ w ; u 1 ; .. ; u K ] . A second (but closely related) interpretation starts from the fact that the true label y k can be regarded as a feature vec- tor of the image x k 10 . Therefore, it is feature selection in a very particular feature space. There is a direct link to mul- tiple kernel learning – a special kind of feature selection. 5.3. Optimisation with strategies W e e xplored two approaches to minimise o ( w , U ) . Based on the discrete deriv ative ∂ o ∂ U , we tried coordinate descent schemes. Due to the strong coupling of the v ari- ables, only very short steps were possible 11 . Conceptu- ally , the pr ocess of optimisation is decoupled from the user interaction process, where removal of already known la- bels from the cheating does not make sense. At ev ery stage of interaction, a user acts according to a strate gy s : ( x k , y k , u k,t , y , w ) 7→ u k,t +1 . The notion of strat- egy or policy is also at the core of a robot user . In order to capture the sequential nature of the human interaction and 10 In fact, it is probably the most informativ e feature one can think of. The corresponding predictor is giv en by the identity function. 11 In the end, we can only safely ﬂip a single pixel u k i at a time to guar- antee descent. assuming a ﬁxed strate gy s , we relax Eq. 4 to min ξ ≥ 0 , w o ( w , T ) := 1 2 k w k 2 + C K 1 > ξ + ι (5) sb .t. min y ∈ Y | u k,T \ y k  w > δ ψ k y − ` k y  ≥ − ξ k ∀ k ι ≥ a > u k,T , u k,T = s T ( x k , y k , w ) ∀ k where we denote repeated application of the strategy s by s T ( x k , y k , w ) =  T − 1 t =0 s ( x k , y k , u k,t , w ) and by  the function concatenation operator . Note that we still cannot properly optimise Eq. 5. Howe ver , as a proxy , we develop Eq. 5 forward by starting at t = 0 with u k, 0 . In ev ery step t , we interleave the optimisation of the con vex objec- tiv e o ( w t , t ) and the inclusion of a ne w user stroke yielding w T as ﬁnal parameter estimate. 5.4. Experiments W e ran our optimisation algorithm with GCS on 5-fold CV train/test splits of the GrabCut images. W e used unary potentials (GMM and ﬂux) as well as two pairwise poten- tials (Ising and contrast) and the center robot user with B = 25 strokes. Fig. 5b shows, how the relati ve weight of the linear parameters v aries over time. At the begin- ning, smoothing (high w i ) is needed whereas later , edges are most important (high w c ). Also the SVMstruct objec- tiv e changes. Fig. 5c makes clear that the data ﬁt term decreases over time and regularisation increases. Howe ver , looking at the test error in Fig. 5a (av eraged over 5 folds) we see only very little difference between the performance of the initial parameter w 0 and the ﬁnal parameter w T . Our explanation is based on the fact that GCS is too simple as it does not include connectivity or unary iterations. In ad- dition to the Gaussian Mixture Model (GMM) based color potentials, we also e xperimented with ﬂux potentials [12] as a second unary term. Figure 5b sho ws one example, where we included a ﬂux unary potential. W e get almost identical behavior without ﬂux unaries. 6. Conclusion This paper showed ho w user interaction models (robot users) can be used to train and e valuate interacti ve systems. W e demonstrated the po wer of this approach on the problem of parameter learning in interacti ve segmentation systems. W e sho wed ho w simple grid search can be used to ﬁnd good parameters for different segmentation systems under an ac- tiv e user interaction model. W e also compared the perfor- mance of the static and dynamic user interaction models. W ith more parameters, the approach becomes infeasible, which naturally leads to the max margin frame work. W e introduced an extension to SVMstruct, which allows it to incorporate user interaction models, and showed ho w to solve the corresponding optimisation problem. How- ev er , crucial parts of state-of-the-art segmentation systems Figure 5: Max-margin stat/dyn : a) Segmentation perfor- mance using GCS when parameters are either statically or dynamically learnt. b) Evolution of w during the optimisa- tion. c) Evolution of the ﬁrst tw o terms of o ( w ) . include (1) non-linear parameters, (2) higher-order poten- tials (e.g. enforcing connectivity) and (3) iterativ e updates of the unary potentials ingredients that cannot be handled directly inside the max-margin frame work. In future work, we will try to tackle these challenges to enable learning of optimal interactiv e systems. References [1] Amazon mechanical turk. https://www .mturk.com. [2] Grabcut database. http://research.microsoft.com/en-us/um/ cambridge/projects/visionimagevideoediting/se gmentation/ grabcut.htm. [3] X. Bai and G. Sapiro. A geodesic framework for fast inter- activ e image and video segmentation and matting. 2007. [4] A. Blake, C. Rother , M. Bro wn, P . Perez, and P . T or . Interac- tiv e image segmentation using an adaptive GMMRF model. In ECCV , 2004. [5] Y . Boykov and M. Jolly . Interactive graph cuts for optimal boundary and re gion se gmentation of objects in N-D images. 2001. [6] O. Duchenne, J.-Y . Audibert, R. Keri ven, J. Ponce, and F . S ´ egonne. Segmentation by transduction. In CVPR , 2008. [7] M. Everingham, L. V . Gool, C. K. I. Williams, J. W inn, and A. Zisserman. The P ASCAL V isual Object Classes Chal- lenge. http://www .pascal-network.org/challenges/V OC. [8] T . Finley and T . Joachims. Training structural SVMs when exact inference is intractable. In ICML , 2008. [9] L. Grady . Random walks for image segmentation. P AMI , 28:1–17, 2006. [10] P . K ohli, L. Ladicky , and P . T orr . Robust higher order poten- tials for enforcing label consistency . In CVPR , 2008. [11] P . Kohli and P . T orr . Efﬁciently solving dynamic markov random ﬁelds using graph cuts. In ICCV , 2005. [12] V . Lempitsky and Y . Boykov . Global optimization for shape ﬁtting. In CVPR , 2007. [13] Y . Li, J. Sun, C.-K. T ang, and H.-Y . Shum. Lazy snapping. SIGGRAPH , 23:303–308, 2004. [14] J. Liu, J. Sun, and H.-Y . Shum. Paint selection. In SIG- GRAPH , 2009. [15] S. Nowozin and C. H. Lampert. Global connectivity poten- tials for random ﬁeld models. In CVPR , 2009. [16] C. Rother, V . Kolmogoro v , and A. Blake. Grabcut - in- teractiv e foreground extraction using iterated graph cuts. 23(3):309–314, 2004. [17] B. C. Russell, A. T orralba, K. P . Murphy , and W . T . Freeman. Labelme: a database and web-based tool for image annota- tion. IJCV , 77:157–173, 2008. [18] D. Singaraju, L. Grady , and R. V idal. P-brush: Continu- ous valued mrfs with normed pairwise distributions for im- age segmentation. In CVPR , 2009. [19] A. Sorokin and D. Forsyth. Utility data annotation with ama- zon mechanical turk. In Internet V ision W orkshop at CVPR , 2008. [20] M. Szummer , P . K ohli, and D. Hoiem. Learning CRFs using graph cuts. In ECCV , 2008. [21] B. T askar, V . Chatalbashev , and D. K oller . Learning associa- tiv e markov netw orks. In ICML , 2004. [22] I. Tsochantaridis, T . Hofmann, T . Joachims, and Y . Altun. Support vector learning for interdependent and structured output spaces. In ICML , 2004. [23] S. V icente, V . Kolmogorov , and C. Rother . Graph cut based image segmentation with connectivity priors. In CVPR , 2008. [24] L. von Ahn and L. Dabbish. Labeling images with a com- puter game. In SIGCHI , pages 319–326, 2004. [25] L. W asserman. All of Statistics . Springer, 2004. −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w c weighted Hamming error (brush stroke average) GCA, w c static brush static trimap dynamic brush (a) GCA, contrast weight w c −5 0 5 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w i weighted Hamming error (brush stroke average) GCA, w i static brush static trimap dynamic brush (b) GCA, Ising weight w i −5 0 5 10 15 20 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w β weighted Hamming error (brush stroke average) GCA, w β static brush static trimap dynamic brush (c) GCA, β -scale w β −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w c weighted Hamming error (brush stroke average) GC, w c static brush static trimap dynamic brush (d) GC, contrast weight w c −5 0 5 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w i weighted Hamming error (brush stroke average) GC, w i static brush static trimap dynamic brush (e) GC, Ising weight w i −5 0 5 10 15 20 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w β weighted Hamming error (brush stroke average) GC, w β static brush static trimap dynamic brush (f) GC, β -scale w β 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GCA, w c static brush static trimap dynamic brush (g) GCA, contrast weight w c 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GCA, w i static brush static trimap dynamic brush (h) GCA, Ising weight w i 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GCA, w β static brush static trimap dynamic brush (i) GCA, β -scale w β 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GC, w c static brush static trimap dynamic brush (j) GC, contrast weight w c 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GC, w i static brush static trimap dynamic brush (k) GC, Ising weight w i 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average weighted error GC, w β static brush static trimap dynamic brush (l) GC, β -scale w β Figure 6: Learning with grid search (single parameter at a time), f ( er b ) = sig moid ( er b ) , a-f training and g-l testing −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w c weighted Hamming error (brush stroke average) GCA, w c static brush static trimap dynamic brush (a) GCA, contrast weight w c −5 0 5 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w i weighted Hamming error (brush stroke average) GCA, w i static brush static trimap dynamic brush (b) GCA, Ising weight w i −5 0 5 10 15 20 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w β weighted Hamming error (brush stroke average) GCA, w β static brush static trimap dynamic brush (c) GCA, β -scale w β −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w c weighted Hamming error (brush stroke average) GC, w c static brush static trimap dynamic brush (d) GC, contrast weight w c −5 0 5 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w i weighted Hamming error (brush stroke average) GC, w i static brush static trimap dynamic brush (e) GC, Ising weight w i −5 0 5 10 15 20 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 w β weighted Hamming error (brush stroke average) GC, w β static brush static trimap dynamic brush (f) GC, β -scale w β 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GCA, w c static brush static trimap dynamic brush (g) GCA, contrast weight w c 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GCA, w i static brush static trimap dynamic brush (h) GCA, Ising weight w i 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GCA, w β static brush static trimap dynamic brush (i) GCA, β -scale w β 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GC, w c static brush static trimap dynamic brush (j) GC, contrast weight w c 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GC, w i static brush static trimap dynamic brush (k) GC, Ising weight w i 0 5 10 15 20 0.5 1 1.5 2 3 4 B number of brush strokes average Hamming error GC, w β static brush static trimap dynamic brush (l) GC, β -scale w β Figure 7: Learning with grid search (single parameter at a time), f ( er b ) = er b , a-f training and g-l testing

Learning an Interactive Segmentation System

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment