Let Me At Least Learn What You Really Like: Dealing With Noisy Humans When Learning Preferences

Let Me At Least Learn What Y ou Really Like: Dealing W ith Noisy Humans When Learning Pr efer ences Sriram Gopalakrishnan, Utkarsh Soni Abstract Learning the preferences of a human improv es the quality of the interaction with the human. The num- ber of queries av ailable to learn preferences maybe limited especially when interacting with a human, and so activ e learning is a must. One approach to activ e learning is to use uncertainty sampling to de- cide the informativ eness of a query . In this paper, we propose a modiﬁcation to uncertainty sampling which uses the expected output v alue to help speed up learning of preferences. W e compare our ap- proach with the uncertainty sampling baseline, as well as conduct an ablation study to test the v alid- ity of each component of our approach. 1 Introduction In an AI system that inv olves a human-in-the-loop, learning the human’ s preference allows for a better e xperience for the person. Consider the example of an AI agent for selecting online ads for a user . This is a very noisy environment, and queries with feedback are few and precious. Users seldom want to click on ads much less provide feedback. Advertise- ments on Facebook for example, do get user feedback. The user has the option to click ”Why am I seeing this ad” , and provide feedback. So when the amount of feedback is sparse, rather than try- ing to learn the entire preference model of the user accu- rately , if we can at least learn what the important features and weights are, we can better serve the user . Additionally , we should av oid showing product ads that the user hates, and learn a useful model as quickly as possible. As to what is a useful model, let’ s say the human’ s prefer- ence is modelled by a function o ver some features with noise (as we expect that human feedback is typically noisy). If we had to choose output regions in which to be more accurate, then we argue that it is more important to be accurate when the output (preference) is very high or v ery lo w , as compared to being accurate about moderate or lo w preferences. W e jus- tify this with the follo wing arguments. (A) Given a set of options, the best options should be presented, or if there are no clear good options, then bad options should be av oided. (B) Rather than just detecting (classifying) a liked or disliked Figure 1: Illustration of Output V alue Biased Uncertainty Sampling sample point, we argue that being accurate in the extreme re- gions of preferences is also important. This is apparent in online ad selection as sho wing the best ad increases the like- lihood of clicking the ad. W e illustrate the idea of a useful function in the ﬁgure 1 , where the green re gion is the high preference region, and the red region is the lo w preference re gion. W e try to learn a bet- ter approximation function (the orange line) that matches the true function more closely in those extreme output regions. W e tradeof f between accurac y in all regions and sample efﬁ- ciency by prioritizing accuracy when the output value (mag- nitude) is higher . In this way we can reduce the number of samples needed to learn a useful model of the human’ s pref- erences quickly . W e lev erage this intuition, as well as the idea of uncertainty sampling from Acti ve Learning (AL) to query with the most valuable and informati ve data points. In Acti ve Learning queries are carefully selected so as to hav e the greatest expected impro vement in the accurac y . The queries are posed to an oracle (a source of ground truth in- formation, which could be humans) that gives feedback to help impro ve the model. It has been widely used to impro ve the sample efﬁciency of machine learning models [ Settles, 2009 ] as well as choosing what experiments to do in the de- sign of experiments. One of the most commonly used AL ap- proaches for sampling queries is uncertainty sampling [ Lewis and Gale, 1994 ] . In this approach, the data point sampled is the one for which the model is least conﬁdent about it’ s out- put v alue (i.e. has the most variance). In our work we modify uncertainty sampling in a way that is especially applicable to learning preferences. W e call our approach Output-Biased Uncertainty Sampling( OBUS ) . OBUS modiﬁes the uncer- tainty (variance) measure of informativ eness by adjusting it based on the output value. In the context of learning pref- erences, a query is more informati ve if the variance is higher and the e xpected output is a higher value (in magnitude). This deceptiv ely simple change produces valuable gains in learn- ing speed and sample complexity . W e will discuss the details further in the methodology section where we also describe the additional details needed for learning preferences with this sampling paradigm. Note that Activ e Learning is still a form of supervised learning (as is apparent). A supervised learning task tra- ditionally in volv es querying only about the label of a data point, and activ e learning typically does the same. Howe ver , there has been sev eral more recent works in activ e learn- ing like [ Hema et al. , 2006 ] , [ Poulis and Dasgupta, 2017 ] , [ Druck et al. , 2008 ] , [ Druck et al. , 2009 ] , [ Mann and McCal- lum, 2008 ] where the human annotator is also asked to select relev ant(predictiv e) features for a particular label. In terms of learning preferences, we can liken this to the human ex- plaining their preferences by what features contributed to it. As one might expect, this would greatly speed up learning. There are already both empirical [ Hema et al. , 2006 ] [ Druck et al. , 2009 ] , [ Mann and McCallum, 2008 ] and theoretical works [ Poulis and Dasgupta, 2017 ] that demonstrate this.W e too utilize feature-feedback to speed up the learning. When the human is queried with a data point, that query is giv en a preference score or lev el by the user according to their preference function ov er some (possibly all) of the features in the plan. The learning problem is then to learn this function. W e also enable feature-feedback from the user to tell us what features were liked and disliked. W e assume that we start with a lar ge set of features, F , that could possibly be relev ant (prior knowledge accumulated from past data) of which only a small subset is relev ant to the current human’ s preferences. Then for learning preferences, each data point or query is then just about a subset of features from the set F that are found in the query . For example, when a query or sample is presented to the user , there could be man y possible features that are present in it. When the query is gi ven to the human, they pro vide their preference score as well as telling us of salient features that were liked or disliked. Let us consider an online ad or message about a local French restaurant. The human could hav e rated the ad as 4 . 5 and indicated that French related ads are liked, and food re- lated ads are disliked. The human does not tell us how much each feature contributed to the score, just that the score re- ﬂects their overall feeling about the ad. Note that the range of preference values is not restricted, and it is the relati ve val- ues of the preferences that deﬁne the preference model. The range can be any arbitrary range, b ut needs to be set prior to any queries. In the online ad scenario it makes sense to re- strict this range from 0 to 100. W e go into the details in the methodology and experimental sections. Our claim is that using OBUS , we can frugally query the human to quickly learn a useful preference model. W e compare our approach with the standard uncertainty (vari- ance) based sampling for the regression problem that we will shortly describe. W e also conduct an ablation study to ana- lyze the effect of the different components of our approach. In order to effecti vely compare the effect of different com- ponents and methods of sampling, we use a simulated user . This helps us get copious and comparable feedback for each experimental setting. The feedback from our simulated user is intentionally made noisy in order to mimic a human’ s noisy feedback. Our paper is organized as follo ws; we ﬁrst deﬁne the prob- lem formally in section 3 and go over our approach in de- tail in section 4 . The experimental results are presented in section 5 where we compare OBUS learning rate to random sampling and traditional uncertainty sampling baselines. W e follow this with an ablation study to help ev aluate the compo- nents of our active learning approach. W e then compare our work with existing literature on activ e learning and eliciting preferences before concluding with a summary , discussion of results, and future directions for this line of work. 2 Problem F ormulation In our w ork the problem of learning preference is represented as a tuple T = < D , F , O, P , R > . D is a set of input data points. F is set of features that are present in the data points and possibly relev ant to the preferences of the user; O is an oracle that provides feedback (simulated user); P is a prob- ability function that returns the probability of seeing any of an input set of features; The objective is to learn the weights W of a linear preference function o ver the features in F (not all of which will be rele v ant to the user/oracle). W e will now discuss each of the problem formulation elements in detail. In our experimental setting, we are giv en a large pool of data points D split into a training set D train and test set D test . The features in each data point come from the super- set of features F . Some subset of F is relev ant to the oracle whose preferences we wish to learn. The features could be ar- bitrarily complex like compound logical statements that hold true or not, but the complexity of features is an orthogonal dimension that we do not explore. In our work, the features are binary features, i.e. the features is either present or not. Our work is not limited to binary features as the user will see from the methodology section; we chose binary features as it is simpler, sufﬁcient, and captures the type of features one might expect in online ads. As for the simulated user , which we will henceforth call the oracle , it is deﬁned as O = < p rel evant , O µ , O σ , N > . The oracle is intended to let us ev aluate our method against the other methods with the same consistent feedback. The oracle giv es feedback on the preference for the queries posed to it. It returns a preference value for a given query data point based on the oracle’ s value function and also giv es feature feedback. The feature feedback would tell us which features were rele- vant. The oracle is conﬁgured with a lik elihood p rel evant of selecting a feature as relev ant. Then the rele v ant feature is ei- ther liked or disliked with equal ( 50% ) chance. The Oracle’ s preference for an input query is computed as a linear func- tion over the relev ant features in the query . Each feature’ s weight (coef ﬁcient) is ﬁx ed before the ﬁrst round of queries, and assigned from a guassian distribution whose properties can be speciﬁed by O µ , O σ . Our reasoning for choosing a gaussian distribution is that it would giv e us a few features that are high, and lo w , and most rele vant features would hav e comparable weights. Once the magnitude of the feature weight is obtained from the Gaussian distrib ution, the feature weight is set to be pos- itiv e if liked and negati ve if disliked. The preference value returned from the oracle is noisy as human feedback typi- cally is. This preference value is a measure of ho w strongly the agent likes or dislikes the query . The noise in the prefer - ence value is determined by N , which represents the standard deviation of a Gaussian noise with mean 0 . 0 . The oracle sam- ples from this distribution and adds the noise in the preference value feedback. W e assume that the oracle’ s preference or value function is linear on the features; V ( x ) = W T · F ( x ) , where W rep- resents the weights of the features, and F ( x ) represents the feature vector of the query x (representing the relevant fea- tures). W e think using a linear function is an acceptable sim- pliﬁcation since features can be arbitrarily complex and non- linearity can be represented as more complex features. This dimension is orthogonal to our contribution. The objecti ve is to learn a set of weights such that the test set error (in D test ) is minimized especially when the output preference is an extreme value (very high or very low).In or- der to prioritize the accuracy in the extreme value regions, we use the value-biased error ( E r ror V B ( . ) ). It is computed as the product of the error and true value as described in 1. E rr or V B ( ˆ y , y ∗ ) = | ˆ y − y ∗ | · | y ∗ | (1) where ˆ y is the estimated value and y ∗ is the true value. The error for the entire dataset, is the averaged v alue-biased error from the top 20% and bottom 20% of the range of v alues. Lastly , we assume the test set follows the same distribution as the training set. 3 Methodology W e discuss the methodology we used by ﬁrst presenting an ov ervie w of the algorithm, and then going into it’ s parts in detail. T o learn which features matter and their weights, the queries are presented to the Oracle in rounds, in k = 5 plans per round (hyperparameter). The queries for each round are selected based on the score given by Equation 10. The equa- tion is discussed in the subsection S cor ing P lans . For now , sufﬁce to say the score represents the informativeness of a query . The score is computed using a model learned by Ridge Regression (which we will call RRM) and using the standard error of the weights to compute the uncertainty in the predic- tion. W e describe this more in the section on Ridge Regres- sion. The oracle giv es feedback as to what features were rel- ev ant and the score of the plan. After e very batch of queries, we train a new RRM model with the additional information. The updated model is used to score the pool of plans. Then the most informativ e k plans from the remaining are selected for the next batch. Note that for the ﬁrst batch of queries, since we have no data, the queries will be selected based on the likelihood of occurrence of their features. The more the probability of oc- currence, the more important it is to know about the feature. Additionally , each round will include a query that is purely exploratory , i.e. tries to cov er as many features from the su- perset F as possible, chosen using the exploration score S e described in Equation 2 S e ( x ) = Σ f ∈ unseen ( x ) p ( f ) (2) where unseen ( X ) are the features in the query x that have not yet been seen by the user (and so we are unsure if they are relev ant). This helps ﬁnd more features to update the RRM model. W e now describe the RRM model and variance com- putation in more detail. 3.1 Ridge Regression Model (RRM) Ridge Regression is appropriate for modelling data of the form in Equation 3 y = w 0 + w · x +  (3) where w 0 is the bias which we set as 0 in our experiments, i.e. we assume the human is neither positiv ely or negati vely biased. w is the vector of weights for the relev ant features in x .  is the noise, assumed to be Gaussian as N (0 , σ noise ) . Ridge Regression tries to optimize the loss as shown in Equa- tion 4 L ( X, Y ) = 1 N N X i =1 ( y i ∗ − w 0 − w · x i ) 2 + λ X j ( | w j | 2 ) (4) which is the mean squared error plus a penalty λ on the sum of the squared weights of the model. This penalizes large weights, and implicitly that means we assume that no one weight is extremely large. W e used the Ridge Regres- sion implementation in the python library Sklearn [ Pedregosa et al. , 2011 ] . If the reader is interested in more details on Ridge Regression, we recommend the lecture notes by W es- sel van W ieringen [ van W ieringen, 2015 ] . Ridge Regression can give us the maximum likelihood esti- mate with noisy data, where the noise is gaussian with mean zero. The conﬁdence range of each parameter is computed using the standard error (se) of each parameter as in Equation 5, which is a well known and often used metric. se ( w i ) = s ˆ σ 2 S xx (5) where σ = P N i =1 ( y ∗ − y i ) / ( N − 2) is called the residual and an unbiased estimate of the error . S xx = P N i =1 ( x i − ¯ x ) 2 . For a desired conﬁdence interv al deﬁned by 100(1 − α )% , the range of parameter values is gi ven by the follo wing Equa- tion 6 ˆ w i − t α/ 2 ,n − 2 · se ( w i ) ≤ w i ∗ ≤ ˆ w i + t α/ 2 ,n − 2 · se ( w i ) (6) W e used the 90% conﬁdence interval for each parameter using α = 0 . 1 . For each round and for that round, we in- dependently sample parameter v alues uniformly in this range and stored N m = 10 linear models to represent the possible models giv en the data.Then in order to compute the uncer- tainty (variance) in output giv en a data point, we compute the output based on each of the sampled models, and then compute the variance of the outputs. The larger the number of models, the more accurate the variance estimate. For our experiments we chose N m = 10 as a tradeoff between speed and efﬁcac y of the estimate for sample selection. Initially , the model has no features. As we get feedback on what features are rele v ant after each round, we add those (as variables) into the model and retrain the model on all the feedback. 3.2 Scoring Plans For sampling by uncertainty , one typically uses the uncer- tainty (v ariance) in the prediction as the deciding factor; the more uncertain we are, the more informativ e that sample is. The main change to the sampling process that we propose and is our core contribution is an Output-Biased Uncertainty Sam- pling (OBUS). In OBUS, the informativ eness of a sample X is computed using 3 scores. The base information score ( S b ) of a sample x i is deﬁned in Equation 7. S b ( x i ) = σ ( x i ) + σ ( x i ) ( ˆ y i / ˆ y max ) (7) where ˆ y max , is the maximum output value predicted in the pool of queries. So the exponent in the second variance can be at most 1. This limits the contrib ution of the output v alue in the base score to at most doubling the standard deviation. S b replaces just using variance as the sampling criterion. The feature frequency score ( S f ) is the sum of the proba- bilities of the relev ant features in the sample that have been discov ered from user feedback. This is shown in Equation 8. This score giv es importance to learning about the relev ant features that occur more often. S f ( X ) = Σ f ∈ rel ( X ) p ( f ) (8) where r el ( X ) returns the set of relev ant features in X , and p ( f ) returns the probability of the feature f . The discov- ery score ( S d ) is the sum of probabilities of the features in the sample that hav e not yet been shown to the user . This is used to explore features that hav e not yet been sho wn to the user . S d is formalized in the Equation 9. Note that this is different from the exploration query described earlier and scored with S e giv en in Equation 2. Gi ven two queries of equal base score and feature frequenc y scores, the query with features that hav e a higher probability mass will get selected. This helps squeeze as much information value from a query as possible. S d ( X ) = Σ f ∈ unseen ( X ) p ( f ) (9) where unseen ( X ) returns the set of features in X which have been not yet been shown to the user . The total information score ( S t ) of a data sample X is the base score adjusted with the feature frequency score and dis- cov ery score, as sho wn in Equation 10. S t ( X ) = S b ( X ) ∗ (1 + S f ( X ) + S d ( X )) (10) This S t score is what is used to rank and select the queries from the data pool D for the next round. W e will make our code av ailable to the public after publication of this work 4 Experiments and Discussion In our experiments we compare our method of O B U S that uses the informativ eness score deﬁned in Equation 10 against two baselines; sampling by uncertainty (just variance) and random sampling. W e also conduct an ablation study to ana- lyze the effect of the two modiﬁcation scores, i.e. discovery score S d and feature frequency score S f Please note that when we compare with uncertainty sam- pling method, we include the feature discovery score S d and feature frequency score S f to make it a f air comparison, with- out which we get an unfair adv antage. In fact the informa- tiv eness score used in the uncertainty sampling method is the same as in Equation 10 with the difference being in the base score S b . The base score is set to just the v ariance which represents the uncertainty . For our experiments, there are | F | = 200 features. Our pool of queries is of size | D | = 10 , 000 and each query con- tains 4 features. W e set the probability of a feature being selected by the oracle as relev ant is p rel evant = 0 . 1 (10% of all features). The mean and variance of the Oracle’ s feature weights is set as O µ = 8 . 0 , O σ = 3 . 0 . The Oracle’ s noise in ratings is intentionally set high to N = 6 . 0 to test the method- ology in the extreme. W e run each of our experiments for 30 trials and present the av eraged data. 4.1 Comparing O B U S , with Uncertainty Sampling and Random Sampling Figure 2: Comparison of OBUS with Baselines using V alue-Biased Error ( E r r or V B ) Figure 3: Comparison of OB US with Baselines using A verage Error in All Regions Using the aforementioned settings, we display the results of the comparison with the baselines in Figure 2 and Figure 3. As the reader can see from Figure 2, OBUS offers a modest but real beneﬁt over vanilla uncertainty sampling in the ex- treme regions, which is measured using E rr or V B as deﬁned in Equation 1. The range of values may be confusing to the reader; please note that the output values in our experiments lie (on average) in the range [-28,28]. The untrained model’ s prediction is zero, and so E r r or V B would be 28 2 = 784 and the model’ s E rr or V B would drop from there after the ﬁrst round. The important point to note is the relativ e difference between OBUS and other baseline methods, which is consis- tent across all rounds. This data is after a veraging o ver 30 tri- als. This demonstrates an increased accuracy in the extreme output regions. The improvements are modest improvements, and this is while staying comparable to uncertainty sampling in the ov erall error as Seen in Figure 3. In the context of our running example of online ad-selection, these modest im- prov ements in the extreme region could translate to a small increase in the odds of a user clicking on an add. Over a large user base, this translates to a lot more clicks and re venue for the host website. Random sampling does quite well too, as one might ex- pect. Random sampling discovers features well, and samples the frequent features more often. It is still noticeably worse than the other sampling methods. W e observed that as the size of the feature set F increases and the number of relev ant fea- tures decreases, random sampling suffers more as one might expect. 4.2 Ablation Study The total score S t used in OBUS (Equation 10) is comprised of two additional terms S f , and S d . W e analyze the ef fect of these terms in the OBUS score function by comparing the performance of OBUS removing one or both of these terms. The results are shown in Figure 4 and Figure 5. The second term is the known-feature occurrence term which prioritizes learning about relev ant features that occur more often. The third term is the feature discovery term which helps in fea- ture discov ery by giving weight to those samples that have frequent features that the user has not speciﬁed as relev ant or not. T o arriv e at these results, we follow the same process of taking the average error for each round over 30 trials, as was done in the previous e xperiment. As evidenced in the Figure 4,the full OBUS score ( S t ) performs better than all in the extreme regions based on E rr or V B . It is closely followed by the trendline ”no pr ob term” , which corresponds to the OBUS score that turns off the term with S f . The OBUS score that uses feature fre- quency score S f ( ”no discovery” ) but drops discov ery score S d performs worse because it did not discover as many rele- vant features. This is the same for OBUS with only the base score (cyan trendline titled ”neither” ) In terms of overall error (displayed in Figure 5), the full OBUS score S t it is comparable to the method that only keeps S d (discov ery) score. Both discover as many relev ant fea- tures, and the accuracy overall is about even. As expected, without disco vering these features, the other two cases suf fer . It must be noted that S f score without S d does not con- fer much beneﬁt, but combined together , they gi ve the total OBUS score S t a beneﬁt in the extreme regions as in Figure 4. Figure 4: Ablation Study Over the T erms of OBUS T otal Score using V alue-Biased Error ( E r r or V B ) Figure 5: Ablation Study Over the T erms of OBUS T otal Score using A verage Error in All Re gions 5 Related W ork In this section, we focus on discussing the relev ant Activ e Learning literature with respect to preference learning, using feature feedback, and Activ e Learning for regression. Activ e learning techniques are often used to accelerate preference learning (of humans) as we expect human feed- back to be scarce. For instance, [ Maystre and Grossglauser, 2017 ] proposed an efﬁcient method for ranking a list of items in the noisy setting where a pairwise comparison query may be answered incorrectly . They achiev ed improvements in the sampling efﬁcienc y as compared to random sampling. How- ev er, our preference setting is starkly different and arguably more general as we are learning a preference function, not just a ranking ov er a set of points. A different approach or dimension in Activ e Learning is using feature-feedback to improv e the learning rate. In the approach proposed by [ Hema et al. , 2006 ] for document clas- siﬁcation, the learner interleaves the feature query and the in- stance query . Instance query ask for the label of a document, whereas feature query asks about the relev ance of features (a feature is relev ant if it is highly indicati ve for the label). T o incorporate the feature-feedback, the relev ant feature’ s v alues are scaled for all the data points. This boosts the accuracy of the document classiﬁer . The way we use feedback in our work has parallels to what they do. In our approach we obtain feed- back on features that af fect the user’ s preferences, and use the feedback to eliminate irrelev ant features from consideration in subsequent query selection. Additionally , their approach is suitable for classiﬁcation task while our method is meant for a regression task. In learning preferences, it is important to know the order of preference for the options and not just if it is preferred or not; so their method would not work well. Another work that le verages feature feedback for active learning is that of [ Mann and McCallum, 2008 ] . In it, the authors use labelled features to train a conditional random ﬁeld for a sequence labelling task. A conditional distribu- tion of labels given a feature is learnt as a reference distribu- tion. Feature-feedback is then incorporated into the learning model by trying to match this distribution to the conditional distribution of labels ov er the unlabelled data. This learning paradigm gives substantial impro vement in sample efﬁciency . They show that the same accuracy as traditional instance la- belling can be achieved with 10 times fe wer annotations used. A similar approach was used in [ Druck et al. , 2008 ] for a clas- siﬁcation task. It reports as high as 20 times speed-up in time when using labelled features over labelled instances. The y assume that labeling a feature is 5 times faster than labeling a document (a result supported empirically by [ Hema et al. , 2006 ] ). The task in the previous three approaches is labelling and classiﬁcation, and their approaches cannot be applied to learning a preference order over data. Ho wev er, the biggest difference is in the way we do uncertainty sampling where we utilize the expected output values and not just variance (uncertainty). This makes our approach suitable for learn- ing preference models where there is a beneﬁt of being more accurate in the extreme output regions, as in the case of on- line ad selection; the AI system only needs to select one/few highly preferred ads. Giv en these works, we can say that the literature on AL has convincingly shown that feature feedback is essential in speeding up the learning rate. Moreov er , some theoretical bounds have also been provided for some settings like docu- ment classiﬁcation [ Poulis and Dasgupta, 2017 ] . A more closely related work which combines the regres- sion problem setting and feature feedback is that of [ Boutilier et al. , 2010 ] . In their work, they learn a function to recom- mend products that are represented by features and concepts. A concept is a subjectiv e property like ”safety of the car” which can depend on values of different features for differ - ent users. The utility of a product is giv en by the weighted sum of features and a bonus for satisfying the concept. Giv en the uncertainty in weight values, and which features are rel- ev ant to the concept(referred to as versions of concepts), the approach recommends a product which minimizes the max- imum regret(MMR). The maximum regret is the maximum possible difference between utilities of x ∗ (currently recom- mended product) and some adversarially chosen x a (chosen ov er all possible weights and concept versions). They pro- pose several query strategies to ef ﬁciently minimize the re- gret value of the recommended product x ∗ . This is very dif- ferent from the uncertainty sampling v ariation that we use. Additionally their objectiv e is only to recommend the most preferred product. They only elicit positive preference infor- mation. W e learn a model with both positi ve and negati ve features, and seek to be accurate in predicting both high and low preference cases. The most important distinguishing feature of our approach ov erall is that our sampling is sensiti ve to the output region in which we expect the input to lie. W e found one other work that considers sampling based on the output re gion by Eric et al. [ Eric et al. , 2008 ] . In that work, the authors use Gaussian Processes (GP) to search for a single region of high prefer- ence. They focus on the expected output value to guide the learning, which is similar to us. They are different in the sense that the search is for a single setting of parameters, rather than learning weights on the parameters (which are features for us). Our approach lets us identify many highly preferred points as opposed to one parameter setting. They explicitly acknowledge that the y do not focus on learning the valuation function, since their problem has an inﬁnite parameter space. Ours has a discrete and ﬁnite parameter space which makes the problem easier b ut and still useful. Additionally , their use of Gaussian Processes makes it suitable only when the num- ber of parameter/features are few and manageable (as stated in their work) because GP are computationally very demand- ing. T o the best of our knowledge , our intuitiv e modiﬁcation to uncertainty sampling has not been explored in the litera- ture, and it contributes to learning preferences faster for the extreme re gions using the methodology we described. 6 Conclusion and Future W ork In summary , our O B U S approach is a method of active learn- ing for preferences that becomes more accurate in the high and low output regions faster than using uncertainty sam- pling. OBUS modiﬁes uncertainty sampling by including the expected output value term as an adjustment. W e also con- sider the probability of feature occurrence, and the discovery value of queries to select queries. W e empirically demon- strated that our method works well as compared to the stan- dard uncertainty sampling baseline and random sampling. W e demonstrated consistent results using simulated user feed- back. W e also conduct an ablation study to show the validity of the additional terms for feature discov ery and feature fre- quency . W ith respect to the running example of online ad selection, the web service can learn a personal model of preferences us- ing a methodology like ours. This personal model can be used on top of other recommendation algorithms that the service uses, which would lev erage demographic information,history and location. Such heuristic information can be used to dis- play exploratory ads, before building conﬁdence in the user’ s model and using it instead. An additional beneﬁt of our approach is that the uncertainty in the output can also be used to choose what options are presented. If the situation is higher risk/cost, then a more certain option that is highly preferred can be given; when the cost of failure is lower , then a more uncertain option with high expected v alue can be given. Our work can be extended in terms of the feedback inter- face, by supporting pairwise comparisons or using discrete lev els of preference instead of asking the user to return a num- ber between a ﬁxed interval like [0,100]. This would make the feedback process better for the user . References [ Boutilier et al. , 2010 ] Craig Boutilier, K evin Regan, and Paolo V iappiani. Simultaneous elicitation of preference features and utility . In T wenty-F ourth AAAI Conference on Artiﬁcial Intelligence , 2010. [ Druck et al. , 2008 ] Gregory Druck, Gideon Mann, and An- drew McCallum. Learning from labeled features using generalized expectation criteria. In Pr oceedings of the 31st annual international ACM SIGIR conference on Researc h and development in information r etrieval , pages 595–602. A CM, 2008. [ Druck et al. , 2009 ] Gregory Druck, Burr Settles, and An- drew McCallum. Active learning by labeling features. In Pr oceedings of the 2009 Conference on Empirical Meth- ods in Natural Language Pr ocessing: V olume 1-V olume 1 , pages 81–90. Association for Computational Linguistics, 2009. [ Eric et al. , 2008 ] Brochu Eric, Nando D Freitas, and Ab- hijeet Ghosh. Activ e preference learning with discrete choice data. In Advances in neural information pr ocess- ing systems , pages 409–416, 2008. [ Hema et al. , 2006 ] R Hema, M Omid, and J Rosie. Ac- tiv e learning with feedback on both features and instances. Journal of Machine Learning Researc h , 7:1655–1686, 2006. [ Lewis and Gale, 1994 ] David D Le wis and W illiam A Gale. A sequential algorithm for training text classiﬁers. In SI- GIR’94 , pages 3–12. Springer , 1994. [ Mann and McCallum, 2008 ] Gideon Mann and Andrew McCallum. Generalized expectation criteria for semi- supervised learning of conditional random ﬁelds. 2008. [ Maystre and Grossglauser , 2017 ] Lucas Maystre and Matthias Grossglauser . Just sort it! a simple and effecti ve approach to active preference learning. In Proceed- ings of the 34th International Confer ence on Machine Learning-V olume 70 , pages 2344–2353. JMLR. or g, 2017. [ Pedregosa et al. , 2011 ] F . Pedregosa, G. V aroquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blon- del, P . Prettenhofer , R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cournapeau, M. Brucher , M. Perrot, and E. Duchesnay . Scikit-learn: Machine Learning in Python . J ournal of Machine Learning Resear ch , 12:2825–2830, 2011. [ Poulis and Dasgupta, 2017 ] Stefanos Poulis and Sanjoy Dasgupta. Learning with feature feedback: from theory to practice. In Artiﬁcial Intelligence and Statistics , pages 1104–1113, 2017. [ Settles, 2009 ] Burr Settles. Active learning literature sur- ve y . T echnical report, University of W isconsin-Madison Department of Computer Sciences, 2009. [ van W ieringen, 2015 ] W essel N v an W ieringen. Lec- ture notes on ridge regression. arXiv preprint arXiv:1509.09169 , 2015.

Let Me At Least Learn What You Really Like: Dealing With Noisy Humans When Learning Preferences

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment