Inducing Honest Reporting Without Observing Outcomes: An Application to the Peer-Review Process

When eliciting opinions from a group of experts, traditional devices used to promote honest reporting assume that there is an observable future outcome. In practice, however, this assumption is not always reasonable. In this paper, we propose a scori…

Authors: Arthur Carvalho, Stanko Dimitrov, Kate Larson

Inducing Honest Reporting Without Observing Outcomes: An Application to   the Peer-Review Process
Inducing Honest Repor ting W ithout Obs erving Outcomes: An Applicatio n to the Peer -Re vie w Process Arthur Carv alho Univ ersi ty of W aterloo a3carv al@ uwaterloo.ca Stanko Dimitrov Univ ersi ty of W aterloo sdimitro@ uwaterloo.ca Kate Larson Univ ersi ty of W aterloo klarson@uwaterloo.ca October 23, 2013 Abstract When eliciting opinions from a group of e x perts, traditional de vices used to promo te hon- est reportin g assume that there is an observable futur e outcome. I n practice, howe ver, this as- sumption is not alw ay s reasonable. In this paper , we propose a scoring method built on strictly proper scoring rules to in duce hon est repo rting with out assuming o bservable outcom es. Our method provides s cores based on pairwise comparison s be tween the reports made by each pair of experts in the group . For ease of e xp osition, we introduce our sco ring method by illus- trating its applicatio n to the peer-re v iew process. In o rder to do so, we start by m odeling the peer-revie w proce ss using a Bayesian model where the uncertain ty regarding the quality of the manuscript is t aken into accoun t. Th ereafter, we introduce our scoring method to e valuate the repor ted revie ws. Und er th e assumptions t hat revie wer s are Bayesian decision -makers and that they cannot influence the revie ws of other revie wers, we show that risk-neutral re viewers strictly maxim ize their expected scores b y ho nestly disclosing their revie w s. W e also show how the g roup’ s scores can be used to fin d a con sensual r evie w . Experime ntal results show that encou raging hon est rep orting th rough the p ropo sed scoring method crea tes m ore accurate revie ws than the traditiona l peer -r evie w process. 1 Introd uction In the absence of a well-chose n incenti ve structure, e xperts are not necessari ly honest when re- portin g their opinions. For ex ample, w hen reporting subjecti ve probabil ities, exper ts who hav e a reputa tion to protect might tend to produce forecasts near the most likely group consensu s, whereas exp erts who hav e a reputation to b uild might tend to ov erstate the probabi lities of outcomes they feel w ill be underst ated in a possible consens us [ Nakazon o , 2013 ]. H ence, an important questio n when elici ting experts’ opinio ns is ho w to incen tivize honest reporting . Pr oper sco ring rules [ W inkler and Murphy , 1968 ] are tra ditional de vices that incen tivize hon- est reporting of subjecti ve probabiliti es, i.e. , experts maximize their exp ected scores by honestly report ing their opinions . Ho weve r , proper scoring rules rely on the assu m ption that there is an ob- serv able future o utcome, which is not al ways a reasonable assumptio n. For e xample, when mar ket analys ts provi de sales foreca sts on a poten tial new product, there is no guarant ee that the product will e ver be produce d. Hence , the actual number of sales may ne ver be obser ved. In this paper , we propo se a scorin g m ethod for promoting honest reportin g amongst a group of ex perts when future outcomes are unobse rv able. In particul ar , we are interested in settings where exp erts obs erve signal s from a multinomial distrib ution with an unkno wn p arameter . Honest report ing then means that exp erts report exactly the signals that they observ ed. O ur scoring method is buil t o n prope r scorin g rules. Howe ver , dif ferent th an what is traditionally ass umed in the proper 1 scorin g rules literature, our method does not assume that there is an obser v able fut ure outcome. Instea d, scor es are determine d based o n pairwise comparisons b etween exp erts’ reporte d opinion s. The proposed method may be used in a variet y of settin gs, e.g . , strategic plann ing, reputation systems, peer rev iew , etc . When applied to strate gic plannin g, the proposed method may induce hones t ev aluation of dif ferent strateg ic plans. A strate gic plan is a syst ematic and coord inated way to de velop a direction and a course for an or ganizatio n, which inclu des a plan to allocate the or ganization ’ s resources [ Argen ti , 1968 ]. A fter a candid ate strate gic plan is discarded, it becomes nearly impossib le to observe what would be the conseq uences of that plan because strateg ic plans are long- term in nature. Hence, a method to incenti vize honest ev aluation s of candidate strategic plans cann ot assume that the result of a strategi c plan is obs erv able in the future. Our m ethod can also be applie d to reputatio n systems to elicit honest feedback. In reputation systems, in di viduals rate a pro duct/service after experien cing it, e . g. , cu stomer product re views o n Amazon.com are one such reputation system. Due to the subjecti ve nature of this task, incen tive s for hon est feedb ack should not be based on the assumption that an absolute rating exists. For ease of expositio n, we intro duce our scoring m ethod by illustratin g its appli cation to a domain where tradition ally there are no observ able out comes: the peer -revie w process. Peer re- vie w is a process in w hich an expert ’ s output is sc rutiniz ed by a number of oth er experts with rele van t e xpertise in order to ensure quality control and/or to provide credibility . Peer revie w is commonly used when there is no objecti ve way to measure the output’ s quality , i.e. , when qual- ity is a su bjecti ve matt er . Peer revie w has be en widely used in se veral profes sional fi elds, e .g. , accoun ting [ AICP A: Am erican Instit ute of CP As , 2012 ], law [ L SC: Legal Services Commission , 2005 ] , health care [ Dans , 1993 ], etc . Currently , a popular application of t he pe er-re vie w process is in online education. Recent years ha ve seen a surg e of massiv e online open courses, i.e. , free online academic courses aimed a t l arge- scale partici pation. Some of these cours es hav e attracted tens of thous ands of students [ Pappa no , 2012 ] . One of the biggest challenges faced by online educato rs brought by this massi ve number of st udents is the gr ading proc ess since the a v ailable resources (personn el, time, etc .) is often insuf ficient. Auto-g rading by computer s is not always feasib le, e.g. , course s whose assignment s consis t of essay-style quest ions and/or questio ns that do not hav e clear right/wron g answers. Peer re view has been used by some compan ies like Coursera 1 as a way to o vercome this issue. For simplicity’ s sak e, w e focus on peer revie w as used in modern scientific communicatio n. The process, as we consider in this paper , can be described as follo ws: when a manuscr ipt first arri ves at the editori al office of an acade m ic journal, it is first examine d by the editor , who might reject the manuscrip t immediately becau se either it is out-o f-scope or beca use it is of unacceptabl e qualit y . Manuscripts that pass this first stage are then sent out to experts with relev ant exp ertise who are usually asked to classify the manuscrip t as publish able immediately , publishab le after some re visions, or not publisha ble at all. T raditionally , the manuscrip t’ s authors do not kno w the re viewers’ identi ties, but the re viewer s m ay or may not kno w the identity of the authors. In ot her words, peer re vie w ca n be s een as a d ecision -making process where the re viewers serv e as cogniti ve inputs that help a decisio n maker (chair , editor , course instruc tor , etc .) judge the qua lity of a peer ’ s outpu t. A cru cial point in this process is that it great ly depend s on the re viewers’ honesty . In the canonical peer- re view process, revie wers hav e no direct incenti ves for honestly reporting their re views. S ev eral poten tial problems ha ve been discus sed in differe nt researc h areas, e.g. , bias against female authors , auth ors from m inor institu tions, and non-nati ve English writers [ Bornmann et al. , 200 7 , W enneras and W ol d , 1997 , P rimack and Marrs , 200 8 , Newco m be and Bouton , 2009 ]. In order to illustra te the application of our method to peer rev iew , we start by modeling the peer -rev iew process as a Bayesian model so as to tak e the unc ertainty reg arding the quality of 1 http://www . coursera.org / 2 the manusc ript into account . W e then introduce our scoring method to ev aluate report ed rev iews. W e assume that the scores recei ved by re viewers are someho w coupled with rele v ant incenti ves, be th ey social-p sychological, such as prai se or visibility , or mate rial re wards throug h prizes or money . Hence, we naturally assume that rev iewers seek to maximize their expec ted scores and that the re are no e xternal incenti ves. W e sh ow that re vie w ers strictly maximiz e their exp ected scores by honestly disclosing their revie ws u nder th e a dditional ass umption s tha t they are Bayesian decisi on-mak ers and that the y cannot influence the rev ie ws of other re viewers. Honesty is int rinsica lly related to accura cy in our peer -re view model: as the number of ho nest re views increas es, the distrib ution of the re ported revie ws con ver ges to the proba bility distrib ution that represents the quality of the manuscrip t. W e performed peer- revie w experi m ents to val idate the model a nd to tes t the ef fi cienc y of th e proposed scoring method. Our expe rimental result s cor- robora te our theore tical mode l by showing that the act o f enc ouragi ng honest reporting thr ough the propo sed scoring method creates more accurate revie ws than the traditio nal peer -revie w process, where re viewers ha ve no direct incenti ves for expres sing their true re views. In additio n to our method for inducing honest report ing, we also propose a method to aggregat e opinio ns that uses informatio n from exper ts’ scores. Our aggreg ation method is gener al in a sense that it can be used in any decision-maki ng setti ng where exper ts repor t probabil ity dist rib utions ov er the outcomes of a discrete random vari able. The proposed method works as if the experts were continuous ly updatin g their opini ons in order to accommodate the expertise of others. Each update d opinio n take s the form of a linear opin ion pool, wher e the weight tha t an expe rt assign s to a peer’ s opinion is in versely related to th e dis tance between the ir opinions . In o ther words, e xperts are assumed to prefe r opi nions that are clos e to their o wn opinions, where clos eness is defined by an underlying proper scoring rule. W e prov ide condition s un der which consensu s is achie ved under our agg regation method and di scuss a beha vioral foundati on of it. Using data from ou r p eer- re view expe riments, we find that the consensual re view resultin g fro m the proposed aggrega tion method is consis tently more accu rate than the canonical av erage revie w . 2 Related W ork In recent years, two prominent methods to induce honest repor ting without the assumptio n of observ able fut ure outcomes were propose d: the Bayesian truth serum (BTS) method [ Prelec , 2004 ] and the peer -pr ediction method [ Miller et al. , 2005 ]. The BTS method works on a single multiple-cho ice question with a fi nite number of alterna- ti ves. Each expe rt is requested to endorse the answer mostly like ly to be true and to predic t the empirica l distrib ution of the en dorsed answers. E xperts are e v aluate d by the accu racy of their p re- dictio ns as well as how surpri singly c omm on their answers are. The surpris ingly common criterio n exp loits the fal se consensus effe ct to promote truthfulne ss, i.e. , the general tendency of exper ts to ov erestimate the degree of agreeme nt that the oth ers hav e with them. The score receiv ed by an expert from the BTS m ethod has two major components . The first one, called the information score, ev aluate s the answer en dorsed by the expert accord ing to the log-ra tio of its actu al-to-p redicted endorsemen t frequencies . The second co m ponen t, called th e predic tion score, is a penalty proportion al to the relati ve entropy between the empirical distrib u- tion of answers and the expert ’ s prediction of that distrib ution. Under the BTS scoring method, collec ti ve hones t reporting is a B ayes-Nas h equilibrium. The BTS method has been used to promote honest reporting in many differe nt domains, e.g . , when sharing rewa rds amongst a set of expert s [ Carv alho and Larson , 2011 ] and in polic y anal- ysis [ W eiss , 2009 ]. Howe ver , the BTS method has two major drawbacks. First, it requires the popul ation of experts to be lar ge. Second, besides reporting their opinions , exper ts m ust also make pred iction s about ho w their peers w ill report their opini ons. W hile the artificia l intelligenc e 3 community has recently addressed the former issue [ W itko w ski and Parke s , 2012 , Radanovi c and Faltin gs , 2013 ], the latter issue is still an intrinsic requirement for using the BTS method. The dra wbacks of the BTS met hod are not shar ed by th e peer-p rediction method [ Miller et al. , 2005 ] . In the p eer-pre diction method, a number of experts experi ence a product and rate its quality . A mech anism then collec ts the ratin gs and mak es payment s based on th ose ratings. The peer - predic tion method mak es use of the stocha stic correlat ion between the sig nals observe d by the exp erts from the product to achie ve a B ayes-Nas h equ ilibrium where e very exp ert repo rts hone stly . A major problem with the peer -prediction method is that it depend s on histor ical data. For exa mple, when applied to a peer -rev iew setting, after a revie wer i report s his revi ew , say r i , the mechanis m then estimates rev iewer i ’ s prediction of th e rev iew reported by another revie wer j , P ( r j | r i ) , whi ch is then ev aluated and r e warded using a prop er scoring rule a nd re viewer j ’ s actua l report ed revie w . The mechanism needs to ha ve a history of pre viously reported re views for com- puting P ( r j | r i ) , which is not alw ays a reasonable assumptio n, e.g. , when the ev aluation criteria may change from rev iew to re view and when the peer- revie w proc ess is being used for the first time. In other words , the peer-p rediction method is prone to cold-start problems. Carv alho and L arson [ 2012 ] addre ssed this issue by m aking the extra assumption that experts ha ve uninformat i ve prior kno wledge about the distrib ution of the obse rved signals. G iv en this assumpti on, honest report ing is induced by s imply ma king pairwise comparisons b etween report ed opinio ns and re wardin g agreemen ts. In this paper , w e ext end the method by Carv alho and Larson [ 2012 ] in se veral ways. First, we show tha t the assumptio n of uninf ormativ e priors is unnecessar y as long as expert s ha ve common prior distrib utions and this f act is common kno wledge. M oreo ver , we provide stron ger condi tions with re spect to the underl ying proper scoring rule unde r w hich pairwise compa risons induc e honest reporting . Another contrib ution of our work is a m ethod to aggre gate the reported opinio ns in to a sin- gle consensual opinion. Over the years, both beha vioral and mathematical method s ha ve been propo sed to establish conse nsus [ Clemen and W inkler , 1999 ] . B eha vioral methods attempt to genera te agreement through interaction and exchang e of knowle dge. Ideally , the sharing of infor - mation leads to a conse nsus. Ho weve r , beha vioral methods usually provide no condition s under which exp erts can be e xpected to re ach an agreement. On the other ha nd, math ematical aggre- gation m ethods consist of proce sses or analytical models that operate on the reporte d opinions in order to produce a single aggre gate opinion. DeGroot [ 1974 ] propose d a model which describes ho w a group of exper ts can reach agreement on a consens ual opinion by pooling their indi vidual opinio ns. A drawba ck of DeGroot’ s method is that it requir es each expert to ex plicitl y assign weights to the opinion s of other experts . In this paper , we propose a method to set these weights directl y which takes the scores recei ved by the experts into accou nt. W e also pro vide a behav ioral interp retatio n of the proposed aggregati on method. A related method for fi nding consensu s was proposed by Carval ho and Larson [ 2013 ]. Under the assumpti on that experts prefer probability distrib utions close to their o w n distrib utions, where closen ess is measur ed by the ro ot-mean-squar e de viation, the authors sho wed that a co nsensu s is al ways achie ved. Moreov er , if risk- neutral expe rts are re warded usin g the quad ratic scoring rule, then the assumption that expe rts p refer probability distrib utions that are close to their o wn distrib utions follo w s naturally . T he appro ach in this paper is more general because the under lying proper scorin g rule can be an y bound ed proper scoring rule. From an empirica l perspecti ve, we in vestigate the efficien cy of both our scoring method and our method for fi nding consen sus in a peer -rev ie w expe riment. Formal ex periment s in v olving peer re view are still relativ ely scarce . Eve n though the applic ation of the peer -revie w process to scient ific communicat ion can be traced bac k almost 300 ye ars, it was not until the e arly 1990s tha t researc h on this matter became more intens i ve and formalized [ v an Rooyen , 2001 ]. Scientists in the biomedical domain hav e been in the forefront of research on the peer- re view process due to 4 the fact that dependable quality-co ntrolled in formation can literally be a matter of life and death in this research fi eld. In particula r , the staff of the reno wned BMJ, formerly British Medical Journa l, hav e been study ing the merits and limitations of peer re view ov er a number of years [ Lock , 19 85 , Godlee et al. , 2003 ]. Most of their work has focused on de fining and e val uating re view qual ity [ v an Rooy en et al. , 1999 ], and examinin g the effec t of specific interve ntions on the qualit y of the resulting revie ws [ v an Rooy en , 2001 ]. One mechanism used to pre vent bias in the peer- revie w process is called double-bli nd re view , which consists of hiding both author s and re viewers ’ identi ties. Indeed , it has been reported that such a pr actice reduc es bias against female authors [ Budden et al. , 2008 ]. H o w e ver , it can be ar gued that knowin g the autho rs’ identities makes it easier for the revi e w ers to compare the ne w manuscri pt with pre viously pub lished papers, and it also encourag es the re vie wers to dis close con- flicts of interest. Anoth er argumen t that undermines the benefits of double -blind rev iewing is that the a uthors hip of t he manuscrip t is often o bvious to a kno wledgeable reader fr om the cont ext, e.g . , self-re ferenc ing, research topic, writing style, workin g paper reposi tories, seminars, etc . [ Falaga s et al. , 2006 , Justice et al. , 1998 , Y anka uer , 1991 ]. F urthermor e, this m echani sm does not pre vent agains t certain types o f bias, e .g. , when a rev iewer rejects ne w e vidence or new knowled ge be cause it contr adicts estab lished norms, beliefs or paradigms. Some w ork has f ocused on the calibration aspect of peer rev iew . Roos et al. [ 2011 ] propos ed a maximum likeliho od method for calibrating revi e w s by estimating both the bias of each revi ewer and the unkno w n ideal score of the manuscript. Bias is treated as the general rigor of a re viewer across all his revi ews. Hence , Roos et al. ’ s method does not attempt to prev ent bias by re warding hones t repor ting. Instead, it adjusts re views a post eriori so that they can be globa lly compara ble. Instea d of calibratin g revie ws a posterio ri , Robinso n [ 2001 ] suggested to “calibr ate” rev iew- ers a priori . Revie wers are first asked to revie w short texts that hav e gold-sta ndard re views, i.e. , re views of high quality pro vided by expe rts with rele vant ex pertise. Thereafter , they receiv e cali- bratio n scores, which are later u sed as weig hting f actors to determine ho w well their fu ture re vie ws will be consid ered. T his approach, ho weve r , does not guarantee tha t revi ewers will report honest ly after the calibr ation phase, when gol d-standard rev ie ws are no longer a vailab le. T o the best of our k no wledge, our peer -re view e xperiments are the first to in vestig ate the us e of incent i ves for honest reporting in a peer -revie w task. When objecti ve verifica tion is not possible, as in the peer -revie w process , economic measures may be used to encourage experts to honestly disclo se their opinions . The propose d scoring method does so by making pairwise comparison s between repo rted revie ws and rewar ding agreements . Rew arding expe rts based on pairwise compariso ns has been empirically proven to be an ef- fecti ve incenti ve technique in other domains. Shaw et al. [ 2011 ] measured the ef fectiv eness of a collec tion of social and finan cial incent i ve schemes for moti vatin g experts to condu ct a qualit ati ve conten t analysis task. The authors found that treatment conditions that provide d fi nancia l incen- ti ves and asked expert s to prospec tive ly think about the responses of their peers produce d more accura te res ponses. Huang and Fu [ 2013 ] sho w ed that informing the exp erts that their re wards will be based on ho w similar their response s are to other exp erts’ respons es produces more accurate respon ses than telling the exper ts that their rewar ds w ill be based on how similar their respon ses are to gold-stand ard resp onses. Our work adds to the existing body of literature by theoretical ly and empirica lly sho wing that pairwise comparisons make the peer -revi e w process more accur ate. 5 3 The Basic Model In our propos ed peer -re view process, a manuscri pt is re viewed by a set of r evie w ers N = { 1 , . . . , n } , with n ≥ 2 . The quality of the manuscript is represented by a multinomial distrib ution 2 Ω with unkno wn para m eter ω = ( ω 0 , . . . , ω v ) , whe re v ∈ N + repres ents the best ev aluati on scor e that the manuscri pt can recei ve and ω k is the pro bability assign ed to the ev aluation score being equal to k . Each re viewer is mode led as possessing a priv ately observ ed draw (signal) from Ω . Hence, our model captures the uncertaint y of the rev iewers reg arding the quality of the manuscript. W e ext end the model to multiple observ ed signals in Section 5. W e denote the hones t r evie w of each re viewer i ∈ N by t i ∼ Ω , where t i ∈ { 0 , . . . , v } . Honest re views ar e ind epend ent and identic ally distrib uted, i.e. , P ( t i | t j ) = P ( t i ) . W e say that re viewer i is reporting honestly when his re ported r evie w r i is equal to his hones t revie w , i.e. , r i = t i . Rev iews ar e elic ited and aggreg ated by a tru sted entity refe rred to as the center 3 , whic h is al so respon sible for re wardin g the revie wers. Let s i be re viewer i ’ s r eview scor e af ter he reports r i . W e discuss how s i is determined in Section 4. Re view scores are someho w coupled w ith rele vant incent i ves, be they social-psyc hological, such as praise or visibility , or material re wards through prizes or mone y . W e make four major assu m ptions in our model: 1. Au tonomy : Revie wers c annot influence oth er re vie wers’ re views, i.e. , t hey do n ot kno w eac h other’ s identit y and they are not allo wed to co m municate to eac h other du ring the re vie wing proces s. 2. Risk Neutr ality : Revie w ers beha ve so as to maximize their expecte d re view score s. 3. Diric hlet Priors : There exists a common prior distrib ution ov er ω , i.e. , P ( ω ) . W e assume that this prior is a Dirichlet distrib ution and this is common kno wledge. 4. Rationa lity : After observing t i , e very re vie w er i ∈ N up dates his belief by appl ying Bayes’ rule to the common prior , i.e. , P ( ω | t i ) . The first assump tion describes ho w peer re view is tradi tionally done in practice. The sec- ond assumptio n means that re viewers are self-intere sted and no ext ernal incenti ves exist for each re viewer . The thi rd as sumption mean s that re viewer s ha ve common prior kno w ledge ab out the qualit y of the m anuscr ipt, a natural assumption in the peer -revie w process. W e discuss the formal meaning of such an assumption in the follo wing subsecti on. The fourth assumptio n impli es that the posteri or distrib utions are consistent with Bayesian updating , i.e. : P ( ω | t i ) = P ( t i | ω ) P ( ω ) P ( t i ) The last three assumpti ons imply that revie w ers are Bayesian decision-mak ers . W e note that dif ferent modeling choices could ha ve been used, e.g. , models based on games of incomplet e informat ion. Unlike our model, an incomplete-i nformatio n game is oft en us ed when experts do not kno w each othe r’ s beliefs. T o find strate gic equilibria in such inc omplete-informa tion models, one would n eed information about expe rts’ beliefs abou t each other’ s pri vate info rmation. A Bayesian structu re coul d be used to model each expert’ s beliefs about the others, and it would permit the calcul ation of experts’ e xpected sco res, which ar e maximized at equil ibrium. Howe ver , the n atural autono m y assumption makes such a Bayesian structure unrealis tic. 2 W e use the term multinomial distri bution to refer to the generalization of the Bernoulli distri bution for discrete random v ari ables with any constant number of outcomes. The parameter of this distr ibution is a probab i lity vector that specifies the probability of each possible outcome. 3 W e refer to a single revie wer as “he” and to the center as “she”. 6 3.1 Dirichlet Distribu tio ns An i mportant assu m ption in ou r model is that re vie w ers ha ve Diric hlet prior s ov er distrib utions of e valuat ion scores. The Diric hlet distrib ution can be seen as a continuou s distrib ution ov er param- eter vecto rs of a multinomial distrib ution. Since ω is the unkno wn parameter of the m ultinomia l distrib ution th at describ es the quality of the manuscrip t, then it is natural to conside r a Dirichlet distrib ution as a prior for ω . Giv en a vector of positi ve inte gers, α = ( α 0 , . . . , α v ) , that determines the s hape of the Dirichlet dist ributio n, the prob ability density function of the Dirichlet dist rib ution ov er ω is: P ( ω | α ) = 1 β ( α ) v Y k =0 ω α k − 1 k (1) where: β ( α ) = Q v k =0 ( α k − 1)! ( P v k =0 α k − 1)! Figure 1 sho w s the above pro babilit y density when v = 2 for some parameter vec tors α . For the Dirichlet distrib ution in ( 1 ), the expect ed v alue of ω j is E [ ω j | α ] = α i / P v k =0 α k . The probab ility vector E [ ω | α ] = ( E [ ω 0 | α ] , . . . , E [ ω v | α ]) is calle d the ex pected distrib ution regardin g ω . An interestin g property of the Dirichlet distrib ution is that it is the conjug ate prior of the multinomia l distrib ution [ Bernardo and Smith , 1994 ], i.e. , the posteri or distrib ution P ( ω | α , t i ) is itself a Dirichlet distrib ution. This relationsh ip is often used in B ayesian statistics to estimate hidden parameters o f multinomia l distrib utions. T o illustrate this po int, suppos e tha t re vie w er i observ es the signal t i = x , for x ∈ { 0 , . . . , v } . After applying Bayes’ rule, rev ie wer i ’ s poster ior distrib ution is P ( ω | α , t i = x ) = P ( ω | ( α 0 , α 1 , . . . , α x + 1 , . . . α v )) . Consequ ently , the new exp ected distrib ution is: E [ ω | α , t i = x ] =  α 0 1 + P v k =0 α k , α 1 1 + P v k =0 α k , . . . , α x + 1 1 + P v k =0 α k , . . . , α v 1 + P v k =0 α k  (2) W e call th e proba bility vec tor in ( 2 ) revie w er i ’ s po sterior pr edictiv e distrib ution regard ing ω becaus e it provide s the distrib ution of future ou tcomes gi ven the observed data t i . W ith this perspe ctiv e, we re gard the v alues α 0 , . . . , α v as “pseudo -counts” from “pseu do-data”, where each α k can be interpreted as the number of times that the ω k -proba bility eve nt has been observ ed before . Through out this paper , we assume that revi ewers hav e common prior Dirichlet distrib utions and this fact is common kno w ledge, i.e. , the v alue of α is initially the same for all revie wers. A practic al interpre tation of this assumption is that re viewers hav e common prior kno wledge about the quali ty of the manuscript , i.e. , rev iewers hav e a common exp ectation regardin g the quality of arri ving manuscripts. By using Dirichlet distrib utions as priors, belief updating can be expressed as an updatin g of the parameters of the prior distrib ution 4 . Furthermore, the assumption of common knowledg e 4 W e note that other priors could have been used. Howe ver , the inference process would not necessarily be analyti- cally tractab l e. In general, tractability can be ob t ained throu gh conjugate distrib utions. Hence, another modeling cho i ce is to con si der that e valuation sco res follow a normal d istribution with unkno wn parameters. Assuming ex changeab ility , we ca n then use either the no rmal-gamma distribution o r the no r mal-scaled in verse ga mma d i stribution as the conjugate prior [ Bernardo and Smith , 1994 ]. The major drawback wi th this approach is that continuous ev aluation scores might bring extra com plexity to the rev i e wi ng process. 7 F igur e 1: Pr obabili ty densities of Diric hlet distrib utions w hen v = 2 for dif fer ent para m eter vector s. Left: α = (1 , 1 , 1) . C enter: α = (2 , 1 , 1) . Right: α = (2 , 2 , 2) . allo ws the center to estimate re viewers’ posterio r dis tributi ons based solely on their reported re- vie ws, a point which is explor ed by our proposed scoring m ethod. Due to its attracti ve theor etical proper ties, the Dirichlet distrib ution has been used to m odel uncertain ty in a v ariety of differ ent scenar ios, e.g , when experts are sharin g a re ward base d on peer e val uation s [ Carv alho and Larson , 2012 ] and when ex perts are groupe d based on their indiv idual dif ferenc es [ Nav arro et al. , 2006 ]. 4 Scoring Method In this section , we propos e a scoring method to induce honest reporti ng of re views. The proposed method is b uilt on prope r scoring rules [ W inkler and Murphy , 1968 ]. 4.1 Pr oper Scoring Rules Consider an uncertain quantity w ith possible outcomes o 0 , . . . , o v , and a probab ility ve ctor z = ( z 0 , . . . , z v ) , where z k is the probab ility v alue associated with the occur rence of outcome o k . A scorin g rul e R ( z , e ) is a function that provid es a score for the assessment z upon observ ing the outcome o e , for e ∈ { 0 , . . . , v } . A scoring ru le is called str ictly pr oper when an exper t r ecei ves h is maximum ex pected score if and only if hi s stated assessme nt z correspo nds to his tru e assessment q = ( q 0 , . . . , q v ) [ W inkler and Mur phy , 1968 ]. The exp ected scor e of z at q for a real-v alued scorin g rule R ( z , e ) is: E q [ R ( z , e )] = v X e =0 q e R ( z , e ) 8 Proper scoring rules hav e been used as a tool to promote hones t reporting in a var iety of do- mains, e.g. , when sharing re wards amongst a set of expert s bas ed on peer ev aluations [ Carv alho and Larson , 2010 , 2012 ], to incent i vize expert s to accuratel y estimate their own ef forts to accom- plish a task [ Bacon et al. , 2012 ], in pre diction markets [ Hanso n , 2003 ], in weat her forecastin g [ Gneiting and Raftery , 2007 ], etc . S ome of the best kno wn strictly proper scoring rules, together with their scor ing range s, are: logari thmic: R ( z , e ) = log z e ( −∞ , 0] quadra tic: R ( z , e ) = 2 z e − v X k =0 z 2 k [ − 1 , 1] (3) spheri cal: R ( z , e ) = z e q P v k =0 z 2 k [0 , 1] All the abo ve sc oring rules are symmetric , i.e. , R (( z 0 , . . . , z v ) , e ) = R (( z π 0 , . . . , z π v ) , π e ) , for all prob ability vector s z = ( z 0 , . . . , z v ) , for all permuta tions π on v + 1 eleme nts, and fo r all outcomes inde xed by e ∈ { 0 , . . . , v } . W e say that a scoring rule is bounded if R ( z , e ) ∈ R , for all pr obabil ity vectors z and e ∈ { 0 , . . . , v } . For ex ample, the logarithmic scorin g rule is not bounded because it might return −∞ w hene ver the probability vector z contains an element equal to zero, w hereas both the quadratic and spherical scoring rules are alway s bounded. A well-kno wn property of strictly proper scoring rules is that they are still strictly proper under positi ve af fine tran sformations [ Gneiting and Raftery , 2007 ], i.e. , ar gmax z E q [ γ R ( z , e ) + λ ] = ar gmax z E q [ R ( z , e )] = q , for a strictly proper scoring rule R , γ > 0 , and λ ∈ R . Pro p osition 1. If R ( z , e ) is a strictly pr oper scoring rule , then a positive affin e transforma tion of R , i.e., γ R ( z , e ) + λ, for γ > 0 and λ ∈ R , is also stri ctly pr oper . 4.2 Review S cor es If we knew a priori re viewer s’ honest revie ws, we could then compare the hones t revi ews to the report ed re views and rew ard agreement. Howe ver , due to the subjecti ve natur e of the peer - re view proces s, we are facin g a situation where this objecti ve truth is practically unkno wable. Our solutio n is to induce honest reporting throug h pairwise compariso ns of reported revi e w s. The first step to wards computin g each re viewer i ’ s re vie w score is to estimate his posterior pre- dicti ve dist ributio n E [ ω | α , t i ] sho wn in ( 2 ) based on his reported re vie w r i . Let E [ ω | α , r i ] = ( E [ ω 0 | α , r i ] , . . . , E [ ω v | α , r i ]) be such an estimat ion, where: E [ ω k | α , r i ] = ( α k +1 1+ P v x =0 α x if r i = k , α k 1+ P v x =0 α x otherwis e . (4) Recall that the elements of re viewer i ’ s true posterior predicti ve distr ib ution are defined as: E [ ω k | α , t i ] = ( α k +1 1+ P v x =0 α x if t i = k , α k 1+ P v x =0 α x otherwis e . Clearly , E [ ω k | α , r i ] = E [ ω k | α , t i ] if and only if rev iewer i is reportin g ho nestly , i.e . , when he repo rts r i = t i . The rev iew score of re viewer i is determine d as follo ws: s i = X j 6 = i ( γ R ( E [ ω | α , r i ] , r j ) + λ ) (5) 9 where γ and λ are constants, for γ > 0 and λ ∈ R , and R is a strict ly proper scoring rule. Scoring rules require an observ able outco m e, or a “reality”, in order to score an assessment. Intuiti vely , we consider each r evie w repo rted by eve ry re vie wer other than re viewer i as an observ ed outcome, i.e. , th e ev aluatio n score dese rved by the manuscri pt, and then w e score re vie w er i ’ s es timated poster ior pr edictive distrib ution in ( 4 ) as an ass essment of that val ue. Pro p osition 2. Each re viewer i ∈ N strictly maximizes his e xpected r eview scor e if and only if r i = t i . Pr oof. Let Θ i = E [ ω | α , t i ] and Φ i = E [ ω | α , r i ] . B y the autonomy assumption, re viewers canno t af fect their pee rs’ revi e w s. Hence, we can res trict oursel ves to sho w that each revi e w er i ∈ N str ictly maximizes E Θ i [ γ R ( Φ i , r j ) + λ ] , for j 6 = i , if and only if r i = t i . (If part) Since R is a strictly proper scoring rule, from Proposition 1 we hav e that: arg max Φ i E Θ i [ γ R ( Φ i , r j ) + λ ] = Θ i If r i = t i , then by constructi on Φ i = Θ i , i.e. , the estimate d posterior predict ive distrib ution in ( 4 ) is equal to the true posterior predicti ve distrib ution in ( 2 ). Consequently , honest reporting strictl y m aximizes re viewers’ e xpected re view sco res. (Only-if part) . Using a similar argu ment, giv en that R is a strictly proper scoring rule, from Proposit ion 1 we ha ve that: arg max Φ i E Θ i [ γ R ( Φ i , r j ) + λ ] = Θ i By construct ion, Φ i = Θ i if and only if r i = t i (see equation s ( 2 ) and ( 4 )). Thus, re viewers’ exp ected re view scores are stric tly maximized onl y w hen re vie w ers are honest. Another way to in terpret the above result is to imagine th at each revie wer is betting on the re view deserv ed by the manuscript . Since the most relev ant information av ailable to him is the observ ed signal, the n the s trategy tha t maximizes h is exp ected re view score is to bet on that sig nal, i.e. , to bet on his honest rev iew . When this happen s, the true poster ior predict ive distrib ution in ( 2 ) is equal to the e stimated poster ior predicti ve distr ibutio n in ( 4 ) an d, consequent ly , the e xpected score resultin g from a strictly proper scorin g rule is strictly maximized when the expec tation is tak en with respect to the true posterio r predict ive distrib ution. It is important to observ e that by incenti vizing honest re porting, the scoring func tion in ( 5 ) also incen ti vizes accuracy since honest revie w s are draws from the distrib ution that represent s the true qua lity of the manuscript . In o ther words, the cent er is indi rectly observing these dra ws when re viewers report hones tly . Consequen tly , due to the law of lar ge numbers , th e distrib ution of the report ed re views con ver ges to the distri bution that represents the true quality of the manuscript as the number of honestly reported re views increas es. Our e xperimental results in Section 7 sho w that there inde ed exists a strong correlati on between hone sty and accur acy . Dif ferent interpre tations of the scoring method in ( 5 ) arises depend ing on the underlyi ng strictl y proper scoring rule and the hyperpara m eter α . In the follo wing subsecti ons, we discuss two dif ferent interpreta tions: 1) w hen R is a symmetric and bounded strictly proper scoring rule and revie wers’ prior distrib utions are non-info rmati ve; and 2) when R is a strictly prope r scoring rule sensit i ve to distanc e. 4.3 Rewarding Agr eement Assume that rev iewers’ prior distrib utions are non-info rmativ e, i.e. , all the ele m ents making up the hyperpa rameter α hav e the same val ue. T his happens when rev ie wers hav e no rele vant prior 10 kno wledge about the quality of the manu script. Consequentl y , the elements of re vie wers’ true and estimated posteri or predic tive dis trib utions can tak e on only two po ssible v alues (see equat ions ( 2 ) and ( 4 ) for α 0 = α 1 = · · · = α v ). Moreo ver , if R is a symmetric scoring rule, then the term R ( E [ ω | α , r i ] , r j ) in ( 5 ) can tak e on only two possi ble value s because a permutatio n of elements w ith similar v alues does not change the score of a symmetric scoring rule. When R is also strictly proper , it means that R ( E [ ω | α , r i ] , r j ) = δ max , when r i = r j , and R ( E [ ω | α , r i ] , r j ) = δ min , when r i 6 = r j , where δ max > δ min . Consequen tly , each term of the summation in ( 5 ) can be written as: γ R ( E [ ω | α , r i ] , r j ) + λ =  γ δ max + λ if r i = r j , γ δ min + λ otherwis e . When R is also bounded , we can then set γ = 1 δ max − δ min and λ = − δ min δ max − δ min , and the abo ve val ues become, respecti vely , 1 and 0 . Hence, the result ing revi e w scores do not depend on paramete rs of the model. Moreov er , we obtain an intuiti ve interpre tation of the scori ng m ethod in ( 5 ): whenev er two reporte d re views a re equal to each ot her , the underl ying re viewers are re warded by one payof f unit. Thus, in practice, ou r scoring method works by simply comparing report ed re views and re warding agreemen ts whenev er R is a symmetric and bounded stri ctly prope r sco ring rule and re viewers ha ve no informati ve prior knowle dge about the quality of the manuscrip t. Another interesting point is that the center can rew ard dif ferent agreements in differ ent ways, i.e. , revie wers are not necessarily equall y value d. For example, if the center kno ws a priori that a particular re viewer j is re liable (r especti vely , unreliable), then she can increase (respe ctiv ely , decrea se) the rewa rd of re viewers w hose revie ws are in agreement with rev ie wer j ’ s reported re view . Formall y , this means that for dif ferent re viewers i and j , the center can use dif ferent v alues for γ and λ in ( 5 ). P roposi tion 2 is not affect ed by this as lon g as γ > 0 , λ ∈ R , and their values are independ ent of the reported revie w s. Hence, by havi ng a few reliable re viewers, this approa ch might he lp to eliminate t he hypothe tical scenar io where a set of r e viewers le arn o ver time to report simil ar re vie ws. A similar idea was proposed b y Jurca and Faltings [ 2009 ] to prev ent collus ions in reput ation systems. 4.4 Strictly Pr oper Scoring Rules Sensitive to Distance Pairwise comparis ons, as defined in the pre vious subsec tion, might work well for small v alues of v , the best e v aluatio n score that the manuscript can recei ve, b ut it can be too res trictiv e and, to some degre e, unf air when the best ev aluation score is high. For exa m ple, when v = 10 and the re view used as the observ ed outcome is also equal to 10 , a reported revie w equal to 9 seems to be more accurate than a report ed re view equal to 1 . O ne ef fecti ve way to deal with these issues is by using stric tly proper scoring rules in ( 5 ) that are sensitive to distanc e . Using the notation of Sect ion 4.1, recall that z = ( z 0 , . . . , z v ) is some re ported probabil ity distrib ution. Giv en that the outcomes are ordered, we denote the cumulati ve probab ilities by capital letter: Z k = P j ≤ k z j . W e first define the notion of distance between two proba bility vectors as propo sed by S ta ¨ el vo n Holstein [ 1970 ]. W e say that a probab ility vector z ′ is m ore distant from the j th outcome than a proba bility vector z 6 = z ′ if: Z ′ k ≥ Z k , for k = 0 , . . . , j − 1 Z ′ k ≤ Z k , for k = j, . . . , v Intuiti vely , the abo ve definition means that z can be obtained from z ′ by successi vely moving probab ility mass to wards the j th out come from other outc omes [ Sta ¨ el v on Holstein , 1970 ]. A scorin g rule R is said to be sensitiv e to dista nce if R ( z , j ) > R ( z ′ , j ) whene ver z ′ is more distant 11 0 1 2 3 4 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 Reported Review Score j = 0 j = 1 j = 2 j = 3 j = 4 F igur e 2: Scor es r eturned by R ( E [ ω | α , r i ] , j ) for d iffer ent r eported r evie ws when v = 4 , R is th e RPS rule , and α = (1 , 1 , 1 , 1 , 1) . Eac h line rep res ents a differ ent value for j (observed outcome). from z for all j . Epstein [ 1969 ] introduc ed th e rank ed pr obabil ity sc ore (RPS ), a strictly proper scorin g rul e tha t is se nsitiv e to distanc e. Using the formulation of E pstein’ s result proposed by Murphy [ 197 0 ], we ha ve for a pro babilit y vector z and an obse rved outcome j ∈ { 0 , . . . , v } : RP S ( z , j ) = − j − 1 X k =0 Z 2 k − v X k = j (1 − Z k ) 2 (6) Figure 2 illustr ates the scores returned by ( 6 ) for differe nt re ported rev iews and values for j when revie wers’ prior distrib utions are non-in formativ e. When using RP S as the strictly proper scorin g rule in ( 5 ), re vie wers are rew arded based on how close their reported revie ws are to the re views take n as observ ed outcomes. For e xample, when the revie w used as the observe d outcome is equal to 0 (see the dotted line w ith squar es in Figure 2 ), the returned score monotonical ly decrea ses as the report ed revie w increase s. S ince RPS is strictly proper , Proposition 2 is still v alid for any hyperpa rameter α , i.e. , each revi ewer strictly m aximizes his expec ted re view score by repo rting honestly . The scoring range of RPS is [ − v , 0] . H ence, re view sco res are alway s non-n ega ti ve when using γ = 1 and λ = v in ( 5 ). 4.5 Numerical Example Consider four revie wers ( n = 4 ) and the best ev aluation score being equal to four ( v = 4 ). Suppose that revie wers hav e non-informati ve Diric hlet priors with α = (1 , 1 , 1 , 1 , 1) , and that re vie wers 1, 2, 3, and 4 report, respecti vely , r 1 = 0 , r 2 = 0 , r 3 = 1 , and r 4 = 4 . From ( 4 ), the resulting estimated posterio r predicti ve distrib utions are, respecti vely , E [ ω | α , r 1 = 0] =  2 6 , 1 6 , 1 6 , 1 6 , 1 6  , E [ ω | α , r 2 = 0] =  2 6 , 1 6 , 1 6 , 1 6 , 1 6  , E [ ω | α , r 3 = 1] =  1 6 , 2 6 , 1 6 , 1 6 , 1 6  , and E [ ω | α , r 4 = 4] =  1 6 , 1 6 , 1 6 , 1 6 , 2 6  . In what follo ws, we illustrate the scores returned by ( 5 ) when using a symmetric and bound ed strictly proper scoring rule and when using R PS. 12 4.5.1 Rewarding Agr eements Assume that R in ( 5 ) is the quadra tic scoring rule sho w n in ( 3 ), which in turn is symmetric, bound ed, and strictly proper . Consequen tly , as discuss ed in Section 4.3, the term γ R ( E [ ω | α , r i ] , r j )+ λ in ( 5 ) can take on only two v alues: γ  4 v +2 −  2 v +2  2 − P v − 1 e =0  1 v +2  2  + λ if r i = r j , γ  2 v +2 −  2 v +2  2 − P v − 1 e =0  1 v +2  2  + λ otherwis e . Hence, by setting γ = 1 δ max − δ min = v +2 2 and λ = − δ min δ max − δ min = − v 2 v +4 , the abov e value s are equal to, respect i vel y , 1 and 0 . Using the scoring method in ( 5 ), we obtain the following re view scores : s 1 = s 2 = 1 and s 3 = s 4 = 0 . That is, the rev ie w scores receiv ed by re viewers 1 and 2 are similar due to the fact that r 1 = r 2 . Revie wer 3 and 4 ’ s revie w scores are equal to 0 becaus e there is no match between their repor ted re views and othe rs’ reported rev iews. 4.5.2 T aking Dista nce into Account No w , assume that R in ( 5 ) is the R PS rule shown in ( 6 ). In order to ensure non-ne gati ve revi e w scores , let γ = 1 and λ = v = 4 . U sing the scori ng method in ( 5 ), we obtain the f ollowing r evie w scores : s 1 = s 2 = ( − 0 . 8333 γ + λ ) + ( − 0 . 5 γ + λ ) + ( − 1 . 5 γ + λ ) = 9 . 1667 , s 3 = 2 · ( − 1 . 0833 γ + λ ) + ( − 1 . 4167 γ + λ ) = 8 . 4167 , and s 4 = 2 · ( − 1 . 5 γ + λ ) + ( − 0 . 8333 γ + λ ) = 8 . 1667 . The re view score of revie wer 4 is the lowes t because his repor ted re view is the most dif ferent rev iew , i.e. , it h as the lar gest distan ce between it and all of the othe r revi ews. 5 Multiple Criteria In ou r basic model, re vie wers observ e only one sign al from the distrib ution that repre sents the qualit y of the manuscript. Howe ver , manuscri pts are often e valua ted under multiple criteria, e.g. , rele van ce, clarity , originalit y , etc ., meaning that in practice re viewers might observe multiple sig- nals and report multiple ev aluation scores . Under the assumption that these signals are indepe n- dent, each reported e va luation sco re can be scored indiv idually using the same scoring method propo sed in the previo us section. Clearly , Propositio n 2 is stil l v alid, i.e. , honest reporting sti ll strictl y m aximizes re viewers’ e xpected re view sco res. The lack of relations hip between differ ent criteria is not alway s a reasonable assumpt ion. A modeling choice that takes the relations hip between observ ed signals into account , which is also consis tent with our basic model, is to assume that the qu ality of the manuscr ipt is still represen ted by a multinomial distrib ution, b ut now re viewers may observe sev eral signals from that distrib u- tion. Formally , let ρ ∈ N + be th e number of d raws from the distrib ution that rep resents the qualit y of the manuscript, where each signa l represents an ev aluation score related to a criterio n. Instead of a single number , each re vie wer i ’ s priv ate info rmation is no w a v ector: t i = ( t i, 1 , . . . , t i,ρ ) , where t i,k ∈ { 0 , . . . , v } , for k ∈ { 1 , . . . , ρ } . The basic assumption s (auton omy , risk neutra lity , Dirichlet priors, and rationa lity) are still the same. For ease of expositi on, w e denote revi e w er i ’ s true poster ior predicti ve distrib ution in this section by Θ ( ρ ) i = E [ ω | α , t i ] . Under this ne w model, each re viewer i ’ s posterior predicti ve distrib ution is now defined as: Θ ( ρ ) i =  α 0 + P ρ k =1 H (0 , t i,k ) ρ + P v x =0 α x , α 1 + P ρ k =1 H (1 , t i,k ) ρ + P v x =0 α x , . . . , α v + P ρ k =1 H ( v , t i,k ) ρ + P v x =0 α x  (7) 13 where H ( x, y ) is an indicato r functio n: H ( x, y ) =  1 if x = y , 0 otherwise . Assuming that each reported revie w is a vecto r of ρ ev aluations scores, i.e. , r i = ( r i, 1 , . . . , r i,ρ ) , where r i,k ∈ { 0 , . . . , v } , for i ∈ N and k ∈ { 1 , . . . , ρ } , the center estimates each revie wer i ’ s pos- terior predicti ve distrib ution by applyi ng Bayes’ ru le to t he common prior . The r esultin g estimated poster ior predi cti ve distrib ution E [ ω | α , r i ] , referred to as Φ ( ρ ) i for ease of ex positi on, is: Φ ( ρ ) i =  α 0 + P ρ k =1 H (0 , r i,k ) ρ + P v x =0 α x , α 1 + P ρ k =1 H (1 , r i,k ) ρ + P v x =0 α x , . . . , α v + P ρ k =1 H ( v , r i,k ) ρ + P v x =0 α x  (8) Thereafte r , the center re wards Φ ( ρ ) i by usin g a strictly proper scori ng rule R and other re view- ers’ report ed revie ws as observed outcomes : s i = X j 6 = i  γ R  Φ ( ρ ) i , G ( r j )  + λ  (9) where G is some function used by the center to summariz e each revie wer j ’ s reported revie w in a single numbe r , and whose image is eq ual to the se t { 0 , . . . , v } . For ex ample, G ca n be a functi on that returns the median or the mode of the repor ted ev aluation scores. Honest report ing, i.e. , r i = t i , maximizes re viewers ’ expected revi e w scores under this setting . Pro p osition 3. When observi ng and r eporting multiple signals , ea ch re viewer i ∈ N maximizes his e xpecte d re view scor e when r i = t i . Pr oof. Due to the autonomy assumptio n, we restrict ourselv es to sho w that each revi e w er i ∈ N maximizes E Θ ( ρ ) i h γ R  Φ ( ρ ) i , G ( r j )  + λ i , for j 6 = i , when r i = t i . Gi ven that R is a strictly proper scorin g rule, from Proposition 1 we ha ve that: arg max Φ ( ρ ) i E Θ ( ρ ) i h γ R  Φ ( ρ ) i , G ( r j )  + λ i = Θ ( ρ ) i When r i = t i , w e hav e by construct ion that Φ ( ρ ) i = Θ ( ρ ) i (see equations ( 7 ) and ( 8 )). Thus, re viewers’ exp ected re view scores are maximize d w hen re viewers repo rt honestly . When observin g and reportin g multiple ev aluation sco res, a rev iewer can w eakly maximize his expecte d re vie w score by reportin g a rev iew differe nt th an his true re view as long as the es- timated posterior predicti ve distrib utions are the same. For example, when revi e w er i report s r i = (1 , 2 , 3) , the resultin g estimated poster ior predict ive distrib ution is the same as when he re- ports r i = (3 , 1 , 2) , and, consequent ly , re viewer i receiv es the same re view score in both cases. This implies tha t the scorin g method in ( 9 ) is more suitable to a peer -revi e w proces s w here all criteri a are equal ly weighted since honest reporting weakly maximizes expecte d revie w scores. 5.1 Summarizing Signals when Prior Distributions ar e Non-Inf ormative When rev iewers report multiple ev aluation scores , the intuiti ve interpretatio n of re view scores as re wards for agreemen ts that arises when using sy m metric and bound ed strictly proper scoring r ules (see Section 4.3) is lost because the elements of the estimated posterior predicti ve distrib ution in ( 8 ) can tak e on more than two dif ferent v alues. 14 A differ ent a pproach that preserv es t he aforementio ned i ntuiti ve interpre tation when re viewers’ prior distr ib utions are non-informa ti ve is to ask the re viewers to summarize thei r observed signals into a single val ue before reporting it, instead of the center doing it on their behalf. Hence, each re viewer i is now reporti ng hone stly when r i = G ( t i ) , where G is some function sugge sted by the center whose image is equal to the set { 0 , . . . , v } . This new model c an be interpreted as if the re viewers were revi ewing the manusc ript under se veral criteria and reporti ng the manuscrip t’ s ov erall ev aluation score by reporting the val ue G ( t i ) . Since revie wers are reporting only one valu e, we can use the original scoring method in ( 5 ) to promote hones t reportin g. W e prove below tha t for any symmetric and b ounde d strictly proper sco ring rule, honest reporting strictl y maximizes revie wers’ expe cted rev iew scores un- der the s coring method in ( 5 ) if and only if G is the mod e of the obse rved signa ls, i.e. , w hen G ( t i ) = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) . T ies b etween observ ed signals are brok en randomly . Pro p osition 4. When obser ving m ultipl e signal s and r eporting r i = G ( t i ) , each r eviewer i ∈ N w ith non-informat ive prior strict ly m aximizes his exp ected r eview scor e under the scoring method in ( 5 ), for a symmetric and bo unded strictl y pr oper scoring rule R , if and only if r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) . Pr oof. Recall that since each revie wer i ∈ N observ es multiple signals, his true posterio r predic- ti ve d istribu tion is equal to E [ ω | α , t i ] = Θ ( ρ ) i = ( θ 0 , θ 1 , . . . , θ v ) as sho wn in ( 7 ). Due to Proposi- tion 1 and sin ce rev iewers canno t af fect their peers’ re views be cause of the a utonomy assumption , we restrict ourselves to sho w that each revie wer i ∈ N maximizes E Θ ( ρ ) i h R  Φ (1) i , r j i , for j 6 = i , if and only if r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) , where Φ (1) i = E [ ω | α , r i = G ( t i )] . Let z ∈ { 0 , . . . , v } be the most common signal observ ed by rev iewer i . H ence, revi ewer i ’ s subjec tive probab ility associated with z is greater than his subjec ti ve probab ility as sociat ed with any o ther signal y ∈ { 0 , . . . , v } , i.e. , θ i,z > θ i,y . (If p art) Giv en that R is a symmetric and strictly proper scoring rule and that each rev ie wer is rep orting only one e val uation score, the resultin g score from R  Φ (1) i , r j  can tak e on onl y two possible v alues: δ max , if r i = r j , and δ min otherwis e (see discuss ion in Section 4.3). When report ing r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) = z , re viewer i ’ s expected revie w score is θ i,z δ max + P y 6 = z θ i,y δ min . Giv en th at θ i,z > θ i,y , for an y y 6 = z , and δ max > δ min , this ex- pected re view score is maximized. Thus, report ing r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) = z maximizes re viewer i ’ s expec ted revie w score. (Only-if part) Recall that all the el ements making up the hyperpara m eter α hav e the same v alue because re viewers’ prior distrib utions are non-informat ive . Let Φ (1) i = ( φ i, 0 , . . . , φ i,v ) be re viewer i ’ s estimated pos terior predicti ve distrib ution computed accordin g to the original scoring method in ( 5 ) when re viewer i is report ing r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) = z , i.e. : φ i,k = ( α k +1 1+ P v x =0 α x = α k +1 ( v +1) · α k +1 if k = z , α k 1+ P v x =0 α x = α k ( v +1) · α k +1 otherwis e . For contradic tion’ s sake , suppose that revie wer i maximizes his expected re view score by misrepor ting his revie w and reportin g r i = y 6 = z . Let ˜ Φ (1) i = ( ˜ φ i, 0 , . . . , ˜ φ i,v ) be revie w er i ’ s estimated poste rior predi ctiv e distrib ution w hen he is misrep orting his rev iew , i.e. : ˜ φ i,k = ( α k +1 1+ P v x =0 α x = α k +1 ( v +1) · α k +1 if k = y , α k 1+ P v x =0 α x = α k ( v +1) · α k +1 otherwis e . 15 As discussed in Section 4.3, the term R  Φ (1) i , r j  can take on only two possib le value s when- e ver R is a symmetric scoring rule. Consequ ently , R  ˜ Φ (1) i , k  = R  Φ (1) i , k  for k 6 = z , y . A conseq uence of our assumption that revi e w er i maximizes his expec ted revi e w score by misre- portin g his rev iew is that E Θ ( ρ ) i h R  ˜ Φ (1) i , r j i ≥ E Θ ( ρ ) i h R  Φ (1) i , r j i . Assuming that R is a symmetric and boun ded proper scoring rule, this inequalit y becomes: v X k =0 θ i,k R  ˜ Φ (1) i , k  ≥ v X k =0 θ i,k R  Φ (1) i , k  = ⇒ θ i,z R  ˜ Φ (1) i , z  + θ i,y R  ˜ Φ (1) i , y  ≥ θ i,z R  Φ (1) i , z  + θ i,y R  Φ (1) i , y  = ⇒ θ i,y ≥ θ i,z   R  Φ (1) i , z  − R  ˜ Φ (1) i , z  R  ˜ Φ (1) i , y  − R  Φ (1) i , y    The sec ond line follo ws from the fact that R  ˜ Φ (1) i , k  = R  Φ (1) i , k  for k 6 = z , y . Re- gardin g the last line, we hav e by const ructio n that R  Φ (1) i , z  = R  ˜ Φ (1) i , y  = δ max , and R  ˜ Φ (1) i , z  = R  Φ (1) i , y  = δ min . Consequentl y , we obtain that θ i,y ≥ θ i,z . As we stated before , since z is the most co mmon signal observ ed by r e viewer i , then θ i,z > θ i,y . T hus, we ha ve a contrad iction. So, E Θ ( ρ ) i h R  ˜ Φ (1) i , r j i < E Θ ( ρ ) i h R  Φ (1) i , r j i , i.e . , re viewer i maximizes his exp ected re view score only if he repo rts r i = arg max x ∈{ 0 ,...,v } P ρ k =1 H ( x, t i,k ) = z . In other words, the abov e propos ition says that each re viewer should report the ev aluatio n score most like ly to be deserv ed b y the manuscript when their prior distrib utions are non-informat i ve and the y are rew arded accordi ng to the scoring method in ( 5 ). An y other e valua tion score has a lower associ ated subjecti ve proba bility and, conseq uently , reporting it res ults in a lo wer exp ected re view score. T o summarize, Pro positio n 4 implies that the scorin g method proposed in ( 5 ) induce s honest report ing by re wardin g agreements whene ver re vie w ers’ prio r distrib utions are no n-informati ve and the center is interes ted in the mode of each revie wer’ s observ ed signals. It is note worthy that Proposit ion 4 does not assume that the center kno ws a priori the number of observe d signa ls ρ , thus pro viding more flexib ility for practic al applic ations of our metho d. 6 Finding a Consensual Re view After rev iewers report their revie ws and receiv e their revie w scores, there is still the question of ho w the center will use th e repor ted re views i n making a suitabl e decisio n. S ince revie wers are not alw ays in agreemen t, belie f agg regation methods must be used to combine the reporte d re views into a single representat i ve rev iew . The traditiona l a verage m ethod is not necessarily the best approa ch since unrelia ble re viewers m ight ha ve a big impact on the aggre gate revi e w . Moreo ver , a conse nsual revie w is desirable because it represents a revie w that is acceptab le by all. In this secti on, we propose an adaptation of a classical mathematical method to fi nd a con sen- sual rev ie w . Intuiti vely , it works as if revi ewers were constant ly up dating their re views in order to aggrega te kno wledge from others. T he scoring con cepts introduce d in pre vious sections are incorp orated by the revie wers when u pdatin g their re views. In what follows, for the sak e of gener - ality , we assume that re viewers ev aluate the manuscrip t under ρ ∈ N + criteri a, i.e. , each revie wer i obse rves ρ signal s from the unde rlying distrib ution that represents the quality of the manuscr ipt 16 and report a vecto r r i = ( r i, 1 , . . . , r i,ρ ) of ev aluation sc ores, where r i,k ∈ { 0 , . . . , v } for all k . The center then estimates re vie wers i ’ s posterior predicti ve distrib ution E [ ω | α , r i ] , referre d to as Φ ( ρ ) i for ease of exp osition, as in ( 8 ). W e relax our basic model by allo wing the ev aluation score s in the aggre gate revie w to tak e on any real v alue between 0 and the best ev aluation score v . 6.1 DeGroot’ s Model DeGroot [ 1974 ] prop osed a model that describes how a group migh t reach a c onsens us by poolin g their indivi dual opinion s. When applying this model to a peer- re view setting , each re viewer i is first informed of othe rs’ reported revie ws. In order to accommodate the information and expertise of the rest of the grou p, re viewer i then updates his own re view as follo w s: r (1) i = n X j =1 w i,j r j where w i,j is a weight that revi ewer i assig ns to revie wer j ’ s reported re vie w when he carrie s out this update. It is assumed that w i,j ≥ 0 , for ev ery re viewer i and j , and P n j =1 w i,j = 1 . In this way , each upda ted rev iew takes the form of a linear combinatio n of reported revie ws, also kno wn as a linear opinion pool . The w eights must be chosen on the basis of the relativ e importance that re viewers assign to their peers’ revie ws. The who le up dating proces s can be written in a more genera l form using matrix notation: R (1) = WR (0) , where: W =      w 1 , 1 w 1 , 2 · · · w 1 ,n w 2 , 1 w 2 , 2 · · · w 2 ,n . . . . . . . . . . . . w n, 1 w n, 2 · · · w n,n      and R (0) =      r 1 r 2 . . . r n      =      r 1 , 1 r 1 , 2 · · · r 1 ,ρ r 2 , 1 r 2 , 2 · · · r 2 ,ρ . . . . . . . . . . . . r n, 1 r n, 2 · · · r n,ρ      Since all the origin al revie ws hav e change d, the re viewers might wish to update their ne w re views in the same way as they did before. If there is no basis for the revie wers to change their assign ed w eights , the w hole upda ting process after t revisi ons can then be represente d as follo ws: R ( t ) = WR ( t − 1) = W t R (0) (10) Let r ( t ) i =  r ( t ) i, 1 , . . . , r ( t ) i,ρ  be rev iewer i ’ s rev iew after t re visions, i.e. , it denotes the i th ro w of th e matrix R ( t ) . W e say th at a conse nsus is reach ed if and only if r ( t ) i = r ( t ) j , fo r e very re vie wer i and j , when t → ∞ . 6.2 Review S cor es as W eights The original method proposed by DeGroot [ 1974 ] does not encour age honesty in a sense that re viewers can assign weights to their peers ’ revie ws howe ver the y w ish so as long as the weights are consisten t with the constructi on pre viously defined. Furthermore, it requires the discl osure of report ed revie ws to the w hole group when re viewers are weighting others’ revie ws, a fac t which might be troub lesome when the revie ws are of a sensiti ve nature. A possib le way to circumv ent the aforement ioned problems is to deri ve weights from the origin al rep orted re views by taking into account re view scores. In pa rticular , we assume the weig ht that a re viewer assigns to a peer’ s revie w is directly related to how close their estimated posteri or predic ti ve distrib utions are, where closeness is defined by an underl ying proper scoring rule. W e pro vide behav ioral foundations for such an assumpti on in the follo w ing subse ction. Formally , the weight that re viewer i assigns to re vie w er j ’ s repor ted revie w is computed as follows: 17 w i,j = E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i P n j =1 E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i (11) that is, the w eight w i,j is propo rtional to th e expect ed re view s core that re viewer i would re ceiv e if he had reported the rev ie w re ported by revie wer j , where the expec tation is taken with respect to re- vie wer i ’ s estimate d posterior predic tive distrib ution. Conseq uently , the w eight tha t each revie wer indire ctly assigns to his o w n revie w is alw ays t he highest because R is a strictly proper scoring rule, i.e. , arg max Φ ( ρ ) j E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i = Φ ( ρ ) i . W e assume that E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i > 0 , for ev ery re viewer i and j . As long as R is bounded , this assumption c an be met by appropriat ely setting the v alue of λ . Consequ ently , 0 < w i,j < 1 , for ev ery i, j ∈ N . M oreo ver , P n j =1 w i,j = 1 becaus e the denominator of the fraction in ( 11 ) normalizes the weights so they sum to one. In the in terest of reaching a co nsensu s, DeGroot’ s method in ( 10 ) is ap plied to the ori ginal report ed re vie ws using the weig hts as defined in ( 11 ). W e show t hat a c onsensus is al ways reache d under this propos ed m ethod whene ver the re view scores are pos iti ve. Pro p osition 5. If E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i > 0 , for every r evie w er i, j ∈ N , then r ( t ) i = r ( t ) j when t → ∞ . Pr oof. Due to the assumption that E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i > 0 , for e very re vie wer i and j , all the elements of the matrix W in ( 10 ) are strictly greater than zero and strictl y les s than one. Moreo ver , the sum of the elements in an y ro w is equal to one. C onseq uently , W can be regarde d as a n × n stochastic matrix, or a one-step transit ion probabil ity matrix of a Markov chain with n states and stationar y probabilitie s. Furth ermore, the underlying Marko v chain is aperiodi c and irredu cible. Therefore , a standard limit theore m of Marko v chains applies in this setting, namely gi ven an aperiodic and irreducible Marko v chain with transi tion probab ility matrix W , ev ery ro w of the matrix W t con ver ges to the same probabili ty vecto r w hen t → ∞ [ Ross , 1995 ] . Recall tha t r ( t ) i = P n j =1 w i,j r ( t − 1) j = P n j =1 w i,j P n k =1 w j,k r ( t − 2) k = · · · = P n j =1 β j r (0) j , where β = ( β 1 , β 2 , . . . , β n ) is a pro bability vector th at incorporate s all the pre vious weights. This equali ty implies that the consensu al re view can be rep resented as an instan ce of the linear opinion pool. Hence, an interpretatio n of the propo sed method is that re viewers reach a consens us regard - ing the weigh ts in ( 10 ). When β = (1 /n, 1 /n, . . . , 1 /n ) in the ab ove equality , the underlyin g linear opinion pool becomes the av erage of the reported ev aluatio n scores. A drawback with an a veraging app roach is that i t does no t take into ac count the scorin g concep ts introd uced in the p re- vious sectio ns, a fact w hich might fa vo r unreliabl e re viewers . Moreov er , dispara te revie ws might ha ve a big impact on the resul ting aggre gate re view . On the other hand, under our approach to find β , revie wers weight down revie ws f ar from th eir o wn revi ews, which implies that th e pro posed method might be less influenced by dispara te re views. A numeri cal exampl e in subs ection 6.4 illustr ates this point . The ex perimental results discus sed in Section 7 s how that ou r m ethod to find a conse nsual revie w is consistent ly more accurate than the traditional av erage m ethod. 6.3 Beha vioral Foundation The major assumption regarding our m ethod for finding a consens ual re view is that re viewers assign weights accordi ng to ( 11 ). A n intere sting interpretatio n of ( 11 ) arises when the proper scorin g rule R is ef fective with respect to a metric M . Formally , giv en a m etric M that assigns a real number to any pair of prob ability vectors, which can be seen as the shortest distance between 18 the two pro babili ty vect ors, we say that a scoring rule R is effect i ve with respect to M if th e follo wing relation holds for all pro bability vector s Φ ( ρ ) i , Φ ( ρ ) j , and Φ ( ρ ) k [ Friedman , 1983 ]: M  Φ ( ρ ) i , Φ ( ρ ) j  < M  Φ ( ρ ) i , Φ ( ρ ) k  ⇐ ⇒ E Φ ( ρ ) i h γ R  Φ ( ρ ) j , e  + λ i > E Φ ( ρ ) i h γ R  Φ ( ρ ) k , e  + λ i Thus, when R is ef fectiv e with res pect to a metri c M , the high er the weight one revie wer assign s to a peer’ s revi e w in ( 11 ), the closer their estimate d posterior predicti ve distrib utions are accord ing to t he metric M . In o ther words, when using e ffecti ve sco ring rules, rev iewers nat urally prefer re views close to their own reported re views, and the weight that each re vie wer assigns to his o wn re vie w is alwa ys the highe st one. Hence, in spiri t, the re sulting learni ng model in ( 10 ) c an be seen as a model of anchori ng [ Tversk y and Kahneman , 1974 ] in a sense that the revie w of a re viewer is an “anch or”, and subsequ ent updates are biased to wards re views close to the an chor . Friedman [ 1983 ] discuss ed some example s o f effec tive scoring rules. For exa mple, the quadrat ic scorin g rule in ( 3 ) is ef fecti ve with respec t to th e root-mea n-square de viation, the spherical sco ring rule is ef fectiv e with respect to a renormaliz ed L 2 -metric, w hereas the logari thmic scoring rule is not ef fectiv e w ith respect to any metri c [ Nau , 1985 ]. 6.4 Numerical Example Consider a peer -revie w pro cess where the best ev aluation score is four ( v = 4 ), three rev iewers ( n = 3) obser ve three signals ( ρ = 3 ), and the y rep ort the follo w ing revie ws: r 1 = (0 , 1 , 3) , r 2 = (0 , 2 , 3) , and r 3 = (4 , 4 , 4) . C onsequ ently , the matrix R (0) in ( 10 ) is: R (0) =   0 1 3 0 2 3 4 4 4   Consider the hyperparamete r α = (1 , 1 , 1 , 1 , 1) . Hence, the estimated posterior pre dicti ve distrib utions are Φ (3) 1 = (2 / 8 , 2 / 8 , 1 / 8 , 2 / 8 , 1 / 8) , Φ (3) 2 = (2 / 8 , 1 / 8 , 2 / 8 , 2 / 8 , 1 / 8) , and Φ (3) 3 = (1 / 8 , 1 / 8 , 1 / 8 , 1 / 8 , 4 / 8) . Assume that R in ( 11 ) is the quad ratic scorin g rule in ( 3 ) , and let γ = 1 and λ = 1 in order for the result ing expecte d value s in ( 11 ) to be alw ays positi ve. W e obta in: W =   0 . 3545 0 . 345 5 0 . 3000 0 . 3455 0 . 354 5 0 . 3000 0 . 3158 0 . 315 8 0 . 3684   Focus ing on the main diagon al of W , we notice that each rev iewer alwa ys assign s the highest weight to his own re vie w . From the first row of W , we can see that revie wer 1 assigns a high weight to his revie w and to re viewer 2 ’ s revie w , and a lower weight to rev iewer 3 ’ s re view . This happe ns be cause re vie wer 3’ s re view is very distant from the ot hers’ repor ted re views. W e can dra w similar conclusion s from the other ro ws. W e then obtain the follo wing wei ghts when carryin g out DeGroot’ s method with the weights calculate d according to ( 11 ): lim t →∞ W t =   0 . 3390 0 . 3390 0 . 3220 0 . 3390 0 . 3390 0 . 3220 0 . 3390 0 . 3390 0 . 3220   and, conseque ntly , the consensua l rev ie w is repres ented by any row of the m atrix lim t →∞ W t R (0) , which results in the vecto r (1 . 288 , 2 . 305 , 3 . 322) . It is worthwhi le to discus s an interesting point reg arding the above e xample. The agg regate re view w ould be (1 . 3 33 , 2 . 333 , 3 . 333) if it was equal 19 to the av erage of the reported revi e w s. Hence, re viewer 3 ’ s re view would ha ve more impact on the aggre gate re view because the ev aluatio n sco res in the av erage rev iew are all greater than the corres pondin g ev aluation scores in the consensual re vie w . In our propos ed method, the influence of re vie wer 3 on the aggregate rev iew is diluted because his re view is very dif ferent from the others ’ re views. More formally , revie wer 3 ’ s esti m ated poster ior predic tiv e distrib ution is very distan t from the others’ estimated posterior predicti ve distrib utions when measured according to the root-mean -square dev iation, the metric associate d with the quadra tic scoring rule. 7 Experiments In this section , we describe a peer -revie w experimen t design ed to test the ef ficacy of both the propo sed scorin g method and the propose d aggre gation method. In the follo w ing subsec tions, we discuss A mazon’ s Mechanica l T urk, the platform used in our expe riments, the exper imental design , and our results. 7.1 Amazon’ s Mechanical T urk Amazon’ s Mechanic al T urk 5 (AMT) is a n online lab or market origina lly dev eloped for human computa tion tasks, i.e. , tasks that are relativ ely easy for human being s, b ut nonethele ss challeng- ing or ev en cu rrently impo ssible for computers, e .g. , au dio tran scription, filtering adult content , ext racting data from images, etc . S e ver al studies ha ve sho wn that A MT can ef fectiv ely be used as a means of colle cting valid data in the se settings [ Snow e t al. , 2008 , Marge et al. , 20 10 ]. More recent ly , AMT has also been used as a platfo rm for cond ucting beha vioral experimen ts [ Mason an d S uri , 20 12 ]. One of th e adv antages that it of fers to resea rchers is the access to a large, di verse, and stable pool of people willing to partic ipate in the expe riments for relati vely low pay , thus simplifyi ng the recruitme nt proce ss and allo wing a fas ter iteration between dev eloping theory and execu ting exper iments. Furthermore, AMT provid es an easy-to- use b uilt-in mechanis m to pay work ers that greatly reduces the difficulti es of compensating indi viduals for their particip ation in the experimen ts, and a built- in reputation system that helps requesters distinguish between good and bad work ers and, consequent ly , to ensur e data qu ality . Numerous st udies ha ve sho wn that results of beh aviora l studies con ducted on AMT are comparable to results obtained in other onlin e domains a s well as in offline settin gs [ Buhrmester et al. , 2 011 , Horton et al. , 2011 ], thus pr ovi ding e vidence that AMT is a v alid means of collectin g beha vioral data. 7.2 Experimental Design W e designed a task on AMT th at required work ers, hencef orth refe rred to as revie wers, to revie w 3 short texts under three dif ferent criteria: Grammar , Clarity , and Releva nce . The first two texts w ere ext racts from p ublish ed poems, b ut with some or iginal words inte ntiona lly replac ed by mis spelled words . The third text contai ned random words presented in a semi-structur ed way . All the details reg arding the text s are inclu ded in the appen dix. For each text, three question s were presen ted to the re viewers, each one ha ving three possible respo nses ordered in decreasing negati vity order: • Grammar: does the tex t contain misspellin gs, syntax errors, etc .? – A lot of grammar mistak es – A fe w grammar mistakes – No grammar mistak es 5 http://www . mturk.com 20 • Clarity : does the tex t, as a whole, make an y sense? – The te xt does not make sen se – The te xt makes some sens e – The te xt makes per fect sense • Rele van ce: co uld the text be part of a poem related to lov e? – The te xt canno t be part of a lo ve poem – The te xt might be part of a lov e poem – The te xt is definite ly part of a love poem W ords with s ubjecti ve meaning were inten tionally used so as to simulate the subj ectiv e nature of the ev aluatio n scores in a revi e w , e.g. , “a lot”, “a few”, etc. Each indi vidual response was transla ted into an e v aluation score inside the set { 0 , 1 , 2 } . The most neg ativ e res ponse recei ved the score 0 , the middle respons e recei ved the score 1 , and the most positi ve respo nse recei ved the score 2 . Thus, each re viewer repor ted a ve ctor of 9 e valua tion scores (3 texts times 3 criteria) . W e re cruited 150 re vie wers on AMT , all of them resi ding in the Unite d States of Am erica and older than 18 years ol d. They were required to accomplish the task in at most 20 minutes. Rev iewers w ere split into 3 group s of equal size. After accomplishi ng the task, ev ery rev iewer in e very group receiv ed a payment of 20 cents. A study done by Ipeiroti s [ 2010 ] showed that more than 90 % of the tasks on AMT ha ve a baseli ne paymen t less tha n $0 . 10 , and 70% of th e tasks ha ve a baselin e payment less t han $0 . 05 . T hus, ou r baseline p ayment was much higher than the a verage payment from other jobs posted to the AMT marke tplace. Each revie wer was randomly assigned into one of the three groups. Revie wers in two of the group s, the treatment group s, could earn an additional bonus of up to 10 cents. Rev iewers in the first treatment group, referre d to as the B onus Gr oup (BG), were informed that their bonuses w ould be proportio nal to the number of re views similar to th eir re ported re vie w s. Rev ie wers in the s econd treatmen t group, the B onus and Informati on Gr oup (BIG), recei ved similar informatio n, but they also recei ved a short summary of some theoretic al results presented in this paper : “ A group of researche rs from the Univ ersity of W aterloo (Canada) formally sho wed that the best strategy to maximize your expe cted bonus in this setting is by being hones t, i.e . , by consid ering each questio n thoro ughly and decidin g the best answers accord ing to your personal opinion”. Members of the third group, the Contr ol Gr oup (CG), neither receiv ed extra exp lanations nor bonus es. T heir reported re views were used as the control conditi on. Bonuses were computed by re warding agreement s as described in Section 4.3. Due to the one-shot nature of this peer -rev iew task, we assumed that Dirichlet priors w ere non-informat i ve with hyperpa rameter α = (1 , 1 , 1) . For each reported e valuat ion score, there could be at most 49 similar reported ev aluation scores becaus e each gr oup h ad 50 members. W e then us ed the formula 10 9 × # agreemen ts 49 to calculate th e re ward for an indi vidual ev aluation score. Giv en that each re viewer reporte d 9 ev aluatio n scores , if the ev aluation sco res reported by all members o f a group were the same, then all group membe rs would recei ved the maximum bonus of 1 0 cents. T he provide d bonuses can be seen as revi e w scores . Our primar y objecti ve when perfo rming this expe riment was to empiricall y in vestiga te the ext ent to which provid ing revie w scores af fects the quality of the repor ted revie ws. 21 7.3 Gold-Standard Evaluation Scor es Since the sou rce a nd origin al content o f eac h te xt were kno wn a priori , i.e. , before the exper iments were conducted , we w ere able to deri ve go ld-sta ndar d re views for eac h te xt. In order to a void confirmatio n bias 6 , we asked fiv e professo rs an d tutors from the English and Literature Department at th e Uni versity of W ate rloo to provide the ir re views for each text. W e set the gold-s tandar d eva luatio n scor e for each criterio n in a te xt as the median of the e val uation sco res repo rted by the professor s and tutors . Coincidental ly , each median v alue was also the mode of the underlyi ng e valuat ion sco res. All the ev aluation scores reporte d by the professo rs an d tut ors as well as the respec ti ve gold- standard ev aluation scores are in the appendix. 7.4 Hypotheses Our first researc h question was whether or not prov iding revie w scores throug h pairwise compar- isons makes the reporte d revie ws more accurate, i.e. , closer to the gold-s tandard revie w s. Based on our theo retical resul ts, our hypothes is was: Hypothesis 1. T he aver age accur acy of gr oup BIG is gr eater than the avera ge accuracy of gr oup BG, whic h in turn is gr eater than the avera ge accur acy of gr oup C G. In other words, the resultin g revie ws would be on av erage m ore accurate when revie wers re- cei ved re view scores , and the extra exp lanation re grading the theory behind the scoring meth od would provide mor e credib ility to it, th us making the re vie ws more a ccurate. Regar ding the result- ing bonuses, sin ce ho nest reporting maximizes re vie wers’ expecte d re view scores in our model, our second hypot hesis was: Hypothesis 2. The avera ge bonus re ceived by members of gr oup BIG is gr eater than the avera ge bonus r eceived by membe rs of gr oup BG, whic h in tu rn is gr eater than th e aver age bo nus r eceived by member s of gr oup CG. In order to test whether or not Hypothesis 2 was true, we used the b onus the memb ers of group CG would hav e recei ved had they recei ved any bonus . It is important to note that Hypothes is 1 was measured by comparing how close the reporte d rev iews were to the gold-st andard re views, whereas Hypothesis 2 was measured by making pairwise comparison s between reported re views: the higher the numbe r of agre ements, the greater the resultin g bonus. Another metric used to compare groups ’ performa nce was the task completion time . The amount of time spen t by revie wers on th e re viewing task can be seen as a proxy for the ef fort the y e xerted to complete the task. Regarding this metric, we e xpecte d rev iewers w ho receiv ed re view scores to be more cautious when completing their tasks. Moreov er , the extra explana tion reg rading the theor y behind the scoring m ethod would prov ide more credibility to it, thus making the members of group BIG work hard er on the task. Hence, our third hypothe sis was: Hypothesis 3. The avera ge task completi on time of gr oup BIG is gr eater than the avera ge task completi on time of gr oup BG, which in turn is gr eater than the avera ge task completio n time of gr oup CG. Finally , we belie ved that the c onsens ual revie w , computed as descr ibed in Section 6, would be more accurat e than the a verage revie w since disparate re views are less likely to ha ve a big influence on the cons ensual re view tha n on the av erage revie w . Hence, our fourth hypothe sis was: Hypothesis 4. The avera ge accurac y of the conse nsual r eview is gr eater than the avera ge accu- rac y of the avera ge re view . 6 The tendenc y to interpret information in a way that con firms one’ s preconceptions [ Plous , 1993 ]. 22 T abl e 1: Accuracy of each gr oup on individua l criteria. The aver age of the absolute diff er ence between th e r eported eval uation scor es and the corr espondi ng gold-s tandar d evalua tion scor es is shown below eac h gr oup. F or each criterio n, the lowest aver age is h ighlig hted in bold. The standa rd dev iations ar e in par enthesis. One-tail ed p-value s r esulting fr om ran k-sum tests ar e shown in the last thr ee columns. Given the notation A-B, the null hypoth esis is that the outco me measur es res ulting fr om gr oups A and B ar e equivale nt, and the alternative hypoth esis is that the outcome mea sur e r esulting fr om gr oup A is l ess th an the outcome meas ur e r esulting fr om gr oup B. p -v alues BG BIG CG BIG-BG BIG-CG BG-C G Grammar 0.5000 0.3200 0.44 00 0.035** 0.110 0.726 (0.5051 ) (0.471 2) (0.5014) T ext 1 Clarity 0.8200 0.6200 0.86 00 0.065* 0.052* 0.413 (0.6606 ) (0.602 4) (0.7287) Rele vance 0.2200 0.2000 0.30 00 0.484 0.213 0.230 (0.5067 ) (0.451 8) (0.5803) Grammar 0.4400 0.3600 0.38 00 0.209 0.420 0.729 (0.5014 ) (0.484 9) (0.4903) T ext 2 Clarity 0.5000 0.3800 0.54 00 0.155 0.067* 0.325 (0.6468 ) (0.602 4) (0.6131) Rele vance 0.4400 0.6400 0.66 00 0.977 0.419 0.014** (0.5014 ) (0.484 9) (0.4785) Grammar 0.7600 0.7800 1.02 00 0.539 0.077* 0.061* (0.8466 ) (0.864 0) (0.8449) T ext 3 Clarity 0.1400 0.0000 0.16 00 0.006** 0 .002** 0.301 (0.4046 ) (0.000 0) (0.3703) Rele vance 0.1200 0.1000 0.20 00 0.491 0.112 0.122 (0.4352 ) (0.364 2) (0.4949) * p ≤ 0 . 1 ** p ≤ 0 . 05 7.5 Experimental Results 7.5.1 Accuracy on Individual Criteria In our first analysis, we computed the absolut e diffe rence between each repor ted e valua tion score and the correspon ding gold-standa rd ev aluation score. T hus, the outcome measure was an integ er with a v alue between zero and two, and the closer this value was to zero, the better the resulting accura cy . T able 1 sho w s the av erage accuracy of each group on indi vidual criteria. Focus ing fi rst on the groups BG and BIG, the group BIG is the most accurate grou p on all criteri a, exce pt fo r the criterion Relev ance in T ext 2 and the criteri on Grammar in T ext 3. T his result is statist ically significant with p -val ue ≤ 0 . 1 in three out of the sev en cases in which BIG is more accurate than BG. In two out of these three statist ically significant cases, this result is also statist ically significant with p -v alue ≤ 0 . 05 . BG is more accurate than BIG in only two criteria. This resul t is only statis tically significant for the criterion Relev ance in T ext 2 ( p -v alue ≤ 0 . 05 ). The group CG, the control conditi on that in volv ed no incenti ves beyond the baseline compen- sation offe red for completing the task, nev er outperfo rms both BG and BIG at the same time, and 23 T abl e 2: Aggr e gate accu rac y of each gr oup. The avera ge of the sum of th e ab solute differ ence between the re ported evaluatio n scor es and the corr esponding gold-stan dar d eva luation scor es is shown below eac h gr oup. F or each te xt and for the whole task, the lowest aver age is highlight ed in bold. The standar d devi ations ar e in par enthesis. One-tail ed p-values r esulting fr om rank-su m tests ar e given in the last thr ee columns. Given the notation A-B, the null hypothesi s is that the outcome measur es re sulting fr om gr oups A and B ar e equivalent , and the alte rnative hypothes is is that the outcome measur e r esulting fr om gr oup A is less than the outcome measur e r esulting fr om gr oup B. p -v alues BG BIG CG BIG-BG BIG-CG BG-CG T ext 1 1.5400 1.14 00 1.6000 0.043** 0.085* 0.588 (1.1287 ) (1.030 4) (1.4142) T ext 2 1.3800 1.38 00 1.5800 0.54 7 0.163 0.148 (1.0669 ) (0.966 6) (0.9916) T ext 3 1.0200 0.88 00 1.3800 0.39 4 0.020** 0.052* (1.1865 ) (0.917 9) (1.1933) Overa ll 3.9400 3.40 00 4.5600 0.11 0 0.002** 0.064* (2.2352 ) (1.690 3) (2.1301) * p ≤ 0 . 1 ** p ≤ 0 . 05 it is the less accurate group in se ven out of nine crite ria. In two (respe cti vely , four) occasion s, CG is statis tically significantly less accurate than BG (respecti vely BIG) with p -v alue ≤ 0 . 1 . Giv ing these result s, we conclude that Hypothe sis 1 is true for indi vidual criteria, i.e. , the resulti ng re views are on av erage more accura te when using revi e w scores, and the extra exp lana- tion regr ading the theory behind the scoring m ethod seems to pro vide more credibility to it, thus impro ving the accurac y of the reported rev iews. 7.5.2 Aggr egate Accuracy W e also computed th e agg regate accu racy of each group for each te xt as well as for the whole task. In the former ca se, the outcome measure was the sum of th e absolu te dif ference be tween each rep orted e valua tion score for a gi ven te xt and the corresp onding gold-stan dard e valua tion score. For e xample, giv en (0 , 1 , 2) as the report ed ev aluation scores for T ext 1, and (1 , 2 , 2) as the corres pondin g gold-stan dard e valuat ion scor es, the out come measure for T ext 1 would be | 0 − 1 | + | 1 − 2 | + | 2 − 2 | = 2 . For the whole task, we summed the abso lute dif ferences across all criteria and tex ts. T able 2 sho w s the aggre gate accur acy of each group. For e very singl e text as well as for the o vera ll task, members of the group CG report less accura te revie ws than members of the grou p BG and the group BIG. For t he group BG, this result is statistica lly significant for T ext 3 and for the ove rall task ( p -v alue ≤ 0 . 1 ) . For the group BIG, this result is statistic ally significant fo r T ex t 1 ( p -v alue ≤ 0 . 1 ), T ex t 3 ( p -v alue ≤ 0 . 05 ), and for the whole task ( p -v alue ≤ 0 . 05 ). Thus, the experi m ental results sugges t that provi ding revi ew score s produ ces a significant impro vement in quality o ver the control co nditio n. Moreov er , providin g an extra e xplanation about the theory behind the scoring method improv es the final quality of the re views because, on av erage, the revie ws from group BIG are more accurate than the re views from group BG. This result is statistic ally significant for T ext 1 ( p -valu e ≤ 0 . 05 ). Therefore, we conclu de that Hypothesis 1 is also true on the aggrega te lev el. 24 T abl e 3: A vera ge bonus and comp letion time per gr oup. T he highest ave ra ge values ar e high- lighte d in bold. The standar d dev iations ar e in pa ren thesis. One-taile d p-v alues r esultin g fr om ran k-sum tests ar e given in the last thr ee columns. G iven the notatio n A -B, the null hypoth esis is that the outcome measur es r esulting fr om gr oups A and B ar e equivale nt, and the altern ative hy- pothes is is that the outcome m easur e res ulting fr om gr oup A is gr eater than the outcome measur e r esulting fr om gr oup B. p -v alues BG BIG CG BIG-BG BIG-CG B G-CG Bonus 0.053 0.058 0.05 0 < 0.0005** < 0.0005* * 0.0025 ** (0.0086 ) (0.0073 ) (0.0078) T ime 178.66 215.90 196.36 0.0232* * 0.0257* * 0.420 8 (87.449 5) (127.747 1) (149.07 88) * p ≤ 0 . 1 ** p ≤ 0 . 05 7.5.3 Bonus The a verage bonus per group is shown in the first ro w of T able 3 . From it, we conclu de that Hypothe sis 2 is true, i . e. , the a verage bonus recei ved by members of BIG is greater than the av erage bonus recei ved by members of BG, which in turn is greater than the av erage bonus hypoth etically recei ved by members of CG. A ll these results are statis tically significant with p -v alue ≤ 0 . 05 . In other w ords, pro viding revi ew scor es and info rming revie wers about the theory beh ind the scoring method do indeed increa se the number of reported revie ws that are similar . Interes tingly , there is a str ong negati ve correlation b etween bonuses and the agg regate absolute error for the whole task shown in the fourth ro w of T able 2 , ev en though the former is computed by making pai rw ise comparisons between reporte d re views, w hereas the latte r is computed by comparin g reported re views with gold-stand ard re views. The Pearson correlation coef ficients for BG, BIG, and CG a re, respecti vely , − 0 . 73 , − 0 . 79 , and − 0 . 7 2 . This res ult implies that ther e exists a strong positi ve correla tion between honest repor ting and accur acy in this task, a fact which is in agreemen t with our theoretica l model. 7.5.4 Completion Time The a verage completion time per group is shown in the second ro w of T able 3 . W e start by noting that Hypothes is 3 is not true. Surprisingly , th e av erage time spent on the task by members of the group BG is statistica lly equi v alent to the av erage time spent by members of the group CG since the null hypothesis cannot be rejecte d. The a vera ge completio n time by membe rs of the gro up BIG is the highest on e amongst the three groups, and this result is statistic ally significa nt with p -v alue ≤ 0 . 05 . A po ssible explan ation for this r esult is that re vie w ers work on the re vie w ing task more se riously by taking mor e time to compl ete it when th ey recei ve a br ief explana tion rega rding some theoretica l results of the proposed scoring method, w hereas the y could be quickly guessing ho w their peers would rev iew the tex ts when the extra explanat ion about the theoret ical result s is not pro vided. It is note worthy that ev en though the av erage values might sugge st tha t spe nding m ore time re viewing the text s results in higher bonuses and lower ov erall abso lute errors, we do not find an y significa nt correlation between these variab les at an indiv idual lev el. 25 T abl e 4: A ccur acy of the avera ge r eview (A VG ) and the consensual r evie w (CR ) per gr oup. T he aver age of th e absolute dif fer ence between t he ag gre gate evalua tion scor es and the co rre spond ing gold-s tandar d ev aluati on scor es is shown per gr oup and criteria . F or each criterio n, the lowest aver age absol ute diffe ren ce in eac h gr oup is highlig hted in bold. The standar d deviation s ar e in par enthes is. A total of 1000 boo tstr ap res amples wer e used. BG BIG CG A V G CR A V G CR A VG CR Grammar 0.2622 0.27 81 0.1238 0.108 6 0.1285 0.1307 (0.0920 ) (0.1015) (0.0680) (0.0631) (0.0800) (0.0829) T ext 1 Clarity 0.821 9 0.8159 0.62 11 0.6078 0.8633 0.847 7 (0.0903 ) (0.0963) (0.0850) (0.0987) (0.1031) (0.1156) Rele vance 0.2188 0.14 62 0.2020 0.138 2 0.2996 0.2156 (0.0676 ) (0.0536) (0.0622) (0.0512) (0.0783) (0.0696) Grammar 0.1643 0.16 67 0.0754 0.069 8 0.0716 0.0679 (0.0819 ) (0.0855) (0.0571) (0.0541) (0 .545) (0.0531 ) T ext 2 Clarity 0.496 5 0.4314 0.38 12 0.3036 0.5405 0.502 2 (0.0879 ) (0.0970) (0.0858) (0.0860) (0.0857) (0.0987) Rele vance 0.3590 0.35 08 0.4360 0.495 7 0.4996 0.5642 (0.0804 ) (0.0933) (0.0948) (0.1061) (0.0909) (0.0999) Grammar 0.7598 0.69 76 0.7792 0.719 8 1.0231 1.0288 (0.1204 ) (0.1473) (0.1222) (0.1505) (0.1161) (0.1455) T ext 3 Clarity 0.137 9 0.0859 0.00 00 0.0000 0.1615 0.111 5 (0.0565 ) (0.0400) (0.0000) (0.0000) (0.0522) (0.0453) Rele vance 0.1197 0.06 99 0.1019 0.059 7 0.2022 0.1320 (0.0619 ) (0.0400) (0.0521) (0.0335) (0.0696) (0.0537) 7.5.5 Consensus Lastly , we tested the accurac y of the method propos ed in Section 6 to fi nd a consensual rev iew . W e compared the resultin g consensual rev iew with the av erage re view by using a bootstra pping techni que. For each group of revie wers, we randomly resample d with replacement revi ews from the origi nal dataset so as to obtain bootstr ap r esamples . T he size of each boots trap resample was equal to the size of th e origin al dataset, i.e. , each boot strap resample contai ned 50 data points (the origin al number of re vie w ers), each one consistin g of 9 ev aluatio n scores . For e ach bootstrap resample, we aggre gated ev aluation scores indi vidual ly using both the p ro- posed metho d for finding a consens ual rev iew and the a verage method. For eac h ev aluatio n score, the weight that each rev iewer i assigned to revi ewer j ’ s reported ev aluation score was computed accord ing to equation ( 11 ). The proper scoring rule R and the constants γ and λ were set so as to re ward a greement a s in S ection 4.3. G iv en th at the best e valuat ion sco re in this tas k wa s v = 2 and α = (1 , 1 , 1) , each element of a re viewer’ s estimated poste rior predic ti ve distrib ution in ( 4 ) could tak e on only tw o v alues: 0 . 25 and 0 . 5 . C onsequ ently , the numerato r in ( 11 ) c ould tak e on o nly two v alues: 0 . 5 , if re viewer i and j ’ s reported ev aluation scores were the same, and 0 . 25 otherwise. After aggregat ing e valuat ion scores, we computed the accurac y of each aggre gation method. The outcome measure was the absolu te di f ference between each aggrega te e va luation sc ore and the correspon ding go ld st andard score. Thus, the outcome measure was an inte ger with a value 26 between zero and two, and the close r this val ue was to ze ro, the better the re sulting accurac y . T able 4 sho ws the a verage accurac y by group resulting from a total of 1000 bootstrap resamples. T able 4 sho w s that consensual ev aluation sco res are more accurate than ave rage e valua tion scores in 20 out of 27 cases, and equally accurate in one case. It comes as no surpri se that con- sensua l e valua tion scores are more accura te in group s where rev iew scores w ere provide d since these groups reported more accurate revie ws and their reported re vie ws were more similar , as pre viously discusse d. W e performed a statistica l analy sis to in vestigate whether or not these dif- ferenc es in accurac y are statistically significan t. Since we used the same bootstra p resamples for both aggrega tion methods, the W ilcoxon signed -rank test was used. T he null hypothe sis was that the outcome m easure s result ing from both aggregat ion metho ds ar e equi va lent. T he alternat ive hypot hesis was that the outcome measure resulting from the consens ual method was less than the outcome measure resultin g from the av erage method, w hich implies that the former is more ac- curate than the latter . All the resulting 27 p -v alues are extremely small  < 10 − 10  . Therefore, we conclude that Hypothe sis 4 is indeed true, i.e. , the proposed method for finding a consensua l re view is, on av erage, more accurate than the av erage appro ach in this expe riment. As discusse d in S ection 6, we believ e this result happens because dispar ate reporte d revie ws are less likely to ha ve a big influence on the consens ual re view than on the a verage rev iew . 8 Conclusion W e propose d a scoring m ethod buil t on strictl y proper scoring rules that indu ces honest report ing when outcomes are not observ able. W e illustrate d the mechanic s beh ind our scoring method by applyi ng it to the peer -rev iew process. In order to do so , we mod eled the peer -re view process using a Bayesian model whe re the un certain ty reg arding t he quality o f the man uscript is ta ken i nto accoun t. The main assumption s in ou r model are that re vie w ers cann ot be influenc ed by oth er re viewers, and re viewer s are Bayesian decision -makers. W e then sho wed how our scorin g m ethod can be used to ev aluate reported rev iews and to encou rage honest reportin g by risk-neu tral re viewers. The proposed method assign s scores based on ho w close reported rev iews are, where closene ss is defined by an underlying proper scoring rule. Under the aforementio ned assumptions, we showed that risk-neut ral re viewers strictly maximize their expected scores by honest ly disclosi ng their rev ie w s. W e also propos ed an exten sion of our model and scoring m ethod to scenario s where re viewers ev aluate a manuscrip t under sev eral criteri a. W e discussed ho w honest reportin g is related to accuracy in our m odel: when re viewers report honestly , the distrib ution of reporte d rev iews con ver gences to the distrib ution that represents the quality of the manusc ript as the numbe r of reported revie w s increases. Since all re views are not alw ays in agreement, we sugge sted an adaptat ion of the method propo sed by DeGroot [ 1974 ] to fi nd a consen sual rev iew . Intui ti vely , the proposed method works as if the revie wers were going through sev eral rounds of discu ssion, w here in each round the y are informed about others’ reported revie ws, and they update their o wn re views using our predefined method in order to reach a consensu s. Formally , each updated revie w is a con vex combinatio n of report ed re vie ws, where re vie w scores are use d as part o f the w eights t hat revi e w ers as sign to their peers’ rev ie w s. W e sho wed that the resulting method al ways con ver ges to a consen sual revie w when revie w ers’ expecte d rev iew scores are positi ve, and we provi ded behavi oral foundatio ns for the aggre gation method. W e tested the ef fi cacy of both the proposed scoring method and the proposed aggre gation method on a peer -revie w exp eriment using Amazon’ s Mechanica l T urk. Our expe rimental result s corrob orated the relationship between hon est reportin g and ac curacy in o ur model. W e empiri- cally showed that providin g revie w scores through pairwise comparisons result s in more accurate re views tha n the traditiona l peer -re view process, where re vie w ers ha ve no di rect incent i ves for e x- 27 pressi ng their true revie ws. Moreov er , revie wers tended to agree more with each other when they recei ved rev iew scores. In addition, our method for finding a consen sual re view outperfo rmed the traditi onal av erage method in our peer -revie w exper iments. For ease of expositio n, our discuss ion on peer revie w was focused on scientific communica- tion. Howe ver , our model and scoring method are readily applied to most peer- revie w setting s, e.g . , ac ademic courses , clini cal peer rev iew , etc . Moreo ver , our proposed method to incenti vize hones t reporting is readily applied to diff erent domains . For exampl e, our method can be used to incent i vize honest feedbac k in reputatio n syste m s, where indiv iduals rate a product/ser vice aft er exp eriencing it, and to induce honest ev aluation of diffe rent strate gic plan s in an or ganization’ s strate gic planning process. T he propose d aggre gation method is also genera l in a sense that it can be ap plied to any decision analys is proces s where e xperts ex press their opinio ns throu gh probabi l- ity dist rib utions over a set of e xhaust ive and mutually excl usiv e outcomes. Giv en the positi ve results obtain ed in our peer -re view exper iments, an interesting open ques- tion is whether o r not the method s proposed in this p aper would perform as well i n oth er domains, such as in the afo rementioned reputation systems an d strategic plan ning in or ganizations . Another questi on wort h contemp lating is wheth er or not inc enti ves oth er th an fro m the recei ved scores play a role in our scoring method. For e xample, one can conjecture that altruism may p lay an importan t role in our scorin g method. In our peer -revi e w experiment s, the performance of the re viewers not only aff ect their own revie w scores , b ut also the revie w scores of their peers. In other words, if re viewers do n ot pu t enough ef fort into repor ting high- quality re views, not only mig ht th ey re- cei ve low re view scores , but other re vie ws ev aluated based on those erron eous revie ws might also recei ve lo w revie w scores. Thus, an interest ing futu re wo rk is to in vestigate whether or not ex- perts, in general, hav e an altruis tic motiv e to put more effo rt into the underlying task in order to maximize the poten tial payo f fs of their peers. Ackno wledgments The authors thank Carol A cton, Katherine Acheson, Stefan Rehm, S usan Gow , and V eronica Austen for pro viding gold-stan dard revie ws for our ex periments. A ppendix In this appendix , we descri be the texts used in our exp eriments as well as the gold-sta ndard re views report ed by five professors an d tu tors from t he English and Literature Depa rtment at the Univ ersity of W ater loo, henceforth referred to as the exp erts . T ext 1 An excer pt from the “Sonnet XV II” by Neruda [ 2007 ]. Intentio nally misspelled words are high- lighte d in bold. “I do not lov e you as if you was salt-rose, or topaz, or the arr own of carnations that spread fire: I lov e you as certain dark things are love d, secretl y , between the shadown an d the soul” T able 5 sho w s the e valu ation score s reported by the e xperts. The gold -standard ev aluation score for each crite rion is the median/mode of the repo rted ev aluation scores. 28 T abl e 5: Evaluatio n scor es r eported by the exp erts for T e xt 1. Criterion Expert 1 Expert 2 Expert 3 Expert 4 E xpert 5 Median/ Mode Grammar 1 0 1 0 1 1 Clarity 2 2 2 1 2 2 Rele vance 2 2 2 2 2 2 T ext 2 An exc erpt from “The C o w ” by T aylor et al. [ 2010 ]. Intentional ly misspelled words are highlighte d in bold . “THANK you, pre ty co w , that made Plesant milk to soak my bread, Every d ay and e very night, W arm, and fre sh, and sweet, and white. ” T able 6 shows the ev aluation scores repo rted by the e xperts. The gold -standard ev aluation score for each crite rion is the median/mode of the repo rted ev aluation scores. T abl e 6: Evaluatio n scor es r eported by the exp erts for T e xt 2. Criterion Expert 1 Expert 2 Expert 3 Expert 4 E xpert 5 Median/ Mode Grammar 1 1 1 1 1 1 Clarity 2 2 2 1 2 2 Rele vance 1 0 0 1 1 1 T ext 3 Random words in a semi-structu red way . Each li ne starts with a noun follo w ed by a ver b in a wrong ver b form. All th e words in the same li ne start w ith a similar let ter in order to mimic a poetic writing style. “Baby bet binary bound aries bubb les Carlos cease CIA condit ionall y curve Daniel den y disease domino dumb Faust fe st fierce forced furbish ed” T able 7 sho w s the e valu ation score s reported by the e xperts. The gold -standard ev aluation score for each crite rion is the median/mode of the repo rted ev aluation scores. T abl e 7: Evaluatio n scor es r eported by the exp erts for T e xt 3. Criterion Expert 1 Expert 2 Expert 3 Expert 4 E xpert 5 Median/ Mode Grammar 0 1 0 0 0 0 Clarity 0 0 0 0 0 0 Rele vance 0 1 0 0 0 0 29 Refer ences AICP A: American Institute of CP As. Peer Rev iew Program Manual. 2012. Retrie ved from http: //www .aicpa.org/InterestAreas/PeerReview . J. Arg enti. Corpo rate P lannin g: a Pr actical Guide . Number 2. Routledge , 1968. D. F . Bacon, Y . C hen, I. Kash, D. C. Park es, M. Rao, and M. Sridharan. Predicting Y our Own Eff ort. In P r oceedings of the 11th Internation al Confer ence on Auto nomous Agent s and Multi- ag ent Systems , pages 695–7 02, 2012. J. M. Bernardo and A. F . M. Smith. B ayesian Theory . Joh n W iley & Sons, 1994 . L. Bornmann, R. Mutz, and H. D. D aniel. Gender Diff erences in Grant Peer Revi ew: A Meta- Analysis . Jo urnal of Informetr ics , 1(3):226 – 238, 2007 . A. E. Bud den, T . T regen za, L. W . Aars sen, J. K oriche va, R. L eimu, and C. J. Lortie . Double-Blind Rev iew Fav ours I ncreas ed T epresentat ion of Female Authors. T r ends i n Ecolo gy and Evol ution , 23(1): 4–6, 2008 . M. D . Buhrmeste r , T . Kwang, and S . D . Gosling. Amazon’ s M echani cal T urk: A New Source of Ine xpensi ve, Y et Hig h-Quality , Data? P er spectiv es on Psycholo gical Science , 6(1):3–5 , 2011. A. Carvalh o and K. Larson. S haring a Re ward Based on Peer Eva luation s. In Pr oceedings of the 9th Intern ational Confer ence on Autonomou s Agents and Multia gent Systems , pages 1455– 1456, 2010 . A. Carvalh o and K. Larson. A T ruth Serum for Sharin g Rewa rds. In P r oceedings of th e 10th Intern ationa l Confer ence on Aut onomous A gen ts and Multia gent Systems , pages 635–642, 2011. A. C arv alho and K. Larson. S haring Re wards Among Strangers Based on Peer Eva luations. De- cision Analysis , 9(3): 253–2 73, 2012. A. Carv alho and K. Larson. A Consensu al L inear Opinion Pool. In Pr oceedings of the 23r d Intern ationa l Join t Confer ence on Artificial Intellige nce , pages 2518–2524 , 2013. R. T . C lemen and R. L. W inkler . Combining Probabili ty Distribu tions From Experts in Risk Analysis . Risk Analysis , 19:187– 203, 1999. P . E. Dans. Clinical Peer R e view: Burnishin g a T arnished Icon. Annals of Internal Medicine , 118 (7):56 6–568 , 1993. M. H. DeGroot. Reaching a Consens us. J ournal of the Americ an Statis tical Associat ion , 69(345): 118–1 21, 1974. E. S. Epstein. A Scoring S ystem for Probability Forecasts of Ranked Categorie s. J ournal of Applied Meteor olog y , 8(6):98 5–987, 1969. M. E. Falagas, G. M. Zouglaki s, and P . K. Kavv adia. Ho w Mask ed Is the Mask ed Peer Revi e w of Abstracts Submitted to Intern ationa l Medical? Mayo Clinic Pr oceedings , 81(5):7 05, 2006. D. F riedman. E f fectiv e S coring Rules for Probabilis tic Forecas ts. Manag ement Scienc e , 29(4 ): 447–4 54, 1983. T . Gneiting and A. E. Raftery . Strictly Proper S coring Rules, Prediction , and Estimation. Jou rnal of the American Stati stical Associa tion , 102(477) :359–3 7 8, 2007. 30 F . Godlee, T . Jeff erson, M. Callaham, J. C lark e, D. Altman, H. Bastian, C. Bingham, and J. D eeks. P eer Rev iew i n Health Scienc es . BMJ books London, 2003. R. Hanson. Combinato rial Information Market Design. Informat ion Syste m s F r ontiers , 5(1):107 – 119, 200 3. J. J. Horton, D. G. Rand, and R. J. Zeckhaus er . The Online Laboratory : Conducting E xperimen ts in a Real Labor Mark et. Experimental E conomics , 14(3):399– 425, 2011. S.-W . Huang and W .-T . Fu. Enhancing Reliabili ty Using Peer Consistenc y Ev aluatio n in Human Computatio n. In Pr oceedings of the 2013 Confer ence on Computer Support ed Cooper ative W or k , pages 639– 648, 2013. P . G. Ipeirotis . Analyzing the Amazon Mec hanica l T urk Marketplac e. XR DS Cr ossr oads: The A C M Mag azine for Student s , 17(2 ):16–2 1, 2010. R. Jurca and B. Faltings. Mechanisms for M aking Crowds Tru thful. Jo urnal of A rtifici al Intelli- gen ce Resear ch , 34:209–25 3, 2009. A. C . Justice, M. K . Cho, M. A. W inker , J. A. Berlin, D. Rennie, an d The PEE R In vest igators . Does Masking Aut hor Identity Impro ve Peer Revie w Qua lity?: A Rand omized Controlled Trial . J ournal of the Am erican Medic al Association , 280(3):240 –242, 1998. S. Lock. A Difficu lt Balance: Editorial P eer Revie w in Medicine . Nuf field Provincia l Hospital s T rust, 1985. LSC: Legal Services Commission. Indepe ndent Peer Rev iew . 20 05. Retrie ved from http: //www .legalservices.gov.uk/civil/how/mq_peerreview.asp . M. Mar ge, S. Banerje e, and A. I. Rudni cky . Using the Amazon Mech anical T urk for Trans cription of Spoken Langua ge. In Pr oceedings of the 2010 IEEE Internation al Confer ence on Acousti cs Speec h and Signal Pr ocessing , pages 5270 –5273, 2010. W . Mason and S. Suri. Conducting Beh aviora l R esearch on Amazon’ s Mechanic al T urk. Behavior Resear ch Methods , 44(1):1– 23, 2012. N. M iller , P . R esnick , and R. Zeckhauser . Eliciting Informati ve Feedback: T he P eer -Prediction Method. Mana gement Science , 51(9):1359 –1373, 2005. A. H . Murphy . A Note on the Ranked Probabili ty Score. Jo urnal of Applied Meteor olog y , 10(1): 155–1 56, 1970. Y . Nakazono . Strategic Behavi or of Federal Open Market Committee Board Members: Evid ence from Members’ For ecasts. J ournal of E conomic Behavi or & Org anization , 93:62–70, 2013. R. F . Nau. Should S coring Rules Be “Ef fectiv e”? Manag ement Science , 31(5):527 –535, 1985. D. J. Nav arro, T . L. Griffiths , M. Steyv ers, and M. D. L ee. Modeling Indi vidual D if ferences with Dirichlet Processes . J ournal of Mathemati cal Psycholo gy , 50(2):10 1–122, 2006. P . Neruda. 100 Lov e Sonnets . Exile, Bilingual edition, 2007. N. S. Newcomb e and M. E. Bouton. M ask ed Revi e w s Are Not Fairer Rev iews. P er spectives on Psych olo gical Science , 4(1):62–6 4, 2009. L. Pappa no. The Y ear of the MOOC . New Y ork T imes , pag e ED26, Nove m ber 4th, 2012. 31 S. Plo us. The Psycholo gy of J udgment and Decis ion Makin g . Mcgraw-Hill Book Company , 1993. D. Prelec. A Bayesian T ruth Serum for Subjecti ve Data. Science , 306(5 695):4 62–466, 2004. R. B. Primack and R . Marrs. Bias in the Re vie w Process. Biolog ical Conservatio n , 141(12 ): 2919– 2920, 2008. G. R adano vic and B. Faltings. A Rob ust Bayesian T ruth Serum for Non-Binary Signals. In Pr oceedin gs of the 27th AAA I Confer ence on Artificia l Intellig ence , 2013. R. Robinso n. Calibrated Peer Revie w. The American Biolo gy T eac her , 63(7):474– 480, 2001. M. R oos, J. R othe, and B . Scheuermann. H o w to Calibrate the Scores of Biased Revi ewers by Quadrati c Progra m ming. In Pr oceeding s of the T wenty-F ifth Confer ence on Artificia l Inte lli- gen ce , pages 255–26 0, 2011. S. M. Ross. Stoc hastic Pr ocesses (W iley Series in Pr obabilit y and Statistics) . W iley , 2 edition, 1995. A. D. Shaw , J. J. Horton, and D. L. Chen. Designing Incenti ves for Inexpe rt Human Raters. In Pr oceedin gs of the A CM 20 11 Confer ence on C omputer Supporte d Coopera tive W ork , pages 275–2 84, 2011. R. Sno w , B. O’Connor , D. Jurafsk y , an d A. Y . Ng. Cheap and Fast—But is it Good? Evalua ting Non-Expert Annotati ons for Natural L anguag e T asks. In P r oceedi ngs of the Confer ence on Empirical Method s in Natural Languag e Pr ocessing , pages 254–263, 2008. C.-A. S. Sta ¨ el vo n Holstein. A Family of S trictly Proper Scoring R ules Which Are Sensiti ve to Distance . Jo urnal of Applied Meteor ology , 9(3):360 –364, 1970. J. T aylor , A. T aylo r , and K. Greenaway . Little Ann and Other P oems . Nabu Press, 2010 . A. Tversk y and D . Kahneman. Judgment under Uncertain ty: Heuris tics and Biases. Science , 185 (4157 ):1124 –1131, 1974. S. v an Rooyen. The Ev aluatio n of P eer -Revie w Q uality . Learned Publishing , 14(2):85–91 , 2001. S. van Rooyen , N. Blac k, and F . G odlee. De velo pment of the R e view Quality Instru ment (RQI) for Assessin g Peer Revie ws of Manuscripts. J ourna l of Clinical Epidemiolog y , 52(7 ):625 – 629, 1999. R. R. J. W eiss. Optimally Aggr e gating Elicited Expertise : A Pr oposed Applica tion of the Bayesian T ruth Serum for P olicy Analysis . PhD thes is, Massachuse tts Institut e of T echnology , 2009. C. W enne ras and A. W old. Nepoti sm and Sexism in Peer -Revie w. Natur e , 387:341–3 43, 1997. R. L. W inkler and A. H. Murphy . “Good ” Probabil ity Assessors . Jo urnal of Appli ed Meteor olog y , 7(5):7 51–75 8, 1968. J. W itko wski and D. C. Park es. A Rob ust Bayesian T ruth S erum for S mall Populations . In Pr o- ceedin gs of the 26th AA AI Confer ence on Artificia l Intellig ence , 2012. A. Y ank auer . How Blind is Blind Revie w? American Jou rnal of P ublic Health , 81(7): 843–8 45, 1991. 32

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment