Using Mechanical Turk to Build Machine Translation Evaluation Sets

Using Mechanical T urk to Build Machine T ranslation Evaluation Sets Michael Bloodgood Human Language T echnology Center of Excellence Johns Hopkins Uni versity bloodgood@jhu.edu Chris Callison-Burch Center for Language and Speech Processing Johns Hopkins Uni versity ccb@cs.jhu.edu Abstract Building machine translation (MT) test sets is a relati v ely e xpensi ve task. As MT becomes increasingly desired for more and more lan- guage pairs and more and more domains, it becomes necessary to build test sets for each case. In this paper, we in vestigate using Ama- zon’ s Mechanical T urk (MT urk) to make MT test sets cheaply . W e ﬁnd that MT urk can be used to make test sets much cheaper than professionally-produced test sets. More im- portantly , in experiments with multiple MT systems, we ﬁnd that the MT urk-produced test sets yield essentially the same conclu- sions regarding system performance as the professionally-produced test sets yield. 1 Introduction Machine translation (MT) research is empirically e valuated by comparing system output against refer- ence human translations, typically using automatic e valuation metrics. One method for establishing a translation test set is to hold out part of the training set to be used for testing. Howe ver , this practice typ- ically overestimates system quality when compared to e valuating on a test set dra wn from a dif ferent do- main. Therefore, it’ s necessary to mak e ne w test sets not only for new language pairs b ut also for new do- mains. Creating reasonable sized test sets for new do- mains can be expensi ve. For example, the W orkshop on Statistical Machine T ranslation (WMT) uses a mix of non-professional and professional translators to create the test sets for its annual shared translation tasks (Callison-Burch et al., 2008; Callison-Burch et al., 2009). F or WMT09, the total cost of creat- ing the test sets consisting of roughly 80,000 words across 3027 sentences in sev en European languages was approximately $39,800 USD, or slightly more than $0.08 USD/word. F or WMT08, creating test sets consisting of 2,051 sentences in six languages was approximately $26,500 USD or slightly more than $0.10 USD/word. In this paper we e xamine the use of Amazon’ s Mechanical T urk (MT urk) to create translation test sets for statistical machine translation research. Sno w et al. (2008) showed that MT urk can be useful for creating data for a v ariety of NLP tasks, and that a combination of judgments from non-experts can attain expert-le vel quality in many cases. Callison- Burch (2009) sho wed that MT urk could be used for lo w-cost manual ev aluation of machine translation quality , and suggested that it might be possible to use MT urk to create MT test sets after an initial pi- lot study where turkers (the people who complete the work assignments posted on MT urk) produced translations of 50 sentences in ﬁ ve languages. This paper explores this in more detail by ask- ing turkers to translate the Urdu sentences of the Urdu-English test set used in the 2009 NIST Ma- chine T ranslation Evaluation W orkshop. W e ev alu- ate multiple MT systems on both the professionally- produced NIST2009 test set and our MT urk- produced test set and ﬁnd that the MT urk-produced test set yields essentially the same conclusions about system performance as the NIST2009 set yields. This paper was published within the Pr oceedings of the NAA CL HLT 2010 W orkshop on Creating Speec h and Language Data with Amazon’ s Mechanical T urk, pages 208-211, Los Angeles, California, J une 2010. c  2010 Association for Computational Linguistics 2 Gathering the T ranslations via Mechanical T urk The NIST2009 Urdu-English test set 1 is a pro- fessionally produced machine translation ev alua- tion set, containing four human-produced reference translations for each of 1792 Urdu sentences. W e posted the 1792 Urdu sentences on MT urk and asked for translations into English. W e charged $0.10 USD per translation, giving us a total translation cost of $179.20 USD. A challenge we encountered during this data collection was that many turkers would cheat, giving us fak e translations. W e noticed that many turkers were pasting the Urdu into an online machine translation system and giving us the output as their response ev en though our instructions said not to do this. W e manually monitored for this and rejected these responses and blocked these workers from computing any of our future work assignments. In the future, we plan to combat this in a more prin- cipled manner by con verting our Urdu sentences into an image and posting the images. This way , the cheating turkers will not be able to cut and paste into a machine translation system. W e also noticed that many of the translations had simple mistakes such as misspellings and typos. W e wanted to in vestigate whether these would decrease the v alue of our test set so we did a second phase of data collection where we posted the translations we gathered and asked turkers (likely to be com- pletely different people than the ones who provided the initial translations) to correct simple grammar mistakes, misspellings, and typos. For this post- editing phase, we paid $0.25 USD per ten sentences, gi ving a total post-editing cost of $44.80 USD. In summary , we built two sets of reference trans- lations, one with no editing, and one with post- editing. In the next section, we present the results of experiments that test how effecti ve these test sets are for e valuating MT systems. 3 Experimental Results A main purpose of an MT test set is to e valuate vari- ous MT systems’ performances relati ve to each other and assist in drawing conclusions about the relativ e 1 http://www .itl.nist.gov/iad/894.01/tests/mt/2009/ ResultsRelease/currentUrdu.html quality of the translations produced by the systems. 2 Therefore, if a given system, say System A, out- performs another given system, say System B, on a high-quality professionally-produced test set, then we would want to see that System A also outper- forms System B on our MT urk-produced test set. It is also desirable that the magnitudes of the differ - ences in performance between systems also be main- tained. In order to measure the dif ferences in perfor- mance, using the differences in the absolute mag- nitudes of the BLEU scores will not work well be- cause the magnitudes of the BLEU scores are af- fected by many factors of the test set being used, such as the number of reference translations per for- eign sentence. For determining performance differ - ences between systems and especially for compar- ing them acr oss differ ent test sets , we use percentage of baseline performance. T o compute percentage of baseline performance, we designate one system as the baseline system and use percentage of that base- line system’ s performance. For example, T able 1 sho ws both absolute BLEU scores and percentage performance for three MT systems when tested on ﬁ ve different test sets. The ﬁrst test set in the table is the NIST -2009 set with all four reference trans- lations per Urdu sentence. The next four test sets use only a single reference translation per Urdu sen- tence (ref 1 uses the ﬁrst reference translation only , ref 2 the second only , etc.). Note that the BLEU scores for the single-reference translation test sets are much lower than for the test set with all four ref- erence translations and the difference in the absolute magnitudes of the BLEU scores between the three dif ferent systems are different for the different test sets. Ho wev er , the percentage performance of the MT systems is maintained (both the ordering of the systems and the amount of the difference between them) across the dif ferent test sets. W e ev aluated three different MT systems on the NIST2009 test set and on our two MT urk-produced test sets (MT urk-NoEditing and MT urk-Edited). T wo of the MT systems (ISI Syntax (Galley et al., 2 Another useful purpose would be to get some absolute sense of the quality of the translations but that seems out of reach currently as the values of BLEU scores (the def acto stan- dard e v aluation metric) are difﬁcult to map to precise le vels of translation quality . Eval ISI JHU Joshua Set (Syntax) (Syntax) (Hier .) NIST -2009 33.10 32.77 26.65 (4 refs) 100% 99.00% 80.51% NIST -2009 17.22 16.98 14.25 (ref 1) 100% 98.61% 82.75% NIST -2009 17.76 17.14 14.69 (ref 2) 100% 96.51% 82.71% NIST -2009 16.94 16.54 13.80 (ref 3) 100% 97.64% 81.46% NIST -2009 13.63 13.67 11.05 (ref 4) 100% 100.29% 81.07% T able 1: This table shows three MT systems ev aluated on ﬁv e different test sets. For each system-test set pair, two numbers are displayed. The top number is the BLEU score for that system when using that test set. For ex- ample, ISI-Syntax tested on the NIST -2009 test set has a BLEU score of 33.10. The bottom number is the per- centage of baseline system performance that is achieved. ISI-Syntax (the highest-performing system on NIST2009 to our knowledge) is used as the baseline. Thus, it will always hav e 100% as the percentage performance for all of the test sets. T o illustrate computing the percentage performance for the other systems, consider for JHU- Syntax tested on NIST2009, that its BLEU score of 32.77 divided by the BLEU score of the baseline system is 32 . 77 / 33 . 10 ≈ 99 . 00% 2004; Galle y et al., 2006) and JHU Syntax (Li et al., 2009) augmented with (Zollmann and V enugopal, 2006)) were chosen because they represent state- of-the-art performance, having achiev ed the highest scores on NIST2009 to our knowledge. They also hav e very similar performance on NIST2009 so we want to see if that similar performance is maintained as we e v aluate on our MT urk-produced test sets. The third MT system (Joshua-Hierarchical) (Li et al., 2009), an open source implementation of (Chi- ang, 2007), was chosen because though it is a com- petiti ve system, it had clear , markedly lo wer perfor- mance on NIST2009 than the other tw o systems and we want to see if that difference in performance is also maintained if we were to shift e valuation to our MT urk-produced test sets. T able 2 shows the results. There are a number of observ ations to make. One is that the absolute magnitude of the BLEU scores is much lo wer for all systems on the MT urk-produced test sets than on Eval ISI JHU Joshua Set (Syntax) (Syntax) (Hier .) NIST - 33.10 32.77 26.65 2009 100% 99.00% 80.51% MT urk- 13.81 13.93 11.10 NoEditing 100% 100.87% 80.38% MT urk- 14.16 14.23 11.68 Edited 100% 100.49% 82.49% T able 2: This table sho ws three MT systems ev alu- ated using the ofﬁcial NIST2009 test set and the two test sets we constructed (MTurk-NoEditing and MT urk- Edited). For each system-test set pair , two numbers are displayed. The top number is the BLEU score for that system when using that test set. For example, ISI-Syntax tested on the NIST -2009 test set has a BLEU score of 33.10. The bottom number is the percentage of base- line system performance that is achiev ed. ISI-Syntax (the highest-performing system on NIST2009 to our kno wl- edge) is used as the baseline. the NIST2009 test set. This is primarily because the NIST2009 set had four translations per foreign sen- tence whereas the MT urk-produced sets only hav e one translation per foreign sentence. Due to this dif ferent scale of BLEU scores, we compare perfor- mances using percentage of baseline performance. W e use the ISI Syntax system as the baseline since it achie ved the highest results on NIST2009. The main observation of the results in T able 2 is that both the relativ e performance of the v arious MT sys- tems and the amount of the dif ferences in perfor- mance (in terms of percentage performance of the baseline) are maintained when we use the MT urk- produced test sets as when we use the NIST2009 test set. In particular , we can see that whether using the NIST2009 test set or the MT urk-produced test sets, one would conclude that ISI Syntax and JHU Syn- tax perform about the same and Joshua-Hierarchical deli vers about 80% of the performance of the two syntax systems. The post-edited test set did not yield different conclusions than the non-edited test set yielded so the value of post-editing for test set creation remains an open question. 4 Conclusions and Future W ork In conclusion, we hav e shown that it is feasible to use MT urk to b uild MT e valuation sets at a sig- niﬁcantly reduced cost. But the large cost sav- ings does not hamper the utility of the test set for e valuating systems’ translation quality . In e xper- iments, MT urk-produced test sets lead to essen- tially the same conclusions about multiple MT sys- tems’ translation quality as much more expensi ve professionally-produced MT test sets. It’ s important to be able to b uild MT test sets quickly and cheaply because we need new ones for ne w domains (as discussed in Section 1). No w that we hav e shown the feasibility of using MT urk to build MT test sets, in the future we plan to build ne w MT test sets for speciﬁc domains (e.g., enter- tainment, science, etc.) and release them to the com- munity to spur work on domain-adaptation for MT . W e also envision using MT urk to collect addi- tional training data to tune an MT system for a new domain. It’ s been shown that active learning can be used to reduce training data annotation burdens for a v ariety of NLP tasks (see, e.g., (Bloodgood and V ijay-Shanker , 2009)). Therefore, in future work, we plan to use MT urk combined with an acti ve learning approach to gather new data in the new do- main to in vestigate improving MT performance for specialized domains. But we’ll need ne w test sets in the specialized domains to be able to ev aluate the ef- fecti veness of this line of research and therefore, we will need to be able to b uild new test sets. In light of the ﬁndings we presented in this paper , it seems we can build those test sets using MT urk for relativ ely lo w costs without sacriﬁcing much in their utility for e valuating MT systems. Acknowledgements This research was supported by the EuroMatrix- Plus project funded by the European Commission, by the D ARP A GALE program under Contract No. HR0011-06-2-0001, and the NSF under grant IIS- 0713448. Thanks to Amazon Mechanical T urk for providing a $100 credit. References Michael Bloodgood and K V ijay-Shanker . 2009. T aking into account the differences between activ ely and pas- siv ely acquired data: The case of acti ve learning with support vector machines for imbalanced datasets. In Pr oceedings of Human Language T echnolo gies: The 2009 Annual Confer ence of the North American Chap- ter of the Association for Computational Linguistics , pages 137–140, Boulder, Colorado, June. Association for Computational Linguistics. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder . 2008. Further meta-ev aluation of machine translation. In Pr oceed- ings of the Thir d W orkshop on Statistical Mac hine T ranslation (WMT08) , Colmbus, Ohio. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder . 2009. Findings of the 2009 W orkshop on Statistical Machine Translation. In Pr o- ceedings of the F ourth W orkshop on Statistical Ma- chine T ranslation (WMT09) , March. Chris Callison-Burch. 2009. Fast, cheap, and creativ e: Evaluating translation quality using Amazon’ s Me- chanical T urk. In Pr oceedings of the 2009 Confer ence on Empirical Methods in Natural Language Process- ing , pages 286–295, Singapore, August. Association for Computational Linguistics. David Chiang. 2007. Hierarchical phrase-based transla- tion. Computational Linguistics , 33(2):201–228. Michel Galley , Mark Hopkins, K evin Knight, and Daniel Marcu. 2004. What’ s in a translation rule? In Pr o- ceedings of the Human Language T echnology Con- fer ence of the North American chapter of the Asso- ciation for Computational Linguistics (HLT/NAA CL- 2004) , Boston, Massachusetts. Michel Galley , Jonathan Graehl, Ke vin Knight, Daniel Marcu, Ste ve DeNeefe, W ei W ang, and Ignacio Thayer . 2006. Scalable inference and training of context-rich syntactic translation models. In Pr oceed- ings of the 21st International Confer ence on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL- CoLing-2006) , Sydney , Australia. Zhifei Li, Chris Callison-Burch, Chris Dyer , Juri Gan- itke vitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan W eese, and Omar Zaidan. 2009. Joshua: An open source toolkit for parsing-based ma- chine translation. In Pr oceedings of the F ourth W ork- shop on Statistical Machine T ranslation , pages 135– 139, Athens, Greece, March. Association for Compu- tational Linguistics. Rion Snow , Brendan O’Connor , Daniel Jurafsky , and Andrew Y . Ng. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natu- ral language tasks. In Pr oceedings of EMNLP-2008 , Honolulu, Hawaii. Andreas Zollmann and Ashish V enugopal. 2006. Syntax augmented machine translation via chart parsing. In Pr oceedings of the NAA CL-2006 W orkshop on Statis- tical Machine T r anslation (WMT -06) , New Y ork, New Y ork.

Using Mechanical Turk to Build Machine Translation Evaluation Sets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment