Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

Annotation Sensitivity: T raining Data Collection Methods Affect Model P erf ormance Christoph K ern ♣ Stephanie Eckman ♢ Jacob Beck ♣ Rob Chew ♠ Bolei Ma ♣ Frauke Kr euter ♣ , ♢ ♣ LMU Munich ♢ Uni versity of Maryland, Colle ge P ark ♠ R TI International {christoph.kern, jacob.beck, bolei.ma, frauke.kreuter}@lmu.de steph@umd.edu rchew@rti.org Abstract When training data are collected from human annotators, the design of the annotation instru- ment, the instructions gi ven to annotators, the characteristics of the annotators, and their in- teractions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. W e introduce the term annotation sensitivity to refer to the impact of annotation data col- lection methods on the annotations themselves and on downstream model performance and predictions. W e collect annotations of hate speech and of- fensiv e language in ﬁv e experimental condi- tions of an annotation instrument, randomly as- signing annotators to conditions. W e then ﬁne- tune BER T models on each of the ﬁv e resulting datasets and e valuate model performance on a holdout portion of each condition. W e ﬁnd con- siderable dif ferences between the conditions for 1) the share of hate speech/of fensi ve language annotations, 2) model performance, 3) model predictions, and 4) model learning curves. Our results emphasize the crucial role played by the annotation instrument which has re- ceiv ed little attention in the machine learning literature. W e call for additional research into how and why the instrument impacts the an- notations to inform the dev elopment of best practices in instrument design. K eyw ords: Annotation sensitivity , human anno- tation, annotation instrument, task structure effects 1 Introduction Supervised NLP models are typically trained on human annotated data and assume that these an- notations represent an objectiv e "ground truth." This paper was accepted to EMNLP 2023 Findings: https://aclanthology.org/2023.findings- emnlp. 992/ . Mislabeled data can greatly reduce a model’ s abil- ity to learn and generalize ef fecti vely ( Frénay and V erle ysen , 2013 ). Item difﬁculty ( Lalor et al. , 2018 ; Swayamdipta et al. , 2020 ), the annotation scheme ( Northcutt et al. , 2021 ), and annotator characteris- tics ( Gev a et al. , 2019 ; Al Kuwatly et al. , 2020 ) can contribute to bias and v ariability in the an- notations. Ho w the prompts are worded, the ar- rangement of b uttons, and other seemingly minor changes to the instrument can also impact the an- notations collected ( Beck et al. , 2022 ). In this work, we in v estigate the impact of the annotation collection instrument on downstream model performance. W e introduce the term an- notation sensitivity to refer to the impact of data collection methods on the annotations themselv es and on model performance and predictions. W e conduct experiments in the context of annotation of hate speech and of fensi ve language in a tweet corpus. Our results contribute to the gro wing literature on Data Centric AI ( Zha et al. , 2023 ), which ﬁnds that larger improvements in model performance often come from improving the quality of the train- ing data rather than model tuning. The results will inform the dev elopment of best practices for anno- tation collection. 2 Background and Related W ork Annotation sensitivity ef fects are predicted by ﬁnd- ings in sev eral ﬁelds, such as surv ey methodol- ogy and social psychology . Decades of research in the surve y methods literature ﬁnd that the wording and order of questions and response options can change the answers collected ( T ourangeau et al. , 2000 ; Schuman and Presser , 1996 ). For e xample, small changes in surve y questions (“Do you think the government should forbid cigarette advertise- ments on television?” v ersus “Do you think the gov ernment should allo w cigarette adv ertisements on television?”) impact the answers respondents gi ve. Questions about opinions are more sensitiv e to these effects than questions about f acts ( Schnell and Kreuter , 2005 ). Psychologists and survey methodologists hav e also documented the cognitive shortcuts that re- spondents tak e to reduce the length and burden of a surve y and ho w these impact data quality . Respon- dents often satisﬁce, choosing an answer cate gory that is good enough, rather than expending cog- niti ve effort to ev aluate all response options thor- oughly ( Krosnick et al. , 1996 ). When the questions allo w , respondents will go further , giving incorrect answers to reduce the b urden of a surv ey ( Kreuter et al. , 2011 ; T ourangeau et al. , 2012 ; Eckman et al. , 2014 ). Prior research ﬁnds strong e vidence of task struc- ture effects in annotations of hate speech and of- fensi ve language. When annotators coded both annotations on one screen, they found 41.3% of tweets to contain hate speech and 40.0% to con- tain offensi v e language. When the task was split ov er two screens, results dif fered signiﬁcantly . The order of the annotations matters as well. When hate speech was ask ed ﬁrst, 46.6% of tweets were coded as hate speech and 35.4% as offensi v e lan- guage. The con verse w as true when the order was switched (39.3% hate speech, 33.8% offensi v e lan- guage) ( Beck et al. , 2022 ). Our w ork expands upon these ﬁndings to test the hypothesis that task struc- ture af fects not just the distrib ution of the annota- tions but also do wnstream model performance and predictions. 3 Methods W e tested our hypotheses about the downstream ef fects of annotation task structure in the conte xt of hate speech and of fensi v e language in tweets. W e used a corpus of tweets previously annotated for these tasks ( Da vidson et al. , 2017 ), b ut replaced the original annotations with new ones collected under ﬁ ve e xperimental conditions. W e then trained sepa- rate models on the annotations from each condition to understand ho w the annotation task structure impacts model performance and predictions. 3.1 Data Collection W e sampled tweets from the Davidson et al. corpus, which contains 25,000 English-language tweets. W e selected 3,000 tweets, each annotated three times in the prior study . Selection was stratiﬁed by the distribution of the previous annotations to ensure that the sample contained v arying le vels of annotator disagreement. W e created nine strata reﬂecting the number of annotations (0, 1, 2, or 3) of each type (hate speech, offensi ve language, neither). The distribution of the sample across the strata is sho wn in T able A.1 in Appendix § A.1 . W e do not treat the previous annotations as ground truth and do not use them in our analysis. There is no reason to belie ve they are more correct than the annotations we collected, and ground truth is dif ﬁcult to deﬁne in hate speech annotation ( Arhin et al. , 2021 ). 1 W e de veloped ﬁ ve e xperimental conditions which v aried the annotation task structure (Figure 1 ). All tweets were annotated in each condition. Condition A presented the tweet and three options on a single screen: hate speech (HS), of fensi ve lan- guage (OL), or neither . Annotators could select one or both of HS and OL, or indicate that neither applied. Conditions B and C split the annotation of a single tweet across two screens (screens are sho wn in Figure 1 as black boxes). For Condition B, the ﬁrst screen prompted the annotator to indicate whether the tweet contained HS. On the follo wing screen, they were sho wn the tweet again and asked whether it contained OL. Condition C was similar to Condition B, but ﬂipped the order of HS and OL for each tweet. Annotators assigned to Condition D ﬁrst annotated HS for all assigned tweets and then annotated OL for the same set of tweets. Con- dition E worked the same w ay b ut started with the OL annotation task followed by the HS annotation task. In each condition, annotators ﬁrst read through a tutorial (Appendix § A.3 ) which pro vided deﬁni- tions of HS and OL (adapted from Davidson et al. 2017 ) and sho wed e xamples of tweets containing HS, OL, both, and neither . Annotators could ac- cess these deﬁnitions at any time. Annotators were randomly assigned to one of the experimental con- ditions, which remained ﬁxed throughout the task. The annotation task concluded with the collection of demographic variables and some task-speciﬁc questions (e.g., perception of the annotation task, social media use). W e assessed demographic bal- ance across experimental conditions and found no e vidence of meaningful imbalance (see Figure A.4 in Appendix § A.4 ). Each annotator annotated up to 50 tweets. Those 1 The original annotation task deﬁned hate speech and of- fensiv e language as mutually e xclusi ve cate gories, which we felt was too restricti ve and dif ﬁcult for annotators. Figure 1: Illustration of Experimental Conditions in Condition A saw up to 50 screens, and those in other conditions saw up to 100. W e collected three annotations per tweet and condition, resulting in 15 annotations per tweet and 44,900 tweet-annotation combinations. 2 W e recruited 917 annotators via the crowdsourc- ing platform Proliﬁc during Nov ember and Decem- ber 2022. Only platform members in the US were eligible. Annotators recei ved a ﬁxed hourly wage in excess of the US federal minimum wage after completing the task. The full dataset including the annotations and the demographic information of the annotators is av ailable at the following HuggingFace repos- itory: https://huggingface.co/datasets/ soda- lmu/tweet- annotation- sensitivity- 2 . 3.2 Model T raining T raining and T est Setup W e used the collected annotations to b uild prediction models for tw o bi- nary outcomes: Predicting whether a tweet con- tains offensi ve language (1 = yes, 0 = no), and predicting whether a tweet contains hate speech (1 = yes, 0 = no). W e split our data on the tweet 2 Annotations by two annotators of 50 tweets each were corrupted and omitted from analysis. le vel into a training set (2,250 tweets) and a test set (750 tweets). Because we collected three an- notations per tweet in each e xperimental condi- tion, we had ﬁ ve condition-speciﬁc training data sets which each contain 6,750 ( 2 , 250 × 3 ) tweet- annotations. W e similarly created ﬁve condition- speciﬁc test sets with 2,250 ( 750 × 3 ) tweet- annotations each and a combined test set that con- tains tweet-annotations from all ﬁ ve conditions. Note that the ﬁv e condition-speciﬁc training sets contain the same tweets, annotated under dif ferent experimental conditions, as do the ﬁv e condition- speciﬁc test sets. W e used the ﬁv e training sets to build condition- speciﬁc prediction models for both OL and HS. That is, we trained a model on annotations col- lected only in Condition A and another model on annotations collected in Condition B, and so on. Models were e v aluated against the ﬁve condition- speciﬁc test sets as well as the combined test set. For both model training and testing, we used the three annotations collected for each tweet in each condition; that is, we did not aggreg ate annotations to the tweet le vel ( Aro yo and W elty , 2015 ). Model T ypes W e used Bidirectional Encoder Representations from Transformers (BER T , De vlin et al. 2019 ), which is widely used for text classi- ﬁcation and OL/HS detection tasks (e.g. V idgen et al. , 2020 ; Al Kuwatly et al. , 2020 ). BER T is a pre-trained model which we ﬁne-tuned on the anno- tations from each condition. W e also trained Long Short-T erm Memory models (LSTM, Hochreiter and Schmidhuber 1997 ) on the condition-speciﬁc training data sets. The LSTM results are reported in Appendix § A.4 . T raining and V alidation The condition-speciﬁc training datasets were further split into a train (80%) and de v elopment (20%) set for model valida- tion. During training, after each training epoch, we conducted a model v alidation on the de velopment set based on accuracy . If the validation sho wed a higher accuracy score than in the pre vious epoch, we sa ved the model checkpoint within this epoch. After full training epochs, we saved the respec- ti ve best model and used it for ﬁnal deployment and e valuation on the ﬁv e condition-speciﬁc test sets. W e repeated this training process 10 times with dif ferent random seeds and report av erage performance results to make our ﬁndings more ro- bust. W e report further model training details in Appendix § A.2 . Reproducibility The code for data processing, model training and e v aluation, and a list of software packages and libraries used is av ailable at the fol- lo wing GitHub repository: https://github.com/ chkern/tweet- annotation- sensitivity . 4 Results T o understand the inﬂuence of the annotation task structure on the annotations themselves, we ﬁrst compare the percent of tweets annotated as OL and HS and the agreement rates across conditions. W e then ev aluate sev eral measures of model per- formance to study the impact of the task structure on the models, including balanced accuracy , R OC- A UC, learning curves, and model predictions. Cor- responding results for the LSTM models are giv en in Appendix § A.4 . All statistical test account for the clustering of tweets within annotators. Collected Annotations T able 1 contains the fre- quency of OL and HS annotations by experimental condition. The number of annotations and anno- tators was nearly equi v alent across the ﬁv e con- ditions. More tweets were annotated as OL than HS across all conditions. Condition A, which dis- played both HS and OL on one screen and asked annotators to select all that apply , resulted in the lo west percentage of OL and HS annotations. Con- ditions B and C resulted in equal rates of OL an- notations. Howe v er , Condition C, which collected HS on the second screen (see Figure 1 ), yielded a lower share of HS annotations than Condition B ( t = 5 . 93 , p < 0 . 01 ). The highest share of HS annotations was observed in Condition D, which asked annotators to ﬁrst annotate all tweets as HS and then repeat the process for OL. Condition E, which ﬂipped the task order (ﬁrst requesting 50 OL annotations followed by 50 HS annotations), resulted in signiﬁcantly more OL annotations than Condition D ( t = − 10 . 23 , p < 0 . 01 ). Number of Percent Cond. Annotations Annotators OL HS A 9,000 184 51.6 26.8 B 9,000 183 58.8 29.6 C 8,950 182 58.5 28.2 D 8,950 179 54.4 33.5 E 9,000 189 59.0 31.8 T able 1: Annotation results by condition T able 2 highlights disagreement across condi- tions. W e calculated the modal annotation for each tweet in each condition from the three annotations, separately for HS and OL, and compared these across conditions. The left table giv es the agree- ment rates for the modal OL annotations and the right table for the modal HS annotations. For OL, Condition A disagrees most often with the other conditions. For HS, the agreement rates are smaller , and Condition E has the most disagreement with the other conditions. Cond. OL HS B 0.653 0.596 C 0.646 0.731 0.545 0.536 D 0.629 0.695 0.707 0.559 0.579 0.539 E 0.655 0.740 0.740 0.724 0.477 0.505 0.484 0.510 A B C D A B C D T able 2: Agreement between modal labels across anno- tation conditions (Krippendorff ’ s alpha) Prediction Perf ormance T able 3 shows the bal- anced accuracy and R OC-A UC metrics for the ﬁne- tuned BER T models when ev aluated against the combined test set (with annotations from all con- ditions). The models achie ve higher performance when predicting OL compared to HS, highlighting that HS detection is more dif ﬁcult (T able 3 ). Bal. Accuracy R OC-A UC Cond. OL HS OL HS A 0.772 0.690 0.846 0.806 B 0.792 0.704 0.866 0.803 C 0.802 0.681 0.862 0.800 D 0.797 0.701 0.857 0.801 E 0.794 0.696 0.863 0.794 T able 3: Balanced Accuracy and ROC-A UC by anno- tation condition in combined test set, av eraged o ver 10 BER T models Figure 2 sho ws the performance results for all combinations of training and testing conditions. The cells on the main diagonal sho w the balanced accuracy of BER T models ﬁne-tuned on training data from one experimental condition and ev alu- ated on test data from the same condition, averaged across the 10 model runs. The off-diagonal cells sho w the performance of BER T models ﬁne-tuned on one condition and ev aluated on test data from a dif ferent condition. The left panel, in blue, con- tains results from the OL models; the right panel, (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure 2: Performance (balanced accuracy) of BER T models across annotation conditions (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure 3: Performance (R OC-A UC) of BER T models across annotation conditions in orange, from the HS models. Performance dif fers across training and testing conditions, particularly for OL, as evident in the ro w- and column-wise patterns in both panels of Figure 2 . W e do not observe higher performance scores in the main diagonal: models trained and tested on data from the same experimental condi- tion do not perform better than those trained and tested in dif ferent conditions. Models trained with data from Condition A re- sulted in the lo west performance across all test sets for the OL outcome (T able 3 ). Similarly , e v aluating OL models on test data from Condition A leads to lo wer than average performance compared to other test conditions (Figure 2 a). These results echo the high disagreement rates in OL annotations between Condition A and the other conditions (T able 2 ). In contrast, models tested on data from Condition A sho w the highest performance when predicting HS (Figure 2 b). Conditions B and C collected the HS and OL annotations on separate screens, differing only in the order of the HS and OL screens for each tweet (Figure 1 ). Models trained on annotations collected in Condition C, where HS was annotated second for each tweet, have lower balanced accuracy across test sets when predicting HS (Figure 2 b). Howe ver , there is no corresponding effect for Condition B when training models to predict OL. The OL models perform w orse when e v aluated on test data collected in Condition D than on data collected in Condition E (see the strong column (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure 4: Learning curves of BER T models compared by annotation conditions ef fects in the last two columns of Figure 2 a). Again, no similar column effect is visible in the HS panel (Figure 2 b). A potential explanation is that the second batch of 50 annotations was of lower quality due to annotator fatigue. Ev aluating models across conditions with re- spect to R OC-A UC largely conﬁrms these patterns (Figure 3 ). The performance differences across testing conditions (column ef fects) are particularly striking in the ROC-A UC ﬁgures, while dif ferences between training conditions (ro w ef fects) are less pronounced. Predicted Scores Comparing model predictions across conditions conte xtualizes our ﬁndings. T a- ble 4 sho ws agreement (Krippendorf ’ s alpha) be- tween the predictions produced by models ﬁne- tuned on different conditions. All predictions in T able 4 are for tweets in the test sets. OL predic- tions from the model trained on Condition A show more disagreement with the predictions from the other models (similar to the agreement rates in T a- ble 2 between modal labels). Agreement rates in predictions between models trained on Condition B to E are lo wer for HS than for OL. Cond. OL HS B 0.679 0.778 C 0.754 0.869 0.822 0.777 D 0.727 0.869 0.901 0.839 0.811 0.751 E 0.682 0.878 0.861 0.872 0.788 0.789 0.760 0.797 A B C D A B C D T able 4: Agreement between BER T predictions across annotation conditions (Krippendorff ’ s alpha) Learning Curves W e further study how the an- notation conditions impact model performance un- der a range of training set sizes, in vestigating whether dif ferences between conditions impact training ef ﬁciency . Figure 4 shows learning curv es for the BER T models by training condition (line plots and corresponding scatterplot smoothers). These curv es sho w ho w test set model performance changes as the training data size increases. For each training condition, 1% batches of data were suc- cessi vely added for model training and the model was e v aluated on the (combined) test set. T o limit computational b urden, one model was trained for each training set size and training condition. While the differences in the learning curves across conditions for OL are small (Figure 4 a), Conditions A, C, and E stand out among the HS learning curves (Figure 4 b). Building models with annotations collected in Condition E is less efﬁ- cient: initially , more training data are needed to achie ve acceptable performance. 5 Discussion Design choices made when creating annotation in- struments impact the models trained on the result- ing annotations. In the conte xt of annotating hate speech and offensi ve language in tweets, we cre- ated ﬁ ve e xperimental conditions of the annotation instrument. These conditions af fected the percent- age of tweets annotated as hate speech or offensi ve language as well as model performance, learning curves, and predictions. Our results underscore the critical role of the annotation instrument in the model dev elopment process. Models are sensitiv e to ho w training data are collected in ways not previously appreciated in the literature. W e support the calls for improv ed transparency and documentation in the collection of annotations ( Paullada et al. , 2021 ; Denton et al. , 2021 ). If we assume a univ ersal “ground truth” annotation ex- ists, task structure effects are a cause of annotation noise ( Frénay and V erle ysen , 2013 ). Alternati vely , if the annotations generated under dif ferent task structures are equally valid, task structure effects are a mechanism of researcher-induced concept drift ( Gama et al. , 2014 ). Determining a ground truth annotation is dif ﬁcult and in some instances impossible. In such situations, the documentation of instrument design choices is particularly impor - tant. Lack of thorough documentation may explain why attempts to replicate the CIF AR-10 and Im- ageNet creation process encountered difﬁculties ( Recht et al. , 2019 ). Large language models (LLMs) can assist with annotation collection, but we expect that human in v olvement in annotation will continue to be nec- essary for critical tasks, especially those related to discrimination or fairness. Ho we ver , such opinion questions are exactly the kinds of questions that are most susceptible to wording and order ef fects in sur- ve ys ( Schnell and Kreuter , 2005 ). Whether LLMs are also vulnerable to task structure effects is not yet kno wn. In addition, we expect that researchers will increasingly use LLMs for pre-annotation and ask annotators revie w the suggested annotations. This revie w is vulnerable to anchoring or conﬁrma- tion bias ( Eckman and Kreuter , 2011 ), where the annotators rely too much on the pre-annotations rather than independently ev aluating the task at hand. T ask structure ef fects may impact anchoring bias as well. W e want to gi ve special attention to Condition A, which presents all annotation tasks for a gi ven item on the same screen. W e suspect this condition is the most commonly used in practice because it requires fe wer clicks and thus less annotator ef fort. The rate of HS and OL annotations was lo west in Condition A. Models trained on data collected in Condition A had the lowest performance for of fensi v e language and the highest performance for hate speech. There are sev eral possible explanations for the differences in the annotations and in model performance which we cannot disentangle with our data. One explana- tion is an order ef fect: the hate speech button w as sho wn ﬁrst in a list of classes (response options). Respondents may hav e satisﬁced by choosing just one annotation rather than all that apply ( Krosnick et al. , 1996 ; Pew Research Center , 2019 ). Another explanation is distinction bias: asking the annotator to consider both classes simultaneously may lead the annotators to ov er-e xamine the distinctions be- tween of fensi ve language and hate speech ( Hsee and Zhang , 2004 ). Annotators in Conditions D and E annotated 50 tweets and then saw the same tweets again. W e hypothesize that the relatively poor performance of these conditions is the result of a fatigue ef fect. Annotators may gi ve lower -quality annotations in the second half of the task due to boredom and fa- tigue, which could partially e xplain the patterns we see in model performance (Figures 2 , 3 ). Howe ver , we hav e not yet tested for order or fatigue ef fects. Our results do not clearly demonstrate the su- periority of any of the ﬁ ve conditions we tested. More research is needed to identify the best ap- proach. In the meantime, we suggest incorporating v ariation in the task structure when annotation in- struments are created for human annotators, a point also raised by ( Recht et al. , 2019 ). Such v ariation can also help in creating di verse test sets that pro- tect against the test (column) effects we observe (Figures 2 , 3 ). Relying on only one annotation col- lection condition leads to performance assessments that are subject to annotation sensiti vity , that is, dis- torted by the design of the annotation instrument. Limitations This research has explored annota- tion sensitivity only in English-language tweets annotated by Proliﬁc panel members li ving in the United States. The v ariations in annotations, model performance, and model predictions that we see across conditions could differ in other countries and cultures, especially in the context of hate speech and of fensi ve language. The similar ef fects in sur - ve ys, which motiv ated our work, do differ across cultures ( T ellis and Chandrasekaran , 2010 ; Lee et al. , 2020 ). This work has also not demonstrated that task structure ef fects appear in other tasks, such as image and video annotation. In retrospect, we should ha ve included a sixth condition that re- versed the display of the label options in Condition A, which would ha ve let us estimate order ef fects. W e encourage future work in this area to address these limitations. Future W ork Our results suggest that there are order and fatigue effects in annotation collection. Future work should conduct experiments to esti- mate these ef fects. In surve ys, measurement error is more common in later questions ( Egleston et al. , 2011 ). Future work could also assess fatigue effects by incorporating paradata ( Kreuter , 2013 ), such as mouse mov ements ( Horwitz et al. , 2017 ) and re- sponse times ( Galesic and Bosnjak , 2009 ). These behavioral indicators could contextualize how f a- tigue impacts task performance and data quality . In subsequent work, we will explore ho w the an- notators’ characteristics inﬂuence their judgments and interact with the task structure effects. W e also plan to diversify the tasks to include image e v aluations and assignments that allo w for a more accurate determination of ground truth. These en- hancements will provide a more comprehensiv e understanding of annotation sensitivity . W orking with tasks that ha v e (near) ground truth annotations would help the ﬁeld make progress towards the de velopment of best practices in annotation collec- tion. Ethics Statement In our work, we deal with hate speech and offen- si ve language, which could potentially cause harm (directly or indirectly) to vulnerable social groups. W e do not support the views expressed in these hateful posts, we merely venture to analyze this online phenomenon and to study the annotation sensiti vity . The collection of annotations and annotator char- acteristics was re vie wed by the IRB of R TI Interna- tional. Annotators were paid a wage in excess of the US federal minimum wage. Acknowledgments This research receiv ed funding support from R TI International and BERD@NFDI. References Hala Al Kuwatly , Maximilian W ich, and Georg Groh. 2020. Identifying and measuring annotator bias based on annotators’ demographic characteristics . In Pr oceedings of the F ourth W orkshop on Online Abuse and Harms , pages 184–190, Online. Association for Computational Linguistics. K oﬁ Arhin, Ioana Baldini, Dennis W ei, Karthikeyan Natesan Ramamurthy , and Moninder Singh. 2021. Ground-truth, whose truth? - examin- ing the challenges with annotating toxic text datasets . CoRR , abs/2112.03529. Lora Aroyo and Chris W elty . 2015. T ruth is a lie: Cro wd truth and the seven myths of human annotation . AI Magazine , 36(1):15–24. Jacob Beck, Stephanie Eckman, Rob Chew , and Frauke Kreuter . 2022. Improving labeling through social sci- ence insights: Results and research agenda. In HCI International 2022 – Late Breaking P apers: Interact- ing with eXtended Reality and Artiﬁcial Intelligence , pages 245–261, Cham. Springer Nature Switzerland. Thomas Davidson, Dana W armsley , Michael Macy , and Ingmar W eber . 2017. Automated hate speech detec- tion and the problem of offensi ve language . Pr oceed- ings of the International AAAI Confer ence on W eb and Social Media , 11(1):512–515. Emily Denton, Alex Hanna, Razvan Amironesei, An- drew Smart, and Hilary Nicole. 2021. On the genealogy of machine learning datasets: A crit- ical history of ImageNet . Big Data & Society , 8(2):205395172110359. Jacob Devlin, Ming-W ei Chang, K enton Lee, and Kristina T outanov a. 2019. BER T: Pre-training of deep bidirectional transformers for language under - standing . In Pr oceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T ech- nologies, V olume 1 (Long and Short P apers) , pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Stephanie Eckman and Frauke Kreuter . 2011. Conﬁr- mation bias in housing unit listing . Public Opinion Quarterly , 75(1):139–150. Stephanie Eckman, Frauke Kreuter , Antje Kirchner , An- nette Jäckle, Roger T ourangeau, and Stanley Presser . 2014. Assessing the mechanisms of misreporting to ﬁlter questions in surveys . Public Opinion Quarterly , 78(3):721–733. Brian L. Egleston, Suzanne M. Miller, and Neal J. Meropol. 2011. The impact of misclassiﬁcation due to surve y response fatigue on estimation and identi- ﬁability of treatment ef fects . Statistics in Medicine , 30(30):3560–3572. Benoît Frénay and Michel V erle ysen. 2013. Classiﬁca- tion in the presence of label noise: a survey . IEEE transactions on neural networks and learning sys- tems , 25(5):845–869. Mirta Galesic and Michael Bosnjak. 2009. Effects of Questionnaire Length on Participation and Indica- tors of Response Quality in a W eb Survey . Public Opinion Quarterly , 73(2):349–360. João Gama, Indr ˙ e Žliobait ˙ e, Albert Bifet, Mykola Pech- enizkiy , and Abdelhamid Bouchachia. 2014. A sur- ve y on concept drift adaptation. ACM computing surve ys (CSUR) , 46(4):1–37. Mor Ge va, Y oav Goldber g, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an inv es- tigation of annotator bias in natural language under- standing datasets . In Pr oceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Pr ocessing (EMNLP-IJCNLP) , pages 1161–1166, Hong Kong, China. Association for Computational Linguistics. Sepp Hochreiter and Jürgen Schmidhuber . 1997. Long short-term memory . Neural computation , 9(8):1735– 1780. Rachel Horwitz, Frauke Kreuter , and Frederick Conrad. 2017. Using mouse movements to predict web survey response dif ﬁculty . Social Science Computer Re view , 35(3):388–405. Christopher K. Hsee and Jiao Zhang. 2004. Distinction bias: Misprediction and mischoice due to joint ev alu- ation. Journal of P ersonality and Social Psycholo gy , 86(5):680–695. Frauke Kreuter , editor . 2013. Impr oving Surve ys with P aradata , 1 edition. John W iley & Sons, Ltd. Frauke Kreuter , Susan McCulloch, Stanley Presser , and Roger T ourangeau. 2011. The Effects of Asking Fil- ter Questions in Interleafed versus Grouped F ormat. Sociological Methods and Resear c h , 40(88):88–104. Jon A. Krosnick, Sowmya Narayan, and W endy R. Smith. 1996. Satisﬁcing in surve ys: Initial evidence . New Dir ections for Evaluation , 1996(70):29–44. John P . Lalor , Hao W u, Tsendsuren Munkhdalai, and Hong Y u. 2018. Understanding deep learning perfor- mance through an e xamination of test set dif ﬁculty: A psychometric case study . In Pr oceedings of the 2018 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 4711–4716, Brussels, Belgium. Association for Computational Linguistics. Sunghee Lee, Mengyao Hu, Mingnan Liu, and Jennifer Kelle y . 2020. Quantitati ve ev aluation of response scale translation through a randomized experiment of interview language with bilingual english- and spanish-speaking latino respondents . In The Essen- tial Role of Language in Surve y Resear c h . R TI Press. Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Conﬁdent learning: Estimating uncertainty in dataset labels. Journal of Artiﬁcial Intelligence Resear ch , 70:1373–1411. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer , James Bradbury , Gregory Chanan, T re vor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K opf, Edward Y ang, Zachary DeV ito, Martin Raison, Alykhan T e- jani, Sasank Chilamkurthy , Benoit Steiner , Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperativ e style, high-performance deep learning library . In Advances in Neural Information Pr ocess- ing Systems 32 , pages 8024–8035. Curran Associates, Inc. Amandalynne Paullada, Inioluw a Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset dev elopment and use in machine learning research . P atterns , 2(11):100336. Pew Research Center . 2019. When Online Surve y Re- spondents Only Select Some That Apply . Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and V aishaal Shankar . 2019. Do imagenet classiﬁers generalize to imagenet? Rainer Schnell and Frauke Kreuter . 2005. Separating interviewer and sampling-point effects. Journal of Ofﬁcial Statistics , 21(3):389–410. How ard Schuman and Stanley Presser . 1996. Questions and answers in attitude surveys: Experiments on question form, wor ding, and conte xt . Sage. Swabha Swayamdipta, Ro y Schwartz, Nicholas Lourie, Y izhong W ang, Hannaneh Hajishirzi, Noah A. Smith, and Y ejin Choi. 2020. Dataset cartograph y: Mapping and diagnosing datasets with training dynamics . In Pr oceedings of the 2020 Confer ence on Empirical Methods in Natural Languag e Pr ocessing (EMNLP) , pages 9275–9293, Online. Association for Computa- tional Linguistics. Gerard J. T ellis and Deepa Chandrasekaran. 2010. Ex- tent and impact of response biases in cross-national surve y research . International Journal of Resear c h in Marketing , 27(4):329–341. Roger T ourangeau, Frauke Kreuter , and Stephanie Eck- man. 2012. Moti vated underreporting in screening interviews . Public Opinion Quarterly , 76(3):453– 469. Roger T ourangeau, Lance J. Rips, and Kenneth A. Rasin- ski. 2000. The psychology of survey r esponse , 10. print edition. Cambridge Uni versity Press and Cam- bridge Univ . Press. Bertie V idgen, Scott Hale, Ella Guest, Helen Mar- getts, David Broniatowski, Zeerak W aseem, Austin Botelho, Matthew Hall, and Rebekah T romble. 2020. Detecting East Asian prejudice on social media . In Pr oceedings of the F ourth W orkshop on Online Abuse and Harms , pages 162–172, Online. Association for Computational Linguistics. Thomas W olf, L ysandre Deb ut, V ictor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, T im Rault, Remi Louf, Morgan Funto w- icz, Joe Davison, Sam Shleifer , Patrick von Platen, Clara Ma, Y acine Jernite, Julien Plu, Canwen Xu, T ev en Le Scao, Sylvain Gugger , Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language processing . In Pr oceedings of the 2020 Confer ence on Empirical Methods in Natural Language Pr ocessing: System Demonstrations , pages 38–45, Online. Association for Computational Linguistics. Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, F an Y ang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. 2023. Data-centric artiﬁcial intelligence: A survey . arXiv pr eprint arXiv:2303.10158 . A A ppendix A.1 Selection of T weets W e selected 3,000 tweets from 25,000, stratiﬁed by the previous annotations ( Davidson et al. , 2017 ). The nine strata and the sample size in each is sho wn in T able A.1 . # of Annotations Sample Size Percent HS OL Neither 0 0 3 417 14 0 1 2 417 14 0 2 1 417 14 0 3 0 417 14 1 0 2 160 5 1 2 0 417 14 2 0 1 97 3 2 1 0 417 14 3 0 0 241 8 T able A.1: Distribution of sampled tweets by stratum W e do not weight for the probability of selection in our analysis, because our goal is not to make inferences to the Da vidson et al. corpus, which is itself not a random sample of tweets. A.2 Model T raining Details Our implementation of BER T and LSTM models was based on the libraries pytorch ( Paszk e et al. , 2019 ) and transformers ( W olf et al. , 2020 ). During training, we used the same hyperparameter settings of the respectiv e LSTM and BER T models for our 5 different training conditions to k eep these v ariables consistent for comparison purposes. W e report the hyperparameter settings of the models in T able A.2 and A.3 . The number of parameters for pre-trained BER T (base) is ∼ 110 M, and the number of parameters for LSTM is 7 , 025 , 153 . T o av oid random ef fects on training, we trained each model variation with 10 different random seeds { 10 , 42 , 84 , 420 , 567 , 888 , 1100 , 1234 , 5566 , 7890 } and took the av erage across the models. All the experiments were conducted on an NVIDIA ® A100 80 GB RAM GPU. W ithin this computation infrastructure, the LSTM model takes Hyperparameter V alue encoder bert-base-cased epochs_trained 20 learning_rate 5 e − 5 layer_norm_eps 1 e − 12 batch_size 64 optimizer AdamW T able A.2: Hyperparameter settings of BER T models Hyperparameter V alue epochs_trained 20 learning_rate 5 e − 5 batch_size 64 embedding_dim 512 lstm_hidden_size 512 linear_layer_size 256 dropout_rate 0 . 3 vocab_size 5 000 bidirectional True loss BCELoss (binary cross entropy) optimizer Adam T able A.3: Hyperparameter settings of LSTM models approximately 40 seconds per training data condi- tion, and the BER T model takes approximately 15 minutes. A.3 Annotator Instructions Y ou will see a series of tweets. F or each tweet, we want you to code if it contains hate speech or contains of fensi ve language. The next fe w screens explain what hate speech and offensi v e language are, to help with the coding. W e deﬁne hate speech as: Language that is used to expresses hatred to- wards a tar geted group or language intended to be derogatory , to humiliate, or to insult the members of the group. Here are examples of tweets that contain hate speech, according to our deﬁnition: “Y ou are a jiggaboo...!” “they’ re working on a bill to pre vent retards from voting. who knew retards COULD v ote? things are starting to make sense no w . ” “Every slant in #LA should be deported. Those scum ha ve no right to be here. Chinatown should be bulldozed, ” Remember , hate speech is: language that is used to express hatred to wards a tar geted group or lan- guage intended to be derogatory , to humiliate, or to insult the members of the group. Ho wev er , context matters. If a tweet discusses someone else’ s hate speech, it may use hateful terms, but the tweet is not hate speech. F or ex- ample, “Why no boycott of the racist “redskins”? #Red- skins @ChangeTheName” This tweet contains a hateful term, but the tweeter is making an anti-racist statement. W e would label tweet as containing of fensive language. W e deﬁne offensi ve language as: Language that is highly offensi ve to certain in- di viduals or groups b ut does not meet the require- ments of hate speech. “Guess who just got an apartment in do wnto wn Columbus? That’ s right bitch you guessed it, this guy . ” This tweet contains offensi ve language, but it is not hate speech. W e would label it as of fensi ve language. Some tweets contains both hate speech AND of fensi ve language. “Subtweet me one more time, you dirty chink whore” Some tweets do not meet our deﬁnition of hate speech or of fensi ve language. “Great lead battle and then Ricky hits Danica for a yello w . Oh boy . #N ASCAR” A.4 Additional Results W e provide additional results of LSTM models in T able A.4 and Figure A.1 , A.2 , A.3 , as well as comparisons of demographic co v ariates across conditions in Figure A.4 . Bal. Accuracy R OC-A UC Cond. OL HS OL HS A 0.846 0.806 0.747 0.637 B 0.866 0.803 0.755 0.654 C 0.862 0.800 0.759 0.625 D 0.857 0.801 0.734 0.655 E 0.863 0.794 0.754 0.638 T able A.4: Balanced Accuracy and R OC-A UC by anno- tation condition in combined test set, av eraged o ver 10 LSTM models (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure A.1: Performance (balanced accuracy) of LSTM models across conditions (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure A.2: Performance (R OC-A UC) of LSTM models across annotation conditions (a) Predicting Offensi v e Language (b) Predicting Hate Speech Figure A.3: Learning curves of LSTM models compared by annotation conditions Figure A.4: Comparison of demographic cov ariates across conditions

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment