Using Global Constraints and Reranking to Improve Cognates Detection
Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using g…
Authors: Michael Bloodgood, Benjamin Strauss
Using Global Constraints and Reranking to Impr ov e Cognates Detection Michael Bloodgood Department of Computer Science The College of Ne w Jerse y Ewing, NJ 08628 mbloodgood@tcnj.edu Benjamin Strauss Computer Science and Engineering Dept. The Ohio State Uni versity Columbus, OH 43210 strauss.105@osu.edu Abstract Global constraints and reranking hav e not been used in cognates detection research to date. W e propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and re- sults in significant performance improv e- ments beyond current state of the art per- formance on publicly av ailable datasets with dif ferent language pairs and v arious conditions such as dif ferent lev els of base- line state of the art performance and dif- ferent data size conditions, including with more realistic large data size conditions than hav e been e v aluated with in the past. 1 Introduction This paper presents an ef fecti ve method for us- ing global constraints to improve performance for cognates detection. Cognates detection is the task of identifying words across languages that have a common origin. Automatic cognates detection is important to linguists because cognates are needed to determine how languages ev olv ed. Cognates are used for protolanguage reconstruction ( Hall and Klein , 2011 ; Bouchard-C ˆ ot ´ e et al. , 2013 ). Cognates are important for cross-language dictio- nary look-up and can also improve the quality of machine translation, word alignment, and bilin- gual lexicon induction ( Simard et al. , 1993 ; K on- drak et al. , 2003 ). A word is traditionally only considered cognate with another if both words proceed from the same ancestor . Nonetheless, in line with the con ven- tions of previous research in computational lin- guistics, we set a broader definition. W e use the word ‘cognate’ to denote, as in ( Kondrak , 2001 ): “...words in different languages that are similar in form and meaning, without making a distinction between borro wed and genetically related words; for e xample, English ‘sprint’ and the Japanese bor - ro wing ‘supurinto’ are considered cognate, even though these two languages are unrelated. ” These broader criteria are moti v ated by the ways scien- tists dev elop and use cognate identification algo- rithms in natural language processing (NLP) sys- tems. For cross-lingual applications, the advan- tage of such technology is the ability to identify words for which similarity in meaning can be ac- curately inferred from similarity in form; it does not matter if the similarity in form is from strict genetic relationship or later borrowing ( Mericli and Bloodgood , 2012 ). Cognates detection has receiv ed a lot of atten- tion in the literature. The research of the use of statistical learning methods to build systems that can automatically perform cognates detection has yielded many interesting and creative approaches for gaining traction on this challenging task. Cur- rently , the highest-performing state of the art sys- tems detect cognates based on the combination of multiple sources of information. Some of the most indicativ e sources of information discovered to date are word context information, phonetic in- formation, word frequency information, temporal information in the form of word frequency dis- tributions across parallel time periods, and word burstiness information. See section 3 for fuller e x- planations of each of these sources of information that state of the art systems currently use. Scores for all pairs of words from language L1 x language L2 are generated by generating component scores based on these sources of information and then combining them in an appropriate manner . Simple methods of combination are giving equal weight- This paper was published in the Pr oceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1983-1992, V ancouver , Canada, J uly 2017. c 2017 Association for Computational Linguistics DOI: https://doi.org/10.18653/v1/P17-1181 ing for each score, while state of the art perfor- mance is obtained by learning an optimal set of weights from a small seed set of known cognates. Once the full matrix of scores is generated, the word pairs with the highest scores are predicted as being cognates. The methods we propose in the current paper consume as input the final score matrix that state of the art methods create. W e test if our meth- ods can improve performance by generating new rescored matrices by rescoring all of the pairs of words by taking into account global constraints that apply to cognates detection. Thus, our meth- ods are complementary to pre vious methods for creating cognates detection systems. Using global constraints and performing rescoring to improve cognates detection has not been explored yet. W e find that rescoring based on global constraints im- prov es performance significantly beyond current state of the art le vels. The cognates detection task is an interesting task to apply our methods to for a fe w reasons: • It’ s a challenging unsolved task where ongo- ing research is frequently reported in the lit- erature trying to improv e performance; • There is significant room for impro vement in performance; • It has a global structure in its output clas- sifications since if a word lemma 1 w i from language L1 is cognate with a word lemma w j from language L2, then w i is not cognate with any other word lemma from L2 different from w j and w j is not cognate with any other word lemma w k from L1. • There are multiple standard datasets freely and publicly av ailable that hav e been worked on with which to compare results. • Different datasets and language pairs yield initial score matrices with very different qual- ities. Some of the score matrices built using the existing state of the art best approaches yield performance that is quite low (11-point interpolated a verage precision of only ap- proximately 16%) while some of these score 1 A lemma is a base form of a word. For example, in En- glish the words ‘baked’ and ‘baking’ would both map to the lemma ‘bake’. Lemmatizing software exists for many lan- guages and lemmatization is a standard preprocessing task conducted before cognates detection. matrices for other language pairs and data sets hav e state of the art score matrices that are already able to achie ve 11-point interpo- lated av erage precision of 57%. Although we are not aw are of w ork using global constraints to perform rescoring to improv e cog- nates detection, there are related methodologies for reranking in different settings. Methodologi- cally related w ork includes past w ork in structured prediction and reranking ( Collins , 2002 ; Collins and Roark , 2004 ; Collins and Koo , 2005 ; T askar et al. , 2005a , b ). Note that in these past works, there are many instances with structured outputs that can be used as training data to learn a struc- tured prediction model. For example, a semi- nal application in the past was using online train- ing with structured perceptrons to learn improved systems for performing various syntactic analyses and tagging of sentences such as POS tagging and base noun phrase chunking ( Collins , 2002 ). Note that in those settings the unit at which there are structural constraints is a sentence. Also note that there are many sentences av ailable so that online training methods such as discriminativ e training of structured perceptrons can be used to learn struc- tured predictors effecti vely in those settings. In contrast, for the cognates setting the unit at which there are structural constraints is the entire set of cognates for a language pair and there is only one such unit in existence (for a giv en language pair). W e call this a single overarching global structure to make the distinction clear . The method we present in this paper deals with a single overar - ching global structure on the predictions of all in- stances in the entir e problem space for a task. For this type of setting, there is only a single global structure in existence, contrasted with the situa- tion of there being many sentences each impos- ing a global structure on the tagging decisions for that individual sentence. Hence, previous struc- tured prediction methods that require numerous instances each having a structured output on which to train parameters via methods such as perceptron training are inapplicable to the cognates setting. In this paper we present methods for rescoring ef fec- ti vely in settings with a single ov erarching global structure and show their applicability to improv- ing the performance of cognates detection. Still, we note that philosophically our method builds on previous structured prediction methods since in both cases there is a similar intuition in that we’ re using higher-le vel structural properties to inform and accordingly alter our system’ s predictions of v alues for subitems within a structure. In section 2 we present our methods for per- forming rescoring of matrices based on global constraints such as those that apply for cognates detection. The key intuition behind our approach is that the scoring of word pairs for cognateness ought not be made independently as is currently done, b ut rather that global constraints ought to be taken into account to inform and potentially alter system scores for word pairs based on the scores of other word pairs. In section 3 we provide results of experiments testing the proposed methods on the cognates detection task on multiple datasets with multiple language pairs under multiple conditions. W e sho w that the new methods complement and ef fecti vely improve performance over state of the art performance achieved by combining the ma- jor research breakthroughs that have taken place in cognates detection research to date. Complete precision-recall curves are provided that sho w the full range of performance improvements over the current state of the art that are achiev ed. Summary measurements of performance improvements, de- pending on the language pair and dataset, range from 6.73 absolute MaxF1 percentage points to 16.75 absolute MaxF1 percentage points and from 5.58 absolute 11-point interpolated av erage preci- sion percentage points to 17.19 absolute 11-point interpolated av erage precision percentage points. Section 4 discusses the results and possible exten- sions of the method. Section 5 wraps up with the main conclusions. 2 Algorithm While our focus in this paper is on using global constraints to improv e cognates detection, we be- lie ve that our method is useful more generally . W e therefore abstract out some of the specifics of cog- nates detection and present our algorithm more generally in this section, with the hope that it will be able to be used in the future for other appli- cations in addition to cognates detection. None of our abstraction harms understanding of our method’ s applicability to cognates detection and the fact that the method may be more widely ben- eficial does not in any way detract from the utility we sho w it has for improving cognates detection. A common setting is where one has a set X = { x 1 , x 2 , ..., x n } and a set Y = { y 1 , y 2 , ..., y n } where the task is to extract ( x, y ) pairs such that ( x, y ) are in some relation R . Here are examples: • X might be a set of states and Y might be a set of cities and the relation R might be “is the capital of”; • X might be a set of images and Y might be a set of people’ s names and the relation R might be “is a picture of”; • X might be a set of English words and Y might be a set of French words and the re- lation R might be “is cognate with”. A common way these problems are approached is that a model is trained that can score each pair ( x, y ) and those pairs with scores above a thresh- old are extracted. W e propose that often the rela- tion will have a tendency , or a hard constraint, to satisfy particular properties and that this ought to be utilized to improve the quality of the extracted pairs. The approach we put forward is to re-score each ( x, y ) pair by utilizing scores generated for other pairs and our knowledge of properties of the rela- tion being extracted. In this paper , we present and e v aluate methods for impro ving the scores of each ( x, y ) pair for the case when the relation is known to be one-to-one and discuss extensions to other situations. The current approach is to generate a matrix of scores for each candidate pair as follo ws: S cor e X,Y = s x 1 ,y 1 · · · s x 1 ,y n . . . . . . . . . s x n ,y 1 · · · s x n ,y n . (1) Then those pairs with scores abov e a threshold are predicted as being in the relation. W e now de- scribe methods for sharpening the scores in the matrix by utilizing the fact that there is an ov er- arching global structure on the predictions. 2.1 Rev erse Rank W e kno w that if ( x i , y j ) ∈ R , then ( x k , y j ) / ∈ R for k 6 = i when R is 1-to-1. W e define r ev er se r ank ( x i , y j ) = |{ x k ∈ X | s x k ,y j ≥ s x i ,y j }| . Intuiti vely , a high rev erse rank means that there are lots of other elements of X that score bet- ter to y j than x i does; this could be evidence that ( x i , y j ) is not in R and ought to hav e a lower score. Alternati vely , if there are v ery fe w or no other ele- ments of X that score better to y j than x i does this could be evidence that ( x i , y j ) is in R and ought to hav e a higher score. In accord with this intuition, we use rev erse rank as the basis for rescaling our scores as follo ws: scor e RR ( x i , y j ) = s x i ,y j r ev er se r ank ( x i , y j ) . (2) 2.2 F orward Rank Analogous to rev erse rank, another basis we can use for adjusting scores is the forward rank. W e define f or w ar d r ank ( x i , y j ) = |{ y k ∈ Y | s x i ,y k ≥ s x i ,y j }| . W e then scale the scores anal- ogously to ho w we did with rev erse ranks via an in verse linear function. 2 2.3 Combining Rev erse Rank and Forward Rank For combining rev erse rank and forward rank, we present results of experiments doing it two ways. The first is a 1-step approach: scor e RR F R 1 step ( x i , y j ) = s x i ,y j pr oduct , (3) where pr oduct = r ev er se r ank ( x i , y j ) × f or w ar d r ank ( x i , y j ) . (4) The second combination method in volves first computing the re verse rank and re-adjusting ev ery score based on the rev erse ranks. Then in a second step the ne w scores are used to compute forward ranks and then those scores are adjusted based on the forward ranks. W e refer to this method as RR FR 2step. 2.4 Maximum Assignment If one makes the assumption that all elements in X and Y are present and hav e their partner element in the other set present with no extra elements and the sets are not too large, then it is interesting to compute what the ‘maximal assignment’ would be using the Hungarian Algorithm to optimize: max Z ∈ X × Y X ( x,y ) ∈ Z scor e ( x, y ) s.t. ( x i , y j ) ∈ Z ⇒ ( x k , y j ) / ∈ Z , ∀ k 6 = i ( x i , y j ) ∈ Z ⇒ ( x i , y k ) / ∈ Z , ∀ k 6 = j. (5) 2 For both rev erse rank and forward rank we also experi- mented with exponential decay and step functions, but found that simple di vision by the ranks worked as well or better than any of those more complicated methods. W e do this on datasets where the assumptions hold and see how close our methods get to the Hungarian maximal assignment at similar points of the precision-recall curves. For our larger datasets where the assumptions don’t hold, the Hungarian either can’ t complete due to limited computational resources or it functioned poorly in comparison with the performance of our re verse rank and forward rank combination methods. 3 Experiments Our goal is to test whether using the global struc- ture algorithms we described in section 2 can sig- nificantly boost performance for cognates detec- tion. T o test this hypothesis, our first step is to implement a system that uses state of the art re- search results to generate the initial score matri- ces as a current state of the art system would cur- rently do for this task. T o that end, we imple- mented a baseline state of the art system that uses the information sources that pre vious research has found to be helpful for this task such as pho- netic information, word context information, tem- poral context information, word frequency infor- mation, and word burstiness information ( K on- drak , 2001 ; Mann and Y aro wsky , 2001 ; Schafer and Y arowsky , 2002 ; Klementie v and Roth , 2006 ; Irvine and Callison-Burch , 2013 ). Consistent with past work ( Irvine and Callison-Burch , 2013 ), we use supervised training to learn the weights for combining the various information sources. The system combines the sources of information by us- ing weights learned by an SVM (Support V ector Machine) on a small seed training set of cognates 3 to optimize performance. This baseline system obtains state of the art performance on cognates detection. Using this state of the art system as our baseline, we inv estigated how much we could improv e performance beyond current state of the art lev els by applying the rescoring algorithm we described in section 2 . W e performed e xperi- ments on three language pairs: French-English, German-English, and Spanish-English, with dif- ferent text corpora used as training and test data. The different language pairs and datasets ha v e dif- ferent lev els of performance in terms of their base- line current state of the art score matrices. In the 3 The small seed set w as randomly selected and less than 20% in all cases. It was not used for testing. Note that us- ing this data to optimize performance of the baseline system makes our baseline ev en stronger and makes it even harder for our new rescoring method to achie ve larger impro vements. next fe w subsections, we describe our experimen- tal details. 3.1 Lemmatization W e used morphological analyzers to con vert the words in text corpora to lemma form. For En- glish, we used the NL TK W ordNetLemmatizer ( Bird et al. , 2009 ). For French, German, and Span- ish we used the T reeT agger ( Schmid , 1994 ). 3.2 W ord Context Information W e used the Google N-Gram corpus ( Michel et al. , 2010 ). For English we used the English 2012 Google 5-gram corpus, for French we used the French 2012 Google 5-gram corpus, for German we used the German 2012 Google 5-gram corpus, and for Spanish we used the Spanish 2012 Google 5-gram corpus. From these corpora we compute word context similarity scores across languages using Rapp’ s method ( Rapp , 1995 , 1999 ). The intuition behind this method is that cognates are more likely to occur in correlating context win- do ws and this statistic inferred from lar ge amounts of data captures this correlation. 3.3 Frequency Information The intuition is that over large amounts of data cognates should hav e similar relativ e frequencies. W e compute our relative frequencies by using the same corpora mentioned in the previous subsec- tion. 3.4 T emporal Inf ormation The intuition is that cognates will hav e simi- lar temporal distributions ( Klementie v and Roth , 2006 ). T o compute the temporal similarity we use newspaper data and con vert it to simple daily word counts. For each word in the corpora the word counts create a time series v ector . The Fourier transform is computed on the time series vectors. Spearman rank correlation is computed on the transform vectors. For English we used the English Gigaw ord Fifth Edition 4 . For French we used French Gigaw ord Third Edition 5 . F or Spanish we used Spanish Gigaword First Edition 6 . The German news corpora were obtained by web crawling http://www.tagesspiegel.de/ and extracting the ne ws articles. 4 Linguistic Data Consortium Catalog No. LDC2011T07 5 Linguistic Data Consortium Catalog No. LDC2011T10 6 Linguistic Data Consortium Catalog No. LDC2006T12 3.5 W ord Burstiness The intuition is that cognates will have similar burstiness measures ( Church and Gale , 1995 ). For word b urstiness we used the same corpora as for the temporal corpora. 3.6 Phonetic Inf ormation The intuition is that cognates will hav e correspon- dences in how they are pronounced. For this, we compute a measurement based on Normalized Edit Distance (NED). 3.7 Combining Inf ormation Sources W e combine the information sources by using a linear Support V ector Machine to learn weights for each of the information sources from a small seed training set of cognates. So our final score assigned to a candidate cognate pair ( x, y ) is: scor e ( x, y ) = X m ∈ metrics w m scor e m ( x, y ) , (6) where metr ics is the set of measurements such as phonetic similarity measurements, word bursti- ness similarity , relative frequency similarity , etc. that were explained in subsections 3.2 through 3.6 ; w m is the learned weight for metric m ; and scor e m ( x, y ) is the score assigned to the pair ( x, y ) by metric m . The scores such assigned represent a state of the art approach for filling in the matrix identified in equation 1 . At this point the matrix of scores would be used to predict cognates. W e now turn to e v aluation of the use of the global constraint rescoring methods from section 2 for improving performance beyond the state of the art le vels. 3.8 Using Global Constraints to Rescore For our cognates data we used the French-English pairs from ( Bergsma and Kondrak , 2007 ) and the German-English and Spanish-English pairs from ( Beinborn et al. , 2013 ). Fig- ure 1 sho ws the precision-recall 7 curves for 7 Precision and recall are the standard measures used for systems that perform search. Precision is the percentage of predicted cognates that are indeed cognate. Recall is the per- centage of cognates that are predicted as cognate. W e vary the threshold that determines cognateness to generate all points along the Precision-Recall curve. W e start with a very high threshold enabling precision of 100% and lower the threshold until recall of 100% is reached. In particular , we sort the test examples by score in descending order and then go do wn the list of scores in order to complete the entire precision-recall curve. 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Max Assignment Max Assignment Score Figure 1: Precision-Recall Curves for French-English. Baseline denotes state of the art performance. French-English, Figure 2 shows the performance for German-English, and Figure 3 sho ws the performance for Spanish-English. Note that state of the art performance (denoted in the figures as Baseline) has very different performance across the three datasets, but in all cases the systems from section 2 that incorporate global constraints and perform rescoring greatly exceed current state of the art performance le vels. The Max Assignment is really just the single point that the Hungarian finds. W e drew lines connecting it, but keep in mind those lines are just connecting the single point to the endpoints. Max Assignment Score traces the precision-recall curve back from the Max Assignment by steadily increasing the threshold so that only points in the maximum assignment set with scores abov e the increasing threshold are predicted as cognate. For the non-max assignment curves, it is some- times helpful to compute a single metric summa- rizing important aspects of the full curve. For this purpose, maxF1 and 11-point interpolated av erage precision are often used. MaxF1 is the F1 mea- sure (i.e., harmonic mean of precision and recall) at the point on the precision-recall curv e where F1 is highest. The interpolated precision p interp at a gi ven recall lev el r is defined as the highest preci- sion le vel found for an y recall le vel r 0 ≥ r : p interp ( r ) = max r 0 ≥ r p ( r 0 ) . (7) The 11-point interpolated a verage precision (11-point IAP) is then the av erage of the p interp at r = 0 . 0 , 0 . 1 , ..., 1 . 0 . T able 1 sho ws these per- formance measures for French-English, T able 2 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Max Assignment Max Assignment Score Figure 2: Precision-Recall Curves for German-English. Baseline denotes state of the art performance. 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Max Assignment Max Assignment Score Figure 3: Precision-Recall Curves for Spanish-English. Baseline denotes state of the art performance. sho ws the results for German-English, and T able 3 sho w the results for Spanish-English. In all cases, using global structure greatly improves upon the state of the art baseline performance. In ( Bergsma and K ondrak , 2007 ), for French-English data a re- sult of 66.5 11-point IAP is reported for a situation where word alignments from a bite xt are a vailable and a result of 77.7 11-point IAP is reported for a situation where translation pairs are av ailable in large quantities. The setting considered in the cur- rent paper is much more challenging since it does not use bilingual dictionaries or word alignments from bitexts. The setting in the current paper is the one mentioned as future work on page 663 of ( Bergsma and K ondrak , 2007 ): ”In particular , we plan to inv estigate approaches that do not re- M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 5 4 . 9 2 5 0 . 9 9 R R 6 2 . 9 4 5 9 . 6 2 R R F R 1 S T E P 6 8 . 3 5 6 4 . 4 2 R R F R 2 S T E P 6 9 . 7 2 6 7 . 2 9 T able 1: French-English Performance. B A S E L I N E indicates current state of the art performance. M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 2 1 . 3 8 1 6 . 2 5 R R 2 2 . 7 1 1 7 . 8 0 R R F R 1 S T E P 2 8 . 6 8 2 2 . 3 7 R R F R 2 S T E P 2 8 . 1 1 2 1 . 8 3 T able 2: German-English Performance. B A S E - L I N E indicates current state of the art performance. quire the bilingual dictionaries or bitexts to gener - ate training data. ” Note that the ev aluation thus far is a bit artifi- cial for real cognates detection because in a real setting you wouldn’t only be selecting matches for relativ ely small subsets of words that are guaranteed to have a cognate on the other side. Such was the case for our ev aluation where the French-English set had approx. 600 cognate pairs, the German-English set had approx. 1000 pairs, and the Spanish-English set had approx. 3000 pairs. In a real setting, the system would have to consider words that don’t ha ve a cognate match in the other language and not only words that were hand-selected and guaranteed to ha ve cognates. W e are not aware of others ev aluating according to this much more difficult condition, b ut we think it is important to consider especially gi ven the po- tential impacts it could have on the global struc- ture methods we’ ve put forward. Therefore, we run a second set of ev aluations where we take the ten thousand most common words in our corpora for each of our languages, which contain many of the cognates from the standard test sets and we add in any remaining words from the stan- dard test sets that didn’t make it into the top ten thousand. W e then repeat each of the experi- M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 5 6 . 2 6 5 7 . 0 3 R R 6 8 . 5 2 6 9 . 3 3 R R F R 1 S T E P 7 0 . 6 6 7 1 . 4 7 R R F R 2 S T E P 7 3 . 0 1 7 4 . 2 2 T able 3: Spanish-English Performance. B A S E - L I N E indicates current state of the art performance. 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Figure 4: Precision-Recall Curves for French-English (large data). Note that Base- line denotes state of the art performance. ments under this much more challenging condi- tion. With approx. ten thousand squared candi- dates, i.e., approx. 100 million candidates, to con- sider for cognateness, this is a large data condi- tion. The Hungarian didn’t run to completion on two of the datasets due to limited computational resources. On French-English it completed, but achie ved poorer performance than any of the other methods. This makes sense as it is designed when there really is a bipartite matching to be found like in the artificial yet standard cognates ev aluation that was just presented. When confronted with large amounts of words that create a much denser space and hav e no match at all on the other side the all or nothing assignments of the Hungarian are not ideal. The re verse rank and forward rank rescoring methods are still quite effecti ve in im- proving performance although not by as much as they did in the small data results from abo ve. Figure 4 shows the full precision-recall curves for French-English for the large data condition, Figure 5 shows the curves for German-English for the large data condition, and Figure 6 shows the results for Spanish-English for the large data con- dition. T ables 4 through 6 show the summary metrics for the three language pairs for the large data ex- periments. W e can see that the reverse rank and forward rank methods of taking into account the global structure of interactions among predictions is still helpful, providing large improv ements in performance e ven in this challenging large data condition o ver strong state of the art baselines that 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Figure 5: Precision-Recall Curves for German-English (large data). Note that Baseline denotes state of the art performance. 0 20 40 60 80 100 Recall 0 20 40 60 80 100 Precision Precision-Recall Curves Baseline RR RR_FR_1step RR_FR_2step Figure 6: Precision-Recall Curves for Spanish-English (large data). Note that Baseline denotes state of the art performance. make cognate predictions independently of each other and don’t do any rescoring based on global constraints. 4 Discussion W e believ e that this work opens up new a venues for further exploration. A few of these include the follo wing: • in vestigating the utility of applying and ex- tending the method to other applications such as Information Extraction applications, man y of which hav e similar global constraints as cognates detection; • in vestigating how to handle other forms of global structure including tendencies that are M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 5 5 . 0 8 5 1 . 3 5 R R 6 0 . 8 8 5 8 . 7 9 R R F R 1 S T E P 6 5 . 8 7 6 3 . 5 5 R R F R 2 S T E P 6 5 . 7 6 6 5 . 2 6 T able 4: French-English Performance (large data). B A S E L I N E indicates state of the art performance. M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 2 1 . 2 5 1 6 . 1 7 R R 2 4 . 7 8 1 9 . 1 3 R R F R 1 S T E P 3 0 . 7 2 2 4 . 9 7 R R F R 2 S T E P 3 0 . 3 4 2 4 . 8 6 T able 5: German-English Performance (large data). B A S E L I N E indicates state of the art perfor- mance. M E T H O D M A X F 1 1 1 - P O I N T I A P B A S E L I N E 5 4 . 7 5 5 4 . 5 5 R R 6 2 . 5 2 6 1 . 4 2 R R F R 1 S T E P 6 6 . 4 5 6 5 . 8 9 R R F R 2 S T E P 6 6 . 3 8 6 5 . 5 T able 6: Spanish-English Performance (large data). B A S E L I N E indicates state of the art perfor- mance. not necessarily hard constraints; • dev eloping more theory to more precisely understand some of the nuances of using global structure when it’ s applicable and making connections with other areas of ma- chine learning such as semi-supervised learn- ing, acti ve learning, etc.; and • in vestigating how to have a machine learn that global structure exists and learn what form of global structure exists. 5 Conclusions Cognates detection is an interesting and challeng- ing task. Previous work has yielded state of the art approaches that create a matrix of scores for all word pairs based on optimized weighted combina- tions of component scores computed on the basis of various helpful sources of information such as phonetic information, word context information, temporal context information, word frequency in- formation, and word burstiness information. How- e ver , when assigning a score to a word pair, the current state of the art methods do not take into ac- count scores assigned to other word pairs. W e pro- posed a method for rescoring the matrix that cur- rent state of the art methods produce by taking into account the scores assigned to other word pairs. The methods presented in this paper are com- plementary to existing state of the art methods, easy to implement, computationally efficient, and practically ef fecti ve in improving performance by large amounts. Experimental results rev eal that the new methods significantly improve state of the art performance in multiple cognates detec- tion experiments conducted on standard freely and publicly a v ailable datasets with dif ferent language pairs and v arious conditions such as dif ferent le v- els of baseline performance and different data size conditions, including with more realistic lar ge data size conditions than hav e been ev aluated with in the past. References Lisa Beinborn, T orsten Zesch, and Iryna Gurevych. 2013. Cognate production using character-based machine translation. In Pr oceedings of the Sixth International Joint Conference on Natu- ral Language Pr ocessing . Asian Federation of Natural Language Processing, Nagoya, Japan, pages 883–891. http://www.aclweb.org/ anthology/I13- 1112 . Shane Ber gsma and Grzegorz K ondrak. 2007. Alignment-based discriminative string similarity . In Pr oceedings of the 45th Annual Meeting of the Association of Computational Linguistics . Association for Computational Linguistics, Prague, Czech Republic, pages 656–663. http://www. aclweb.org/anthology/P07- 1083 . Stev en Bird, Ewan Klein, and Edward Loper . 2009. Natural Language Pr ocessing with Python . O’Reilly Media, Inc., 1st edition. Alexandre Bouchard-C ˆ ot ´ e, David Hall, Thomas L. Griffiths, and Dan Klein. 2013. Automated Re- construction of Ancient Languages using Proba- bilistic Models of Sound Change . Pr oceedings of the National Academy of Sciences 110:4224–4229. https://doi.org/10.1073/pnas.1204678110 . Kenneth W . Church and William A. Gale. 1995. Poisson mixtures. Natural Language Engineering 1:163–190. Michael Collins. 2002. Discriminative training meth- ods for hidden markov models: Theory and ex- periments with perceptron algorithms . In Pr o- ceedings of the 2002 Conference on Empirical Methods in Natural Language Pr ocessing . Asso- ciation for Computational Linguistics, pages 1–8. https://doi.org/10.3115/1118693.1118694 . Michael Collins and T erry Koo. 2005. Dis- criminativ e reranking for natural language pars- ing . Computational Linguistics 31(1):25–70. https://doi.org/10.1162/0891201053630273 . Michael Collins and Brian Roark. 2004. Incremen- tal parsing with the perceptron algorithm . In Pr oceedings of the 42nd Meeting of the Asso- ciation for Computational Linguistics (ACL ’04), Main V olume . Barcelona, Spain, pages 111–118. https://doi.org/10.3115/1218955.1218970 . David Hall and Dan Klein. 2011. Large-scale cognate recov ery . In Pr oceedings of the 2011 Confer ence on Empirical Methods in Natural Language Pr ocess- ing . Association for Computational Linguistics, Ed- inbur gh, Scotland, UK., pages 344–354. http:// www.aclweb.org/anthology/D11- 1032 . Ann Irvine and Chris Callison-Burch. 2013. Su- pervised bilingual lexicon induction with multi- ple monolingual signals. In Pr oceedings of the 2013 Confer ence of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language T echnologies . Association for Computational Linguistics, Atlanta, Georgia, pages 518–523. http://www.aclweb.org/ anthology/N13- 1056 . Alexandre Klementiev and Dan Roth. 2006. W eakly supervised named entity transliteration and discov- ery from multilingual comparable corpora . In Pr oceedings of the 21st International Confer- ence on Computational Linguistics and 44th An- nual Meeting of the Association for Computa- tional Linguistics . Association for Computational Linguistics, Sydney , Australia, pages 817–824. https://doi.org/10.3115/1220175.1220278 . Grzegorz Kondrak. 2001. Identifying cognates by pho- netic and semantic similarity . In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies . Association for Computa- tional Linguistics, Stroudsb urg, P A, USA, N AA CL ’01, pages 1–8. http://www.aclweb.org/ anthology/N/N01/N01- 1014.pdf . Grzegorz K ondrak, Daniel Marcu, and Ke vin Knight. 2003. Cognates can improve statistical transla- tion models. In Pr oceedings of the 2003 Con- fer ence of the North American Chapter of the As- sociation for Computational Linguistics on Hu- man Languag e T echnology: Companion V olume of the Pr oceedings of HLT -N AA CL 2003–short P a- pers - V olume 2 . Association for Computational Linguistics, Stroudsburg, P A, USA, NAA CL-Short ’03, pages 46–48. http://www.aclweb.org/ anthology/N/N03/N03- 2016.pdf . Gideon S. Mann and David Y arowsk y . 2001. Mul- tipath translation lexicon induction via bridge lan- guages. In Pr oceedings of the second meet- ing of the North American Chapter of the As- sociation for Computational Linguistics on Lan- guage technologies . Association for Computa- tional Linguistics, Stroudsb urg, P A, USA, N AA CL ’01, pages 1–8. http://www.aclweb.org/ anthology/N/N01/N01- 1020.pdf . Benjamin S. Mericli and Michael Bloodgood. 2012. Annotating cognates and etymological origin in T ur- kic languages. In Pr oceedings of the F irst W orkshop on Language Resources and T echnologies for T ur- kic Languages . European Language Resources As- sociation, Istanbul, T urkey , pages 47–51. http: //arxiv.org/abs/1501.03191 . Jean-Baptiste Michel, Y uan Kui Shen, A viv a Presser Aiden, Adrian V eres, Matthew K. Gray , The Google Books T eam, Joseph P . Pickett, Dale Hoiberg, Dan Clancy , Peter Norvig, Jon Or- want, Ste ven Pinker , Martin A. No wak, and Erez Lieberman Aiden. 2010. Quantitativ e analysis of culture using millions of digi- tized books . Science 331(6014):176–182. https://doi.org/10.1126/science.1199644 . Reinhard Rapp. 1995. Identifying word translations in non-parallel texts . In Pr oceedings of the 33rd Annual Meeting of the Association for Computa- tional Linguistics . Association for Computational Linguistics, Cambridge, Massachusetts, USA, pages 320–322. https://doi.org/10.3115/981658.981709 . Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and ger- man corpora . In Pr oceedings of the 37th An- nual Meeting of the Association for Computational Linguistics . Association for Computational Linguis- tics, College Park, Maryland, USA, pages 519–526. https://doi.org/10.3115/1034678.1034756 . Charles Schafer and David Y arowsky . 2002. In- ducing translation lexicons via diverse similar - ity measures and bridge languages. In Pro- ceedings of the 6th Conference on Natural language Learning - V olume 20 . Association for Computational Linguistics, Stroudsb urg, P A, USA, pages 1–7. http://www.aclweb.org/ anthology/W/W02/W02- 2026.pdf . Helmut Schmid. 1994. Part-of-speech tagging with neural networks. In COLING . pages 172–176. http://www.aclweb.org/anthology/C/ C94/C94- 1027.pdf . Michel Simard, George F . Foster , and Pierre Isabelle. 1993. Using cognates to align sentences in bilingual corpora. In Pr oceedings of the 1993 Conference of the Centr e for Advanced Studies on Collaborative Resear ch: Distributed Computing - V olume 2 . IBM Press, CASCON ’93, pages 1071–1082. Ben T askar , V assil Chatalbashev , Daphne Koller , and Carlos Guestrin. 2005a. Learning structured predic- tion models: A large margin approach. In Pr oceed- ings of the 22nd International Conference on Ma- chine learning . A CM, pages 896–903. Ben T askar, Lacoste-Julien Simon, and Klein Dan. 2005b. A discriminativ e matching approach to word alignment. In Pr oceedings of Hu- man Language T echnology Confer ence and Con- fer ence on Empirical Methods in Natural Lan- guage Pr ocessing . Association for Computational Linguistics, V ancouver , British Columbia, Canada, pages 73–80. http://www.aclweb.org/ anthology/H/H05/H05- 1010.pdf .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment