Persian Wordnet Construction using Supervised Learning

This paper presents an automated supervised method for Persian wordnet construction. Using a Persian corpus and a bi-lingual dictionary, the initial links between Persian words and Princeton WordNet synsets have been generated. These links will be di…

Authors: Zahra Mousavi, Heshaam Faili

Persian Wordnet Construction using Supervised Learni ng Zahra Mousavi School of Electri cal and Compute r Engine ering, Colle ge of Engi neering, Unive rsity of Te hran, Tehran, Iran sz.mousavi@ut.ac.ir Heshaam Faili School of Electrical a nd Computer Enginee ring, Colle ge of Engi neering, Unive rsity of Te hran, Tehran, Iran hfaili@ut. ac.ir Abstract — This pa per pre sent s an a uto mate d su perv ise d met hod f or Per si an w ordn et co nstr uct ion. Usi ng a P ersi an corp us a nd a bi - ling ual diction ary, the initi al links betw een Persian w ords and Prin ceton W ordNet sy nsets ha ve been gener ate d. These links will be d iscrimi nated later as co rrect or inc orrect by e mploying s even features in a trai ned class ifi cati on sys te m. The w hole metho d is j ust a c la ssif ic atio n s yst em, w hic h has b ee n trai ne d on a t rai n set co ntai ning FarsNet as a set of co rrect i nstances. State of the art result s on the automati cally d erived Pe rsian w ordnet is achiev ed. The re sult ed w ordnet wi th a pr ecis io n of 91 . 1 8% inc ludes m ore than 16,000 wo rds an d 2 2 ,000 synsets. Keyword s - wordne t; ontol ogy; su pervi sed; Pe rsian l anguage I. I NTRODUC T ION Over t he pa st ye ars , acq uir in g sema ntic kno wled ge about lexical terms h as been the concern of many projects in q uer y expansion, te xt summariza tion [1] , text categor ization [2] and generating concep t hierarchies [3] . F or s ome lan guage s suc h as Eng lis h, bro ad cove rage sema ntic t axo no my li ke Pr ince to n WordNet (PW N) [4] has b een co nstruc ted ma nual ly b y spen ding great c ost and time . Also, tw o great efforts in cons tructing w ordnet for othe r langua ges were EuroWordNet [5 ] and Balk aNet [6] . The former deals with E urop ean l ang ua ges s uch a s Engl ish, Dut ch, German, French, Spanish, Italian, Czech, and E stoni an, and t he l atte r de al s wit h la ngua ges fro m Bal kan ar ea suc h as Ro ma nia n, Bul garia n, Tur kish, Slove nia n, Gr ee k and S erb ian. A co mmon feat ure a mon g w ord nets i n di ffere nt langua ge s is s ynse t. Synse ts ar e set s of s yno nyms, which are connected together b y me ans of sem anti c relations. Two main strate gies for auto matically construct ing wordnet can be considered: 1) Merge an d 2) Expansion [5] . In the merge approach, an independent wordnet for target language is created, and for each synset in the gener ated wor dne t eq uival ent syn sets i n PW N o r another availab le word net is i dentified . This method is mor e complex th an expansion approa ch and requ ires more time to cons tr uct a wordnet . T he available lexical resou rces and w ordnet building tools and also, the polys emy of t he words i n the sy nsets, directly affect the average t ime is cons umed for b uildi ng each lexical entr y of w ordnets . In the e xp ansi on ap pr oac h, one available wordnet, usually PWN, is considered as sourc e wordne t, and the w ords associa ted to its synsets are translated to the t arget language to g enerate the initial synsets o f the wordnet. T his process is based on an assumption, which implies that the concep ts and their relations are language - in dependent, while it may be disaffirmed in some cases. T herefore, the coverag e of langu age - specific concepts and properties isn’ t warra nted b y the p rod uced wordne t, whic h is a drawback of th e expansion ap proaches. In th ese approaches, the struct ure of the source wordnet is used for t arge t lang uage and o ther me ta - data over source wordnet such as Domain m odels can be used for target word net, t oo . Co nseq uent ly, it e xcl udes time - cons uming and e xpensive ma nual proces s for pro vidi ng suc h i nfor matio n. T he othe r ad vanta ge o f this ap pr oac h is a uto matic ali gni ng wo rd n ets to each other, which can be exploite d in NLP multili ngual task s exte nsi vel y. In ge nera l, the expa nsio n a ppr oa ch is an efficient method for WordN et construction, but the generated wordnet is heavily biased or limited to the source wordnet. In EuroWordNe t and BalkaNet projects a top - dow n methodol ogy h as been used. In the fi rst st ep of this methodol ogy, a c ore w ordnet has been dev eloped manua lly whic h co ntai ns a ll hi gh - level concepts of the language. At the n ext step, core wordnet has been expa nded usi ng a utomated techniques with high conf ident resu lts. Usin g this approach , a num ber of autom ated meth ods were propos ed for con struct ing a wordnet for Asi a n languages such as Japanese, Arabic, Tha i, and P ersi an, which use s PW N and ot her e xisti ng lexical resource s. In recent years, some efforts have been made in order to create a w ordnet for Persian language. In fact, different methods to cons truct Persian wordnet manua ll y, se mi - automatically and automat ically have been propos ed. In [7] a semi - auto matic method is propose d in wh ich for e ach Pers ian w ord, a num ber of PW N synsets is s uggest ed b y the sy stem in or der to be judged later by a human annotator to select a relevant synset . B y us ing so me ot her a uto mated me tho ds with huma n sup ervi sio n, t heir wo rk in c o nstr uctio n o f Persian wordnet has been expanded later, and an i nitial Persian w ord net named Far sNet h as been developed [8] . In [9] an automatic method for Persian WordNet cons truction ba sed on PWN is int roduced. The propose d method u ses a bi - li ngual dictionar y and Persian and Engl ish corpora t o link Per sian words t o PW N synsets. A sc ore function has been defined to rank the mappings between Persian words and PWN synset s. In the ne xt work [ 10 ], a w ord sense di sambig - uatio n (W SD) met hod is emplo y ed in a n it era tive approach based on Expectation - Maximization (EM) algorithm to estimat e a probability for each can didate syns et linked t o Persian words . An ot h er iterative approach is presented in [ 11 ] in which t he esti mation of probabi lities is perform ed based on Markov chain Monte Carlo alg orithm. A n extens ion of [ 10 ] i s described in [ 12 ] , whic h suc ce ede d to impro ve t he resul ts b y empl o ying a gra ph - based WSD me thod. After execut ion of the EM a lgor ithm, all lin ks with a probabil ity un der a pre - determined threshold w ere remov ed from t he wordn et. Consi dering 0. 1 as the va lue of th reshold ac quired a w ordnet com posed of 11,899 u nique w ords an d 16,472 WordNet syns ets with a precisi on of 90%. In this paper, we use this wordnet, the state - of - the - art auto matically co nstructed Persian wordnet, as t he baseline for evaluating our w ordne t. In thi s pa per , an e xpansi on - based approach is propose d for cons tructing a Persian wordn et. Most of previou sly propose d methods f or automa tically const ruct io n of P ersi an word ne t foll o w unsup ervi sed approaches. We intend to presen t a supervised wordn et construction due to their higher accuracy in compari - son wit h uns uper vised met hod s. Ho wever , sup ervis ed methods usually suffer fro m the lack of sufficient reliable labeled data. In this research, a train dataset is produced by utilizing FarsNet, the pre - existing Persian wordnet. In f act, the main idea of t hi s wo r k i s exploiting t he available lin ks bet ween FarsNe t and PW N synset s to link o ther Per sian word s to P WN synset s. Simi lar t o the work o f [ 13 ] , the construction method is defined as a classifier. By defining seven features for each link, the classifier is able to classify the l i nks into two categories: correct and incorrect. Available Persian resources are employed to extract distributional and se mantic featur es. Also, the feature s et is enriched by utiliz ing efficient methods for measuring lexical semantic sim ilarity such as Wo rd2Vec m odel [ 14 ] . Evaluation of the results indic ates an i mpr ove men t co mpa ring t o t he pr evio usl y built Persian wordnets. The res t of the pape r is org anize d as fol low s. Section 2 presents an overvi ew on som e autom ated methods propo sed for constructing word nets. Section 3 prese nts our method for automatic ally exte nding the Persian wordnet. Experimen tal result s and evalu ation of the propos ed method are explained in Sectio n 4. Finally , conclusion and futur e works are pres ented in Section 5. II. R ELA TED W OR KS Many researchers have proposed different approaches for autom atically constructing wordnets. In [ 13 ] an autom atic method for construction of a Korean word net usin g PW N ha s be en pre sente d. I n thi s work, links betwe en K ore an wor ds and P WN synse ts ha ve been made using a bi - ling ual dictionary. These lin ks are classified as correct or in correct by using a classifier w ith six features, which is trained on a set containing 3260 manually classified instances. The perf ormance of each f ea ture has been examined by means of precis ion and cov erage as the proporti on of link ed senses of Korean words t o all the s enses of Korean in a test s et. The best feature had 75.21% precis ion and 59.5% cove rage. In addition, the expe ri ment s have shown that the precision for each features, is always bett er than random choice baseline. The combination of featu res using decision tree show ed 93.59% precis ion and 77.1 2% coverag e for Kor ean la ngua ge. In [ 15 ] the basic English - Russ ian wordnet based on the En g lish - Russian lexical resources and morphological a nalyzer tools wa s built. Also, in [ 16 ] a pattern - based algorithm for extracting lexical - sem ant ic relations in P olish is prese nted. In [ 17 ] , an eff ort has been done f or extendin g Arabic wordnet u sing lexical and m orphological rules and appl ying Bayesian inference in se mi - automatic manner. In th is research in order to associate A rabic words with PWN syns ets, a Bay esian netw ork with fou r layers has been proposed. In the f irst layer, A rabic words have been l ocated and t heir corre sponding Engl ish tr anslations are placed in th e second layer. All the s ynset s of En gli sh wo rds exist ing i n la yer 2 , have been set in layer 3. La yer 4 is additional layer of PW N synset s, whic h has b een as soc iat ed wit h the s ynset s o f layer 3 b y way of se mantic rela tion. For t h e Arabic words with onl y one English translation, which this trans lation i s monosemous , too and m oreover f or the Arabic word s with E nglish tra nslations b elonging to a common synset, association between the words and the common PWN synset h ave been made dir ectly. In other cases a learning alg orith m has been applied for measuring the reliability of each associa tion. A set o f candidate s is built with pairs where X belongs to A rabic words and Y belo ngs to PW N s ynset s in l ay er 3 of Bayesian network and ha s a non - zero pr obability, also ther e is a path from X to Y. The tuple is sco red with the posterior probabil ity of Y given th e evidence prov ided by the Baye sian net work. Onl y the t uple s sco red over a predefined th reshold were selec ted for inc lusion in the final set of ca ndidates. The best resu lt obtained from the me ntio ned met hod in this r esea rch s ho wed precis ion of 71% . By ex amining ca ndidate sy nsets of a given w ord in target language and their relations, some criteria can be defi n ed, which represent some features of correct links. In [ 18 ] , such id ea for c onstr uc ting T ha i word ne t has been propos ed. They defin ed 13 c rit eria , which ha ve been categorized into th ree groups: Monosemic criteria which focus on English w ords with on ly one meaning, Polysemic criter ia which focu s on English words with multiple meani ngs and Struct ural criter ia which foc us on the str uct ura l re la tions between ca ndidate synsets. In or de r to ver ify t he co nstr uct ed l inks u sin g the se 13 criteria, str atified sampling tec hnique has been app lied. The results of verification showed 92% correctness for the be st crite rion and 49.25 %. was report ed as th e low est correctness. In [7] , a Persian core wordnet was constru cted for a set of common base concepts . In or de r to extend the core wordnet, for each synset in PWN, all Persian translations of E nglish words were ex tracted using a bi - lingua l dictionar y and the approp riate translatio ns we r e ident ifie d us ing t wo heuri sti cs a nd a W SD metho d. The manual evaluation of the resul ted links between Per sian word s and PW N s ynse ts sho wed pr eci sion o f abo ut 72 % in the resul ti ng Per sian l exico n. Thi s work was e xtend ed in [8 ] and published as the first Persian wordnet, called FarsNet . T hree method s for extracting conceptual relations for no uns were presented. In the first method, a set o f 24 patterns to extract taxono mic relations has been defined. While in the second approach, Wikipedia page structures such as tables, bulle ts, a nd h yper links have been used to extract some relations bet ween word pairs. Fina lly in the t hird method, morph ological ru les hav e been applied on a corpus to extract antonymy relations between adjec tives. Their s ystem emplo ys lingui stic and statistical methods to cl uster adj e ctives. Adjectives that defined different degree of the same attribute are put in one cluster. In [9] an autom atic method for Persian wordnet cons truction ba sed on PWN is int roduced. It uses a score funct ion f or rankin g the ma ppings betw een Per sian word s and P W N synse ts, and the fi nal wor dnet is buil t b y sele cti ng t he hi ghe st sc or es. I n the next wor k [ 10 ] , they proposed an unsu pervised m ethod usi ng EM algorithm to construct a Per sian wordnet. I n ord er to determine candidate synsets for each Persian word, a bi - lingual dictionar y and PWN were utilized. Ne xt, a probability was calculated for each candidate synset apply ing a WSD m et hod in Expecta tion st ep. T hese prob abilities were being upd ated in each itera tion of EM algorithm until convergence to a steady state. Finall y, a wor dne t inc ludi ng 7 ,1 09 uniq ue word s and 9,427 PWN syns ets, was adopt ed by ext racting 10% of high probable w ord - s ynse t p air s. T he eva l uat ions showed a precision of 86.7% according to a m anual test set c onsists of about 1,500 random ly selec ted word - synset pai rs. A n exte nsio n o f this wo rk is d esc rib ed i n [ 12 ] , which su cceed ed to impr ove the r esul ts b y chan ging t he W SD metho d. Also , this method is applicab le to low - resource languages due to the employ ed r esources. The resulted wordnet consists of 11,899 Persian w ords an d 16,472 PWN syns ets with about 30, 000 word - s ynset p air s, gained a score of 90 % with respect to p recision. A similar it erative approach using Markov ch ain Monte Carlo algorithm was presented in [ 11 ] to const ruct a Persian wordn et. T his metho d approximates the probabilities of each candidate s yns et assigned to Persian words based on a Bayesian Inference. Selecting 10,000 word - synset pair s with highest pro babilities, r esulted to a word net with the precis ion of 90.46% . III. P ER S IAN W OR DNET C ONSTRU CTION The propos ed met hod uses Princeton WordNet, a bi - lingua l dictionary, a pr e - exist in g Pe rsia n word net , FarsNet, and a Persian corpus as its available resources. Each concept in English is represented by one s ynset i n P WN. Ba sed on the a s su m ption o f the Expa nsio n met hod , i t is considered that for the m ost conc ept s in E ngl ish, the re e xis ts a n eq uiva le nt co nce pt in Persian and the language - specific concepts are ign ored. Thus, by ident ifying the proper tran slati ons of an Eng lish w ord appearin g in each synset, a Persian synset representing the same concept as th e English one can be constru c ted. Bij ankha n P ersi an co rpu s [ 19 ] is employed as the resource for extracting Persia n words of the wordnet . It leads to coverage o f more frequently used Persian words in the resul ting wordne t. B ija nkha n cor pus i s available in two versions, which the second release is used in our exper iments. I t is a collec tion of dail y news and co mmon texts. All docu ments in this collection ar e group ed into a bout 4300 dif ferent subject c ategorie s. Thi s cor pus c onta ins ab out ten millio ns ma nual ly tagged words with a tag set including 5 50 Persian pa rt of speech (POS) tags [ 20 ]. The first step for wordnet constructio n is translating th e Persian words by a bi - lingual dictio nary to English co unterpar ts. But be fore translati ng the words, it's n ecessar y to em ploy a le mmatizer too l to adapt the differe nt forms of t he w ords. Otherwi se, s ome words existing in the corpus may not be detected in the dictionar y due to app ea ring i n the inflectio n forms . In this reg ard, STeP - 1 [ 21 ] tool is exploited . It contains some Persian tex t p rocessing tools such as tokenizer, spell ch ecker, morphological analyzer and POS tagger . Next, e ach lem matized Persian word is translated to E nglish e q uiva len ts b y Arya npo ur 1 Persian to Engl ish d icti onar y. Then Princ eton WordNet 3.0 is used to id ent ify Engli sh candidate sy nsets f or each Persian w ord . Determining all the PWN synsets incl udin g En glis h tra nslat io ns o f a P ersia n word , the initial links b etween that P ersian words and P WN synsets are generated. It is possible th at more than one Persian word linked to the same PWN synset. Because of En glish w ord polysemy , some of these Persi an words don’t im ply the same meani ng as the mean in g of their link ed synset. In fact, there are several invali d link s between Pers ian words and PWN sy nsets, w hich should be rem oved. Some of these links can be delete d by ex ploiting ext ra knowledg e about Pers ian words . As menti oned , B ij ankha n c orpus is en riched by POS taggi ng. This c orpus giv es proper evide nce about POS tags o f ea ch Pe rsia n word. By us ing t hi s cor pus, the probabil ity of observing each Persian word with each POS tag of noun, verb, adjective and adverb is calculated. This inform ation is used to eli minate incompatible links between P WN synsets and Persian words. T he incompatible link is the one that is made between a PWN synset and a Persian word with incons istent POS tags. Cons equently , 47,291 lin ks out of 247,9 47 link s are pr une d and tota lly 200 , 656 candidate lin ks are remained. Ho wever, there are s till many false links , which must be removed. For this purpose, seven f eatures for each of these l inks have been introduced. Using the se features, a class ifier to discriminate these links as correct or incorrect links has been trained. To defin e some of these features , so me measures of corpus - based se mantic similarit y and relatedness have been us ed. Over th e past y ears, m any arti cles ad dressed t he notion of lexi cal semantic sim ilarity [ 22 ] . The studies in this fiel d attempted to determ ine ho w tw o words a re sem antically close and w hat sem antic rel ation they share, if similar . Another field that is even more g eneral than sem antic sim ilarity is s emantic relat edness [ 22 ] . I n 1 See http:/ /www .aryanpour .com this area some efforts have targete d designing sim ilarity measures that exploit more or less st ructure d source of knowled ge such as WordN et, dictionar ies, Wikipedi a articl es and c orpora. Most of th ese measures are def ined based o n distributiona l hypothesis, which is based on the idea that words found i n sim ila r context have more chanc e to be sim ilar. Each w ord in the corpus is charact erized w ith a contex t vect or. Each el ement of this vector is considered as a feature and its value is calcul ated by lexica l ass ociation meas ures. Sem antic similarity between tw o words is then cal culated by computi ng similar ity measures on co ntext vec tors of each g iven w ord pair . In our e xperiments co ntext vector of Persian word s was constr ucted using Bijankhan co rpus. In this study co - occurrence freque ncy for extracting co ntext vector of each word fro m the corpus has b een used. Cont exts w ere restrict ed to th e word s wit hin the sen tence contai ning the target word and one hundred words, which have t he highest co - occurrence freque ncy with each word in the context, are considered as t he cont ext vecto r (CV ) of tha t wo r d . Recent l y, neur a l embedding t echniqu es such as Word2V ec [ 14 ] have attrac ted lots of atte ntion o f researches . Word2Ve c is an u nsuperv ised met hod for learning di stributional r eal - va lued representations of words b y using their contexts to capture t he relatio n between the words. Due to its effectiveness, It has been wide ly used in many Natural L anguage Proce ssing (NLP) tasks si nce its publicat ion. Indeed , it transfers the words to a lo w- dimensional vector space, which is able to re present the word s with similar conte xts properly in a close pr oximity of the space. Hence, i t gives a good metric for semantically comparing the words by using vector - based similarity measures. In our exp eri me nts by exploit ing Word2Ve c mode l, 300 - dim ensional vectors for Persi an words have b ee n tra ined usi ng Bij an kha n cor pus. Usi ng t hese vectors, semantic similarity between each pairs o f Persian words can be com puted . Here Cosine sim ilarity measure was used to calculat e similarity between two words . Similar to the procedures carried out for Persian words , in the c ase of Engl ish words , about 500 megabytes of English Wikipedia documents were considered and a context vector for each Eng lish word was constructed. FarsNet wordnet link, label Feature vectors links Train Set Extract Features Extract FarsNe t Links Classification Extract Correct Links Extract Persian w ords Translate T o English Extract PW N synsets Pr une lin ks b y POS Figure 1 : the Overvie w of propose d methods for cons tructio n of Persi an wordnet A s mentioned t he whole metho d is just a classifier system, which has been train ed on a gen erated training data set. By employing seven features in this classifier the links between Persian words and PWN synsets are classified i nto two disti nct categor ies: c orrect and incorrect. The final Persian W ordNet is a set of all links, whic h have been c lassified a s correct lin ks. Figur e 1 i llus tr ates a n over vie w of t he pr opo sed methods for w ordnet const ruction. We used the links between Persian words and PWN synset s, which ha ve be en presented in Fars Net as correct instances of training data. Also, a set of randomly sel ec ted links we re added to train ing data as incorrect insta nce s. By exploiting distributio nal and semantic infor mation extracte d from available Per sian resources, seven features for the classification task have b ee n defi ned whic h are describ ed in the follo wing subsecti ons. A. Relatedness Measure In [9] a m easure for calculating the relatedness measure between PWN synsets and Persian words has been defined. One of the drawbacks of the mentioned measure is th e usage of path WordNet sim ilarity. This simil ari ty mea sure has the re striction, whic h is onl y applica ble to nou ns and verbs. H ere anoth er approach is used to define a new relatedness measure f or each link. One of t he basic ideas for calculating semantic similarity between two words is based on this fact that two w ords ar e similar if their context vectors be similar [ 22 ] . So, in t he ca se of E nglis h word s ap pea ring in t he same synset, it’s expected that they appear in the s ame con text a nd t hus ha ve si milar c ontext vec tor . B ased on the above notion, a relatedness measure between an Eng lish word an d a PWN syns et can be de fined us ing formu la 1. { } | ( ) ( )| | ( ) ( )| ( , ) (1) || | es CV e CV e CV e CV e Relat edness e s ee s ′ ∈ ′ ∩ ′ ∪ = ′′ ∈ ∑ Where the |.| operator gives the size of given collection. Acc ording to th is formul a, an Eng lish word e has the highest re latedness with resp ect to a P WN synset s if it is a related word of all word s appeared in synset s . As previously mentioned, contex t vector of each Persian word was extracted from a corpus. Usi ng Aryanp our Pe rsia n to E ngli sh d ictio nar y, eq uiva lent English tran slations o f these words were e xtracted which called co ntext vector tr anslation (CVT) . By cons idering the li nk bet wee n a PW N synse t s a nd a Persian w ord f , this inference can be made that if f implies the sam e concept as s then its context vector is more similar to the cont ext vector of words in s . Because the words in s are in E ngli sh a nd f is i n Persian, CVT of Persian w ord w a s used to calculate this similarit y. Thus t he relate dness measure of the link between f and s is high if CVT membe rs have high relatedness respect to s . However, t his possibilit y must be taken into account that despite the high relatedness of a CVT element e wit h s , the re might b e o ther s ense s of w ords within s wh ic h e has higher relatedness to them. Therefore, the relat ive r elatedness of e a nd s to the summation of relatedness between e and all sy n se ts conta ini n g words of s is considered rather than the relatedness of e and s , itself. According to the following formula, the average of relative relat ed ness of CVT el eme nts a nd s is computed as relatedness measure (R) of f and s . (, ) (, ) ( , ) ( 2) || e CVT s Relat edness e s Relat edness e s fs CVT R ∈ ′ ′ = ∑ ∑ Where s′ is t he member of all PW N synsets, which conta ins the En gli sh wor ds appeared in s and Relatedness is calculat ed using formula 1. Since t h is feature isn’t computable for the Persi an word s witho ut cont ext vec tor , the E ngli sh equi vale nt s of P ersian word f which links it to PW N synset s c an be considered as CVT too. B. Synset Stre ngth The second feature is based on the idea that if t wo words are synonym then they u sua lly ap pea r i n the same co nt ext [ 22 ] . A s pre viou sl y ment io ned, the b asi c metho d fo r d isco veri ng syno nym word s is fi ndi ng the words that ha ve similar con text vecto r. Pe rsian words, whic h have b een co rrectly linked to a PWN synset, are mor e probable t o be sy nonym. Th us, th eir representative v e ctor s must be similar. Consider k Per sian words f 1 , f 2 , f 3 ,…,f k which l inke d to same P WN synset s . For Persian word f and PWN s ynse t s , S ynset Strength (SS) f eature is set to one in the case o f k =1 and other wise it is defi ned a s follo ws: 1, ( ,) ( , ) ( , ) ( 3) 1 i k ii i ff p f s Similarity f f SS f s k = ≠ × = − ∑ Where p (f i ,s) is the s um mati on o f the inve rse o f polys emy degree of English words which link P ersian word f i to P WN synse t s. The Simila rit y me asur e betw een two Persi an words f i and f j is calculated by comp uti ng Co sine si milar it y mea sur e on t he vectors train ed by Word2Vec m odel. C. C ontext Overlap A general definit ion or exa mple sentence h as been provided in PWN for each synset. One of the basic algori thms for w ord sense di sambiguat ion (WSD) task is Lesk approach [ 23 ] . T his algorithm uses dictio nary definitions perta ining to the vario us senses of the ambigu ous words in order t o iden tify t he most l ikely mean ings o f the word s in a gi ven c onte xt. T his ide a is used here to rate various Persian translations of each PWN syn set. In order to disam biguat e Persian translations of each PWN s ynset, the ov erlap bet ween context vector of Persian w ord and Persian translation of th e words in PWN sy nset gl oss is consi dered. This feature is calculated us in g formula 4. | ( ) ( )| ( , ) ( 4) | ( ) ( )| GT s CV f Cont extOverl ap f s GT s CV f ∩ = ∪ Where GT represe nts the set o f Per sian tran slations of g loss words in PWN synse t s. D. Do main Sim ilarity Another similari ty measure was defined here betw een two Persi an words th at exploit s dom ain categories of docum ents in Hamshahri text corpus. Ha mshahr i is o ne of t he onl ine P ersi an ne wspap ers i n Ira n, whic h has bee n pub lishe d fo r more than 2 0 year s and its archive has been prese nted to the public. In [ 24 ] this archive has bee n used and a standard text corpus with 318,000 docum ents contai ning about 110 milli on words has been constructed . T he do cument s in thi s corpus have been cat e gorized into nine m ain categories and 36 subcategories (like Economy, Economy. Bo urse, …). For each Persian word f , a 9 - dimensi o nal vector was considered, one element for each ca tegory, as dom ain dis tributi on of f . The value of i - th ele ment is defin ed as the probability of occurring Persian word f in the d ocu me nts o f i - th categ ory. Domain similarity between two Persian words is calculated by using the Jen sen - Sha nno n di ver gence , which is a popul ar method of mea sur ing the si milarity betw een two probabil ity dist ributions . T he squ are root of the Jensen – Shanno n d iver genc e i s a metric oft en referred to as Jensen - Shanno n dis ta nce [ 25 , 26 ] . T he Jensen - Sha nnon d iver genc e b et ween t wo d ist rib utio ns P and Q is calculated u sing formula 5. 1 ( , ) ( ( || ) ( || )) (5) 2 JS P Q D P M D Q M = + The f unctio n D is the Kullback - Leibler divergence, and M is the average of P and Q. For mula 6 is used to compute the similarit y betwee n two distr ibutions P a nd Q. (, ) 1 (, ) ( 6 ) Similarity P Q JS P Q = − Domain Similarity measure is based on the idea that it is expected tha t s yno nym word s ap pea r in t he sam e domain or th e distri bution of syn onym w ords in different do mains is si milar. So, ac cording to thi s feature a link between a Persian word f and a P WN synset s is co rrect if f appears in th e same domain as the do mai ns t hat other Persi an words l inked to sy nset s, appear in. If just o ne Pe rsia n word f is linked to PW N synset s , the val ue o f thi s feat ur e for the cor re spo nding link will be set to o ne. No w , consider Persi an words f 1 , f 2 , f 3 ,…, f k , whic h ar e lin ked to sa me P WN s ynset s . For Persian w ord f and PW N syn set s , D omain S imila rity ( DS ) is defined as follo ws: 1, ( ,) ( , ) ( , ) (7) 1 i i k i ff i ff p f s Similarity D D DS f s k = ≠ × = − ∑ Where p (f i ,s) is the s um mati on o f the inve rse o f polys emy degree of English words which link P ersian wor d fi to PW N synset s and D f is the domain distribution o f Persian word f . E. Monosemous Engl ish This feature is si milar to the f irst heuristic de fined in [7] . Suppose th at word e is an English translation of Persian w ord f . If the re has bee n onl y one s ynset s i n PWN that co ntains e as a me mber, the n the va lue o f this feature for the link bet wee n f and s is set to o ne and in the oth er case, zero. Since e is an Englis h translation of f , it s hares some concepts with f . So, there are some senses o f e i n PW N tha t ha ve eq uivale nt co nc ept wit h Persian w ord f . I n the ca se t hat Engli sh wo rd e appear s in one s ynset , we s upp ose tha t thi s syn set i mpl ies the comm on concept w ith Persian w ord f and set t he value of this feature to one. It should be conside red that it is possible that P ersian word f ma y ha ve more tha n o ne sense , which will be propos e d w i th its other English translations. F. Synset Comm onality This feature has been defined similar to the s eco nd heuri stic defi ned in [7] . T his feature shows the num ber of different Englis h words that link a Per sian word f to a PWN synset s . W hate ver mor e Engl ish tra nsla tio ns sugge st a P W N syn set s for a give n Per sia n wo rd f , it is mo r e probable that comm on meaning between f and its English tra nslations be s ynset s . T hus if P er sia n word f h as several English trans lations and there is a PW N synse t t hat ha s m of t hos e Engl is h tra nsla tio ns a s membe r t hen the va lue of this feature is set to m . G. Importance In a Persian/E nglish dictio nary, d ifferent meanings of each Persian w ord can be represented by differen t Eng lish words. On th e other hand, for each Eng lish word, one or more senses have been presented in PWN. With this assu mption that each Engli sh tra nsla ti on o f a given P er sia n wo rd r ep rese nts o ne o f its mea ni ng, fo r each En glish tran slatio n o ne o f its se nses has t he sa me meaning with Persian w ord. T he Importance f eature was def ined to ex ploit th is assumpt ion. The v alue of this f eature was c alcula ted using value s of o ther features. Consider Persian word f and one of its English translations e . Suppos e s 1 ,s 2 ,…,s k ar e synsets i n PW N, whic h cont ain e as their member. The Im portance feature for a link between f and s i is calculated as follow s : four featu re s, Rel ated ness m easure, Synset Strength, Conte xt Overlap , and D omain Similarit y, are initially taken into conside ratio n. For each of which, if s i ha s the m a ximum value com pared to t he othe r synset s o f Engl ish word e then Importance value of l ink b et ween f and s i is increased by on e. In fact, the link between Persian w ord f and P WN synset s i will have t he hi ghes t I mport ance o nl y if the val ue o f the aforesaid featu re s is th e maximum, comparing to the other syns ets of E ngli sh word e . IV. E XPER IMENTS AND R ESU LTS The goal of t he e xper ime nts i s to asse ss the effectiveness of the proposed features in discriminating between correct and incorrect links by evaluating the accuracy of classification system. As mentioned, the approach is t o train a classif ier that makes use of these features. In order to train su ch classifier, we need a collection of classified links as training set. In this regard, we considered the usage of pre - existing Persian wordnet, Fa rsNet, which is the first publis hed Persian W or dNet. T he process of building train data reli es on the second release of FarsN et. Thi s versi on o rga niz es mor e tha n 3 6,0 00 Persian words a nd more th an 20,000 s ynsets in different hierarc hical structure s. It also contains inter - lingual relations connecting Persian synsets to Eng lish synset s o f Pr incet on W ord Net 3 .0. T aking ad va ntage of these links, we are able to obtain correct instances of train da ta. Tab le 1 sho ws some statistic s about FarsNet 2.0. For each available link between Persian word s and P WN s ynset s s uch as (f, s ) in FarsNet, an insta nce (f, s, correct) was considered as correct insta nce o f tra ini ng se t. Categ ory Wor ds Sy nsets L i n ks t o P WN Noun 22,180 11,954 10,108 Adjective 6,560 4,261 4,516 Adverb 2,014 923 929 Verb 5,691 3,294 2,678 Tot al 36,445 20,432 18,231 Table 1 :Stati stic s of FarsN et 2.0 By considerin g the whole available links in FarsN et, 10,952 lin ks are added t o traini ng set as correct class. In order to generate incorrect instances of training set, 5 ,000 links between Persian words and PWN synsets excluding F arsNet links, were selected rand o mly and added to training set as (f, s, inco rrect). In general a train s et consists of 10,952 correct and about 5, 000 incorre ct in stances, w as obtai ned. Du e to overla p of som e links with the gold dataset, that is used in the evaluation process of experiments, several links were eliminated . The statistics of the final training set is reporte d in Table 2 . POS Correct Incorrect Tot al Noun 7,974 3,288 11,262 Adjective 2,357 1,261 3,618 Adverb 217 82 299 Verb 316 363 679 Tot al 10,864 4,994 15,858 Table 2 : Stati stic s of train s et For each of lin ks in training set, defin ed features were calculated. In our exp er imen ts, W eka o pen sour ce data mining software [ 27 ] was us ed. In order to evaluate the clas sifier accuracy, two methods w ere considered. The first method uses ten - f old cross vali dation tes ting method pro vided by We ka. Table 3 shows the precision and recall measures obtained from different classifiers. Becau se the final Persian wordnet is generated by coll ecting the links classified as correct, the precision of correct class in stances is more impo rtant tha n t he ot her m easur es. T he las t t wo columns of Table 3 show the precision and recall measures of correct c lass, with respect to d ifferent classifiers: Rand om Forest, KNN, Multilayer Perceptron, and Naïve Bayes. 2 The resul ted Per sian Word Net is freel y dow nloadable from http://ece.ut .ac.ir/en/node/9 40 Class ifier Precis ion Recall Correct Precis ion Correct Recall NaïveB ayes 70 .4 58.3 83. 6 48. 6 KNN (k=10) 67. 7 69. 9 74. 2 86 .1 Rand omFo rest 6 7.9 70 .2 74. 2 86.5 Multil ayerPercep tron 68 .0 70. 6 73 .3 89 .6 Table 3 : P reci sion and Recal l of appl ie d classif iers As sho wn i n Ta ble 3 , the best accuracy with respect to the precision of correct class was achieved by Naïve Bayes classifier. Therefore, Naïve Bayes classifier is employed to c onstruct the final wordnet. T he links classified as correct c lass e xc ludi ng t he exi sti ng li nks in F arsNet, were collected to make the final Persian wordn et with prec ision sc ore of 83.6 %. 2 In order to assess the effect of each feature on the resulted wordnet, Naïve Bayes classifier is learned by diff erent conf iguration of features . For this purpose, the w orth of each feature is evaluated by m easuring the information gain of each feature using Weka attribute selection. Next, features are incremen tall y added to the feat ure s et i n ord er of thei r info r matio n gai n and the outp ut of e ac h ste p is give n to a clas sifier. Table 4 shows the results of classifiers in ter ms of precisi o n, reca ll and F - measure scores with respect to the correct class. Features are present ed in this tab le accord ing to the in for mati on gain ra nk. Fea ture s Precis ion Recall F- measu re I mportance 68.5 100 81 .3 + Syn set C omm onalit y 80.7 5 8.4 67.8 + Relatednes s Measure 82 .8 52.4 64.2 + Dom ain Sim ilarit y 82 .6 50.4 62.6 + Synset Strength 83.4 48 .2 6 1.1 + Mon osem ous En glish 83 .7 48 .4 61 .3 + Contex t Sim ilarit y 83 .6 48.6 61.5 Table 4 : The results gained by c lassi fier s trained on increm entally i ncreasi ng feature set As sho wn i n Ta ble 4 , the precision measure is usually increas in g as features are adde d. In some cas es such a s ad din g Context S imilarity feature, pr ecision falls down, while recal l increases. Employing all the feat ures lead s to a precisi on of 83. 6 % and a recall of 48. 6 % according to ten - fold cross validatio n testing method. Simi lar t o ot her work s i n the PW N s ynset mapp ing, a manually judged test set i s emplo yed for evaluating the final links between Persian words and PWN synset s. In t his r egar d, the me tho d int rod uce d in [ 12 ] is used a s ba sel ine. I n thi s wor k as i n our met hod , the initial links were genera ted by linking P ersian words in Bij ankha n co rp us to PW N synse ts. Next , an unsup er vised EM - based algorit hm usin g a cross - lingual WSD method has been applied to estim ate the pr obabilities for each link. T he final wordnet co ntained total links excludi ng low rated ones, which don't meet a pre - determined t hreshold. The highest pre cision in the e xpe ri ments was gained b y 0.1 as t he t hres hold , which indicat es a precisi on score of 90% and a recall of 35%. We addres s thi s wordn et as "EM - based wordnet" in co ntrast to our final wordnet as "Sup er vised word net" . I n th e exp er iment s o f EM - based wordnet, a s et of manually judged link s has been obtained to evaluate the result s. A subset of manua l judges consi sts of abou t 1000 l inks corre sponds t o our generated links. Moreover, they aren't prese nted in the built training set. T herefor e, we used this collection as test set in the evaluation pr ocess of the genera ted wordnet. Table 5 demon strat es so m e statistics ab out test dataset with re spect to POS category and label . POS Correct Incorrect Tot al Noun 440 109 549 Adjective 181 57 237 Adverb 27 4 31 Verb 103 84 187 Tot al 751 254 1,005 Table 5 : Stati stic s of tes t set Similar to [ 12 ] the precision is considered as th e number of correct links are common in the wordnet and test data, divi ded by th e total num ber of wordn et links which belong to the test da ta. Also, the reca ll of the wordnet is considered as th e number of correct links are com mon in th e wordn et and test data, divided by the total number o f correct lin ks in the test set. The manua l e val uatio n o n the selected links shows a precision score of 91 .1 8% and a recall score of 4 5.4 1% , which s urp asse s the EM - based w ordnet, the state of th e art automatically constructed Persian wordnet. Table 6 dem onstrates the precision and recall of the su pervised w ord net for dif ferent POS categories. The best precisio n was acquired for n ouns with a s core of 93.69 % and the be st recall dedicated to adverbs wi th a score of 51.8 5 %. POS P recision Recall F- meas ure Noun 93.69 47.27 62 .8 4 Adjective 90 . 43 46. 96 61. 82 Adverb 93 . 33 51 . 85 66. 67 Verb 79.07 3 3.01 46.58 Tot al 91. 18 45.41 60.62 Table 6 : Precision and Recal l of resulte d wor dnet with res pect to P OS category In addition to precision measure, the o ther noticeable factor for delib erating the quality of word net s is thei r siz e. I t de notes the numb er o f uniq ue words , synsets an d word - sense pairs, covered by the wordnet. Table 7 represents this in formation ab out the indu ced wordn et. The re sulted wor dnet cov ers about 16,000 w ords and 2 2 ,000 syn sets an d makes abou t two tim es more conne ctio ns fro m Per sian wor ds t o PW N synse ts, in comparison with FarsNet. According to the first 3 See htt p://w ordnetc ode.pri nceto n.edu/s tandof f - files/c ore - wordnet.t xt colu mn of T ab le 7, no uns ha ve the lar gest proportion of the resulted wordnet an d t he lowest coverage returns to verbs . POS Wo rd s Synsets Wor d - sense Pair s Noun 10,4 86 13, 9 47 2 3,425 Adjective 4,7 75 5 ,433 11,037 Adverb 46 0 5 08 778 Verb 4 08 2,883 3,1 07 Tot al 16, 129 22, 7 71 38, 3 47 Table 7 : N umber of wor ds, sy nsets an d word - sense pair s in resul ted Pers ian wordnet In the following, t he scalabilit y of t wo wordnets from the perspective of the num ber of uniqu e words, synset s a nd wo rd - sense pairs, is studied. Table 8 repor ts these statistics for the induce d wordnet and baseli ne me thod. Al so, th e numbe r of u nique w ords with more than on e sense in side the w ordnet, divided by th e total num ber of unique w ords is repres ented i n the last column of this table as polyse my rat e. The higher polysemy rate in wordnets can be considered as a po int o f stre ngth fo r the m, due t o le ading mor e efficiency in NLP and IR tasks . A ccordi ng to Table 8 , su pervise d wordnet outperf orms EM - ba sed wordne t in res pect of s ize, too. But the prop ortio n of polysemi c words, words w ith more t ha n one se nse , i n EM - based w ordnet is more than sup er vised word net. Unique Wor ds Synsets Wor d - sense pai rs Polys em y rate EM - based wordnet 11,899 16,472 29,944 0.73 Super vise d wordnet 16, 129 22, 771 3 8, 347 0.51 Table 8: Size of super vised w ordnet in comparis on with EM - based wordnet The other measure cons idered in the eva lua tio n of EM - based wordnet, reg ards to the coverage of Persian corpus words, PWN sy nsets and cor e concepts . Core concept s impl y mor e freq ue ntl y used s ynse ts in a langua ge , whic h co ver ing t he m in a wor dne t bo os ts its effi ciency. A set of appr oximately the 5,000 m ost frequently used PWN word sens e s is created in [ 28 ] , which is exploited here 3 . T ab le 9 compares supervised wordn et and EM - based wordnet from the coverage point of view. I t ’s ob vio us t hat super vis ed wordne t ha s a w ider co verage ove r Bij ankha n co rpus and P W N synset s , but EM - based w ordnet ha s covered a higher percentage of core concepts. Bijankhan (unique w ords) PWN synsets Core synsets EM - bas ed wordnet 11,543 14% 53% Supe rvised wordn et 14, 797 19.35% 38. 76% Table 9 : Cove rage of super vised w ordnet in compar ison wi th EM - based wordne t In ge nera l, the exp er iment s s ho wed tha t s upe rvise d wordn et performe d better th an EM - based w ordnet in many aspects. From the bes t o f our kn o wl edge, the retrieved precision is the highest accuracy comparing to whole other a utomaticall y built Per sian wordnets. Also, it is the larges t full y automatically co nstructed Persian w ordnet, which covers m ore than 16,0 00 wo rds, 22,000 PWN syn sets and 38 ,0 00 w ord - sense pair s. V. C ONCLUS ION AND F U TURE W OR KS Automatic constr uction of P er sian word net us i ng available resources such as Persian and English monol ingual c orpora, bi - lingual d ictio nar y, a nd Persian part of speech tagged corpus is th e main concern of th is pap er. Also, FarsNet, t he pre - exi sti ng Persian wordnet was exploited to produce a training s et . For each link between Persian w ords and PWN synset s , se ven features were defined and a classifier was trained to discriminate between correct and incorrect links. The features were defined by using measure of corpus - based se mantic similarity and relatedness. Ou r experiment s on Persian lan gu a ge show ed the preci sion of 91.1 8% for the links that are classified as correct, wh ich outperform s the prev iously propose d automa ted methods . The experiments revealed th at there are prob lems for calculating some features values . In PWN for so me syns ets a sh ort gloss h as been provide d whic h causes the calculated Context Overlap featu re for li nked Per sian word s to b e lo wer tha n ot her s ynse ts t hat li nke d to th ose Persian w ords. In order to ov ercome this prob lem, synsets that have se mantic relatio n with these synsets such as hy per nyms can be considered. Another observation that we made is that corr espo nding P WN synset s of so me sense s of En glish word s cont ai n o nly o ne E ngli sh word. For exa mple “bank” ap pea rs in 1 0 d iffer en t nou n s ynse ts suc h t hat 6 of t hem co ntai n onl y “bank” . In th ese cases, the value s of S ynse t Str engt h a nd Do mai n Si milar ity features become equal for all of links that derived from such E ngli sh wor ds. As we exami ned P WN, we observ ed that PWN con tains 7,935 Eng lish words , whic h app ea r alone in mor e than o ne s ynse t. T his numbe r of words i s 5 percent of all English words in PWN and it is expected in these cas es that other features discriminat e bet ween correct and incorrect links. The experiments sh o wed that v erbs have the lowest proporti on of th e indu ced wordnet . Persian v erbs are categorized into sim ple and com pound verb s . Compoun d verbs a re composed of a v erbal and one or several non - verbal parts. This category of v erbs includes a larger am o unt of Persian verbs. Since in the propose d method, Bijank han corpus was us ed to extract Persian w ords and each token was specified as a single word, the extracted verbs usually correspond to simple verbs an d our wordnet lack s a satisfactory coverage on compo und verbs. We need a method for extra cti ng t he co mpound ver bs fro m corp us, whic h can be considered as a future work. Also, the features can be enriched by POS wi se features to have more accurate results. The whole method is language - independent and can be experimented on each langua ge whose needed resour ces are availa ble. R EFERENCES 1. Clarke , C.L., e t al. The influ ence o f ca ption featur es on click thro ugh patt erns in we b searc h . i n Proc eedin gs of the 30th annual int ernationa l ACM SIGIR co nferenc e on Rese arch and develo pment in i nfor matio n retr ieva l . 2007. ACM. 2. L i, C.H., J.C. Yang, a nd S .C. Park , Text c ategoriz ation algo rithms usin g sem anti c appr oac hes, c orpus - based thesa urus a nd W ordNe t. Expe rt Sy stem s with Appl icatio ns, 20 12. 39 (1): p. 765 - 772. 3. L ee, S., S. - Y. Huh, and R.D . Mc Niel, Autom atic gener atio n of conce pt hier archie s usin g W ordNe t. E xper t Sy stems w ith Applic ations , 2008 . 35 ( 3): p. 1132 - 1144 . 4. Fellba um , C. , WordN et: An Ele ctro nic Lexic al D atabas e: Bra dford B ook . 1998, Cambridge, MA: MIT Press. 5. Vo ssen, P., Introduction to eurowor dnet. Com puters and the Hu mani ties, 19 98. 32 (2 - 3): p. 73 - 89. 6. T ufis, D., D. Cr istea, and S. Stamo u , Ba lkaNet: A ims, methods , resul ts and perspect ives. a g eneral overview . Romani an Journa l of Informat ion scienc e and tech nolo gy, 200 4. 7 (1 - 2): p. 9 - 43. 7. Shams fard, M. Deve loping Far sNet: A le xical ontol ogy for Persian . in 4t h Globa l Wor dNet Confer ence, Szeg ed, Hungar y . 2008. 8. Shams fard, M. , et al. Semi au tomat ic devel opme nt of fars net; th e pers ian wor dnet . in Proc eedi ngs of 5t h Glo bal Word Net Co nfere nce, Mum bai, Indi a . 2010. 9. Mon tazery , M. and H. Fail i. Autom atic Per sian w ordn et const ruction . in Pr oceedi ngs o f the 23r d Inte rnat ional Conf erenc e on Com putat ional Lingu istics : Pos ters . 2010. Associat ion for Com putati onal Lingui stic s. 10. Montaz ery, M. and H. F aili. U nsuper vise d Learn ing for Persian Wor dNet Constr ucti on . in RAN LP . 2011 . 11. Fadae e, M., e t al., A utoma tic Wor dNet Constr ucti on Usin g Marko v Chai n Monte C arlo. Polibit s, 20 13(47) : p. 13 - 22. 12. Taghi zad eh , N. and H. Fai li, Automa tic Wordne t Deve lopme nt for Low - R esour ce Lang uages using Cr oss - Lingual WSD. J . Artif . Inte ll. Res.(JAI R), 2016. 56 : p. 61 - 87. 13. Lee , C., G. L ee, and S.J . Yun. A utom atic Wor dNet mapping using wo rd sense disambi guation . in Proc eedin gs of the 2 000 J oint SI GD AT con feren ce on Empi rical method s in natu ral languag e proces s ing and very large cor pora: hel d in conju nction wit h the 38t h Annua l Meeting of t he Asso ciatio n for Co mputat ional Lingu istics - Volume 13 . 2000. Associ ati on for Compu tati ona l Lingui stic s. 14. Mikolov , T., et al. Distr ibuted represent ations o f word s and phr a ses an d t heir compos itional ity . in Advance s in neural informat ion p rocessing systems . 2013. 15. Yabl onsky , S. Eng li sh - Russian Wor dNet for Mu ltiling ual Mapp ings . in Pr oceedi ngs o f 201 0 Works hop o n Cross - Cult ural an d Cros s - Lingual A spects of the Seman tic Web . 20 10. Cit eseer. 16. Kur c, R., M. Piase cki, and S . Szpako wicz. Automat ic acqu isiti on of word net re lati ons by distr ibut ional ly suppo rted mo rphologica l patte rns extract ed from Pol ish corpo ra . in Inte rnation al Confer ence on Text, Speech and Dial ogue . 2 010. Springer. 17. Rod rígu ez, H. , et al. Arabic Wor dNet: Se mi - autom atic Exte nsions using B ayes ian Inf erenc e . in LREC . 2008 . 18. Satha pornru ngkij, P. a nd C. P luempit iwiriy a we j, Constr uction of Thai W ordNet lexica l data base fr om mac hine re ada ble d iction aries . Pro c. 10t h Mac hine Transla tion Su mmit, Phuk et, Tha iland, 200 5. 19. Bijan kh an, M., The r ole of the c orpus in wri ting a gramm ar: A n int roduc tion t o a s oftwa re. I ranian Journal of Linguist ics, 2 004. 19 (2). 20. Oro umchian, F ., et al. , Cre ating a feas ible cor pus f or Persian P OS tagging . D epart ment of Elect ri cal an d Comp uter E ngine ering , Univ ers ity o f Te hran, 20 06. 21. Shamsfar d , M., H.S . Jafari, an d M. Il beygi. STeP - 1 : A S et of F undament al Tools for Pe rsian Text Proce ssing . in LREC . 2010. 22. Ze sch, T. and I . Gurev ych, Wi sdom of cr owds v ersus wisdom of lingu ists - me asuri ng th e se manti c rel atedne ss of w ords. N atur al L anguag e Eng ine ering , 201 0. 16 (1): p. 25. 23. Le sk, M. Autom atic sens e disam bigua tion us ing mac hine read able dictio narie s: how to tell a pin e cone fro m an ice crea m con e . in Proceedi ngs of the 5t h ann ual inte rnation al con ferenc e on S ystems doc umentat ion . 1986. ACM. 24. AleA h mad, A., e t al., H amsha hri: A stan dard Persia n text c ollect ion. Kn o wl ed ge - Based Systems, 2 009. 22 (5): p. 382 - 387. 25. Endre s, D.M. a nd J.E. Schi ndeli n , A new m etric for proba bility distri bution s. IEEE Trans action s on Inf orma tion the ory , 2003. 49 (7): p. 1 858 - 1860. 26. Öste rreicher , F. and I . Vajda, A n ew c lass of met ric diver genc es on prob abil ity spa ces and its a ppli cabili ty in statis tics. A nnals o f the I nstitute of Statis tical Mat hemat ics , 20 03 . 55 (3): p. 6 39 - 653. 27. Hall , M., et al ., The W EKA da ta mini ng soft ware: a n update . ACM SIGKDD exp lorati ons newslett er, 2 009. 11 (1) : p. 1 0 - 18. 28. Bo yd - Grabe r, J., et al. Adding d ense, weight ed conne ctions to WordN et . in Proce eding s of th e thir d inte rnation al Wo rdNet co nferenc e . 2 006. Ci tes eer.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment