Dating medieval English charters

Deeds, or charters, dealing with property rights, provide a continuous documentation which can be used by historians to study the evolution of social, economic and political changes. This study is concerned with charters (written in Latin) dating fro…

Authors: Gelila Tilahun, Andrey Feuerverger, Michael Gervers

Dating medieval English charters
The Annals of Applie d Statistics 2012, V ol. 6, No. 4, 1615–164 0 DOI: 10.1214 /12-A OAS566 c  Institute of Mathematical Statistics , 2 012 D A TING MEDIEV AL ENGLISH CHAR TERS 1 By Gelila Tilahun, Andrey Feuer ver ger and Michae l Ger vers University of T or onto Deeds, or c harters, d ealing with prop erty righ ts, provide a co n- tinuous do cumentation which can b e u sed by historians to study t h e evol ution of social, economic and p olitical changes. This study is con- cerned with charters (written in Latin) dating from the tenth through early fourteen th centuries in England. Of these, at least on e million w ere left un dated, largely due to administrative changes introdu ced by W illiam the Conqu eror in 1066. Correctly dating such c harters is of v ital imp ortance in the study of English mediev al history . This pap er is concerned with computer-automated statistical methods for dating suc h document collections, with t he goal of reducing the con- siderable efforts required to date them manually and of imp ro v ing the accuracy of assigned dates. Prop osed method s are based on suc h data as the v ariation ov er time of word and phrase usage, an d on measures of distance betw een do cuments. The extensive ( and dated) Docu ments of Early England Data Set (D EEDS ) maintained at the Universit y of T oron to was used for this pu rp ose. 1. In tro d uction. Our ob ject in this p ap er is to con tribute to ward th e de- v elopment of statistica l pro cedu res for compu terized calendaring (i.e., dat- ing) of text-based do cument s arising, for example, in collect ions of h istorical or other material s. The particular data set w hic h motiv ated this study is th e Do cumen ts of Early E n gland Data S et (DEEDS) main tained at the Cen tre for Mediev al Studies of the Univ ersit y of T oronto . This data set consists of c harters, that is, do cum en ts evidencing the transf er and/or p ossession of land and /or mo v able prop er ty , and the righ ts which go v ern th em. The do c- ument s in question date from the tenth through early fourteenth centuries and are wr itten in Latin, the admin istr ativ e language of their time. Th ey Received Novem b er 2011; revised April 2012. 1 Supp orted by a grant from th e Natural Sciences and Engineering Research Council of Canada and by a gran t from Google Inc. Key wor ds and phr ases. Bandwidth selection, cross-v alidation, mediev al c harters, DEEDS data set, generalized linear mo dels, kernel smoothing, local log-likelihood , maxi- mum prev alence meth od , nearest neigh b or metho ds (kNN), quantil e regression, text min- ing. This is an electronic reprint of the orig ina l article published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2012, V ol. 6 , No. 4, 1 615– 1 640 . This reprint differs from the origina l in paginatio n and typo graphic detail. 1 2 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S w ere mostly obtained f rom cartularies and c harter collect ions pro d uced in England and W ales, with a few fr om Scotland. A p eculiarit y of that era is that most of the charters that were iss u ed d o not b ear a date or other c hr onologica l mark er. This is p articularly so from the time of the C onquest in 1066, u n til ab out 1307, wh en fewe r than 10% of the more than one million sur viving c harters b ore dates. (A more complete bac kground to these circumstances is pr o vided in Section 2 .) Charters dat- ing fr om the t welfth and thirteen th cen turies, h o wev er, are a vital source for the study of E nglish so cial, economic and p olitica l history , and significan t historical information can b e derive d when suc h c harters can b e dated or sequenced acc urately . (F or some examples, see Section 2 .) T he c harters com- prising th e DEEDS data set are deriv ed f rom among those c h arters w hic h can in fact b e accurately d ated, and, sp ecifically , to within a y ear of their actual issue. A ke y aim of the DEEDS pro ject w as to pro du ce a reliable d ata base from which metho d s for dating the undated charte rs could b e devised. The DEEDS data set current ly consists of some 10,000 d o cumen ts, in computer readable form , tak en from pu blished editions of c harter sources. These h a ve all b een dated b y historians on the b asis of in ternal dates or other in ternal c h r onologica l markers su c h as p erson or place names, or refer- ence to a datable ev ent. (Note, ho wev er, that dating manually , f or instance, b y comparing names, is prone to er r ors wh ich can multiply w hen charters are used to date other c h arters; not infrequ en t names su c h as “William son of Ric hard son of William son of Ric hard” can easily b e generationally mis- aligned.) One k ey idea und erlying our wo rk is that change s in language use across time can b e used to help iden tify the date of an undated do cu m en t. F or example, a study of dated c harters shows that th e phrase “ amic orum me orum vivorum et mortuorum ” (“of my friends living and dead”) was in currency b et w een the y ears 1150 and 1240. As another examp le, the ph rase “ F r ancis et Anglicis ” (a form of address: “to F rench and English”) w as phased out when Normandy w as lost b y En gland to the F renc h in 120 4. By com bining evidence from many w ords and phr ases, an d /or b y examining measures of distance b et w een do cument s, our goal is to d ev elop algorithms to help automate the pr o cess of estimating the dates of und ated c harters through pu r ely compu tational means. In Section 2 w e provide fu rther historical bac kground conce rning th e c har- ters with wh ich the DEEDS data set is concerned. W e explain there h o w it h app ened that so man y c harters had b een left undated, and indicate the imp ortance that d ating c harters correctly has for researc h in to the so cial, economic and p olitical h istory of England in the h igh m iddle age s. F ollo w ing this, we p r o vide a m ore detailed description of that p art of the DEEDS data set on w hic h ou r work w as based. In S ection 3 we fi rst b riefly discuss some concepts relev ant for statistical pro cessing of text-based do cum ents, and set do wn th e notation to w hic h w e DA TING MEDIEV AL ENGLISH CHA R TERS 3 will adhere thr oughout. W e then review some p revious calendaring work that had b een carried out u sing the DEEDS data set. In S ections 4 , 5 and 6 we discuss thr ee distinct metho ds f or calendaring undated c harters. The meth- o ds describ ed in Section 4 are b ased on nearest neigh b ors (kNN); essen tially , these metho ds av erage the dates of do cumen ts in a training set w hic h ha ve kno wn dates, and which are “closest” to the one b eing dated. This approac h requires notions of distance b et w een do cumen ts which w e also discuss there, as w ell as the s electio n of tuning parameters using cross-v alidation. Th e metho d p rop osed in Section 5 is based on an analogue of maxim u m lik eli- ho o d wh ic h w e refer to as the metho d of maximum pr evalenc e (MP). T h is metho d attempts to assign a probability , at every p oint in time, that the do cument would ha v e randomly b een pro duced then, and it estimates the date of the do cument to b e the time at which this p robabilit y is greatest. Finally , in S ection 6 , we prop ose a metho d based on determining the mini- m u m of a n onparametric quantile r egression curve fitted to a scatterplot of the distances from a do cument to b e dated to the do cum en ts in a test set, against the kno wn dates of those test d o cument s. Some asymptotic theory for the estimation metho ds is discussed br iefly in Section 7 , and based on the three statistical metho d s discussed, numerical work we carried out using the DEEDS data s et is describ ed in Section 8 . Some concluding remarks are pro v id ed in Section 9 where av en ues for fur ther work are also ind icated. The metho d discus sed in Section 2 is due to R. Fiallos, but is discussed here in statistical terminology and in greater detail than in Fiallos ( 2000 ). The m etho ds review ed in S ection 4 are from F euerv er ger et al. ( 2005 , 2008 ) and are in clud ed here for comparison and completeness. Th e maximum prev alence m etho d describ ed in Section 5 is our main new metho d ologica l con tribu tion. As well, a k ey con tribution of our work lies in the no vel appli- cation of the ment ioned estimation metho d s to historical data of the type considered here. This wo rk m a y b e seen in the con text of other work in the digital humanities, temp oral language mo deling and information r etriev al. Some entry p oin ts to that literature in the con text of calendarin g do cuments include de Jong, Ro de and Hiemstra ( 2005 ), Kanhabua and Norv ag ( 2008 , 2009 ) and the references ther ein. F or broader con text see, for examp le, Berry and Bro w ne ( 2005 ) and Mannin g, Ragha v an and S c h ¨ utze ( 2008 ). 2. Description of the d ata set. Th e k eeping of records p ertaining to the o wners hip and transfer of prop ert y is as old as w riting itself, and dates bac k to at least the third millenium BC in Su m eria wh ere s u c h do cum ents were inscrib ed on cla y . C onsequen tly , d eeds, or c harters (as they are kno w n), pro v id e a con tin u ous legal do cumen tation whic h can b e us ed b y h istorians to study the evolutio n of so cial, economic and administrativ e c h anges. F or c harters to b e used in this w a y , h o wev er, establishing an accurate c h ronol- ogy is imp ortan t. Belo w, we will u se the term c harter to represen t an official legal do cu men t, often w r itten or issued by a religious, la y or ro yal institu- 4 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S tion, whic h typically pro vides evidence of the transfer of landed or mo v able prop erty and the r igh ts whic h go v ern them. It w as the fate of England, b et wee n the time of the Conquest in 1066 when William the Conqueror (also Duke of Norman d y) ascended the English throne, u n til th e start of the reign of Edward I in 1307, that—in cont rast to the R oman and p ap al traditions—most c harters issued did not b ear a d ate regardless of the lev el of so ciet y in whic h the charters originated. William I in tro duced in to the roy al c hancery the then-current Norman custom of issuing charters without dates or other c h ronological mark ers. This cus tom con tinued u n til the r eign of England’s sixth p ost-Conquest (and crusader) king—Ric hard the Lionhearted (1189– 1199) —when, for the first time, do c- ument s issu ed fr om the ro yal c hancery b egan regularly to include a date. It w as, ho w ever, not u n til the accession of the tenth king, Ed w ard I I (1307– 1327) , that the custom of includ ing dates also b ecame un iv ersally ad op ted b y those r esp onsible (ecclesiastics and la ymen) for issu ing priv ate c h arters. Charters from the t welfth and thirteen th centuries, written in Latin— the administrativ e language of the time—are th e predominant source for the s tudy of English so cial, economic and p olitical h istory of that era. It is estimated that at least on e million charte rs ha ve survived fr om that n early 250 y ear p erio d, some as originals, b u t most as copies in cartularies (i.e., deed b o oks). Of these, well ov er 90 p ercent do not b ear d ates, so that few er than 10% of them can b e d ated at all accurately . Although in creasingly less so w ith the passage of time, ev en at the turn of the four teen th cen tur y the p ercent age of E n glish c harters b earing dates remains m o dest. Significan t historical information can b e deriv ed when c harters can b e dated or sequen ced correctly as the follo wing three examples attest: (i) A study of donations to the tw elfth-cen tury Order of the Hospital of Sain t John of J erusalem allo w ed historians to conclude that the Order b ecame militarized in resp ons e to th e fall of Edessa in 1144, and to the call for the Second Crusade in 1145. (ii) Widespread reluctance to incorp orate the in - v o cation of divine interv ent ion int o legal language of th e da y evidences the so cial un rest in En gland under the Papal In terdict of 1208–1 214 . (iii) With the Crus ades ca me the foun dation of the military-religi ous orders known as the T emplars and the Hospitallers who fi nanced their activitie s in p art through the managemen t of pr op erties in Eu rop e and the Middle East. Th e relativ e gro wth of their estates in London and its s u burb s fr om the t we lfth to th e fourteenth cen tur ies confirms with ou t a doubt that as London sp read outside its ancien t Roman w alls in the t welfth cen tur y , the T emplars p lay ed a far more significan t role in subu rban dev elopmen t, and f rom a m u c h earlier p erio d, than did the Hospitalle r s. F urther b ac kground and examples may b e found in Gerv ers ( 2000 ), Gerv ers and Hamonic ( 2010 ), and references therein. The DEEDS database, m ain tained at the Universit y of T oronto , is n o w a corpu s of o ver 10,000 mediev al L atin charters dealing primarily w ith land DA TING MEDIEV AL ENGLISH CHA R TERS 5 and mo v able ob jects (grants, leases, agreemen ts, etc.) and righ ts regulating their use. The charte rs in this corpus are all date d ones; they were either dated int ernally or they con tained sufficien t information to enable histori- ans to situate them to within a ye ar of their issue. These c harters were all obtained from p u blished ed itions of c h arter sources co ve ring E n gland and W ales, and a few from Scotland, and were deriv ed predominan tly fr om the arc hives of religious h ouses and to wns, as w ell as la y institutions suc h as college s and universities. (Note that b ecause the c harters were tak en from published sources, they necessarily b ear an y editorial decisions made b y the publishin g author.) The DEEDS pro ject has, as a key ob j ectiv e, to establish computerized m etho dologie s for d ating the v ast n u m b er of med iev al c harters that ha ve not ye t b een d ated in the hop e that, tak en together, th e dated do cuments from the database, and those to whic h d ates can b e attributed via statistica l and other means, ma y all o w h istorians to construct a more precise un derstanding of the ev olution of English s o ciet y within th at era. W e remark that du e to the paucit y of surviving do cuments, and the rarit y among th em of charters b earing dates, there is v ery little in th e DEEDS database fr om b efore 1160. Original charters, written on parc hment , and b earing the seal of th e issuer or his patron, are rare. Most of the c h arters th at ha v e survived tod a y exist as copies in d eed b ooks kno w n as cartularies w h ic h were pro duced p erio d ically during the elev en th to fifteen th cen tur ies. (Suc h cop ying could o ccasionally in tro duce tr anscriptional or other c hanges and in accuracies.) Cons equen tly , palaeograph y and sigillograph y generally cannot help in th e calendaring p ro- cess, lea ving the evo lu tion o v er time of vocabulary u sage, w ord patterns and do cument structure as the pr imary data from which dating can b e carr ied out. These charters are preserve d to day in suc h rep ositories as the National Arc hives, the British Libr ary , the arc hives of Ox f ord and Cambridge Uni- v ersities, in coun ty record offices and in p riv ate collect ions. The data : Although the DEEDS data set h as grown, 3353 do cuments w ere a v ailable to us when our computations were implemented; we now de- scrib e this data set. Prior to their analyses, certain p repro cessing steps w ere applied. Dates w ere mapp ed in to the J u lian calendar. Do cu men ts we re n or- malized for v ariations in sp elling, and all pun ctuation marks were remo ved. Names w ere left unc hanged, and ju st as they app eared in the do cument. All num b ers app earing in a do cum ent w ere en cased b etw een excla m ation signs—th us, xv b ecame ! xv !—and all num b ers were s ubsequently treated as b eing the same distinct wo rd. (W e are not referring here to actual dates whic h might app ear in a do cument allo wing it to b e dated without d iffi - cult y .) The determination of distinct w ord s w as tak en to b e case sensitiv e; this rule was ap p lied ev en to the fi rst words of sen tences whose fir st c harac- ter wa s generally in upp er case. A sample of a do cum en t pr o cessed in this w ay is provided at the end of this section. 6 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S Fig. 1. Histo gr am for the distribution of dates of the 3353 date d do cuments. Figures 1 and 2 , as well as T able 1 , provide some graphical and tabu lar information ab out our 3353 d ated DEEDS d o cumen ts. Figure 1 is a h is- togram of the kno w n dates for the d o cument s; the earliest of these is dated 1089, and th e latest is dated 1438. Th e mean date of these c harters is 1237 with a standard deviation of 46 years. Figure 2 is a h istogram of th e lengths (i.e., wo rd counts) of the do cu m en ts; the sh ortest of these consisted of only Fig. 2. Histo gr am for the distribution of lengths (wor d c ounts) of the 3353 do cuments. DA TING MEDIEV AL ENGLISH CHA R TERS 7 T able 1 F r e quency of wor d r ep etitions in the data set of 3353 do cuments, c omprising 50,006 distinct wor ds W ord frequency Number of o ccurren ces W ords occu rring only once 28,282 W ords occu rring exactly twice 7223 W ords occu rring exactly three t imes 3265 W ords occu rring more than th ree times 11,236 W ords occu rring more than 10 times 4952 W ords occu rring more than 30 times 2330 W ords occu rring more than 100 times 1004 W ords occu rring more than 300 times 415 W ords occu rring more than 1000 times 109 15 w ord s, and the longest of 2054 wo r ds; th e m ed ian and mean of the w ord coun ts w ere 202 and 237, resp ectiv ely , w hile the lo wer and up p er quartiles w ere 151 and 275 wo rds. V ery short or very long do cuments are rare. W ord s consisted of an a verag e of 6.5 c haracters. No dep endencies worth y of note w ere detected b et we en the lengths of the do cu men ts w ith their dates, their con tent s or with any other features. Among the 3353 do cu men ts, a total of 50,006 d istinct words o ccurred. Of these, 28,282 wo rds (56%) o ccurred only once. W ords wh ich occurr ed only once w ere not considered relev an t for our study b ecause such wo rds could not sim u ltaneously o ccur in b oth a test subset and a v alidation subset of the data. The frequency of rep etition for rep eated words is giv en in T able 1 . While it is p ossible that in a f ew instances such rep etitions all o ccurred within the s ame d o cumen t, w e did not k eep track of suc h occur rences. Finally , we exh ib it here one of the DEEDS c harters after pr epro cessing as indicate d ab o ve . Th is do cument deals with the transf er of a messuage (house and app urtenances) in Nottingham for an annual p a yment of four p ound s s terling. It b ears serial n u m b er 006500 32 in the DEEDS d ata set and has b een d ated internally by regnal yea r to 1230–12 31: Omnibus sancte matris e c clesie filiis ad qu os pr esens scriptum p ervenerit Simon abb as de Ruffor d’ et c onventus eiusdem lo ci salut em Noverit universitas vestr a nos de disse c onc essisse quiete clamasse et hac pr esenti c arta nostr a c onfi rmasse Jo- hanni filio Bele de Notingha’ u n um mesuagium cum p ertinentiis in Notingha’ quo d jac et inter terr am Walteri Karkeney et terr am A de de Estweyt hab end et tenend eidem Johanni et her e dibus suis et her e dibus e orum in fe o do et her e ditate de n obis vel atornatis nostris lib er e qu iete inte gr e p acific e et honorific e r e ddendo inde an- nuatim nobis vel atornatis nostris quatuor solidos s t erlingorum ad duos terminos anni scilic et duos solidos ad Pente c osten et duos solidos ad festum sancti Martini pr o omn i servicio c onsuet udine se culari demanda et exactio ne Et nos pr e dictam 8 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S terr am cum p ertinentiis pr e dicto Johanni et her e dibus suis vel assignatis suis vel her e dibus e orum c ont r a omnes homines war antizabimus sicut donator es nostri pr e- dictam terr am nobis war antizabunt Ut autem he c nostr a donacio et c onc essio r ata et stabilis imp osterum p ermane at hanc pr esentem c artam sigil lo nostr o r ob or avimus Hiis testibus Wil lelmo Brian Astino filio A licie pr ep ositis Bur gi Anglic o de Notinga’ anno r e gni R e gis Henrici fi lii Johannis Re gis !xv! Henric o Kytt e Henric o le T aylur Augustino cleric o et aliis . 3. Previous w ork. In this section w e describ e some previous w ork on the problem of calendaring u n dated En glish c harters that had b een carried out using the DEEDS data set. First, ho wev er, we d efine some basic terms and set out the n otation that w e will adhere to throughou t. W e will u s e D to denote a generic text do cument; D will frequently b e considered to b e rand om—a selecti on from an effectiv ely infin ite collection of do cuments that could h a ve arisen in the relev ant random exp eriment. Our data corpu s will typicall y b e denoted as D 1 , D 2 , . . . , D n ; our nota tion will not d istinguish whether these represent random do cuments or their actual realizatio n s, as this will alw a ys b e clear fr om the con text. A do cu m en t D consists of a string (ordered sequence) of not necessar- ily distinct wo r ds ( w 1 , w 2 , . . . , w m ), wh ere N ( D ) ≡ |D | = m is the length of the do cum ent. A shingle of size k , or k -shin gle, is a substring s k = ( w j +1 , w j +2 , . . . , w j + k ) of k consecutiv e w ord s in D ; here 0 ≤ j ≤ m − k so there are m − k + 1 (not n ecessarily d istinct) k -shingles in D . W e w ill let s k ( D ) denote the set of these (not n ecessarily d istinct) k -sh ingles, w hile S k ( D ) will denote the set of distinct k -shingles of D . The cardinalities of these sets is | S k ( D ) | ≤ | s k ( D ) | = m − k + 1. When k is considered to b e fixed, and give n a k -shingle s ∈ s k ( D ), we will let n s ( D ) denote the n u m b er of times this shingle o ccur s in s k ( D ); Finally , the date, t , of a do cument will b e denoted by t ( D ) = t . T u rning no w to previous w ork on the DEEDS data, Ro d olfo Fiallos wo r k ed for th e DEEDS pro ject for many y ears, du ring wh ich time h e devised a metho d for dating the man us cripts called the MT metho d. See Fiallos ( 2000 ). MT stand s for Mu ltiplic ador T otal in Spanish and translates into English as “T otal Multiplier.” Fiallos’ metho d is b ased on matching p atterns —shin gles of arbitrary length—which o ccur in th e do cumen t w e seek to date and wh ic h o ccur also in one or more of the d o cumen ts in a training s et of d ated do c- ument s. The und er lyin g idea is that a relativ ely higher concen tr ation of matc hing patterns sh ould b e foun d among those do cum en ts in the training set wh ose dates are closer to the un k own date of the do cument whose date w e are trying to estimate. Fiallos identified thr ee c haracteristics of matc hin g patterns thought to b e imp ortan t f or the calendaring pro cess: L ength : Th e n u m b er of w ords in the matc hing pattern (sh ingle length). DA TING MEDIEV AL ENGLISH CHA R TERS 9 Lifetime : Th e d ifferen ce, in ye ars, b et we en the last and first o ccurrence of the matc hing pattern in the training set. (If a pattern o ccurs only within one y ear, its Lifetime = 0.) Curr ency : Th e Lifetime of the matc hing pattern divided by the n umb er of distinct y ears in which it occurs . (Here we are follo win g the definition of R. Fiallos: th us higher v alues of currency corresp ond to sparser o ccurren ce of the pattern throughout th e years of its lifetime.) T o date a giv en do cum ent D , eve r y sub string of consecutiv e words in D is examined. [If D has length m , there will b e m + ( m − 1) + · · · + 2 + 1 = m ( m + 1) / 2 suc h substrings in all.] I f such a sub string o ccurs also in the training set (it b ecomes a “matc h ing pattern” and) it pro du ces an MT v alue defined as MT = M 1 (Length) × M 2 (Lifetime) × M 3 (Currency) . The larger its MT v alue, the more influentia l the matc hin g pattern is con- sidered to b e for the calendaring pro cess. Here the fu n ction M 1 is increasing since longer patterns are consider ed to b e more informative ; M 2 is decreas- ing since p atterns with longer lifetimes are view ed as b eing less informativ e; and finally M 3 is also decreasing s in ce sparser o ccurrence of a pattern within its lifetime is thought to reduce its evidentia ry w orth . The fu nctions M 1 , M 2 and M 3 can b e defined in many ad ho c w a ys, and su c h definitions in v ariably en tail man y tuning-t yp e parameters; such functions and their parameters w ere d etermined by Fiallos through extensiv e trial and err or and lea v e-one- out cross-v alidation. Once MT v alues ha v e b een assigned to all matching patterns in D , an MT v alue is computed for ev ery y ear for whic h training d ata is a v ailable b y summing the MT v alues of all of th e matc hin g p atterns of D that o ccur among the training data of th at ye ar. Ho wev er, in an attempt to reduce noise, matc hing patterns wh ose MT v alues fall b elo w a certain threshold are excluded f rom this su m ming p ro cess. Th is p ro cedure leads to a function of time, called th e MT f unction. T o accoun t for the fact that the num b er of training d o cumen ts v aries o ve r time, the v alues of this MT fu n ction are eac h divided by the n u m b er of training do cuments in that ye ar. Th ese stan- dardized v alues are referred to as Global MT or GMT v alues. In principle, the date ha ving the highest GMT v alue is tak en to b e the estimated date of D . Ho w ever, b ecause such GMT functions are still quite noisy , th e GMT v alues are first a verage d o ver time int erv als of, sa y , 40 or 20 y ears, leading to an estimated time in terv al for the date of D . This estimated date range is then expanded, and the GMT a v eraging pr o cess is then rep eated o ver this new range bu t no w using a smaller in terv al width. Th is pro cess is rep eated sev eral times, leading finally to a p oin t estimate for th e unkn o wn date. Figure 3 (based on compu tations provi ded by Fiallos) plots the estimated v ersu s the actual dates for 1484 DEEDS do cuments which w ere dated by 10 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S Fig. 3. Estimate d versus true dates for 1484 do cuments, date d by the metho d of R. Fial los, e ach sele cte d r andomly fr om a tr ai ning set of appr oximately 3500 date d do cuments. the MT metho d. T hese 1484 do cumen ts w ere randomly selected from a set of appro ximately 3500 dated do cum en ts, and eac h of these 1484 selecte d do cuments wa s th en dated on the b asis of the fu ll 3500 d o cumen ts d ata set, but with the one b eing d ated left out. The mean absolute error (MAE) w as found to b e 16 y ears. The hea vy concen tration of p oin ts o ccurr ing near the “ x = y ” axis is due to d o cumen ts that ha ve b een dated rather accurately . W e remark, how ev er, that the MAE estimate of 16 years is lik ely to b e optimistic b ecause it was not based on a held-out test set—that is, th e optimization of the many tuning p arameters w as p erformed o ver th e same data set. 4. Calendaring b y nearest neighbors (kNN). Distance based metho ds for calendaring c harters (also referred to as nearest neigh b or or kNN meth- o ds) were in tro du ced in F euerverger et al. ( 2005 , 2008 ), hereafter referr ed to as FHTG ( 2005 ) and FHTG ( 2008 ). The und erlying id ea is to define mea- sures of distanc e b etw een pairs of do cuments and to estimate th e date of an un dated d o cumen t by a w eighte d a verage of the dates of do cu men ts in a training set usin g weigh ts whic h dep end on their d istances to the do cumen t w e seek to d ate. Alternately , one can u se a r ecipro cal to the concept of d is- tance, namely , similarity (also referred to as r esemblanc e or c orr esp ondenc e ), and a verag e o ve r the d ates of d o cumen ts in the trainin g set u sing weigh ts based on the similarit y measures. F or completeness and later comparisons, w e outline these metho ds in this s ection. Me asur es of distanc e and similarity : Distance and similarit y measures on do cuments are discussed, for example, in Djeraba ( 2003 ), FHTG ( 2005 ), DA TING MEDIEV AL ENGLISH CHA R TERS 11 McGill, Koll and Noreault ( 1979 ), Qu ang et al. ( 1999 ), Salton, W ang and Y ang ( 1975 ), T an, Stein b ac h and Kumar ( 2005 ), Zhang and Korfhagen ( 1999 ) and references therein. L et P and Q r epresen t tw o d o cumen ts wh ose union consists of |P ∪ Q| = ℓ d istinct words. (A discussion b ased on k -sh ingles w ould b e analogous.) Let p ≡ ( p 1 , . . . , p ℓ ) and q ≡ ( q 1 , . . . , q ℓ ), resp ectiv ely , b e v ectors corresp ondin g to the occur rence of these distinct w ord s; these v ectors can v arious ly b e word coun ts, normalized counts ( P i p i = P i q i = 1) or 0–1 incidence vect ors. Th en some natural measures of similarity b et wee n P and Q are giv en by Sim γ ( P , Q ) = P ℓ i =1 p γ i q γ i q P ℓ i =1 p 2 γ i q P ℓ i =1 q 2 γ i (4.1) for 0 < γ < ∞ . The case γ = 1 corresp onds to the angle-based c osine simi- larity , while th e case γ = 1 / 2 w ith n orm alized p and q results in a similarit y measure that leads to a Hel linger distanc e . Similarity m easures somewhat alik e to ( 4.1 ) may also b e d efined as Sim α ( P , Q ) = P ℓ 1 p α i q α i P ℓ 1 ( p 2 α i + q 2 α i − p α i q α i ) (4.2) for 0 < α < ∞ . Unlik e ( 4.1 ), these h a ve the adv anta ge th at, for all s u c h v alues of α , Dist α ( P , Q ) ≡ 1 − Sim α ( P , Q ) (4.3) is a prop er metric (i.e., satisfies the triangle inequalit y). Bro der ( 1998 ) d efined the r esemblanc e of t wo do cum ents D 1 and D 2 , for a giv en (fixed) shin gle size k , as Res k ( D 1 , D 2 ) ≡ |S k ( D 1 ) ∩ S k ( D 2 ) | |S k ( D 1 ) ∪ S k ( D 2 ) | . (4.4) Using this defin ition, a set-based r esemblanc e distanc e b et w een do cuments whic h satisfies the triangle inequality ma y b e defined as Dist k ( D 1 , D 2 ) ≡ 1 − Res k ( D 1 , D 2 ) . There are, of course, ma y other measures of distance and similarit y . W e remark that for information retriev al wo rk, man y distance measures often b ehav e similarly and that whether or not the triangle inequalit y holds tends to b e inconsequenti al. [See, e.g., Djeraba ( 2003 ), Chapter 4.] One p oten- tial b enefit, h o wev er, of having many versions of distance is in p ermitting the implementa tion of ensemble-t yp e estimation metho ds. Th e use of m ul- tidimensional scaling as an alternativ e to in corp orate distances b ased on similarities is also w orth men tioning, but lies outside th e scop e of this pa- p er. 12 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S Calendaring by kNN metho ds : T o dev elop and ev aluate distance b ased and other estimation metho ds, the DEEDS do cuments w ere first partitioned at random in to a tr aining set T , a v alidation set V and a test set A . W e w ill frequent ly inte r c hange notation suc h as D i ∈ T and i ∈ T for memb ership in th ese s ets. Ou r aim is to estimate the u n kno w n date t i of a do cu men t D i , when i / ∈ T . Here we follo w FHTG ( 2005 , 2008 ). Let d k ( i, j ), for k = 1 , 2 , . . . , r , denote r differen t distance measures b e- t wee n do cument s D i and D j , sa y . F or instance, these d istances could all b e Bro der distances corresp onding to d ifferent shingle lengths k , with r b eing the largest sh ingle s ize in the pro cedur e. Using these distances, we define an r -dimensional k ernel w eight on the dates t j of the d o cumen ts D j in the training set T : a ( i, j ) ≡ a ( i, j | h 1 , . . . , h r ) = r Y k =1 K h k ( d k ( i, j )) , (4.5) where i corresp onds to the do cument D i w e seek to date. Here K h ( · ) is a non- negativ e, nonincreasing function defin ed on the p ositiv e half-line and h is a bandwidth parameter. F or example, w e could tak e K h ( u ) ∝ exp {− ( u/h ) 2 } , or K h ( u ) ∝ (1 + ( u/h ) 2 ) − η for some c hoice of η , w ith eac h d istance mea- sure p er m itted to ha v e its o w n b andwidth. The distance based (or kNN) estimator for the date t i of D i is then defined as ˆ t ≡ ˆ t i ≡ arg min t X j ∈T ( t j − t ) 2 a ( i, j ) = P j ∈T t j a ( i, j ) P j ∈T a ( i, j ) . (4.6) It remains to consider the selection of the bandwid ths h 1 , . . . , h r in ( 4.5 ). In FHTG ( 2005 , 2008 ) this was based on a f orm of cross-v alidation whic h is lo cal in th e sens e that it tries to determine the set of band widths optimal for eac h do cumen t D i individually . S p ecifically , let K ( i ) b e the collection of nearest neigh b ors to D i , defined as th e u n ion, o v er all 1 ≤ k ≤ r , of the set of al l indices j ∈ T in the training set suc h that d k ( i, j ) is among the m smallest v alues of that quantit y , where the in teger m is some s m all fr action of the total n u m b er of d o cumen ts in T . Then m , as wel l as the h 1 , . . . , h r sp ecific to D i , are c h osen to minim ize the cross-v alidation function CV( m ; h 1 , . . . , h r ) = 1 |K ( i ) | X j ′ ∈K ( i ) ( t j ′ − ˆ t − j ′ ) 2 , (4.7) where ˆ t − j ′ = ˆ t − j ′ ( m ; h 1 , . . . , h r ) = arg min t X j ∈T ,j 6 = j ′ ( t j − t ) 2 a ( j ′ , j ) = P j ∈T ,j 6 = j ′ t j a ( j ′ , j ) P j ∈T ,j 6 = j ′ a ( j ′ , j ) . DA TING MEDIEV AL ENGLISH CHA R TERS 13 While this ban d width selection pr o cess is local in the s en se that for eac h D i , it tries to determine a set of bandwidths by optimizing o v er its nearest neigh b ors K ( i ), if w e w ere to choose all K ( i ) ≡ T th e pro cedure would b e- come global with th e estimated band widths then b eing the same for all of the do cum en ts. The optimization o ver m and h 1 , . . . , h r is carried out via a grid s earch resulting in ( ˆ m ; ˆ h 1 , . . . , ˆ h r ) = arg min CV ( m ; h 1 , . . . , h r ) . The mean squ ared err or of the date estimate ˆ t i can then b e estimated as ˆ s 2 ( i ) = P j ′ ∈K ( i ) ( t j ′ − ˆ t − j ′ ) 2 a ( i, j ′ | ˆ h 1 , . . . , ˆ h r ) P j ′ ∈K ( i ) a ( i, j ′ | ˆ h 1 , . . . , ˆ h r ) , where the ˆ t − j ′ , for all j ′ ∈ K ( i ), are computed using the s ame band widths as for ˆ t i . 5. Calendaring b y maximum prev alence (MP). O ur metho d of maxi- m u m prev alence (MP) for calendaring a do cument D is an an alogue of the metho d of maxim u m lik eliho od ; it attempts to assign, for eac h p oin t t in time, a probab ility for th e o ccurrence of D at that time, and it estimates the unknown d ate of D by that v alue of t at wh ic h D h as the highest prob- abilit y of o ccurr ence. The MP metho d is s p ecific to a giv en sh ingle size, sa y , k , b u t the ensemble of estimates pro du ced u sing differen t v alues of k can subsequently b e com b ined. If no w D consists of a string of N w ords , it will con tain | s k ( D ) | = N − k + 1 (not necessarily unique) k -shingles. W e will let N ( D ) ≡ | s k ( D ) | r epresen t the n u m b er of elemen ts in s k ( D ), su ppressing its dep endence on k . T he assump- tion is then m ade that these N ( D ) shingles o ccur indep end en tly of eac h other and are dr awn from the multi v ariate distr ibution o ver shingles of size k in effect at the true date t ( D ) of the do cum en t. Although this assu mption— made here of necessit y—is untrue, there are some argumen ts in its fa v or. In particular, in some statistical problems, estimators can r emain consistent (and eve n asymptotically efficien t) wh en dep end ency is ignored. Examples include incorrectly assuming indep endence wh en estimating th e mean of cer- tain stationary p ro cesses. In su c h cases, it is pr imarily the v ariances of the estimates that are affected. Additional argumen ts are giv en in Domingos and P azzani ( 1996 ). Supp ose then that f or every p ossible k -shingle s , w e knew th e probabilit y π s ( t ) of its o ccurrence at ev ery time p oin t t . Th en the pr evalenc e function for D is defined as π D ( t ) = Y s ∈ s k ( D ) π s ( t ) , (5.1) 14 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S and b y analogy with maximum like lih o o d, the true date t ( D ) of D wo uld b e estimated as that v alue of t at which π D ( t ) is maximized. The function π D ( t ) is int ended to represent the p robabilit y of the o ccurrence of D as a function o v er time. Of course, we d o not kno w the π s ( t ), but these ma y b e estimated, as ˆ π s ( t ), say , leading to an estimated prev alence function ˆ π D ( t ) = Y s ∈ s k ( D ) ˆ π s ( t ) , (5.2) and finally to our prop osed d ate estimator ˆ t D = arg max t ˆ π D ( t ) . W e m ust no w consider h o w to estimate th e probabilities π s ( t ) of shingle o ccurrence. Giv en a do cument D and a k -shingle s , the num b er of times s o ccurs in D will b e denoted by n s ( D ). F or n s ( D ) w e p ostulate the b inomial mo del L ( n s ( D ) | N ( D ) = N , t ( D ) = t ) ∼ Bin ( N , π s ( t )) according to w h ic h th e p robabilit y of the observ ed v alue n s ( D ) is  N ( D ) n s ( D )  { π s ( t ) } n s ( D ) { 1 − π s ( t ) } N ( D ) − n s ( D ) ; here t ( D ) = t is the date of D and N ( D ) = N is the n u m b er of k -shingles it con tains. In terms of th e canonical log-odds parameter λ s ( t ) ≡ log π s ( t ) 1 − π s ( t ) , the logarithm of this p robabilit y is log  N ( D ) n s ( D )  + n s ( D ) λ s ( t ) − N ( D ) log[1 + exp { λ s ( t ) } ] . Because the fi rst (com binatorial) term here do es not dep end on λ s ( t ), we drop it from subs equ en t expressions. Hence, give n a random sample of do c- ument s D i ∈ T , w ith corresp ond ing dates t i , the log-lik eliho o d function in the parameter λ s ( · ) is tak en to b e X i ∈T { n s ( D i ) λ s ( t i ) − N ( D i ) log[1 + exp { λ s ( t i ) } ] } . W e n ext mo del the function parameter λ s ( · ) as a t -lo cal p olynomial of de- gree p ; sp ecifically , for u near t , λ s ( u ) ≈ β 0 + β 1 ( u − t ) + · · · + β p ( u − t ) p . (5.3) Here the dep endence of λ s ( · ) as well as of the β 0 , . . . , β p on t has b een suppr essed. [See, e.g., Loader ( 1999 ).] Finally , we introd uce a t -lo calized DA TING MEDIEV AL ENGLISH CHA R TERS 15 v ersion of the log-lik eliho o d, namely , X i ∈T { n s ( D i ) λ s ( t i ) − N ( D i ) log [1 + exp { λ s ( t i ) } ] } K h ( t i − t ) , (5.4) whic h is to b e maximized o ver the β 0 , . . . , β p for every giv en t . The r esulting estimate ˆ β 0 for β 0 is then ta k en as our estimate for λ s ( t ). Here K h ( u ) is a symmetric w eigh t function which tak es on its m aximum at u = 0, and is nonincreasing as u mov es aw a y fr om the origin. A Gaussian version migh t b e K h ( u ) ∝ exp {− u 2 / 2 h 2 } , with h corresp onding to its standard deviation. More flexibly , we could write K h,ν ( u ) in p lace of K h ( u ), with K h,ν ( u ) ∝  1 + u 2 ν h 2  − ( ν +1) / 2 corresp ondin g to a t -distribution with ν degrees of freedom; this allo ws for a tail-w eigh t parameter in addition to a scaling. If we take the p olynomial ( 5.3 ) to b e of degree p = 0, so th at λ s ( u ) = β 0 there, and then set th e d eriv ative with r esp ect to β 0 in ( 5.4 ) to zero, w e obtain (in term s of π s ) th e solution ˆ π s ( t ) = P i ∈T n s ( D i ) K h ( t i − t ) P i ∈T N ( D i ) K h ( t i − t ) , (5.5) whic h is an alogous to the estimator of Nadara ya ( 1964 ) and W atson ( 1964 ). If instead w e use a p olynomial of degree p = 1 in ( 5.3 ) (lo cally linear smo oth- ing) and set deriv ativ es with resp ect to β 0 and β 1 in ( 5.4 ) to zero, w e obtain the pair of equations X i ∈T n s ( D i ) K h ( t D i − t ) = X i ∈T N ( D i ) exp { β 0 + β 1 ( t D i − t ) } 1 + exp { β 0 + β 1 ( t D i − t ) } K h ( t D i − t ) (5.6) and X i ∈T n s ( D i )( t i − t ) K h ( t i − t ) (5.7) = X i ∈T N ( D i ) exp { β 0 + β 1 ( t i − t ) } 1 + exp { β 0 + β 1 ( t i − t ) } ( t i − t ) K h ( t i − t ) . These must b e solve d numerically for β 0 and β 1 at ev ery t , giving ˆ β 0 and ˆ β 1 , and we wo uld then take ˆ π s ( t ) = exp( ˆ β 0 ) 1 + exp ( ˆ β 0 ) . W e remark that we could alt ernativ ely ha ve mo d eled the data u sing a p oisson distrib ution as in n s ( D ) ∼ Poisson( µ s ( t ) N ( D )) 16 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S and carried out local p olynomial fitting u sing the canonical link parameter λ s ( t ) = log µ s ( t ). [Here w e ha ve used µ s ( t ) in place of π s ( t ) for the shingle’s probabilities.] If the lo cal p olynomial is tak en to b e of degree 0, this leads again to the Nadara ya– W atson t yp e solution ( 5.5 ), with ˆ µ s ( t ) = ˆ π s ( t ). F or lo cal p olynomials of degree greater than 0, the solutions are app ro ximately , but not exactly , equ iv alent to the binomial case. Note that due to their exp onenti al family nature, the Hessians associated with these mo dels are strictly negativ e d efinite; hence, these v arious equations are w ell-b eha v ed and ha v e unique solutions. As a fi n al r emark, we menti on that one ma y consider replacing the defi- nition of the p rev alence f unction in ( 5.1 ) by something lik e π D ( t ) = Y s ∈ s k ( D ) π s ( t ) Y s / ∈ s k ( D ) [1 − π s ( t )] (5.8) with a corresp ondin g c hange in its empirical version ( 5.2 ), so as to try to tak e in to b etter accoun t shingles th at did not o ccur in the do cu m en t b eing dated. Ho wev er, the logarithm of the second factor in ( 5.8 ) is X s / ∈ s k ( D ) log { 1 − π s ( t ) } ≈ − X s / ∈ s k ( D ) π s ( t ) ≈ − X s π s ( t ) = − 1 , (5.9) since eac h π s ( t ) is small, and b ecause the total num b er of p ossible shin gles far exceeds those in any giv en do cument. W e computed empirical v ersions of the logarithm of the second factor in ( 5.8 ) and inv ariably f ou n d that su c h curv es sta yed close to − 1, and were therefore not informativ e. 6. Calendaring via quant ile regression (QR). A third pr op osal for the calendaring pr oblem is based on quanti le regression as follo ws . Su pp ose th at D is a do cum en t whose date we wish to estimate. A scatterplot is pr o duced of the distances Dist( D , D i ) from D to eac h of the do cuments D i ∈ T in a training set, aga inst the kn o wn dates t ( D i ) of those training set d o cu- men ts. A nonparametric quan tile regression (QR) curv e is then fi t to this scatterplot, and the date at whic h this QR plot attains its minimum v alue is tak en as the estimate of th e date of D . QR algorithms t ypically ha v e t w o parameters: a b andwidth h wh ic h con tr ols the sm o othness of the cu r v e and a qu an tile 0 < q < 1. (The bandwidth parameter need not b e kept constan t o ve r the range of dates and may b e larger in regions of sparser d ate ranges.) The parameters h and q are meant to b e optimized for do cumen ts in a v al- idation set wh ic h are d ated usin g d ata in a training set. Th e pr o cedure is then assessed on the do cuments in a h eld-out test set. Figure 8 in Section 8 b elo w illustrates th e QR pr o cedure in action. F or quan tile r egression, our k ey reference is Ko en ker ( 2005 ). DA TING MEDIEV AL ENGLISH CHA R TERS 17 7. Theoretical considerations. In this sectio n w e discuss some general considerations concerning the consistency of the estimates prop osed in Sec- tions 4 and 5 . T u rning fir st to the distance-based (kNN) metho d , we hav e the follo wing result: Let D 0 b e an und ated do cument written at time t 0 , and denote by D a dated do cument, written at time T , and c h osen at random from a p oten tially infi n ite (but r epresen tativ e) training set and having a r an d om distance ∆ f rom D 0 . (F or simplicit y , w e assu me that our kNN p ro cedure is based on only a single distance measur e, but the general case is similar.) W e p osit fi ve conditions: (i) Asymptotic u nb i ase dness : Th e conditional exp ectat ion of the mean of T conv erges to t 0 o ve r neigh b orho o ds ∆ → 0 . (ii) B ounde d varianc e : The second momen t of T remains b ounded as these distance n eighb orh o o ds shrin k to 0. (iii) A te chnic al c ondition : ∆ can b e view ed as p ossessing a density at the origin w hic h is con tin u ous and p ositiv e. (iv) Th e k ernel K ( u ) is b ounded, con tin uous, compactly supp orted and nonincreasing on the p ositiv e real line, with K (0) > 0. (v) Th e num b er of elements in the tr ainin g set increases sufficientl y quic kly as the band width h tend s to 0. Under the conditions (i)–(v), it wa s prov ed in FHTG ( 2008 ) th at the esti- mator ˆ t defined at ( 4.6 ) is a consisten t estimator of the true date t 0 of the do cument D 0 , that is, ˆ t → p t 0 as the size of the training set tends to infinity . T u rning to the MP metho d , consistency results may b e established along the follo win g lines. Assume time to b e in teger v alued and r estricted to a compact domain: t min ≤ t ≤ t max . W e again let D 0 denote the do cumen t to b e dated, and t 0 is its unkn o wn true date. W e consider our tr ainin g set to b e an in cr easing ( n → ∞ ) sequence of do cuments T n ≡ {D 1 , D 3 , . . . , D n } in whic h the r andom do cum en ts D i , and their corresp onding rand om dates T i , are view ed as b eing an i.i.d. sample from an essen tially infinite p opu lation generically represented b y the rand om ob ject ( D , T ). The set of all shin gles p ossible at any p oin t w ithin our time interv al w ill b e denoted by S . (The shingle size is considered fixed.) Ev ery shingle s ∈ S has asso ciated with it its probabilities π s ( t ) of b eing dra w n at an y of the time p oints t . Note that for eac h t we will ha ve P s ∈S π s ( t ) = 1. W e next assume that if t 1 6 = t 2 , then the collect ions { π s ( t 1 ); s ∈ S } and { π s ( t 2 ); s ∈ S } are n ot iden tical; sp ecifically , π s ( t 1 ) 6 = π s ( t 2 ) for some s ∈ S . In the random ob j ect ( D , T ), we will assu m e that, conditionally on T = t , the sequ ence of s h ingles comprising D is an i.i.d. sample dra w n from S und er the p robabilit y distribu tion { π s ( t ) : s ∈ S } . In particular, the s h ingles of D 0 are assu med to b e rand omly dr a wn from S usin g the d istribution π s ( t 0 ). In ( D , T ) the length of D is assumed to b e indep end en t of T . 18 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S No w, for eac h s ∈ S , under standard cond itions for th e Nadara ya –W atson estimator, we will h a ve sup t min ≤ t ≤ t max | ˆ π s ( t ) − π s ( t ) | → 0 as n → ∞ , (7.1) so that for a D 0 of fixed, fi nite length, w e w ill ha v e sup t min ≤ t ≤ t max | ˆ π D 0 ( t ) − π D 0 ( t ) | → 0 as n → ∞ . (7.2) On the other hand, the s tand ard argument (based on the La w of Large Num b ers an d Kullback– Leibler distance) whic h is used to pr o ve consistency of the MLE in the case when the parameter space is fi nite app lies equally here and allo w s u s to conclude that with arbitrarily high probabilit y , π D 0 ( t ) will tak e on its maximum v alue uniqu ely at t 0 pro v id ed only that |D 0 | is sufficien tly large. Hence, by requirin g |D 0 | to b e sufficien tly large, and then letting n → ∞ , the estimated date ˆ t of D 0 can b e made to equ al t 0 with arbitrarily high p robabilit y . Of course, asymptotic r esults do need to b e assessed for relev ance in any sp ecific application. In p articular, it m u st b e b orne in mind that any do cu- men t to b e dated will b e of finite length and so will necessarily con tain on ly limited “Fisher in f ormation” for the estimation of its date p arameter. 8. Numerical work. In this section we d escrib e some numerical exp eri- men ts whic h w e cond u cted using the kNN and MP estimation metho ds with the DEEDS data set. This work wa s carried out using a com bination of UNIX commands together with the C programming language, as wel l as the R statistical compu ting pac k age. F or the pur p oses of our exp eriments, w e first r an d omly p artitioned the 3353 DEEDS d o cument s whic h w ere a v ailable to us in to a tr aining set T , a validation s et V and a test set A , with th ese sets ha ving cardinalities |T | = 2608, |V | = 419, and |A| = 326. Unlik e the MP metho d, ho wev er, our exp eriments with the k NN m etho d as describ ed in Section 4 did not require a v alidation set b ecause in that metho d the parameters for dating any giv en do cument are determined solely from its n eigh b ors with in the training set, as well as f r om other memb ers of th e training set. Th erefore, for our kNN n u merical work, V and A w er e combined to form a larger test set consisting of 745 d o cumen ts. Our exp eriments with the kNN metho d were based on shingle sizes 1, 2 and 3, as well as on all com b inations of these s izes. W e used the distance ( 4.3 ) with α = 1, based on the similarit y ( 4.2 ), and therefore a p rop er metric; this distance w as computed u sing argumen t vecto rs “ p ” and “ q ” consisting of raw (i.e., unn ormalized) s hingle coun ts. Th ese distances are denoted as d k ( i, j ), with k rep r esen ting the shingle sizes on whic h they are based. DA TING MEDIEV AL ENGLISH CHA R TERS 19 T able 2 Performanc e of the kNN and MP m etho ds on the DEEDS data set Dating Shingle Optimal √ MSE MAE MedAE method lengths parameters (v al., test) (v al., test) (v al., test) M1 1 h = 8, df = 5 18.3, 19.8 11.7, 12.5 7.0, 8.0 M2 2 h = 12, df = 3 14.8, 14.7 9.5, 9.0 6.0 , 6.0 M3 3 h = 12, df = 5 17.0, 15.4 10.1, 9.5 6.0, 6.0 M4 4 h = 16, df = 12 18.8, 22.8 11.5, 12.4 7.0, 7.0 M1234 1–4 — 14.3, 14.5 9.3, 9.2 6.0 , 6.0 kNN1 1 m = 1000 20.1 12.3 6.4 kNN2 2 m = 500 23.7 13.8 6.4 kNN3 3 m = 500 28.3 16.6 7.6 kNN12 1 & 2 m = 100 20.2 12.1 6.3 kNN13 1 & 3 m = 100 21.7 12.9 7.0 kNN23 2 & 3 m = 100 25.5 14.9 6.8 kNN123 1 & 2 & 3 m = 10 25.4 15.0 7.9 F or a giv en do cument D i in our test set of 745 do cum en ts, all 2608 of its distances to the d o cument s in the training set w ere computed for eac h of the three sh ingle sizes. The set K ( i ) of n eigh b ors to D i w as formed b y taking all indices j ∈ T suc h that d k ( i, j ) is among th e m smallest v alues of that dis- tance. When multiple shingle sizes were used, the set of neigh b ors K ( i ) was tak en to b e the union ov er the m smallest distances for eac h of th e shingle sizes used. As the kNN pr o cedure was not v ery sensitiv e to the exact c h oice of m , th e v alues of m we exp erimen ted with w ere 5, 10, 20, 100, 500 and 1000. (The smaller m v alues, of course, result in faster compu tation times.) The optimal bandwidths for use with D i w ere then d etermined (en tirely from within the training set) using the pro cedure defined at ( 4.7 ) together with a stand ard Gaussian k er n el at ( 4.5 ). F or eac h m and D i , these band - widths we re determined by searc hing o v er a one-, t wo- or three- dimensional grid, d ep ending on the num b er of sh ingle lengths used in the pro cedur e; the optimal bandwidths so r esu lting were therefore differen t for eac h D i . Finally , w e computed the RMSE (ro ot mean squ ared error), MAE (mean absolute error) and MedAE (median absolute err or) p erformance measur es for the resulting date estimates of the 745 do cuments in our (enlarged) test set. The r esults of these computations are su m marized in the last s ev en rows of T able 2 , lab eled kNN1 (b ased on shingle size 1) to kNN123 (based on using shingle sizes 1, 2 and 3 simultaneo usly). Among these, the com b in ation kNN12 is seen to b e b est, although kNN1 p erformed similarly in terms of MAE, while kNN1 and kNN2 p erformed sim ilarly in terms of MedAE. T he optimal c hoices for m are also shown in the table. The ap p aren t deterioration of p erf ormance for kNN123 app ears related to the fact that m was held equal 20 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S Fig. 4. Estimate d versus true date s for the 745 do cuments i n the test set, using the kNN metho d with m = 100 and c ombining shingle lengths 1 and 2. The solid l ine i s “ y = x .” for all th ree shingle sizes. The relativ ely large v alues of R MS E o ccur b ecause a sm all n um b er of do cu men ts could n ot b e dated at all accurately . By w a y of comparison, the mean y ear for the training do cument s w as appr oximate ly 1246, and if th is v alue w ere used to estimate the d ates of the do cu men ts in the test set, the RMS E would b e 47, the MAE would b e 37, and the MedAE w ould b e 25. Figure 4 , as an example, sh o ws th e estimated versus th e (presum ed) tru e dates for the 745 do cuments in the test set for the kNN12 pro cedur e based on m = 100. T h is figure evidences some degree of edge bias, with early do cu- men ts ha ving o verestimate d dates and later ones h a ving somewhat underes- timated d ates. This bias is due to the one-sided n ature of nearest neigh b ors at the edges. Our exp eriments with the maxim um prev alence (MP) metho ds r equired all th ree of the sets T , V and A . T o sa v e computational lab or, w e imp le- men ted only the lo cally constant (i.e., Nadara ya– W atson t y p e) version ( 5.5 ) for estimating the sh ingle pr obabilit y fu nctions; we used the t -distribution k ern el K ( x ) = (1 + x 2 /ν ) − ( ν +1) / 2 . F or eac h of the shingle sizes 1, 2, 3 and 4, optimal v alues of the b andwidth h and degrees of freedom parameter ν were determined b y optimizing the date estimates for the do cum en ts in the v al- idation set using the training data. Finally , the p erformance measures we re computed on b oth the v alidation and the test set using the parameters th at w ere d etermined on the v alidation set. These results are sh o wn on the ro w s lab eled M1, M2, M3 and M4 of T able 2 . F or eac h of these methods , the optimized parameter v alues are s h o wn , and the RMS E, MAE and MedAE DA TING MEDIEV AL ENGLISH CHA R TERS 21 Fig. 5. Estimate d pr ob abili ty f unction ˆ π s ( t ) for the shingle testimo nium huic b ase d on de gr e es of f r e e dom ν = 3 and b andwidth h = 12 . The p oints ar e the r elative fr e quencies for this shingle at e ach date. p erformance measures are giv en for b oth the v alidation and the test set d ata. The b est p erformin g of these metho d s was that based on sh ingle size 2 (i.e., metho d M2), with a median absolute err or of 6.0. The shingle size 2 is, in some sense, the b est compromise (for a data set of this size) b etw een ha v- ing the deep er in f ormation cont en t inh eren t in longer shingles and ha vin g enough of them. The RMSE and MAE figur es are again inflated du e to the presence of a small n umb er of do cum en ts that could not b e d ated accurately . Figures 5 , 6 and 7 exemp lify the main comp onents of the MP pro cedu re. Figure 5 sho ws an estimated probability fun ction ˆ π s ( t ) for the 2-shingle testimonium huic (“in witness to which”) based on a t -d istribution k ern el with bandw idth h = 12 and degrees of fr eedom ν = 3. The p oin ts on this graph are the o ccurrence prop ortions for this sh ingle o ve r time, and the concen tration of p oin ts at the b ottom of th e graph corresp ond to yea rs in whic h this shin gle d id not o ccur. Figure 6 is a plot of the logarithm of a typica l prev alence curve ˆ π D ( t ), based on sh in gle size k = 2, u sing four differen t b andwidths , and a do cument D in the test set (consisting of 87 w ord s) whose tru e date is 1299. The MP estimate f or this do cumen t is 1307; w e note that (as wa s typically th e case) th e resu lting date estimate is not undu ly sensitiv e to the exact b an d width chosen. Figure 7 is a plot of the estimated ve rsus the true d ates for the 326 do cuments in the test set using the M2 metho d. Suc h edge b ias as o ccurs could lik ely b e redu ced b y using the more computationally intensiv e lo cally linear sm o othing as in equations ( 5.6 ) and ( 5.7 ). 22 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S Fig. 6 . Example of a pr evalenc e function ˆ π D ( t ) , at four differ ent b andwidths, using a t 3 distribution kernel. The true date f or this do cument is 1299; its maximum pr evalenc e estimate is 1307. Fig. 7. Estimate d versus actual dates for the 326 do cuments in the test set A , using the maximum pr evalenc e m etho d wi th shingle l ength 2. The solid l ine i s “ y = x .” DA TING MEDIEV AL ENGLISH CHA R TERS 23 Fig. 8. Quantile r e gr essions (QR) f or (lower) quantiles q = 0 . 1 and q = 0 . 05 , using b and- widths h = 30 (solid lines) and h = 10 (dashe d lines). The p oints ar e distanc es fr om the do cument b eing date d to do cuments in the test set, plotte d against the true dates of the test do cuments. The vertic al l ine i s at the true date, 1261. W e also attempted to com bine the metho ds M1–M 4 using a weig h ted a ve rage determined b y minimizing MSE (mean squ ared err or) ov er the v al- idation set (sub ject to a constrain t that the weig h ts sum to 1). T he we ights for the resulting metho d , labeled M123 4 in T able 2 , were f ou n d to b e 0.14, 0.64, 0.12 and 0.10. The r esu lts for this metho d we re not muc h b etter than for M2 alone. Our exp er im ents w ith the QR metho d were less successful than for the kNN and MP metho d s. While the Q R metho d did generally pr o vide mean- ingful estimates, error v ariation w as higher than for kNN or MP , particularly for do cuments whose dates were in th e u pp er or lo wer date ranges wh ere test data was r elativ ely sp arse. Figure 8 pro vides an illustration of th e Q R metho d using a do cument D consisting of 336 wo r ds whose true date is 1261 and a test set of 260 8 do cuments D i . In this plot of the distances Dist( D , D i ) v ersu s the dates t ( D i ), four quantile regression cur v es are dr a wn . The tw o solid lines corresp ond to bandwidth h = 30, and the (lo wer) quan tiles q = 0 . 1 and q = 0 . 05, and lead to date estimates of 1256 and 1252, resp ectiv ely; the t wo d ashed lines corresp ond to bandwidth h = 10, and (lo w er) quantil es q = 0 . 1 and q = 0 . 05, and give date estimates 1240 and 1241. Note that this plot is truncated at the far right where th e num b er of training d o cument s is to o sm all to p ermit estimation of the qu an tile cur v es at all reliably . In a fin al series of exp eriments, we attempted to com bine the results of the kNN and MP metho ds. F or example, linearly combining M2 and kNN12 o v er 24 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S the v alidation set using an RMSE criterion, the optimal w eights were fou n d to b e 0.83 and 0.17, and the RMSE ov er the test set dropp ed sligh tly to 13.5 y ears. The other p erformance measures, ho wev er, were not significan tly c hanged. 9. Discussion. The problem whic h motiv ated this w ork leads to in terest- ing technical q u estions and no v el tec hniques, linking statistic al metho d s to w ork asso ciated with inf ormation retriev al. Automated (i.e., computerized) calendaring an d temp oral sequencing of text-based d o cument s are kno w n to b e d ifficult problems. In the case of the DEEDS c harters, h ow ev er, t w o features allo w for p r ogress to b e mad e. First, we hav e a v ailable a large (and increasing) training set of d o cument s whose d ates are accurately kno wn . And second, the do cu men ts in question all ha v e relativ ely similar formulai c structure. W e r emark that the method s we ha ve describ ed can be ap p lied to any collect ion of d o cument s and ha v e p oten tial applicatio ns broader than the one whic h motiv ated this study . F or ins tance, as ind icated in FHTG ( 2005 ), when suitable training d ata is a v ailable kNN-based metho ds can b e adapted to detect other typ es of missing attributes, such as authorship, p oten tially pro v id ing a metho d ology complemen tary to th at of Mosteller and W allace ( 1963 ). Another p oten tial application is in the detection of forgeries, a prob- lem r elated to that of establishing c hr onology in that a common p urp ose of forgery is to alter past inte nt. It is kn own that the num b er of forged En glish mediev al c harters is not small. On e difficulty of this task, how ev er, is the fact that m ultiple and legitimate rewritings of do cuments ha ve b een made b y scrib es w ho may h a ve mo dernized or sligh tly altered the language of the do cuments b eing transcrib ed. W e also hop e that the metho d s prop osed here ma y help determine more pr ecise c hr on ologies in other con texts as well. Of the metho ds in v estigated, w e foun d that the MP metho d p erformed b est. This app ears to b e du e to its more d etailed sensitivit y to the b eha vior of individual shingles o v er time. F or example, the MP method was more effectiv e in d iscoun ting v ery commonly o ccurr in g shingles, since their o ccur- rence probabilities w ere relativ ely m ore constant ov er time. In our n umerical w ork, we also encoun tered tw o somewhat surp rising results. The firs t is th at of the shingle sizes w e w ork ed with , sh ingles of size 1 resulted in estimates not und uly far from the b est resu lts; shingles of size 2 were b etter, but not b y a large margin. The second is that (to w ithin the scale of our exp er- imen ts) com bining m u ltiple shin gle size s and com bining metho ds did not lead to striking imp ro vemen ts. T aken together, these obs erv ations app ear to suggest that, f or determining c h ronology , “single w ords suffice.” W e are, ho we v er, not con vin ced that this observ ation will b e su stained b y further work. As the size of th e DEEDS data set gro ws and as our com- puting resources increase, it will b ecome possib le to carry out estimation using larger trainin g sets, u s ing additional metho ds of estimation and using DA TING MEDIEV AL ENGLISH CHA R TERS 25 more d istances. T he situation is analogous to th at en coun tered in the col- lab orativ e filtering problem of the Netflix cont est w here a blend of no few er than 800 metho ds and v ariations w as n eeded by the w inning team. [See, e.g., F euerverge r, He and Kh atri ( 2012 ).] Thus, with more d ata, we exp ect fu rther progress to b e p ossible via ensemble- t yp e metho ds and by b lend ing methods differen tly across strata of the d ata; see, for example, Hastie, Tibshirani and F riedman ( 2009 ), Chapter 16. F urther, w ith additional d ata, it will b ecome feasible to carry ou t optimizatio n by referring un d ated do cuments to other do cuments of their sp ecific typ e only (i.e ., gran t, lease, agreemen t, etc.), and th u s to tune the estimation pro cedures according to do cu men t t yp e. While further accuracy thus surely seems p ossib le, there m u st also b e some practi- cal limit to what can b e ac hieve d via purely automated means, particularly b ecause an y d o cument to b e dated is of fi n ite length, and therefore carries only a limited amount of “inform ation” regardless of th e amount of training data a v ailable. Wh ile accuracies so far attained su ffice to make a material difference to historians studying that era, the u ltimate goal of th e DEEDS pro ject is to try to attain an accuracy of ab out ± 3 yea rs of error 95% of the time. W e also exp ect that fu rther progress could b e made on the d efinition of distances b et w een d o cumen ts. One observ ation we offer is th at such distances should not b e regarded as absolute, b ut rather as relativ e to a particular col - lection of do cuments. In this regard, the Multiplic ador T otal metho d of R. Fiallos seems p articularly suggestiv e. A highly effectiv e distance b et w een pairs of d o cumen ts sh ould take into account all matc hing patterns b et ween them, as w ell as the lengths, lifetimes, currencies and other relev an t fea- tures that these matching patterns p ossess within the con text of the whole do cument collection. Related to this is the degree of informativ eness of sh in- gles. F or example, Luh n ( 1958 ) suggests that shingles whic h o ccur neither to o frequent ly nor to o r arely will tend to b e th e most inf orm ativ e. As w e had ment ioned, our MP metho d d o es tend to d iscount the v ery frequ en tly o ccurring shingles, but it do es not discount th e very rare ones. The history of the DEEDS pro ject is not yet fully written and there is no d oubt other tec hn iques for the calendaring problem will b e explored. F or instance, in ongoing w ork , we are exploring w a ys in which collections of do cuments can b e correctly se quenc e d in time (to within time-rev ersal), without r egard to any of th e dates asso ciated with them. W e are also ex- ploring w a ys in which metho d s such as neur al net w orks and s upp ort vect or mac hines migh t b e applied to such calendaring problems. Remark ably , dur in g the time this w ork was b eing carried out, a m ediev al English charter w as d isco v ered in a forgotten dra w er of a libr ary at Bro c k Univ ersity (near Niagara F alls), a disco very wh ic h resulted in a certain amoun t of local media fanfare. This docum en t records a land grant from 26 G. TILAHUN, A. FEUER V ERGE R AND M. GER VER S a certain Rob er t of C lopton to his son William. A ttempts b y h istorians u s- ing paleograph y (analysis of handwriting), con tent and other means initially attributed this do cument to the 14th cen tury , an d subs equ en tly to the 13th cen tury . More careful work by Robin S u therland-Harris (a Ph.D. student of Mediev al Stu d ies at th e Universit y of T oront o), b ased on the Pate n t Rolls (administrativ e ord ers of th e king) and the eyre records (records of th e itineran t courts), suggests a date range of 1235 –1245, and p erhaps, more precisely , 1238– 1242. Th ese estimates are b eliev ed to b e r eliable; a compari- son d o cument —b eliev ed to b elong to the same time p erio d—was also found and w as dated 1239. W e dated this charter via maximum prev alence (the most reliable among the metho ds we hav e discussed) u s ing our training set of 2608 do cu m en ts; the date estimate w e obtained was 1246. Ac kn o wledgment s. It is a pleasure to ac kno w ledge the ge nerous assis- tance and su bstan tial con trib utions to this pro ject by Ro dolfo Fiallos. Thanks also to Robin Sutherland -Harris f or assistance with the Bro c k do cument. W e also wish to thank Davi d An drews, Mic hael Ev ans, Ben Kedem, Pet er Hall, Keith Knight, Rad f ord Neil and Nancy Reid f or their in terest in this pr o ject and for the b en efi t of many v aluable d iscussions. W e also th an k the referees for their careful reading and for suggestions w hic h h a ve help ed to improv e the pap er. REFERENCES Berr y, M . W. and Bro wne, M. (2005). Understanding Se ar ch Engines—Mathe matic al Mo deling and T ext R etrieval , 2nd ed. SIAM, Philadelphia. Broder, A. Z. (1998). O n the resem blance and containmen t of do cuments. In Interna- tional Confer enc e on Compr ession and Complexity of Se quenc es (SEQUENCES’97), June 11–13 1997, Positano, Italy 21–29. IEEE Comput . So c., Los Alamitos, CA. de Jo ng, F . , R ode, H. and Hiemstra, D. (2005). T emp oral language models for the disclosure of historical text. I n Pr o c. 16th Int. Conf. of the Asso c. f or History and Computing 161–168. KNA W, Amsterdam. Djeraba, C. (2003). Multime dia Mi ni ng—A Highway to Intel ligent Multime di a Do cu- ments . Kluw er, Boston. Domingos, P. and P az zani, M. (1996). Bey on d indep endence: Conditions for optimality of the Ba yes classifier. In Pr o c e e dings of the 13th International Confer enc e on Machine L e arning 105–112. Asso ciation for Computing Machinery , New Y ork . F an, J. and Gijbels, I. (2000). Local polynomial fitting. In Smo othing and R e gr ession: Appr o aches, Computation, and Applic ation ( M. G. Schime k , ed .) 229–276. Wiley , New Y ork. Feuer verger, A. , He , Y. and Kha tri, S. (2012). Statistical significance of the Netflix chal lenge. Statist. Sci. 27 202–231. Feuer verger, A. , Hall, P. , Tilahun, G. and Ger vers, M. (2005). Distance measures and smo othing metho dology for imputing features of documents. J. Comput. Gr aph. Statist. 14 255–262. MR2160812 Feuer verger, A. , Hall, P. , T ilahun, G . and G er vers, M. ( 2008). Using statisti- cal smo othing to d ate mediev al man uscripts. In Beyond Par ametrics in I nter disci- plinary R ese ar ch: Festschr ift in Honor of Pr ofessor Pr anab K. Sen ( N. Balakrishn an , DA TING MEDIEV AL ENGLISH CHA R TERS 27 E. Pena , M . J. Sil v apulle , eds.). I nst. Math. Stat. Col le ct. 1 321–331. Inst. Math. Statist., Beach wood , OH . MR2462216 Fiallos, R. (2000). An ov erview of t he pro cess of dating und ated mediev al c harters: Latest results and future developmen ts. In Dating Undate d Me dieval Charters ( M. Ger ve rs , ed.). Bo ydell Press, W oo dbridge. Ger vers, M. (2000). Dating Undate d Me dieval Charters . Boydell Press, W o od bridge. Ger vers, M. and Ham onic, N. (2010). Pr o amor e dei : Diplomatic evidence of so cial conflict during the reign of King John. Preprint. Hastie, T . , Tibshi rani, R. and Friedman, J. (2009). The Elements of Statistic al L e arn- ing: Data M i ning, Inf er enc e, and Pr e di ction , 2nd ed. Sp ringer, New Y ork. MR2722294 Kanhabua, N. and N or v ag, K. (2008). Impr oving T emp or al L anguage Mo dels for De- termining Time of Non-Timestamp e d Do cuments . L e ctur e Notes in Computer Scienc e 5173 . Springer, Berlin. Kanhabua, N . and Nor v ag, K. (2009). Using T emp or al L anguage Mo dels for Do cuments Dating . L e ctur e Notes in Com puter Scienc e 5782 . Springer, Berlin. Ko e nker, R. (2005). Quantile R e gr ession . Ec onometric So ciety Mono gr aphs 38 . Cam- bridge Univ. Press, Cambridge. MR2268657 Lo a der, C. (1999). L o c al R e gr ession and Likeliho o d . Sp rin ger, N ew Y ork. MR170423 6 Luhn, H. P. (1958 ). The automatic creation of literature abstracts. I BM J. R es. Develop. 2 159–165 . MR0090905 Manning, C . , Ragha v an , P. and Sch ¨ utze, H. (2008). Intr o duction to Information R e- trieval . Cambridge U niv. Press, New Y ork. McGill, M. , Koll , M. a nd Noreaul t, T. (1979). An eval uation of factors affecting docu ment rankin g by informa tion retriev al systems. T echnical R ep ort. Sc hool of Infor- mation Studies, Sy racuse Univ., Syracuse, NY. Mosteller, F. and W alla ce, D. (1963). I n ference in an authorship prob lem. J. Amer. Statist. Asso c. 58 275–302. Nadara y a, E. A. (1964). O n estimating regression. The ory Pr ob ab. Appl. 10 186–190. Quang, P. X. , James, B. , James, K. L. and Le vina, L. (1999). Document similarit y measure for the vector space model in information retriev al. NSASAG Problem 99-5. Sal ton, G. , W ang, A. and Y a ng, C. (19 75). A vector space mod el for information retriev al. J. Amer. So c. Inf . Sci. 18 613–620 . Simonoff, J. S. (1996). Sm o othing Metho ds in Statistics . Springer, New Y ork. MR1391963 T an, P. N. , S teinbach, M. and Kumar, V. (2005). Intr o duction to Data Mining . Addison-W esley , Reading. W and, M. P. and Jones, M. C. (1995). Kernel Smo othing . Mono gr aphs on Statistics and Applie d Pr ob abili ty 60 . Chapman & Hall, Lon d on. MR1319818 W a tson, G. S. ( 1964). Sm o oth regression analysis. Sankhy¯ a Ser. A 26 359– 372. MR0185765 Zhang, J. and Ko rfhagen, R. (1999). A distance and angle similarit y measure method. J. Amer. So c. Inf. Sci. 50 772–778. G. Tilahun A. Feuer verger Dep ar tment of St a tistics University of Toronto 100 St. George S treet Toronto, On t ario M5S 3G3 Canada E-mail: gelila@utstat.toron to.edu andrey@utstat .toront o. edu M. Ger vers Dep ar tment of Histor y University of Toronto 100 St. George Street Toronto, Ont ario M5S 3G3 Canada E-mail: m.gervers@uto ront o.ca

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment