Origins of Modern Data Analysis Linked to the Beginnings and Early Development of Computer Science and Information Engineering
The history of data analysis that is addressed here is underpinned by two themes, -- those of tabular data analysis, and the analysis of collected heterogeneous data. "Exploratory data analysis" is taken as the heuristic approach that begins with dat…
Authors: Fionn Murtagh
Origins of Mo dern Data Analysis Linked to the Beginnings and Earl y Dev elopmen t of Computer Science and Info rmation Engineering Fionn Murtagh Science F oundation Ireland, Wilton Place, Dublin 2, Ireland, and Departmen t of Computer Science, Roy al Hollo w a y Univ ersit y o f London, Egham TW20 0EX, England Email fm urtagh@acm.org Octob er 30, 20 18 Abstract The history of data analysis th at is addressed here is un derpinned by tw o themes, – those of tabu lar data analysis, and the analysis of collected heterogeneous data. “Exploratory data analysis” is taken as the heuris- tic approach th at b egins with data and information and seeks u nderlying explanation for what is observed or measured. I also cove r some of the evol ving context of researc h and applications, includ ing sc holarly publish- ing, technology transfer and th e economic relationship of th e universit y to so ciety . 1 Data A nalysis as the Understanding of Infor- mation and Not Just Predicting Outcomes 1.1 Mathematical Analysis of Data The mathematical treatment of data and information has been recognize d since time immemo r ial. Galileo [2 0] express ed it in this wa y: “Philos ophy is written in this immense bo o k that stands ever open be fore our eyes (I speak of the Univ erse), but it c annot b e read if one do es not fir st lea rn the la nguage and recognize the characters in which it is written. It is written in mathematical language, and the characters are triangles, circles, a nd other geometrical figures, without the means o f which it is h umanly impossible to understand a word; without these philosophy is confused, wandering in a da r k la b yrinth.” Plato is reputed to hav e had the phra se “ Let no -one igno rant o f geometry en ter” a t the ent rance to his Academy , the school he founded in At hens [46]. 1 1.2 Collecting Data Large scale co llection o f data supp or ts the ana lysis of data. Such collection is facilitated gr eatly by mo dern co mputer science and information e ng ineering hardware, middleware and soft ware. Lar g e sca le data collection in its own right do es not necessa r aily lead to succe s sful explo itation, a s I will show with tw o examples, the VDI technical lexic on and the Carte du Ciel ob ject inv en tory . W alter Banjamin, 188 2–19 40, so cial a nd media (including photog raphy and film) technology critic, noted the following, r e lating to engineering [5]. “Ar o und 1900, the V erband Deutscher Ingenieure [Ger man Engineers’ Asso ciation] set to work o n a co mprehensive technical lexico n. Within thre e years, index cards for more than three - and-a-half million words had b een collec ted. But ‘in 19 0 7 the a sso ciation’s managing committee calculated that, with the pr esent n um b e r of p ersonnel, it would ta ke forty years to g et the manuscript of the technical lexicon rea dy for printing. The work was a bandoned after it had swallo w ed up half a million mar ks’ [4 8]. It had b ecome apparent that a technical dictionary should be structured in terms of its s ub ject ma tter, ar ranged systematically . An alphab etical sequence was obsolete.” The “ Carte du Ciel” (sky map) pro ject was, in a wa y , similar . It was started in 1 887 by Paris Observ atory and the aim was to map the entire sky down to the 11th or 12 th magnitude. It was planned as a co llective and la b o rious study of pho to graphic plates that would take up to 15 years. The work was not co mpleted. There was a widespread view [27] that such manual work led to F rench, and Europ ean, astr o nomy falling b ehind other work elsewhere that was dr iven by instrumentation and ne w observing metho ds. Over the past centuries, the mathematica l treatment of data and infor ma- tion b eca me nowhere more co re than in physics. As Stephenson notes in his nov el ([45], p. 689) – a p oint discussed by other author s elsewhere also – Isaa c Newton’s “Principia Mathematica might never hav e come ab out had Nature not sent a spate of c omets o ur wa y in the 168 0s, and so arra nged their tra jectories that we could ma ke telling obser v a tions.” Indeed the difficulties of exp erimen- tally verifying the theor y of sup er strings has le d to c o nsiderable r ecent debate (e.g., recent b o o k s by P . W o it, Not Even Wro ng ; and L. Smolin, The T r ouble with Physics ). 1.3 Data Analysis and Understand ing I will move now to some o bserv ations r elated to the epistemolog y facets of data analysis and data mining. I use the ca ses of G¨ odel and of Benz´ ecri to show how earlier thinkers and scientists were aw are o f the thor ny issues in moving from obser ved or measure d data to the explanato ry factors that under lie the phenomena asso ciated with the data. Logician Kurt G¨ odel, 1906– 1978, had this critique t o make of physics: “physics ... combines co nc e pts witho ut ana lyzing them” (p. 1 70, [50]). In a 19 61 p er - sp ective he was able to claim: “... in physics ... the p ossibility o f knowledge of ob jectivizable states o f affair s is denied, and it is ass erted that we mu st b e con- 2 ten t to predict the r esults of observ ations. This is really the end o f a ll theoretica l science in the usual se ns e.” (p. 140, [5 0].) This w as a view that was in many wa ys shared by the data a nalysis per sp ec- tive esp oused by data analyst and theo rist, Jea n-Paul Benz´ ecri [8]: “... high energy theo retical physics prog resses, mainly , by cons tituting cor po ra o f rare phenomena among immense sets o f ordinar y cases. The simple observ ation of one of these ordinar y cases requir es detection appar atus based on millions of small elementary detectors. ... Practitioner s w alk straig ht into the analyses , and transformations ... without knowing what they ar e lo o k ing for. What is needed, and what I am proud a bo ut having so mewhat succeeded in doing, is to see what is relev ant in the ob jects studied, ...” Philosopher Alain B a diou [4] (p. 79) echo es some of these p er s pe ctives on blindly walking in to the a nalysis task: “ empirical pr o digality be c omes something like an arbitrar y and sterile burden. The problem ends up b eing replaced by verific ation pure a nd simple.” The debate on the role of data analys is is not a bating. In data a nalysis in neuroscienc e (see [3 2] or, as [22] states in reg ard to the neurog enetic ba sis of languag e, “studies using br a in imaging must acknowledge that lo calizatio n of function do es no t provide ex pla natory p ow er for the linguist a ttempting to uncov er principles under lying the sp eaker’s knowledge o f langua g e”), the issue of what ex plains the data analyzed is very often unresolved. This is notwith- standing solid mach ine o r co mputer learning pr ogress in being able to map characteristics in the data o nt o outcomes . In this shor t disc us sion of data analysis , I am seeking solely to dema r cate how I understa nd data ana lysis in this a rticle. In genera l terms this is also what g o es under the terms of: analyse des donn´ ees in the F rench tradition; data mining; and unsup ervis e d classificatio n. The latter is the term used in the pattern recognition literature , and it can be counterposed to sup ervised classification, or machine lear ning, o r discriminant a nalysis. 2 Bey ond Data T ables: Origins of Data A n alysis 2.1 Benz´ ecri’s Data Analysis Pro ject F or a few reaso ns I will beg in with a fo cus on c o rresp ondence analy sis. F o llowing the b es t par t of tw o years that I s pe nt us ing multidimensional sca ling a nd other metho ds in educational r esearch, I started on a do ctor al progra m in Benz´ ecri’s lab in 1978 . I was v ery impre s sed by the cohesiveness of theoretical underpinning and breadth o f applica tions there. Rather than a n ad ho c applicatio n of an analysis metho d to a given problem, there was instead a fo cused and in tegrated view of theor y and prac tice . What I found was far from be ing a bag of analytical tricks o r curios ities. Instea d the ph ysics o r psyche or s o cial pro ce s s lying behind the data was given its due. Benz´ ecri’s ea rly data a nalysis mo tiv atio n sprung from text and do cument ana lysis. I will return to these ar eas in section 4 .4 3 below. The corresp ondence analysis and “aids to in terpretation” research progra ms as well as softw are pr ograms were developed a nd deploy ed on a broad sca le in Benz´ ecri’s lab or a tory at the Universit ´ e Pierr e et Mar ie Curie, Paris 6, thro ugh the 1970s , 1 980s a nd 1990 s . The hiera rchical clustering progr ams distill the best of the recipro ca l nea rest neighbors a lg orithm that was published in the early 1980s in Benz´ ecri’s journa l, L es Cahiers de l’Analyse des Donn´ ees (“Journa l of Data Analys is”) and have not bee n b ettered since then. (See also section 3.4 b elow.) Much o f the development work in this framework, rang ing fr om Einstein tensor notation throug h to the myriad a pplica tion studies published in the jour nal L es Cahiers de l’Analy se des D onn´ ees , were in a n a dv a nced state of development by the time of my ar riv al in Paris in la te 1978 to start on a do ctoral prog ram. A little bo o k published in 19 8 2, Histoir e et Pr´ ehistoi r e de l’Analyse des Donn´ ees [6 ], offers insig ht s into multiv ariate data analysis or m ultidimensional statistics. It w as wr itten in the s pr ing o f 1 975, circularize d internally , published chapter-wise in Le s Cahiers de l’A nalyse des Donn´ ees , befor e taking b o ok form. It b egins with a theme that echoes widely in Be nz´ ecri’s writings: namely that the adven t of computers ov erturned statis tics as understo o d up until then, and that the supp ositions and premises of s tatistics had to b e rethought in the light of computers. F rom proba bilit y theory , data ana ly sis inherits inspira tion but not metho ds: statistics is no t, a nd canno t b e, pr obability alo ne. Pr obability is concerned with infinite sets, w hile data analy s is o nly touches on infinite s ets in the far more finite world expressed b y such a typical pr oblem as discov ering the system of relationships b etw een rows and co lumns of a rectangular da ta table. 2.2 T abular Data With a computational infra structure, the analy sis o f data tables has come in to its own. An teceden ts clearly go back muc h further. Therefor e let me place the orig ins of da ta analys is in an unortho dox se tting. Clark [1 3] cites F ouc a ult [19] approvingly: “The c o nstitution of tables was o ne of the great problems of scientific, po litical and economic technology in the eig ht eenth century ... The table of the eighteent h century was at o nc e a technique of p ow er and a pro cedure of knowledge”. If ta bula r da ta led to data ana lysis, then it can als o b e p ointed out that tab- ular data – in a nother line of evolution – led to the computer. Charles Babbag e, 1791– 1871 , is genera lly avo w ed to b e a father o f the computer [47]. Babba ge’s early (mechanical) versions o f computers , his Difference E ngine a nd Analy ti- cal Engine , were designed with tabular data pro cess ing in view, for e xample generating tables r anging from logar ithms to long itude. “The need for ta ble s and the re lia nce placed on them b eca me es pec ially acute during the first half of the nineteenth century , which witnessed a fer ment of s cientifi c inven tion and unprecedented engineering ambit ion – br idges, railways, shipbuilding, construc- tion a nd a rchitecture. ... There was one need for tables that was para mount – navigation. ... The proble m was that tables were riddled with error s.” [47]. 4 The computer was called for to avoid these er r ors. 2.3 Algorithmic and Computational Data Analysis In discussing R.A. Fisher (Eng lish statistician, 189 0–196 2), Benz ´ ecri [6] ac- knowledges that Fisher in fact developed the ba sic equatio ns of c orresp o ndenc e analysis but without of cour se a desire to do other than address the discrimi- nation problem. Discriminant a nalysis, or sup e rvised class ification, to ok off in a ma jor wa y with the av aila bilit y o f computing infra structure. The av ailabilit y of such metho ds in turn motiv a ted a g r eat dea l of work in patter n recognition and machine le a rning. It is to b e noted that co mputer-based ana ly sis leads to a change of p ersp ective with options now av ailable that were not heretofore. Fisher’s brillia nt appro a ches implicitly a ssume that v ariables are well known, and that r elations b etw een v ar iables are str ong. In many other fields, a mor e explorato r y and less precis e o bs erv ational rea lity aw aits the analyst. With computers came pattern reco gnition a nd, at the star t, neural netw orks. A co nference he ld in Hono lulu in 19 64 on Metho dolo gie s of Pattern R e c o gnition , that was attended by Benz´ ecri, cited Ros enblatt’s p erc e ptron work many times (alb eit his work was cited but not the p er ceptron as such). F rank Rose nblatt (1928– 1971) was a pioneer o f neura l net w orks, including the p erceptr o n a nd neuromimetic co mputing which he develop ed in the 1950s. Early neural netw ork resear ch was simply what b eca me known later as discriminant analysis. The problem of discriminant analys is, howev er, is inso luble if the characterization of observ ations and their meas ur ements are not appropr ia te. This leads inelucta bly to the imp orta nce of the data co ding issue for a ny type of data analysis. Psychometrics made multid imensional or multiv ariate data analysis what it has now b eco me, na mely , “sear ch by induction of the hidden dimensions that are defined by combinations o f pr imary measures”. Psychometrics is a re sp onse to the problem of explo ring ar e a s where immediate physical measurement is not po ssible, e.g. intelligence, memory , imagination, patience. Hence a statistical construction is used in such cases (“even if n um b e rs can nev er quantify the soul!” [6]). While it is now part of the histor y of data analysis and statistics that around the star t of the 20th century interest c ame ab out in human intelligence, and an underlying measure of intelligence, the int elligence quotient (IQ), there is a further link dr awn by Benz´ ecri [6] in tra cing an astrono mica l origin to psycho- metrics. Psychophysics, a s also many other analysis fra meworks such as the metho d of leas t squares , was developed in no small wa y b y astr onomers: the desire to p enetra te the skies le d to o to study o f the scop e and limits o f human per ception, and hence psychometrics. Around the mid-1960 s Benz´ ecri b egan a cor resp ondence with Roger N. Shep- ard whic h r esulted in a v isit to B ell Labs. Shepa rd (“a s tatistician o nly in o rder to serve psychology , and a psychologist out of lov e for philo s ophy”) and J. Dou- glas Carro ll (who “joyfully used a ll his ingenuit y – which was la rge indeed – to mov e data around in the computer lik e one would mo ve per ls in a k aleidoscop e”) had developed proximit y a nalysis, serving as a lynchpin of multidimensional 5 scaling. 2.4 A Data A nalysis Platform The term “ corre s po ndence analys is” was firs t pro po sed in the fall of 196 2. The first presentation under this title was made by J.-P . Benz´ ecri at the Coll` ege de F ra nce in a course in the winter o f 196 3. By the la te 1970s wha t cor resp ondence analysis ha d b eco me w as not limited to the e x traction of factors from a ny table of p ositive v alues. It also ca ter ed for data prepar ation; rules suc h as co ding using co mplete disjunctive form; too ls for critiquing the v a lidit y of results principally through calculatio ns of co nt ribution; provision of effective pro cedure s for dis crimination and regr ession; and har mo- nious link age with cluster analysis. Thus a unified appr oach w as develop e d, for which the for malism r e ma ined quite simple, but for which deep integration of ideas w as achieved with diverse pro blems. Ma ny of the latter o r iginally app eared from different sources, and so me wen t ba ck in time by many decades. Two explanations are prop osed in [6] for the success of co rresp ondence anal- ysis. Firstly , the principle of distributiona l equiv alence allows a table o f p ositive v alue s to b e given a mathematical structure that comp ensates, a s far as p os- sible, for ar bitr ariness in the choice of weightin g and sub division o f c a tegories . Secondly , a great num ber of data analysts, working in very differe nt application fields, found a v a ila ble a unified pro ce ssing framework, and a single so ft ware pack age. Corres po ndence ana lysis was co nsidered as a standa rd, unifying and int egrated analys is framework – a platform. 2.5 Origins in Linguistic Data A nalysis Corresp o ndence a na lysis was initially prop os e d as a n inductive metho d for an- alyzing linguistic data. F rom a philoso phy standp oint, cor resp ondence a nalysis simult aneously pro cesses large sets of facts, and contrasts them in or de r to discov er global order ; a nd there fore it has more to do with synthesis (etymo- logically , to synthesize means to put to gether) and induction. On the other hand, ana lysis and deduction (viz., to distinguish the ele men ts of a whole; a nd to cons ider the prop erties of the p oss ible combinations of thes e element s) have bec ome the watch w ords o f data interpretation. It has b ecome traditio nal now to sp eak of da ta a nalysis and corres po ndence analysis , and not “da ta synthesis” or “corr esp ondence synthesis”. The s tructural linguist Noa m Chomsky , in the little volume, Syntactic St ruc- tur es [1 2], held that there could not b e a systematic pr o cedure for determining the gr ammar of a la nguage, or mor e genera lly linguistic structures, ba sed on a s et of data such as that of a text rep ository or corpus. T hus, for Chom- sky , linguistics ca nnot b e inductive (i.e., linguistics cannot construct itself using a metho d, explicitly formulated, from the facts to the laws that gov ern these facts); ins tea d linguistics has to be deductiv e (in the sense of star ting from axioms, and then deriving mo dels of r eal language s ). 6 Benz´ ecri did not like this approa ch. He found it idea list, in that it tends to separate the actions of the mind fr om the facts tha t are the ins piration for the mind and the ob ject o f the mind. At that time there was not av a ilable an effec- tive algor ithm to take ten thousa nd pages o f text fro m a la nguage to a syntax, with the a dditional purp ose of yielding semantics. But now, with the adv ances in our computing infrastr uc tur e, sta tistics offers the linguist an effective induc- tive metho d fo r usefully pro ce s sing da ta tables that one ca n immediately co llect, with – on the ho rizon – the ambitious layering o f success ive re s earch that will not leav e anything in the shade – from form, mea ning or s t yle. This then is how data analysis is feasible and pr actical in a world fueled by computing capability: “ W e call the distribution of a word the set o f its po s- sible e n vironments.” In the background there is a co nsideration that La place noted: a well-constructed languag e automatically leads to the truth, s ince faults in reasoning ar e shown up a s faults in sy n tax. Dijkstra, Wirth, Hoar e and the other pioneering computer s cientists who develop ed the bases of progr amming languages that we use today , could not have expressed this better. Indeed, Dijk- stra’s view w as that “the progra mmer should let co rrectness pro of and progra m grow hand in hand” [17]. 2.6 Information F usion F ro m 1 950 o nw ards, sta tis tica l tests b eca me v ery p o pula r, to verify or to protect the acceptability o f a hypo thesis (or of a mo del) pr o p osed a prior i. On the other ha nd corres po ndence analysis refers from the outset to the hypo thes is of independenc e of o bserv ations (usua lly rows) I and attributes (usually columns) J but aims only at explo r ing the extent to which this is not verified: hence the spatial representation of uneven affinities b etw een the tw o s e ts . Co r resp ondence analysis lo oks for typical mo dels that are achiev ed a po steriori and not a priori. This is following the application of mutual pro cess ing o f all da ta tables , without restrictive hypotheses . Thus the aim is the inductive conjugating of mo dels. 2.7 Benz´ ecri’s and H a y as hi’s Shared Vision of Science If Benz´ ecri was eno rmously influent ial in F rance in dr awing out the lessons of data analysis b eing brought into a computer-supp or ted age, in Japan Chikio Hay ashi play ed no less a role. Hay ashi (191 8–200 2) led ar eas that included public o pinion resear ch and s tatistical mathema tics, and was first pr e sident of the Behaviormetric So ciet y of J a pan. In [24] Hay ashi’s data analysis appro ach is set o ut very clea rly . Firstly , what Hayashi referr ed to as “quantification” was the scene setting or data enco ding and representation forming the ba sis of s ubsequent decisio n making. He intro duced therefore [24] “metho ds of qua nt ification of qualitative data in multidimensional analysis and esp ecia lly how to quantify q ualitative patterns to secure the maximum succ ess r ate o f pr ediction o f pheno mena fro m the statistical p oint of view”. So, fir s tly data , sec o ndly metho d, and thirdly decision making are inextricably linked. 7 Next co mes the ro le of data selection, weigh ting, decorr elation, low dimen- sionality selec tio n and rela ted asp ects of the analysis , and class ification. “ T he impo rtant problem in multidimensional analys is is to devise the metho ds o f the q uantification of complex phenomena (intercorrelated b ehaviour patterns of units in dy namic environmen ts) and then the metho ds of classification. Quan- tification means that the patterns are ca teg orized and given numerical v alues in order that the pa tterns may b e able to b e treated as several indices , and classification mea ns prediction of phenomena.” In fact the very aim o f factor analysis type analys e s, including corr e sp ondence a nalysis, is to prepa re the wa y for clas sification: “ The a im of multid imensional quantification is to make nu- merical repres e n tation of intercorrela ted patterns synthetically to maximize the efficiency of classificatio n, i.e. the success ra te of prediction.” F a ctorial metho ds are insufficient in their own r ig ht, maybe lea ding just to display of data: “Quan- tification do es not mean finding numerical v a lues but giving them patterns on the o p erational p oint of view in a pr op er sense. In this s ense, quantification has not absolute meaning but relative meaning to our purp ose .” This b ecame very muc h the appr oach of Benz´ ecri to o. Note that Hayashi’s per sp ectives as descr ibed ab ove date from 1954 . In [7], a co ntribution to the journal B ehaviormetrik a that was invited by Hay ashi, Benz´ e c ri draws the follow- ing conclusio ns on data analysis : “In data a nalysis nu merous disciplines have to co llab orate. The role of mathematics, altho ugh e s sential, remains mo dest in the sense that class ical theorems are used almost exclusively , or elementary demonstration techniques. But it is necessar y that certa in abstract conceptions pene tr ate the spirit of the users, who ar e the s pe c ialists collecting the data and having to orientate the a nalysis in accor dance with the pro blems that are funda- men tal to their particular s c ience.” This asp ect of integral link age of disciplines is as as pec t that I will retur n to in the Conclusions. Benz´ ecri [7] develops the implications of this. The adv ance o f compute ca- pability (remember that this article was published in 1 983) “req uires that Data Analysis [in upp er ca se indica ting the particular sense of data ana lysis as – in Hay ashi’s ter ms and equally the s pir it o f B enz´ ecri’s work – quantification and classification] pro ject a head of the concr ete work, the indisp ensable source of inspiration, a vis ion o f science.” Benz´ ecr i as well as Hay ashi develop ed data analysis as pr o jecting a vis ion of science. He contin ues: “ T his visio n is philoso phica l: it is not a matter of translating directly in mathematical ter ms the system of concepts o f a particular discipline but of linking these concepts in the equations of a mo del. Nor is it a matter of accepting the data such a s they ar e revealed, but instea d o f ela b orating them in a deep-go ing synthesis whic h allows new entities to b e discov ered and simple relationships b etw een these new ent ities.” Finally , the ov erall domain of applicatio n of data analysis is characterized as fo llows: “Thr o ugh differe n tial ca lculus, exp erimental situations that ar e a d- mirably dissected into simple comp onents were translated into so many funda- men tal laws. W e b elieve that it is r eserved for Data Analy sis to expre ss ade- quately the laws of that which, co mplex by nature (living b eing, so cial b o dy , ecosystem), canno t be dissected witho ut losing its very na ture.” 8 While Hayashi a nd Benz´ ecr i sha red a vision o f s cience, they also shar ed greatly a vie w of metho do lo gy to b e applied. I n a 1952 publication [23] Hay ashi referred to “the pro blem of classification by quantification metho d” which is not direct and immediate clustering of data, but r a ther a ca reful combination of numerical enco ding and r epresentation o f data as a basis for the cluster ing. Hay ashi’s aim was to discuss: “(1) the metho ds of quantification of qualita- tive statistica l data obtained b y our measur ements and observ ations ...; (2) ... the patterns o f b ehaviour must b e re presented by so me numerical v a lues; (3 ) ... effectiv e grouping is required.” Data a nalysis metho ds are not a pplied in isolation, therefore. In [7 ] Benz´ e cri referr ed to corr esp ondence analy sis a nd hi- erarchical cluster ing , and indeed discriminant ana lysis (“so as not to b e illuso ry , a discr iminant pr o cedure has to be a pplied using a first set of cases – the base set – a nd then tria lle d on other cases – the test se t” ). In [7] B e nz´ ecri r efers, just a little, to the br eakthroug h results achiev ed in hierarchical clustering algo r ithms ar o und this time, and describ ed in the work of Juan [28, 2 9]. These alg orithmic results o n hierar chical clustering were furthered in the following year by the work of de Rham [16]. As computational re s ults they hav e not b een b ettered since and still represent the sta te o f the art in a lgorithms for this family of classification metho d. In [33, 35, 36] I presented surveys of these algorithms r elative to other fas t alg orithms for par ticular hiera rchical clustering metho ds, and my softw are co de was use d in the CLUST AN a nd R pack ages. Mo re softw are co de is av ailable at [38]. In [34] I show ed how, in practice, even more efficient alg orithms can b e easily desig ned. 3 The Changing Univ ersit y: Academe, Com- mercialization and In dustry 3.1 The Gro wing Partnership of Univ ersit y and Industry It is genera lly considered that a ma jor milestone – p erhaps the most imp ortant even t – in prop elling the univ ersity in to the mo dern age was the Baye-Dole Act, brought into legislation in the United States , in 1980. It was a radical change in public p olity p ersp ective on the universit y and on resear ch. Pr ior to then, the resear ch ro le of universities was a public role , and resea r ch r e s ults were to b e passed on to industry . Intellectual pr op erty was owned, prior to 19 8 0, by the public purse, as embo died in the US Gov ernmen t. The B ay e-Dole Act a llow ed universities to own int ellectual prop er t y , to license it, and to otherwise exploit it as they saw fit. The university b ecame a clos e partner of industry . It was a change in leg islation and in p erception that echoed ar ound the pla ne t. 3.2 1968 in F rance: Role in Br inging the Univ ersit y Closer to I ndustry The r ise of the par tner ship b etw een the university and industry , betw een academe and commer cialization, that is now so integral, everywhere, had o ther face ts to o. 9 Benz´ ecri’s reflections [9] in this rega rd are of interest. In fact, these refle c tio ns throw a somewhat different lig h t on the 1968 p e rio d of significa nt s tuden t a nd general so cial unrest. Benz´ ecri [9] paints the fo llowing picture. “F o rty years ago a memorable ac a demic year started, the rav ages o f which are often deplo red but also I m ust confess to having b een the happy b eneficiar y . Charged by Prof. Daniel Dugu´ e with the teaching of the Diplˆ ome d’E tudes Approfondies (DEA) de Statistique – the Adv anced Studies degree in Statistics, constituting the first pa rt of a do ctor ate, I chose Data Analy s is as the theme of the co urse. Pro f. Dugu´ e la ughed a nd s a id this covered all of statistics! Under this br oad ba nner, I intended to carry out lots of co rresp ondence analyses. With such analyses , thank s to the patience o f Brigitte Cor dier, working on an IBM 1620 – a p o cket calculato r today but one that the Dea n Yves Ma rtin provided for the pr ice of a c hateau! – I w as able, in Rennes, to aim at conquering linguistics, economics, and other fields. T o ana ly ze, data were necessar y . I was reso lved to send the students of the DEA to co llect sheafs of these pr ecious flow ers. F ro m my first cla ss, I announced that the students should underta ke in- ternships. But this call, rep eated fro m week to week, had no resp onse. The student s thought that, if they survived a written exam and an o ral then no -one could ask them for more. Other than by cramming, they would hav e nothing to ga in from the nov elty of a n excursio n int o pr actical things. Moreov er, even if they accepted or even were enticed by my pro ject then who would host their int ernship? The pretty month of May 1968 would change all that! Living in Orl´ e a ns, I did not hea r the sho uting from Lutetia [i.e. from Paris and a lso from the university; Lutetia, o r Lut ` ece, is the name of a town pre- existing Paris and the Ar` enes de Lut ` ece is a public park very close to Univ ersit´ e Pierre e t Marie Curie, Paris 6] but only distant echoes. Finally , in Septem ber , bo th b oss and student s had to res ig n themselves to ea ch take back their role, each cautious ly but also br azenly! Since the month of May , the universit y was s tigmatized as b eing a rthritic, not offering a prepara tion for life. Not I but others had exto lled internships. Machine-lik e of c ourse the students came to tell me that they wanted to ca rry out internships. I had triumphed! Y es, yes, they stammer ed in confess io n, you had told us . It was just so to o for those who in Nov em ber 19 67 ha d refused any intern now in September 19 68 were keen to s ave their native land by pamp er ing young peo ple. The way op ened up for corre s po ndence a na lysis, a metho dolo gy that in practice was very so on ass o ciated with hiera rchical clustering.” 3.3 The Changed Nature of the PhD It is a source of some pride for me to b e able to trace back throug h m y docto ral lineage a s follows [31]. My P hD advisor was Jean-Paul Benz´ ecr i. He s tudied with Henri Cartan (of Bo urbaki: see [1] for a survey). T r acing back through 10 advisors or mentors I hav e: ´ Emile Borel and Henri Leb esgue; Simeon Poisson (advisor a lso to Gustav Dirichlet and J oseph Lio uville); Jos eph Lag r ange (ad- visor als o o f Jea n-Baptiste F ourier); Leonhard Euler; Johann Bernoulli; Jac ob Bernoulli; and Go ttfried Leibniz. The PhD degr ee, including the title, the dissertation and the ev aluation framework as a w ork of r esearch (the “rite of pa ssage” ) came ab o ut in the German la nds betw een the 177 0s and the 18 30s. Cla rk [13] finds it surprising that it survived the disrepute asso c iated with a ll aca demic qua lifications in the turmoil of the late 18th century . In the United States, the first P hD w as aw arded by Y ale Universit y in 1861 . In the UK, the Universit y of Londo n intro duced the degree b etw een 1 857 and 18 60. Ca mbridge Universit y aw arded the DPhil or PhD fro m 1882, and Oxford Universit y only from 1 917. A q uite rema r k a ble feature of the mo der n p erio d is how sp ectacula r the growth of PhD num ber s has now b ecome. In [40], I discus s how in the US, to take one exa mple, in Computer Science and E ngineering, the num ber of PhDs aw arded has doubled in the three years to 2008 . Internationally this evolution holds to o. F o r example, Ire la nd is pursuing a doubling of PhD output up to 2013. Concomitant with num b er s o f P hDs, the very str ucture of the PhD is chang- ing in ma n y countries outside North America. There is a strong movemen t aw a y from the traditional Germa n “master/appr entice” mo del, tow ards instead a “pro fes s ional” qualification. This mov e is se en often as to wards the US mo del. In Ireland there is a strong mov e to refor m the P hD towards what is termed a “structured PhD”. This inv olv es a change fro m the apprenticeship mo del consisting o f lone or small gr o ups of students ov er thr ee years in one univer- sity department to a new mo del incorp ora ting elements of the appre n ticeship mo del centered around gro ups of students p ossibly in multiple universities whe r e generic and transferable skills (including ent repreneur ia l) can b e embedded in education and training ov er fo ur years. Unlike in most of E ur op e, Ger many is retaining a traditional “master / apprentice” mo del. Num ber s of PhDs a re dramatically up, and in many countries there is a ma jor restr ucturing under wa y of the PhD work conten t a nd even timeline. In tandem with this, in North Amer ica the ma jority of P hDs in such are as as computer science and co mputer eng ineering now move directly into industry when they g raduate. This trend go es hand in hand with the move from an apprenticeship for a care e r in aca deme to, instea d, a profes s ional q ualification for a ca reer in business or industry . In quite a few re s pe c ts Benz´ ecr i’s lab w as akin to what is now widely targeted in terms o f cours es and lar ge scale pro duction of PhDs. I r ecall at the end of the 1970s how there were a bo ut 75 student s en th` ese – w orking o n their dissertations – and 75 or so attending c o urses in the fir st year of the do ctoral prog ram lead- ing to the DEA, Diplˆ ome d’ ´ Etudes Approfondies, qua lification. Not atypically at this time in terms of industrial outreach, I ca rried out a study for a co m- pany CIMSA, Compag nie d’Infor matique Milita ire, Spa tiale et A´ e r onautique (a s ubs idiary o f Thomso n) on future co mputing needs ; and my thesis was in conjunction with the Burea u de Recherc hes G´ eo lo giques et Mini` eres, Orl´ ea ns, 11 the nationa l geolog ic al resea rch a nd development agency . 3.4 Changing Citation and Other Asp ects of Sc holarly Publication Practice In my time in Benz´ ecr i’s lab, I developed a view o f citation pr actice, r e lev a nt for mathematically-ba s ed PhD resea rch in the F r ench tradition, and this was a s follows: a go o d introduction in a PhD disser ta tion in a mathematical doma in would lay down a firm foundation in terms of lemmas , theor e ms and corolla ries with der iv a tio n o f r esults. How ever there was not a great deal of citing. Showing that one had a ssimilated very w ell the cont ent was what counted, and not reeling off who ha d done wha t a nd when. On the other hand, it seemed to me to be r elatively clear around 1980 that in general a PhD in the “Anglo-Sax on countries” would start with an ov erview chapter containing plent y of citations to the relev an t literature (or literatures). This different tra dition aimed at highlighting what o ne w as contributing in the dis sertation, after fir st laying out the basis on which this contribution was built. Perhaps the divide was indicative just of the strong mathematical tra ditio n in the F re nch universit y system. How ev er more broa dly sp eaking citation practices hav e changed enormously o ver the past few dec a des. I will dw ell a little on these changes now. In a recent (mid-2 008) review of citation sta tistics, Adler et a l. [2] note that in mathematics and co mputer science, a published ar ticle is cited on av erage less than once; in chemistry and physics, a n ar ticle is c ited on av erage ab out three times; it is just a little hig her in clinical medic ine ; and in the life scie nces a published ar ticle is o n average cited mor e than six times. It is sma ll wonder therefore that in r ecent times (20 08) the NIH (National Institutes of Health, in the US) has been a key fro nt -runner in pushing Op e n Access developmen ts – i.e. mandatory depo s iting of article p ostprints up on publica tion or by an ag reed date following publication. The NIH Op en Access mandate was indeed leg islatively enacted [41]. In his use of L es Cahiers de l’Analyse des Donn´ ees , a journa l published b y Dunod a nd running with 4 issues each year over 21 years up to 1997 , Benz´ ecri fo cused and indeed concentrated the work of his lab. Now adays a lab- o r even institute-based journa l a ppe a rs unusual even if it certainly testifies to a wide range of a pplications and activities. It was not alwa ys so. Consider e.g. the Journal f¨ ur die r eine un d angewandte Mathematik , referred to as Cr elle’s Jour- nal in a n earlier a ge of mathematics when August Leop old Cr elle ha d fo unded it a nd edited it up to his dea th in 18 55. O r co nsider, closer to Benz´ ecri’s lab, the Annales de l’ISUP , ISUP b eing the Institut Statistique de l’Universit´ e de Paris. It is of in terest to dw ell here on just what scient ific, or scholarly , publication is, given the p ossible insight from the past in to current debates on Op en Ac- cess, a nd citation- based p erforma nce and r esource- allo cation mo dels in national resear ch supp ort systems. 12 As is well known what are commonly regarde d as the fir st scientific journals came ab out in 166 5 . T he s e were the Philoso phical T rans a ctions of the Royal So ciety of London early in tha t year and the J ournal des S¸ cav a nt s in Paris a little la ter. Gu ´ e don [21] co ntrasts them, finding that “ the Parisian publica tion follow ed nov elt y while the Londo n journal was helping to v alidate originality”. The Philo sophical T ransactions was established and edited by Henry Oldenburg (c. 1619 to 1677), Secr etary of the Roy al So ciety . This journal “aimed at creating a public re cord of or iginal co nt ributions to knowledge”. Its primary function was no t (a s such) general co mm unication b etw een p eers, nor dissemination to non-scientists, but instea d “a public reg istry o f discov eries”. This fir st scient ific journal was a means for creating int ellectual prop erty . Jour nals orig inating in Oldenburg’s pr o totypical scientific – indeed s cholarly – jo ur nal ar e to be see n “as r egisters of intellectual prop er t y who se functions are clo se to that of a land register” . Noting parallels with the moder n w eb age, Gu´ edon [2 1] sees ho w in the 17th centu ry , “the ro le s o f writers , printers, and b o okstor e owners, as well a s their bo undaries, were still c o nten tious topics .” The sta tioners so ught to e stablish their cla im, just like a claim to la nded pro p er ty . By defining author ship in the writing activity , and s im ultaneously the intellectual pr op erty of that author, the wa y was op en to the stationer a s e arly publisher to have the right to use this prop erty analo gous to landed prop erty . Johns [2 6] p oints to how suspicio n and mistrust acco mpa nied early publishing so that Oldenburg w as also targ e ting an “innov ative use of print technology”. Gu´ edon [2 1] finds: “The desig n of a sci- ent ific p erio dical, far fr om prima rily aiming at disseminating knowledge, rea lly seeks to reinfor ce prop erty r ights over idea s ; intellectual prop erty a nd authors were not leg al concepts desig ne d to protect w r iters – they were inven ted for the printers’ o r statione r s’ b enefits.” Let me temp er this to no te how imp ortant int ellectual pr op erty over ideas is a dditionally in terms of motiv a tion of scholars in subsequent times. The context as muc h as the author- scientist led to this particular fo r m o f int ellectual prop er t y . What I find co nvincing enough in the r ole of the printer or stationer is that the ar ticle, or co llection of articles in a journal, o r other fo rms of printed pro duct (pamphlet, tre a tise), b eca me the most imp ortant para digm. Other p ossible for ms did not. Exa mples co uld include: the exp er iment ; or the table of exp erimental data; or cata logs or inv en tories. Note that the latter have bec ome extremely imp or tant in, e.g., observ ational data ba sed sciences such as astronomy , o r the pro cessed or derived data based life sciences. What is int eresting is that the publication remains the really dominant for m of r esearch output. Coming now to a utho r ship, there has b e en a n o verall shift to wards team work in a uthorship, clea rly e no ugh led by the life sciences and by “big science” . In the highly cited journal, Natur e , it has b een noted [49] that “ almost all origina l resear ch pap ers hav e multiple a uthors”. F urthermor e (in 200 8), “ So far this year ... Natur e has published only six single-author pa pe rs, out of a tota l of some 700”. What is ho w ever very clear is that mathematics or statistics or related metho dology work rarely ever a ppe a rs in Natur e . While so cial netw orks 13 of scientists hav e b ecome very imp ortant, no tes Whitfield, nonetheless there is ro om still for a counter-current in s cholarly activity: “... how ev er finely honed scientists’ team-building strateg ies b ecome, there will a lwa ys b e ro om for the solo effort. In 196 3 , Derek de Solla Price, the father of authors hip-netw ork studies, noted that if the trends of that time p ersisted, s ing le-author pap ers in chemistry would b e e x tinct by 198 0. In fac t, many branches of science seem destined to g et ever clo ser to that p oint but never rea ch it.” With o nline av ailability now of the scholarly liter ature it app ear s that rela- tively few er, rather than more , pap ers are being cited and, b y implication, read. Ev ans [18] finds that: “a s more journal issues came online, the articles refer - enced tended to b e more recent, fewer journa ls and articles were cited, and more of those citations were to fewer jour nals and articles”. Ev ans contin ues: “The forced browsing of pr int archiv es may hav e stretched scientists a nd scholars to anchor findings deeply into past a nd pre s ent scholarship. Searching online is more efficient and following hyperlink s q uickly puts resea rchers in touch with prev ailing opinio n, but this may ac celerate cons ensus and narrow the ra nge of findings and ideas built up on.” Again notwithstanding the 34 million article s used in this study , it is clear that there are ma jo r divides b etw een, say , ma th- ematical metho dology and lar ge teams and c onsortia in the life and physical sciences. In a co mmen tary on the E v ans article [18], Co uzin [15] refer s to “herd b ehav- ior among a uthors” in sc holarly publishing. Couzin concludes by p ointing to how this trend “ may lead to e a sier co ns ensus a nd less active debate in academia ”. I would dr aw the conclus ion that mathematica l thinking – if only beca use it lends itself p o or ly to the par ticula r way that “ prev ailing opinion” and a cceler- ation o f “co ns ensus” are for ced by how we now carr y out r esearch – is of gr eat impo rtance for innov ation and new thinking. The change in res earch and scholarly publishing has implications for b o ok publishing. Ev ans [18] notes this: “ The mo ve to online science app ear s to represent one more step on the path initiated by the muc h earlier shift from the co n textualized monogr aph, like Newton’s Principia or Darwin’s Origin of Sp e cies , to the mo dern resea rch ar ticle. The Principia and Origin , each pro - duced ov er the course of mor e than a decade, not o nly were e ngaged in current debates, but wo ve their prop ositio ns into co nversation with a stronomers , ge - ometers, and na turalists from centuries past. As 21s t-century scientists and scholars use online sear ching and hyperlinking to frame and publis h their a rgu- men ts more e fficie n tly , they w eav e them in to a mor e fo cused – a nd mo re na rrow – past and present.” Undue foc us and narrowness, and “ herd b ehavior”, are at o dds with the Hay ashi and Benz´ ecr i vision of science . F or tuna tely , this vision o f science has not lost its sharp edg e and its innov a tive p otential for our times. 14 4 Cen tra lit y of Data Analysis in Early Com- puter S c ience and Engineering 4.1 Data Stores: Prehistory of the W eb Comprehensive and encyclop edic co llection and interlink a ge of da ta and infor- mation that is now typified by the w eb has a very long histor y . Here I first p o int to just some o f these anteceden ts that pr op erly b elo ng to the prehistory , well avant la lettr e , of the web. In the 12th century the web could mayb e b e typified by the work of John Tzetzes, c. 11 1 0–11 80, who a ccording to Browning [10] was s omewhat dysfunc- tional in his achievemen ts: “ ... His range was immense ... He had a phenome- nal memory ... philological commentaries on works o f classical Greek p o etry ... works of scholarship in verse ... long, allegor ical commentaries ... encyclop edia of Gr eek mythology ... long hexameter p o ems ... works of p opularizatio n ... Tzetzes compiled a c o llection o f his letters, as did many o f his contemporar ies. He then w ent on, how ever, a nd equipp ed it with a gigantic comment ary in nearly 13,000 lines of ‘p olitical’ verse, which is a veritable encyclop edia of miscellaneous knowledge. Later he wen t on to add the elements of a pr ose commentary on his co mmen tary . The whole work conv eys an impression of s cholarship without an ob ject, of a p ow erful engine driv ing no thing. Tzetzes was in some ways a misfit and a failure in his own so ciety . Y et his devotion of immense energy and erudition to a trivia l end is a feature found els ewhere in the litera ture of the t welft h century , and p oints to a brea kdown in the str ucture of Byza ntine so ciety and Byzantine life, a gr owing discrepancy b etw een ends and means.” In mo dern times, the famous 1 945 article [11] by V annev a r Bush (18 90– 1974) set the sce ne in very cle a r terms for the web: “Consider a future device for individual us e , which is a s o rt of mechanized pr iv ate file a nd libra ry . It needs a name, and, to co in one at ra ndom, ‘memex’ will do . A memex is a device in which an individual stores all his b o o ks, records, and communications, and which is mechanized so tha t it may b e consulted with exceeding sp eed a nd flexibility . It is an enlarged intimate supplement to his memory .” It is not widely recog niz e d that Bush’s fa mous e ssay was preceded by an extensively develop ed plan by Belgian Paul Otlet, 18 68–1 944, esp ecially in his bo ok, T r ait ´ e de Do cumentation [42], published in Br ussels in 19 34. As describ ed by him in the chapter entitled “ The pr eserv ation and inter- national diffusion o f thoug ht : the microphotic b o ok” in [43], a clear view was presented (p. 208) o f the physical, log ical, and indeed so cio-eco nomical, lay ers necessary to supp ort a proto type of the web: “ By combining all the central offices discussed ... one co uld crea te a “Do cument Sup er- Ce n ter.” This would be in con tact with national cen ters to which a coun try’s principal offices o f do cumentation and libr a ries would b e link ed to for m stations in a univ ersal net work. ... The b o oks , article s and do cuments ... would b e bro ught together in a great collec tion. Gra dually a clas s ified Micr ophotic Encyclop edia would be formed from them, the first step toward new micr ophotolibra ries. All o f 15 these developmen ts would b e linked to gether to form a Universal Netw ork o f Do c umen tation.” 4.2 Bet w een Classification and Searc h In the next three sections, sections 4.3, 4.4 and 4.5, I pr ogress thro ugh the ph ysical and logic al lay ers suppo rting da ta analy s is. I deta il just some of the work in the 1 960s and 19 70s that inv olved the pra cticalities of data ana lysis. Such work ex tended the work of Otlet a nd Bush. While constituting just small building blo cks in the ma ssive edifice of what we now have by wa y of sear ch and a ccess to data and information, the contribution of underpinning theory and o f skillful implementation that I discuss in these sections should not b e underestimated. Let me draw a line b etw een the work of Bush and Otlet, w hich may hav e bee n eclipsed for so me decades, but which indicates nonetheless that c e r tain ideas were in the spirit of the times . The disruptive technology that came later with s earch eng ines like Go ogle changed the rules of the ga me, as the so ft ware industry often do es. Instead of classifying and ca teg orizing informatio n, search and disc overy were to prov e fully sufficient. Bo th class ification and sear ch were a leg acy of the ear ly years of explora tory data analy sis resea rch. Class ification and sea rch are tw o sides of the sa me coin. Consider how the mainstay to this day of hierar chical clustering algor ithms r emains the nearest neighbor chain a nd recipro ca l nearest neighbor algor ithms, develop ed in Benz´ ecri’s la b in the ea r ly 1980s [36]. The ftp proto col (file trans fer proto co l) was de velop ed in the 1970s and too k its definitive pre sent form by 1985 . Increasingly wider a nd bro ader uptake of data and information access proto cols was the o rder of the day by aro und 1990. Archie, a s e arch service for ftp was developed initially at the McGill University School o f Computer Science in 1 990. The W orld Wide W eb co ncept a nd h ttp (h yp ertext tr ansfer pro to col) was in developmen t by Tim Ber ners-Lee at CERN by 19 9 1. In 1991 a public version of Wide Area Infor mation Servers (W AIS), inv en ted by Br ewster K ahle, was relea sed by Thinking Machines Corp or ation W AIS was based on the Z3 9.50 and was a highly influential (certainly for me!) forerunner of web-wide infor mation search and discovery . In Apr il 1991 Gopher was relea sed by the University of Minnes o ta Micr o computer, W orkstatio n a nd Net works Center. Initially the system was a universit y help serv ice, a “campus- wide do cument delivery system”. In 1 992, a search service for Gopher servers was developed under the name of V eronica , a nd relea sed by the University of Nev ada . The Universit y of Minnesota upset the bur geoning co mm unities using wide ar ea data and infor ma tion sea rch a nd discovery by introducing licensing o f Gopher. This was just b efore the relea s e of the Mosa ic web browser, develop ed by Ma rc Andreessen, an under graduate student at the Univ ersity o f Illinois, Champaign. A co ntempo rary v an tage p oint o n s ome of these developmen ts is in my edited compilation [2 5], which was finalized in the late summer of 1992 . It may b e noted to o that the biblio graphies in the Classifica tion Literature Automated Search Services (see App endix) were set up to b e acce s sed thr o ugh 16 W AIS in the early 199 0s. It is useful to hav e sketc hed out this subsequent evolution in data and infor- mation se a rch and disc ov ery b ecaus e it constitutes one, but unque s tionably an enormous, de velopment ro oted in ea rlier work on heterogeneo us data collection and multiv a riate data analysis. 4.3 En vironmen t and Con text of Data Analysis I hav e noted tha t even in ea rly times, the role of computational c apability was central (see s ections 2 a nd 3 .2). Describing early work with J ohn Go wer in the Statistics Department a t Rothamsted E x pe r imental Station in 1961 , when F rank Y a tes was head of de- partment, Gavin Ro ss reviewed data analy sis a s follows [44]. “... we had several requests for cla ssification jobs, mainly agric ultural a nd biological a t first, such as cla ssification of nema to de worms, ba cterial s tr ains, and soil profiles. On this machine a nd its fas ter successo r, the F erra nt i Orion, we p erformed numerous jo bs, for a rchaeologists, linguists, medical res e a rch lab- orator ies, the Natur al Histor y Mus eum, ecolog ists, and even the Civil Service Department. On the O rion w e could handle 600 units a nd 400 pro per ties p er unit, a nd we progra mmed several alternative metho ds of clas s ification, o rdination and iden- tification, and graphica l displays o f the minimu m s pa nning tree, dendro grams and data plots. My collea gue Roger Payne develope d a suite of identification progra ms which was used to form a massive key to yeast str ains. The world of conv en tional multiv aria te statistics did not at firs t k now how to view cluster ana lysis. Classical discriminant analys is as sumed random sam- ples from multiv aria te nor ma l po pulations. Cluster analys is mixed discr ete and contin uous v ariables, was clearly not randomly s ampled, and for med non- ov erlapping g roups where multiv a riate no rmal po pulations would alwa ys ov er- lap. Nor was the choice of v ariables indep endent of the resulting classifica tion, as Sneath had origina lly hop ed, in the sens e that if one per formed enoug h tests o n bacterial strains the pr o p ortion of matching res ults b etw een tw o strains would reflect the prop ortion of co mmo n genetic information. But we and our collab o- rators lear nt a lot fro m these ea r ly e ndeavours.” In establishing the Cla ssification So ciety [1 4], the interdisciplinary of the ob jectives was stressed: “The foundation of the s o ciety follows the holding of a Sympo sium, org anized by Aslib on 6 April, 1 962, entitled ‘C la ssification: an interdisciplinary problem’, a t which it b ecame clea r tha t there a re many asp ects o f classification common to such widely separa ted disciplines as biology , librarians hip, soil scie nce , and anthropolo g y , and that opp ortunities fo r jo in t discussion of these asp ects would b e o f v alue to all the disciplines concer ned.” How far we hav e come can b e seen in [9] where target areas a re sketc hed out that range over analysis of voting and e le c tions; jet a lgorithms for the T ev atron and L a rge Hadron Collider systems; gamma ray burs ts; en vironment and climate management; so cio logy of r eligion; data mining in retail; sp eech re cognition and analysis; so cio logy of nata lit y – analy s is of trends and r ates o f bir ths; and 17 economics and finance – industr ia l capital in Japan, financial data analys is in F ra nce, monetary and exchange rate analysis in the United States. In all ca ses the underlying explana tions a re wan ted, and not sup er ficial displays or limited regres s ion mo deling. 4.4 Information Retriev al and Linguistics: Early Applica- tions of Data Analysis Roger Needham and Kar e n Sp¨ arck Jones were tw o of the mo st influent ial figures in computing a nd the computational sciences in the UK and worldwide. The work of Roger Needham, who died in F ebrua ry 2003, ranged over a wide swathe of co mputer science. His ear ly work at Ca mb ridge in the 19 50s included cluster analysis and information r etriev al. In the 196 0s, he car ried out pioneering work on computer a rchitecture a nd system softw are. In the 1970 s, his work inv olv ed distributed computing. In later decades, he devoted considerable attent ion to security . In the 1960s he published on clustering and classifica tio n. Informatio n re- triev a l was amo ng the a reas he contributed to. Among his ea rly publica tio ns were: 1. “Keywords and clumps”, J o urnal of Do cumentation, 20 , 5– 15, 1964 . 2. “Applications of the theo ry of clumps”, Mechanical T ranslatio n, 8 , 113 – 127, 1 9 65. 3. “Automatic classificatio n in linguistics”, The Statis ticia n, 17, 45 –54, 1967. 4. “Automatic ter m cla s sifications and retriev al”, Information Stora ge and Retriev a l, 4, 91–10 0, 19 6 8. Needham, who was the husband o f Sp¨ arck J ones, set up a nd b ecame firs t director of Mic rosoft Research in Cambridge in 199 7. Karen Sp¨ arck Jo nes died in April 2 007. Among early and influential publi- cations on her side were the following. 1. “Exp er iments in semantic clas sification”, Mechanical T ranslation, 8, 97– 112, 1 9 65. 2. “Some thoug hts o n classifica tio n for r etriev al”, Jo urnal of Do cumentation, 26, 89 – 101, 1970. (Reprinted in J o urnal of Do cumentation, 2005 .) 3. With D.M. Jackson, “The use o f a utomatically-o btained keyw ord classi- fications for informatio n retriev al”, Informa tion Storage and Retriev al, 5 , 175–2 01, 1 9 70. 4. Automatic Keywor d Classific ation for Information Retriev al , Butterworths, 1971. 18 Even in disciplines o utside of formative or emer gent co mputer science, the centralit y o f data analys is alg orithms is very clear from a s can o f publications in earlier times. A leader of classification and clustering r esearch ov er many decades is J ames Rohlf (State University of New Y or k). As o ne among many examples, we no te this work of his: F.J. Rohlf, Algorithm 76. Hierarchical clustering using the minim um span- ning tree. Computer Jo ur nal, 16, 93– 95, 1 973. I will now turn attention to the early years of the Computer Jour nal. 4.5 Early Computer Journal A le a der in e arly clustering developments a nd in information re tr iev a l, C.J. (Keith) v a n Rijsb erg e n (now Gla sgow Universit y) was E ditor-in-Chief o f the Computer Journal from 1993 to 2000 . A few of his ea rly pap er s include the following. 1. C.J. v an Rijsb erge n, “A clustering algo rithm”, Co mputer Journal, 13, 113–1 15, 1 9 70. 2. N. Ja r dine and C.J . v a n Rijsb er gen, “The use of hierarchic clustering in informatio n r etriev al”, Infor mation Storag e and Retriev al, 7, 217 –240, 1971. 3. C.J. v an Rijsb ergen, “F urther exp eriments with hier archic c lus tering in do cument retr iev a l”, Info r mation Stora ge and Retriev al, 10, 1– 14, 19 74. 4. C.J. v a n Rijsb erge n, “A theoretical basis for the use o f co- o ccurrence da ta in informatio n retriev al”, Jo urnal of Do cumentation, 33, 10 6 –119 , 1977 . F ro m 200 0 to 2 007, I was in this role a s Editor -in-Chief of the Computer Journal. I wrote in an editorial for the 50th Anniversary in 2007 the following: “When I pick up older issues of the Computer Journal, I am str uck by how int eresting many of the articles still are. Some articles are still very highly cited, such as Fletcher and Pow ell on gradient descent. O thers, closer to my own heart, on clustering, data ana lysis, and information r etriev al, by Lance and Williams, Robin Sibs o n, Jim Rohlf, Kare n Sp¨ arck J ones, Roger Needham, Keith v an Rijsb ergen, a nd o thers, to my mind established the fo unda tions of theory and pr actice tha t remain h ugely impor tant to this day . It is a pit y that journal impact factors, which mean so muc h for our day to day r e s earch work, are based o n publications in just tw o previous years. It is clear that ne w work may , or pe r haps should, strike o ut to new s ho res, and be unencumbered with pas t work. But there is of course another also imp ortant view, that the consolidated liter ature is b oth vita l and a well spring of curr ent and future progre s s. B o th asp ects a re cr ucial, the ‘sleep walking’ innov ative element, to use Arthur Ko estler’s [30] characterizatio n, and the co nsolidation element that is part and parcel of understanding.” 19 The very first issue of the Computer J ournal in 1958 ha d articles by the following authors – note the industrial r esearch lab affiliations fo r the most part: 1. S. Gill (F er ranti Ltd.), “Parallel Pr ogramming ” , pp. 2 –10. 2. E.S. Page 3. D.T. Caminer (Leo C o mputers Ltd.) 4. R.A. Bro oker (Computing Ma chine La b o rator y , Univ ersity o f Ma nchester) 5. R.G. Dowse and H.W. Gea ring (Business Group of the Br itish Computer So ciety) 6. A. Gilmour (The English Electric Co mpany Ltd.) 7. A.J. Ba rnard (Norwich Co rp oratio n) 8. R.A. F airthor ne (Roy al Aircra ft Establishment, F ar nborough) 9. S.H. Hollngda le and M.M. Bar r itt (RAE as prev io us) Then fro m later issues I will note so me a rticles that have very clea r links with data a nalysis: 1. V o l. 1, No. 3 , 1958 , J.C. Gower, “A no te on a n iterative metho d for ro ot extraction”, 14 2–143 . 2. V o l. 4, No. 1, 196 1, M.A. W right, “Matching inquirie s to an index” , 38–41 . 3. V o l. 4, No. 2 , 1961 , had lots of a r ticles on character reco g nition. 4. V o l. 4 , No. 4, 1 962, J.C. Gow er, “The handling o f mult iwa y tables on computers”, 2 80–28 6. 5. In V ol. 4, No. 4 , and in V ol. 6 , No. 1 , ther e were ar ticles on regr e ssion analysis. 6. V o l. 7, No. 2 , 1964 , D.B. Lloyd, “Data retriev al”, 11 0–11 3 . 7. V o l. 7, No . 3, 196 4, M.J. Rose, “Classifica tion o f a set of elements”, 208–2 11. Abstract: “ The pap er describ es the use o f a computer in some statistical exp er iment s on weakly connected gr aphs. The work forms part of a s tatistical appro ach to some cla s sification pro blems.” 20 5 Conclusions In this article I hav e fo cused o n early developmen ts in the data mining or unsu- per vised view of data analysis . Some o f those I have referr ed to, e.g. V a nnev a r Bush and Paul Otlet, b eca me obscured or even eclipsed for a w hile. It is clear how ev er that undisputable prog ress in the longer term may see m to develop in fits and sta rts when s een at finer temp o ral sca les. (This is quite co mmonplace. In literature, see how the cen tenary of Go e the’s birth in 184 8, following his death o n 22 March 1832 , pas sed unnoticed. Go ethe did not come to the for e un til the 1870s .) What I am dealing with ther efore in explora to ry , heuristic and multiv ar iate data ana lysis has led me to a sketc h o f the evolving spirit of the times. This sketc h has taken in the evolution of v arious strands in ac ademic disciplines, scholarly resea r ch areas, and co mmer cial, industrial a nd economic se c to rs. I hav e o bserved how the seeds of the present – in fact, remark ably go o d likenesses – were o ften av a ilable in the p erio d up to 198 5 that is mainly a t issue in this a r ticle. This includes the link b etw een s cholarly activity a nd ec onomic and commercia l exploitation. It includes v arious asp ects o f the P hD degree. The co nsequences of the data mining and related explor atory multiv aria te data analysis work overview ed in this ar ticle hav e b een enormous. No where hav e their effects b een gr eater than in cur rent sear ch engine technologies. Also wide swathes of da ta base management, la nguage engineer ing, and multimedia data and digita l information handling, are all directly re la ted to the pioneer ing work describ ed in this a rticle. In s ection 4.1 I lo oked at how ear ly explor atory data ana lysis had come to play a central r ole in our computing infrastr ucture. An interesting view ha s b een offered by [3 ], finding that all of science to o ha s b een usur ped by ex plo ratory data analys is, pr incipally through Go og le’s s earch facilities. Let us lo ok at this argument with an extended quotation from [3]. “ ‘All mo dels ar e wrong , but s o me are useful.’ So pro cla imed sta tis ticia n George Box 30 years ag o, a nd he was right. ... Until now. ... A t the p etabyte s cale, information is not a matter of s imple three- a nd four- dimensional taxonomy and o rder but o f dimensiona lly ag nostic statis tics . It calls for an entirely differ ent appro ach, one that re quires us to lo se the tether of data as something that ca n b e visualized in its tota lit y . It forces us to view data mathematically first and establish a co nt ext for it later. ... Spea king a t the O’Reilly Emerging T echnology Co nference this pa st Mar ch [2008], Peter Norvig, Go og le’s rese a rch dir ector, offered an up date to Geor ge Box’s maxim: ‘All mo dels ar e wr ong, and increasingly you can succeed without them.’ ... This is a world w he r e massive amounts of da ta and applied mathematics replace every other to ol that mig ht b e br ought to b ear . O ut with every theor y of human b ehavior, fro m linguistics to so ciolo gy . F or g et taxono m y , ontology , and psychology . Who knows why peo ple do what they do? The po int is they do it, and we can track and meas ure it with unprecedented fidelity . With enough data, the num ber s sp eak for themselves. 21 The big tar get here isn’t advertising, though. It’s science. The scientific metho d is built ar ound testable hypo theses. These mo dels, for the most pa rt, are systems visualized in the minds of scientists. The mo dels ar e then tested, and exp er iments c o nfirm or falsify theoretical mo dels o f how the world works. This is the wa y science has worked for hundreds o f years. ... There is now a better wa y . Petab ytes allow us to say: ‘Cor relation is enough.’ W e can stop lo oking for mo dels. W e can analyz e the data without hypotheses ab out what it might show. W e ca n throw the num ber s into the bigg est co mput- ing cluster s the world has ever seen and let sta tis tica l alg orithms find pa tterns where science ca nnot.” This interesting view, inspired b y our contemporar y se a rch engine technol- ogy , is prov o cative. The author maintains tha t: “Corr elation s uper sedes causa- tion, and s c ience can adv ance even without coherent mo dels , unified theories, or really a ny mechanistic ex pla nation at all.” No, in my view, the sc ie nces a nd humanities are not to be consigned to any dustbin of history – far from it. As I wrote in [3 9], a par tner ship is needed ra ther than domina nce of one view or ano ther. “Da ta analysts hav e fa r to o often just assumed the p o ten tial for extr acting meaning fr o m the g iven data, tel les quel les . The statisticia n’s wa y to a ddress the problem works well sometimes but has its limits: so me o ne o r more of a finite num ber of s to chastic mo de ls (often handled with the verve and adroitness of a maestro) form the basis of the analysis . The statistician’s to olb ox (or s urgical eq uipmen t, if you wish) can be enormo usly useful in practice. But the statisticia n plays s e cond fiddle to the observ ational scientist or theoretician who really makes his o r her mar k o n the discov ery . This is not fair . Without exploring the e nco ding tha t makes up pr imary data we k now very , very little. (As e xamples, we hav e the DNA co des of the human or any animal; discreteness a t Planck sc ales and in one vista of the quantum universe; and we still hav e to find the prop er enco ding to under stand consciousne s s.) ... [Thro ugh corres p o ndence ana lysis ther e] is the p ossibility op ened up for the data ana lyst, through the data enco ding question, to b e a par tner, hand in hand, in the pro cess of primary discov ery .” References [1] A.D. Aczel, The Artist and the Mathematician: The Story of Nic ol as Bour- b a ki, the Genius Mathematician W ho Never Exist e d , High Stakes, 2006 . [2] R Adler, J Ewing, P T aylor, Citation Statistics , A rep ort from the Inter- national Mathema tica l Union (IMU) in c o op eration with the International Council o f Industrial and Applied Mathematics (ICIAM) and the Insti- tute of Mathematical Statistics (IMS), Joint Committee on Qua ntitative Assessment of Research, 11 June 200 8 22 [3] C. Anderson, “The end of theory: the data deluge mak es the scientific metho d obs olete”, Wir e d Magazine , 16 July 20 08, ht tp://www.wired.co m/science/discoveries/magazine/16-07/pb theory [4] A. Badiou, The or etic al Writings , edited and translated b y Ray Bra ssier and Alber to T osca na, Co ntin uum, 20 04. [5] W. Benjamin, “Pr oblems in the so c io logy of lang uage”, in Walter Ben- jamin, Sele cte d Writings: 1935–1938 V ol. 3 , Har v ar d Univ ersity P ress, 2002. [6] J.-P . Benz´ ecri, Histoir e et Pr´ ehistoir e de l’Analyse des Donn´ ees , Duno d, 1982. [7] J.P . Benz´ ec r i, “L’avenir de l’analyse des donn´ ees ”, Behavi ormetrika , 10, 1–11, 19 8 3. Accessible from: www.corr esp ondances.info [8] J.-P . Benz´ ec ri, “ F o reword”, in [3 7], 2 005. [9] J.-P . Benz´ ecri, “Si j’av ais un lab ora to ire...”, 11 pp., 2007. Scanned c o py a t www.corres po ndances.info [10] R. Br owning, The Byzantine Empir e , Catholic Univ ersity of America Pr ess, 1992. [11] V. Bush, “As w e may think”, Atlantic M onthly , July 1945 . ht tp://www.theatlantic.com/do c/19 4507/ bush [12] N. Chomsky , Synt actic Structu r es , 2nd edn., W alter de Gruyter, 2002. [13] W. Clar k, A c ademic Charisma and the Origins of the Re se ar ch University , Chicago University Press, 200 6. [14] The Classifica tion So c ie ty , Rec o rd of the inaugur al meeting, held at the offices o f Aslib, 3 Be lg rav e Squa re, London S.W. 1 at 2.30 p.m. on F r iday , 17 April, 1964 . ht tp://thames.cs.r hu l.ac.uk/ ∼ fio nn/classifica tion-so ciety/ClassSo c1964.p df [15] J. Couzin, “Sur vey finds citations growing narr ow er as journals move o n- line”, Scienc e , 321 , 32 9, 2008. [16] C. de Rham, “La classifica tion hi´ era rchique ascendante selon la m´ etho de des voisins r´ e c ipro ques”, L es Cahiers de l’Anal yse des Donn´ ees , V, 135 –144 , 1980. [17] E.W. Dijkstra, “The hu mble pr ogrammer ”, ACM T ur ing Le cture, 1972, Communic a tions of the ACM , 15 (10), 859– 866, 197 2. [18] J.A. E v ans, “Elec tr onic publishing and the narr owing of science a nd schol- arship”, S cienc e , 321, 395 –399 , 2008. 23 [19] M. F o ucault, Surveil ler et Punir: Naissanc e de la Prison , Gallima rd, 1975. [20] Galileo Galilei, Il Sagg ia tore (The Assayer) in Op er e , vol. 6 , p. 197, tr ans- lation by Julian Ba rb our. Cited on p. 65 9 of N. Stephenson, Quicksilver, The Bar o que Cycle, V ol. 1 , William Mor row, 2003 . [21] J.-C. Gu ´ edon, In O ldenbur g’s L ong Shadow: Libr arians, R e- se ar ch S cientists, Publishers, and the Contr ol of Scientific Publishing , Asso ciation of Research Libraries, 2001, 70 pp., ht tp://www.arl.o rg/r esources/ pubs/mmpro ceedings/138guedon.shtml [22] M.D. Hause r and T. Bever, “A biolinguistic agenda ” , Scienc e , 322, 1 057– 1059, 20 0 8. [23] C. Hayashi, “On the prediction of phenomena from qualitative data and the quantification of qualitative data from the mathematico-statistical p oint of view”, Annals of the Institu t e of St at ist ic al Mathematics , 3, 69 – 98, 1 952. Av aila ble at: http://www.ism.ac.jp/editsec/aism/ pdf/0 03 2 0069 .p df [24] C. Hay ashi, “Multidimensional quantification I”, Pr o c e e d- ings of the Jap an A c ad emy , 30, 61– 65, 1 954. Av ailable at: ht tp://www.ism.ac.jp/editse c /aism/p df/00 5 2 01 21.p df [25] A. Hec k and F. Murtag h, E ds., Intel ligent I n formation R etriev al: The Case of Astro nomy and R elate d Sp ac e Scienc e , Kluw er, 19 93. [26] A. Johns, The Natur e of t he Bo ok: Print and Know le dge in t he Making , Univ ersity o f Chicago P ress, 1998 . [27] D.H.P . Jo ne s , “W a s the Carte du Ciel an o bstruction to the developmen t of astrophysics in E urop e?”, in A. Heck, Ed., Information Hand ling in Astr onomy – Historic al Vistas , Springer, 200 2. [28] J. Juan, “Le prog ramme HIV OR de classifica tion ascendante hi´ era r chique selon les voisins r´ ecipro q ue s et le c r it` ere de la v ariance”, L es Cahiers de l’A nalyse des Donn´ ees , VI I, 173– 184, 1 982. [29] J. Juan, “P rogr amme de classifica tion hi´ era rchique par l’algo rithme de la recherc he en cha ˆ ıne des voisins r´ ec ipro ques”, L es Cahiers de l’Analyse des Donn´ ees , VI I, 219 – 225, 1982. [30] A. Ko es tler, The Sle epwalke rs: A History of Man ’s Changing Vision of the Universe , Penguin, 1989 . (Orig inally published 19 59.) [31] Mathematics Genea logy , www.g enealogy .ams.o rg (127,9 01 entries as o f 15 Nov em ber 2 008). [32] G. Miller, “ Growing pains for fMRI”, Scienc e , 3 20, 14 12–14 14, 1 3 June 2008. 24 [33] F. Murtagh, “A survey of recent adv ances in hierar chical clus ter ing algo- rithms”, The Computer Journal , 2 6, 354 –359 , 19 83. [34] F. Murtag h, “Exp ected-time complex it y results for hierar chic clus ter ing algorithms which us e cluster centres”, Information Pr o c essing L etters , 16, 237–2 41, 1 9 83. [35] F. Murtagh, “Complexities o f hiera rchic clustering algorithms: s tate of the art”, Computational Statistics Qu arterly , 1, 10 1–11 3, 1984. [36] F. Murtagh, Multidimensional Clust ering Algorithms , Physica-V erla g, W¨ urzburg, 1985 . [37] F. Mur tagh, Corr esp ondenc e Analysis and Data Co ding with J ava and R , Chapman a nd Hall/CRC, 20 05. [38] F. Murtag h, “Multiv ariate d ata analysis so ft w are and re s ources” , ht tp://astro .u-strasbg .fr/ ∼ fmurtagh/mda-sw [39] F. Murtagh, “Reply to: Review by J an de Leeuw of Corres po n- dence Analys is and Data Co ding with Jav a and R, F. Mur tagh, Cha p- man and Hall/CRC, 2005”. Journal of Statistical Softw are, V ol. 14. ht tp://www.cor r esp ondances.info/ reply-to-J an-de-Leeuw.pdf [40] F. Murtagh, “Betw een the informa tion eco no my a nd student recr uitmen t: present conjuncture and future pro sp ects”, U p gr ade: The Eur op e an Journal for the I n formatics Pr ofessio nal , forthcoming , 2008. [41] National Institutes of Hea lth Public Access, http://publicaccess.nih.gov, 2008. Citation from: “The NIH Public Acces s Policy implemen ts Division G, Title I I, Section 218 o f PL 1 1 0-16 1 (Consolidated Appro priations Act, 2008).” [42] P . Otlet, T r ai t´ e de D o cumentation , Brussels, 1934. ht tps://ar chive.ugen t.be/ha ndle/ 1854 /5612 [43] W.B. Rayw ard, International Or ga nisation and Dissemination of Know le dge: Sele cte d Essays of Paul Otlet , Elsev ier, 1 990. ht tp://www.archive.org/details/internationalorg00otle [44] G. Ross , “Ear lier da ys of co mputer classific a tion”, V o te of tha nks, Ina ugu- ral L e c ture, F. Murtagh, “Thinking ultra metrically: under standing massive data sets a nd navigating informatio n spaces” , Roy al Hollowa y , University of Londo n, 22 F eb. 2 007, http://thames.cs.rhul.ac.uk/ ∼ fionn/ ina ugural [45] N. Stephenson, The System of the World, The Bar o que Cycle, V ol . 3 , Arrow Bo oks, 2 005. [46] B. Suzanne, “F requently a sked questions ab out P lato”, 2004, ht tp://plato-dia logues.or g/faq/fa q009.htm 25 [47] D. Swade, The Co gwhe el Br ain: Charles Babb age and t he Quest to Build the First Computer , Little, 2000 . [48] Eugen W¨ uster, Internationale Spr achnormung in der T e chnik, b esonders in der Elektr ote chnik , Bern, 1 931. [49] J. Whitfield, “Co lla b oration: gr oup theor y”, Natur e News items, Natur e , 455, 7 2 0–72 3, 2008. [50] P . Y ourgra u, A World Without Time: The F or gotten L e gacy of G¨ ode l and Einstein , Allen Lane, 2 005. App endix: Sources for Early W ork • Classification Litera tur e Automated Search Ser vice, a CD distributed cur- rently with the fir st is sue each year of the J o urnal of Classification. See ht tp://www.class ification-so ciety .or g/csna The following b o oks hav e bee n scanned and a re av ailable in their entiret y on the CD. 1. Algorithms for Clustering Data (1988), AK Jain and RC Dub es 2. Automatische Kla ssifik ation (1974 ), HH Bo ck 3. Classifica tio n et Analyse Or dinale des Donn´ ees (1981), IC Lerman 4. Clustering Algo rithms (197 5 ), JA Hartiga n 5. Information Retr iev a l (1979, 2nd ed.), CJ v an Rijsbe rgen 6. Multidimensional Clustering Algo rithms (1985 ), F Murtagh 7. Principles of Numerical T ax onomy (1963), RR Sok al and PHA Sneath 8. Numerical T axo nomy: the Principles and Practice of Numerical Clas - sification (197 3), P HA Sneath and RR Sok al • L es Cahiers de l’Analy se des Donn´ ees was the jour nal of Benz´ ecr i’s lab from 1975 up to 1 997, with four issues p er year. Sca nning of all issue s has started, working chronologica lly backwards with thus far 199 4–199 7 cov ered. See http://thames.cs.rhul.ac.uk/ ∼ fionn/CAD • Some texts by Jea n-Paul Benz´ ecri and F ra n¸ co ise Benz´ ecri-Ler oy , published betw een 1954 and 1 971, are av a ilable at http://www.numdam.org (use e.g. “b enz´ ecri” as a search term). 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment