The Structure of Narrative: the Case of Film Scripts

We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts fr…

Authors: ** - **Fionn Murtagh** – Department of Computer Science, Royal Holloway, University of London

The Structure of Narrative: the Case of Film Scripts
The Structure of Narrativ e: the Case of Film Scripts Fionn Murtagh (1), Adam Ganz (2) and Stew art McKie (2) (1) Departmen t of Computer Science (2) Departmen t of Media Arts Ro yal Hollo w a y , Univ ersity of London, Egham TW20 0EX, England Con tact author: fmurtagh at acm dot org Ma y 28, 2018 Abstract W e analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, esp ecially the need to hav e support to ols for television movie writing. W e use the Casablanca film script, and scripts from six episo des of CSI (Crime Scene In vestiga- tion). F or analysis of style and structure, we quan tify v arious cen tral per- sp ectiv es discussed in McKee’s b ook, Story: Substanc e, Structur e, Style, and the Principles of Scr e enwriting . Film scripts offer a useful p oin t of departure for exploration of the analysis of more general narratives. Our metho dology , using Corresp ondence Analysis, and hierarchical clustering, is innov ative in a range of areas that w e discuss. In particular this work is groundbreaking in taking the qualitativ e analysis of McKee and grounding this analysis in a quantitativ e and algorithmic framew ork. Keww ords: data mining, data analysis, factor analysis, correspondence anal- ysis, semantic space, Euclidean display , hierarchical clustering, narrative, story , film script 1 In tro duction As a framew ork for analysis of narrativ e in man y areas of application, film scripts hav e a great deal to offer. W e presen t quite a num ber of innov ations in this w ork: quantifying and automating a range of qualitativ e wa ys of addressing pattern recognition in narrativ e; use of metric em b edding using Correspondence Analysis, and ultrametric em b edding through hierarchical clustering of a data sequence, in order to capture semantics in the data; and verifying experimentally the w ell foundedness of m uch that McKee [15] describ es in qualitative terms. Other than the data mining, tw o distinct levels of user are at issue here. Firstly and foremostly , w e ha ve the scriptwriter or screenwriter in mind. See 1 [15] for a conten t-based description of the scriptwriting pro cess. Secondly and more indirectly we ha ve the mo vie viewer in mind. The feasibility of using statistical learning metho ds in order to map a c haracterization of film scripts (essen tially using 22 characteristics) on to b ox office profitability was pursued b y [7]. The imp ortance of such machine learning of what constitutes a go od qualit y and/or p oten tially profitable film script has b een the basis for (success- ful) commercial initiativ es, as describ ed by [8]. The business side of the movie business is elab orated on in some depth in terms of av en ues to b e explored in the future, in [6]. A film sc ript is semi-structured in that it is subdivided into scenes and some- times other structural units. F urthermore there is metadata provided related to lo cation (internal, external; particular or general lo cation name); c haracters; da y , night. There is also dialog and description of action in free text. There are literally thousands of film scripts, for all genres, a v ailable and op enly accessible (e.g. IMSDb, The Internet Movie Script Database, www.imsdb.com). While offering just one data mo dalit y , viz. text, there is close link age to other mo dalities, visual, sp eec h and often m usic. An area of application of our work that is of particular imp ortance is to mo vies for television. In contrast to a cinema mo vie, in television a serial dramatization is formulaic in its set of characters, their actions, location, and in conten t generally . In form it is ev en more formulaic, in length, initial scenes, p ositioning of adv ertisement breakpoints, and in other asp ects of format. While screen writing and subsequen t asp ects of cinema m o vie creation and dev elop- men t hav e b een m uch studied (e.g. [15]), film writing for television has b een less so. The latter is often team-based, and muc h more “managed” on accoun t of its link ages with other dramatizations in the same series, or in closely related series. These prop erties of television serial dramatization lead, p erhaps even more than for cinema movie, to a need for to ols to quantitativ ely supp ort script writing. Film scripts constitute an outstanding template or mo del for other domains. One longer term goal of our w ork is for our to ols to provide a platform for in- tro ducing interactivit y in to a movie. Interactivit y with a film or video includes the following p ossible uses: (i) it provides an approach to developing interac- tiv e games; (ii) it allows for use in in teractive and even immersive training and learning en vironments; and (iii) support is also feasible for use in the enter- tainmen t area, e.g. interactiv e television. Lev eraging the existing corpus and “fan base” of hit scripts to create new interactiv e versions could create a whole new sector of movie script based interactiv e games or filmed v ariant-sequels to famous movies that take the familiar in a different direction. The conv ergence of story-telling and narrative, on the one hand, and games, on the other, is explored in [9]. There and in [20], story graphs and narrative trees are used as a w ay to open up to the user the range of branching p ossibilities a v ailable in the story . A further longer term goal lies in the area of business analysis, especially in distributed, video-conferencing and other, settings. The narrative structure, similar to a film script, gives the analyst the framework for generating stories. Suc h narrative structures can be used for analysis of meetings, pro jects, dialogs, 2 and an y interactiv e sessions in real or virtual meetings. Our w ork is hugely topical. A television serial episo de costs betw een $2 to $3 million for one hour of television. New series are scripted and run on television c hannels and only then is view er reaction used to determine their viabilit y . This points to ma jor p oten tial sa vings. Understanding and supp orting script writing is crucial. An additional motiv ation is that the entertainmen t and cultural mark et is moving to wards greater in teractivit y , and our work is of great p oten tial here too. The structure of the paper is as follo ws. In section 2, we provide essen- tial background on the analysis approach taken. In section 3, w e discuss the plausibilit y of taking text as a proxy for, or a practical and useful expression of, underlying con tent that w e refer to as the story . In section 4, w e describ e briefly the data analysis algorithms that we use, providing citations to more detailed bac kground reading. In section 5 we study a num ber of asp ects of the Casablanca movie script. In section 6, we turn attention to the television series, CSI (Crime Scene Inv es- tigation, Las V egas). Among imp ortan t asp ects of this work are the following. Firstly , w e recast a num b er of issues discussed by Mc Kee [15] in a quan titative and algorithmic form. Secondly , we emplo y Corresp ondence Analysis as a versatile data analy- sis framework. Since in Corresp ondence Analysis, each scene is expressed as an a verage of attributes used to characterize the scenes, and since each attribute is expressed as an av erage of the scenes they c haracterize, we ha ve in Correspon- dence Analysis a wa y to represent and study semantics. Thirdly , our clustering of film script units (e.g., scenes) is innov ative in a few w a ys, including resp ecting the sequence of film script units, and taking as input the “direction” of the film script conten t rather than having a more static framework for the input (which w e found empirically to work less well in that it was far less discriminatory). This t yp e of clustering captures the seman tics of c hange. If scenes are very dissimilar they will be agglomerated late in the sequence of agglomerations. 2 Analysis of Change in Con tent Ov er Time Analysis of a film script has ev erything to do with change in the course of the story , as we will now discuss, follo wing McKee [15]. “The finest writing ... arcs or c hanges ... o ver the course of the telling” (p. 104, [15]). A t ypical film has 40 to 60 “story even ts” or sc enes . A scene is a story in miniature and must hav e activity or change. A scene will typically (with, of necessity , large v ariability) translate into 2 to 3 min utes of film. In our w ork we hav e examples of very short (one or tw o sen tence) dialog or action descriptions in scenes. Compared to longer scenes, such short scenes can b e equally revealing and significant. Units of action or b eha vior within a scene are termed b e ats by McKee. W e will examine b eats b elo w in a case study using the Casablanca script. Ideally every scene is a turning p oin t, in character b eha vior, or in c hanging the “v alues” inv olved from p ositiv e to negativ e or vice v ersa. 3 Mo ving upw ards now in scale of units of film script, a se quenc e is a series of typically 2 to 5 scenes. A sequence expresses significant albeit mo derate c hange. An act is a set of sequences expressing ma jor impact. Acts are the “macro-structure of story” (p. 217, [15]). It is p ossible to ha ve a one-act play or story , although this would be quite unusual. The ov erall set of scenes (or sequences, or acts) is termed the plot . A climax in a sequence or act or plot is essential. It is final and irreversible (pp. 41–42, [15]). It is p ossible to distinguish b et w een an op en and a closed ending, respec- tiv ely bringing finalit y versus leaving some issues hanging, and this do es not alter the all-imp ortan t climax. A sequence climax is of mo derate imp ortance, an act climax is of greater imp ortance, and the plot climax is crucial. In this sc heme of things, each scene is a turning p oin t of limited, mo derate or great imp ortance in the context of the ov erall plot or story . The degree of c hange is the difference b et ween a scene and the scene that climaxes a sequence, or the scene that climaxes an act, or the ov erall and global climax of the plot. In the plot, there is no ro om for a missing scene, or a sup erfluous scene, or a misordering of scenes. Rh ythm and temp o are related to scene length. The former, rhythm, should b e very v ariable in order to keep the viewer’s (reader’s) atten tion, allowing for a soft upp er limit on the time-span of attention. T emp o can exemplify climax in tw o wa ys: firstly , through increasingly shortened scenes (or other units) leading up to a climax; and, secondly , through the climax b eing of clearly greater length than the preceding scene (in order to allo w for sufficient time to elab orate and offload its conten t). A decrease in temp o, through a sudden increase in scene length, can b e indicative of d ´ enouemen t and climax. Increasing temp o, manifested through successively decreasing scene length, expresses a build-up of tension. Our discussion up to no w has b een about story . W e turn to types of struc- ture. McKee [15] categorizes design into classical; minimalist or miniplot; and an ti-structure or antiplot. In this categorization scheme for film, the minimalist design do vetails v ery well with television film and drama, with e.g. op en ending, unreconciled in ternal conflict, and passive and/or multiple protagonists. Below w e will explore case studies based on episo des of CSI (Crime Scene In vestiga- tion). 3 Justification for T extual Analysis of Film Script In this section we address the issue of plausibilit y of appr e ciable analysis of con tent based on what are ultimately the statistical frequencies of c o-occurrence of w ords. W ords are a means or a medium for getting at the substance and energy of a story (p. 179, [15]). Ultimately sets of phrases express such underlying issues (the “subtext”, as expressed by McKee, a term we av oid due to p ossible confusion with subsets of text) as conflict or emotional connotation (p. 258). W e hav e already noted that change and ev olution is inherent to a plot. Human 4 emotion is based on particular transitions. So this establishes well the possibil- it y that words and phrases are not taken literally but instead can appropriately capture and represent such transition. T ext, says McKee, is the “sensory sur- face” of a work of art (counterposing it to the subtext, or underlying emotion or perception). Simple w ords can express complex underlying reality . Aristotle, for example, used words in common usage to express technically loaded concepts ([18], p. 169), and F reud did also. Best practice in film script writing includes the following. Present tense dominates all ([15], p. 395): “The on tology of the screen is an absolute pr esent tense in c onstant vivid movement ” (emphasis in original). Clipped diction is needed. Generic nouns are av oided in fav or of specific terms, and similarly adjectiv es and adv erbs are to b e av oided. The verb “to b e” is to b e a voided b ecause: “Onscreen nothing is in a state of b eing; story life is an unending flux of change, of b ecoming” (p. 396). Simile and metaphor are out, as is an y explicit positioning of context on behalf of the reader of the script or view er of the film. 4 Metho dology: Euclidean Em b edding through Corresp ondence Analysis, and Clustering the Succession of Film Script Scenes W e ha ve already noted (section 1) some no vel asp ects of our metho dology . W e b egin with the displa y of data (e.g., scenes and/or words) where visualization of relationships is greatly facilitated by having a Euclidean embedding. W e sho w ho w Corresp ondence Analysis furnishes such a metric space embedding of the information present in the film script text, and furthermore how this facilitates an ultrametric (i.e. hierarchical) embedding that takes account of the temp oral, seman tic dynamic of the film script narrativ e. 4.1 A Note on Corresp ondence Analysis Corresp ondence Analysis [18] takes input data in the form of frequencies of o ccurrence, or counts, and other forms of data, and produces such a Euclidean em b edding. The App endix provides a short in tro duction to Corresp ondence Analysis and hierarc hical clustering. W e start with a cross-tabulation of a set of observ ations and a set of at- tributes. This starting p oin t is an array of counts of presence versus absence, or frequency of o ccurrence. F rom this input data, w e can embed the observ ations and attributes in a Euclidean space. This factor space is mathematically opti- mal in a certain sense (using the least squares criterion, which is also Huyghens’ principle of decomposition of inertia). F urthermore a Euclidean space allows for easy visualization that w ould be more awkw ard to arrange otherwise. 5 A third reason for the particular embedding used for the observ ations and attributes in the Euclidean factor space is that weigh ting of observ ations and attributes is handled naturally in this framework. The issue of weigh ting has to b e addressed somehow, with one option b eing to treat the set (of observ ations or of attributes) as identically w eigh ted, and hence of equal a priori imp ortance. V ery often either observ ations or attributes follow a p ow er law: examples of such p o w er laws include Zipf ’s la w (in natural language texts, the w ord frequency is in versely prop ortional to its rank), the Pareto distribution in economics, and man y others [19]. Corresp ondence Analysis handles weigh ting of observ ations and attributes at the “core” of its algorithm [18]. Application of the p o w er law prop ert y to so cial netw orks has included the net work of movie actors [19]. As counterposed to such work, our interest is in analyzing a time series of data. The succession of scenes in a movie, or acts in a pla y , exemplify this. 4.2 Input Data Used W e take all words into account in the semi-structured texts that are provided b y film (or television program) scripts. Punctuation is ignored. Upp er case is con verted to lo wer case for our purposes. W ords must be at least tw o characters in length. Any n umerical figure, or term b eginning with a numerical, is ignored. W e hav e already noted ab o ve (section 3; see also the case studies and discussion in [18]) why the use of all words in this w a y is, in principle, feasible and justified. Occurrences or presence/absence data therefore are the p oint of departure. Eac h scene is cross-tabulated b y the set of all words so that, in this cross- tabulation table, at the intersection of scene i and word j we ha v e a presence (1) or absence (0) v alue. T o employ notions of change or proximit y b et ween scenes, w e need this data to b e appropriately represen ted in a n umerically w ell-defined seman tic space. This is pro vided by mapping the frequencies of o ccurrence data in to a Euclidean space, using Correspondence Analysis. 4.3 Hierarc hical Agglomerativ e Clustering Respecting Se- quence Un usually in this work, w e use hierarchical clustering taking a se quenc e of film script units (e.g., scenes) into accoun t. Our motiv ation is the following: large ultrametric or tree distances derived from the hierarch y hav e a ready interpre- tation in terms of change . A brief in tro duction is provided in the Appendix. A short discussion follo ws. Hierarc hical clustering is carried out through a sequence of agglomerations of successive scenes or of clusters (or temporal segments or in terv als) of suc- cessiv e scenes. Resp ect for the sequentialit y of the scenes is ensured through the requirement that agglomerands must b e adjacent. In addition to this, the least dissimilar scenes are c heck ed, with the criterion b eing: merge t wo clusters when the gr e atest dissimilarity b et ween cluster members (scenes) is minimal. The least dissimilar pair of scenes, considering tw o p oten tial agglomerands, is 6 what we use as our agglomeration criterion. This is the con tiguity-constrained complete link hierarchical clustering metho d. Apart from heuristic reasons for fa voring it, it has some further prop erties that will now b e discussed briefly . The sequence-related adjacency requirement must take into consideration whether an agglomerative clustering metho d w ould giv e rise to an inv ersion, i.e., a later agglomeration in the sequence of agglomerations would ha ve an as- so ciated criterion v alue that is less than the previous criterion v alue. It is not hard to appreciate that our desire to hav e gradations of distance represented b y the dendrogram w ould b e negated, and se v erely so, b y such absence of the mono- tonicit y of criterion v alue, which amounts to a contradiction in interpretation of the dendrogram. It is shown in [17] that t wo algorithms are feasible. What is inv olved here is sk etched out as follows. Con tinuit y-constrained single link hierarc hical clustering is simultaneously hierarc hical clustering on the spanning graph. This is easy to implement (in- efficien tly): just fix an infinite (or very large) distance betw een non-con tiguous pairs and proceed to use single link hierarc hical clustering [17]. The complete link metho d, with the constrain t that at least one mem b er of eac h of the t wo clusters to be agglomerated b e contiguous, is guaranteed not to giv e rise to inv ersions. The O ( n 2 ) time, O ( n 2 ) space algorithm for the complete link metho d, based on the nearest neighbor chain (see [17]), is easily mo dified to include an additional testing of contiguit y whenev er a link age in the nearest neigh b or c hain is created. In this work we use the latter, in view of the w ell-balanced hierarc hies t ypi- cally produced, [16]. 5 Casablanca Film Script Analysis 5.1 Data Used Based on the unpublished 1940 screenpla y b y Murra y Burnett and Joan Al- lison, [3], the Casablanca script by Julius J. Epstein, Philip G. Epstein and Ho ward Ko c h led to the film directed b y Michael Curtiz, pro duced by Hal B. W allis and Jac k L. W arner, and shot by W arner Bros. betw een May and August 1942. W e used the script from The In ternet Movie Script Database, IMSDb (www.imsdb.com). The Casablanca film script comprises 77 successive scenes. All told, there w ere 6710 words in these successive scenes. The source text for the 77 scenes, including metadata, v aried b et ween just 5 w ords, and 1017 words (scene 22). Typical Zipf distributional b eha vior is seen in the marginals. All w ords w ere used, follo wing the imp osing of lo wer case throughout. This was sub ject only to words b eing longer than one character. No stemming or other prepro cessing was used. The top word frequencies were in teresting: the, 965; ric k, 689; you, 651; to, 574; and, 435; in, 332; of, 319; renault, 284; it, 271; ilsa, 256; laszlo, 255; he, 236; is, 232; at, 192; that, 171; for, 7 158; we, 151; on, 149; strasser, 135. The n umerically high presence of p ersonal names is quite unusual relativ e to more general texts, and characterizes this film script text. A ma jor reason for this is that character names head up each dialog blo c k. Casablanca is based on a range of miniplots. This o ccasions considerable v ariety . Miniplots include: lov e story , political drama, action sequences, urbane drama, and aspects of a m usical. The comp osition of Casablanca is said by McKee [15] to be “virtually p erfect” (p. 287). 5.2 Analysis of Casablanca’s “Mid-Act Climax”, Scene 43 T o illustrate our metho dology , applied b elo w to scenes, we lo ok first in depth at one particular scene. This scene is discussed in detail in McKee [15], who sub divides it in to 11 successiv e b eats (here understo od as subscenes). It relates to Ilsa and Rick seeking black mark et exit visas. Beat 1 is Rick finding Ilsa in the market. Beats 2, 3 and 4 are rejections of him by Ilsa. Beats 5 and 6 express rappro chemen t b y b oth characters. Beat 7 is guilt-tripping by each in turn. Beat 8 is a jump in con tent: Ilsa sa ys she will lea ve Casablanca so on. In b eat 9, Rick calls her a co ward, and Ilsa calls him a fo ol. Beat 10 is a ma jor push by Ric k, essentially prop ositioning her. In the climax in b eat 11, all go es to rack and ruin as Ilsa sa ys she w as married to Laszlo all along, and Rick counters that she is a whore. Figure 1 shows the b est planar pro jection of the b eats. The dots show the lo cations of the words. The imp ortance of the factors is defined b y the p ercen t- ages of inertia explained by the factors (see App endix for background details). W e could certainly lo ok further at what words are closest to scene 8. (They are w ords having to do with Ilsa announcing that she will leav e Casablanca.) Lo ok though at the follo wing asp ects of Figure 1, as illustrated in Figure 2. Beats 2, 3 and 4 are moving nicely in one direction, so we can claim that mo ving in the p ositiv e direction of the ordinate (v ertical axis) is reinforcement of Ilsa’s rejection of Ric k. As against this, mo vemen t in the negative direction of the ordinate expresses rappro c hemen t by Ilsa and Rick: lo ok at b eat 10. Beat 8 is wa y off, and for Ric k points to a real possibility of losing the game (of re-captiv ating Ilsa). The climax in b eat 11 mov es distinctly aw ay from Rick’s aspirations as expressed in b eat 10. The length of the b eat can show a lead-up to a climax in the scene, as noted in section 2. W e see this very w ell in the b eats of scene 43: the final five beats ha ve lengths (in terms of presence of the w ords we use) of 50, 44, 38, 30, and then in the climax b eat, 46. Earlier b eats v ary in length, with successive w ord coun ts of 51, 23, 99, 39, 30, 17. The o verall change in this scene, scene 43, is defined by the difference b e- t ween closing and op ening b eats. Given the Corresp ondence Analysis output, where we hav e a Euclidean em b edding taking care of weigh ting and normal- ization on b oth b eat and word sets, we can easily tak e the full-dimensionality em b edding (unlik e the 2-dimensional pro jection seen in Figure 1) and determine this distance b et w een b eat num ber 1 and b eat num b er 11. As suggested very 8 m -1.0 -1.0 M -1.0 M -1.0 -0.5 -0.5 M -0.5 M -0.5 0.0 0.0 M 0.0 M 0.0 0.5 0.5 M 0.5 M 0.5 1.0 1.0 M 1.0 M 1.0 1.5 1.5 M 1.5 M 1.5 2.0 2.0 M 2.0 M 2.0 m -2.0 -2.0 M -2.0 M -2.0 -1.5 -1.5 M -1.5 M -1.5 -1.0 -1.0 M -1.0 M -1.0 -0.5 -0.5 M -0.5 M -0.5 0.0 0.0 M 0.0 M 0.0 0.5 0.5 M 0.5 M 0.5 1.0 1.0 M 1.0 M 1.0 1.5 1.5 M 1.5 M 1.5 Factor 1, 12.6% of inertia M Factor 1, 12.6% of inertia M Factor 1, 12.6% of inertia Factor 2, 12.2% of inertia M Factor 2, 12.2% of inertia M Factor 2, 12.2% of inertia 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 Principal plane of 11 beats in scene 43 M Principal plane of 11 beats in scene 43 M Principal plane of 11 beats in scene 43 . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . M . . . . M . . . . M . . . . M . Figure 1: Corresp ondence Analysis of the scene 43, which crosses (with pres- ence/absence v alues) 11 successive b eats (n umbers) with, in total, 210 words (dots: not lab eled for clarit y here). 9 m -1.0 -1.0 M -1.0 M -1.0 -0.5 -0.5 M -0.5 M -0.5 0.0 0.0 M 0.0 M 0.0 0.5 0.5 M 0.5 M 0.5 1.0 1.0 M 1.0 M 1.0 1.5 1.5 M 1.5 M 1.5 2.0 2.0 M 2.0 M 2.0 m -2.0 -2.0 M -2.0 M -2.0 -1.5 -1.5 M -1.5 M -1.5 -1.0 -1.0 M -1.0 M -1.0 -0.5 -0.5 M -0.5 M -0.5 0.0 0.0 M 0.0 M 0.0 0.5 0.5 M 0.5 M 0.5 1.0 1.0 M 1.0 M 1.0 1.5 1.5 M 1.5 M 1.5 Factor 1, 12.6% of inertia M Factor 1, 12.6% of inertia M Factor 1, 12.6% of inertia Factor 2, 12.2% of inertia M Factor 2, 12.2% of inertia M Factor 2, 12.2% of inertia 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 Principal plane of 11 beats in scene 43 M Principal plane of 11 beats in scene 43 M Principal plane of 11 beats in scene 43 . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . . M . . . . M . . . . M . . . . M . . . M . . . M . . . . M . . . . M . . . . M . Figure 2: As Figure 1, with some ma jor mov ements from b eat to b eat noted and discussed in the text. 10 strongly b y Figure 1, this distance will not necessarily b e the greatest distance among successiv e b eats. W e reiterate that Figure 1 provides us with a planar pro jection of the b eats, whic h is optimal in a least squares sense, but is of necessit y an appro ximation to the full-dimensionalit y clouds of b eat, and word, points. The quality of this best fit approximation is roughly 24.8% (i.e., the sum of inertias explained b y the t wo axes of Figure 1) of the information conten t of the ov erall cloud, considered as either the cloud of b eats, or the cloud of words. F or clustering the data display ed in Figure 1 we will use the full dimension- alit y . W e hav e noted ab o ve some of the changes in direction in the succession of b eats, as display ed in Figure 1: 2, 3 and 4 follo wing a particular sweep; 10 rev erses this; and so on. Let us therefore lo ok at the clustering of b eats based solely on changes in direction or orientation. In the full-dimensionality Corre- sp ondence Analysis embedding we will lo ok not at the positions of the beats but instead at their correlation with the factors (i.e., axes or co ordinates). Changes from one beat’s correlations with all factors, to those of the next b eat, admirably express c hange in orien tation of these successive b eats. Figure 3 shows the hierarchical c lustering of the correlations (with all fac- tors) for the 11 successive b eats, using the sequence- or c hronology-constrained agglomerativ e method discussed in section 4.3. Note how b eats 2, 3, 4 are clus- tered together; ho w 5, 6, 7 hav e a certain unity to o; and in particular ho w beat 8 is a sort of ma jor caesura in the ov erall sequence of beats. W e do not find ev erything needed to understand the b eat succession of scene 43 of Casablanca in the v an tage p oin ts offered b y Figures 1 and 3. But we are gathering very useful p erspectives on this scene. T o see how useful this is, let us carry out a b enc hmarking or baselining of what we see against an alternative of a randomized set of 11 b eats in scene 43. 5.3 The Sp ecific St yle of a Film Script Arising out of our exploration so far, we will use the follo wing indicators of st yle and structure. T o be usable across differen t film scripts, we must lo ok at aggregate quantities. Here we will use first and second order momen ts. W e con tinue to use scene 43 of Casablanca, with its 11 successive, constituen t b eats. The attributes used are as follows. 1. Attributes 1 and 2: The relative mov emen t, giv en by the mean squared distance from one b eat to the next. W e take the mean and the v ariance of these relative mo vemen ts. A ttributes 1 a nd 2 are based on the (full dimensionalit y) factor space em b edding of the beats. 2. Attributes 3 and 4: the c hanges in direction, giv en by the squared dif- ference in correlation from one b eat to the next. W e tak e the mean and v ariance of these changes in direction. A ttributes 3 and 4 are based on the (full dimensionalit y) correlations with factors. 11 m 0.0 0.0 M 0.0 M 0.0 0.2 0.2 M 0.2 M 0.2 0.4 0.4 M 0.4 M 0.4 0.6 0.6 M 0.6 M 0.6 0.8 0.8 M 0.8 M 0.8 1.0 1.0 M 1.0 M 1.0 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 Hierarchical clustering of 11 beats, using their orientations M Hierarchical clustering of 11 beats, using their orientations M Hierarchical clustering of 11 beats, using their orientations Figure 3: Hierarch y of the 11 b eats in scene 43, using the relationships of b eats defined from c hanges in orien tation. 12 3. Attribute 5 is mean absolute tempo. T empo is given by difference in beat length from one b eat to the next. Attribute 6 is the mean of the ups and do wns of temp o. 4. Attributes 7 and 8 are, respectively , the mean and v ariance of rhythm giv en by the sums of squared deviations from one b eat length to the next. 5. Finally , attribute 9 is the mean of the rh ythm taking up or down into accoun t. F or the Casablanca scene 43, we found the following as particularly sig- nifican t. W e tested the given scene, with its 11 b eats, against 999 uniformly randomized sequences of 11 b eats. If we so wish, this provides a Monte Carlo significance test of a null h yp othesis up to the 0.001 level. • In repeated runs, eac h of 999 randomizations, we find scene 43 to b e par- ticularly significant (in 95% of cases) in terms of attribute 2: v ariability of mo vemen t from one b eat to the next is smaller than randomized alter- nativ es. This ma y b e explained by the successive beats relating to coming together, or dra wing apart, of Ilsa and Rick, as w e ha ve already noted. • In 84% of cases, scene 43 has greater tempo (attribute 5) than randomized alternativ es. This attribute is related to absolute tempo, so we do not consider whether decreasing or increasing. • In 83% of cases, the mean rhythm (attribute 7) is higher than randomized alternativ es. 5.4 Analysis of All 77 Scenes The clustering hierarch y that w e fo cus on here is based on the orientation of the scenes. This w e do b y taking an y given scene’s correlations with the factors. W e ha ve observed earlier that the flow of the story , in relative terms, inv olv es many “bac ks and forths” or “tos and fros”. This justifies our reason for lo oking at whether or not a group of scenes maintains an appro ximately similar orien tation for some time, and how dramatic are the changes in direction. Our clustering algorithm takes the sequence of scenes in to account. As such, it offers a wa y to lo ok at c hange ov er this sequence progression. W e could well construct such a hierarch y of changes on data other than scene orientation or direction. W e could use temp o or rhythm, for example. How ev er, orientation or direction serves us very well and provides, already , useful insigh t into deep structure. In Figure 4 we see how differen t scene 1 is, relating to narrated scene-setting of the Second W orld W ar. Scene 25 is a flashbac k to Paris in the spring. Scene 39 is set in a black mark et in Casablanca, where a Nativ e and a F renchman app ear, but not an y of the central characters. In this hierarch y we can see a pronounced redirection of the story in scenes 38, 39 and 40. (Note how the ultrametric distance b et ween scene 39, on the one hand, and scene 38 or any 13 preceding scene on the other hand, is relativley v ery great. The ultrametric distance b et w een scene 39 and scene 40 is even greater.) In scene 38, Laszlo and Ilsa are in Renault’s office, clarifying their visa situation. The black mark et of scene 39 p oints the finger at Signor F errari, to b e found at the Blue Parrot cafe. Scene 40 is then at the Blue P arrot. The essential issue of Laszlo’s role and problem of getting a visa is revealed in these scenes. In the ov erall story-line, these scenes are more or less right at the mid-p oin t. The pairing off of scenes at the end (74 and 75, 76 and 77: resp ectiv ely , airp ort, hangar, road, hangar) is v ery m uch in k eeping with the conten t of these scenes. T ogether they represent the climax scenes. A potential use of Figure 4 is to provide an indication of p ossible commercial breaks b et ween acts or sequences of scenes, such that these breaks are derived automatically from the screenpla y and without the writer explicitly marking them in the text. In cinema movie suc h breaks are not pre-planned. The hierar- c hy pro vides a visualization allo wing comparison betw een the writer’s inten tions and one (alb eit insightful) view of where these breaks are found to b e lo cated, coupled with the strength of the breaks. W e again looked as st yle and structure, using 999 randomizations of the sequence of 77 scenes. Some in teresting conclusions were garnered. • As for the case of b eats in scene 43, we find that the entire Casablanca plot is well-c haracterized by the v ariabilit y of mov emen t from one scene to the next (attribute 2). V ariabilit y of mov ement from one b eat to the next is smaller than randomized alternatives in 82% of cases. • Similarity of orientation from one scene to the next (attribute 3) is very tigh t, i.e. smaller than randomized alternatives. W e found this to hold in 95% of cases. The v ariability of orien tations (attribute 4) w as also tighter, in 82% of cases. • Attribute 6, the mean of ups and downs of temp os is also rev ealing. In 96% of cases, it was smaller in the real Casablanca, as opp osed to the randomized alternatives. This p oin ts to the “balance” of up and down mo vemen t in pace. 6 T elevision Series Script Analysis Our discussion so far has b een for the Casablanca movie. No w w e turn to television drama, for which other constraining asp ects hold (suc h as length, and in ter- as well as intra-cohesion and homogeneity). W e to ok three CSI (Crime Scene Inv estigation, Las V egas – Grissom, Sara, Catherine et al.) television scripts from series 1: • 1X01, Pilot, original air date on CBS Oct. 6, 2000. W ritten b y An thony E. Zuik er, directed b y Danny Cannon. 14 m 0.0 0.0 M 0.0 M 0.0 0.2 0.2 M 0.2 M 0.2 0.4 0.4 M 0.4 M 0.4 0.6 0.6 M 0.6 M 0.6 0.8 0.8 M 0.8 M 0.8 1.0 1.0 M 1.0 M 1.0 1.2 1.2 M 1.2 M 1.2 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 12 1 2 12 M 12 13 1 3 13 M 13 14 1 4 14 M 14 15 1 5 15 M 15 16 1 6 16 M 16 17 1 7 17 M 17 18 1 8 18 M 18 19 1 9 19 M 19 20 2 0 20 M 20 21 2 1 21 M 21 22 2 2 22 M 22 23 2 3 23 M 23 24 2 4 24 M 24 25 2 5 25 M 25 26 2 6 26 M 26 27 2 7 27 M 27 28 2 8 28 M 28 29 2 9 29 M 29 30 3 0 30 M 30 31 3 1 31 M 31 32 3 2 32 M 32 33 3 3 33 M 33 34 3 4 34 M 34 35 3 5 35 M 35 36 3 6 36 M 36 37 3 7 37 M 37 38 3 8 38 M 38 39 3 9 39 M 39 40 4 0 40 M 40 41 4 1 41 M 41 42 4 2 42 M 42 43 4 3 43 M 43 44 4 4 44 M 44 45 4 5 45 M 45 46 4 6 46 M 46 47 4 7 47 M 47 48 4 8 48 M 48 49 4 9 49 M 49 50 5 0 50 M 50 51 5 1 51 M 51 52 5 2 52 M 52 53 5 3 53 M 53 54 5 4 54 M 54 55 5 5 55 M 55 56 5 6 56 M 56 57 5 7 57 M 57 58 5 8 58 M 58 59 5 9 59 M 59 60 6 0 60 M 60 61 6 1 61 M 61 62 6 2 62 M 62 63 6 3 63 M 63 64 6 4 64 M 64 65 6 5 65 M 65 66 6 6 66 M 66 67 6 7 67 M 67 68 6 8 68 M 68 69 6 9 69 M 69 70 7 0 70 M 70 71 7 1 71 M 71 72 7 2 72 M 72 73 7 3 73 M 73 74 7 4 74 M 74 75 7 5 75 M 75 76 7 6 76 M 76 77 7 7 77 M 77 Figure 4: Hierarch y of the 77 scenes, resp ecting the sequence, based on the c haracterization of the 77 scenes by the set of all words used in the scenes. The directional information (i.e., correlations with the factors) asso ciated with the scenes is used. 15 • 1X02, Co ol Change, original air date on CBS, Oct. 13, 2000. W ritten by An thony E. Zuiker, directed by Mic hael W atkins. • 1X03, Crate ’N Burial, original air date on CBS, Oct. 20, 2000. W ritten b y Ann Donahue, directed by Danny Cannon. Note the differences b et ween writers and directors in most cases. This lends w eight to our goal of furnishing the producer and director teams with a platform for automatically or semi-automatically assessing quality of pro duct. W e will refer to these scripts as CSI-101, CSI-102 and CSI-103. All film scripts were obtained from TWIZ TV (F ree TV Scripts & Mo vie Screenplays Archiv es), h ttp://twiztv.com F rom series 3, we took another three scripts. • 3X21, F orever, original air date on CBS, May 1, 2003. W ritten by Sara Goldfinger, directed b y Da vid Grossman. • 3X22, Play With Fire, original air date on CBS, May 8, 2003. W ritten by Naren Shank ar and Andrew Lipsitz, directed b y Kenneth Fink. • 3X23, Inside The Bo x, original air date on CBS, Ma y 15, 2003. W ritten b y Carol Mendelsohn and Anthon y E. Zuiker, directed by Dann y Cannon. W e will refer to these as CSI-321, CSI-322 and CSI-323. An example of a very short scene, scene 25 from CSI-101, follo ws. [INT. CSI - EVIDENCE ROOM -- NIGHT] (WARRICK opens the evidence package and takes out the shoe.) (He sits down and examines the shoe. After several dissolves, WARRICK opens the lip of the shoe and looks inside. He finds something.) WARRICK BROWN: Well, I’ll be damned. (He tips the shoe over and a piece of toe nail falls out onto the table. He picks it up.) WARRICK BROWN: Tripped over a rattle, my ass. W e see here scene metadata, characters, dialog, and action information, all of which we use. F ron tpiece, preliminary or preceding storyline information, and credits were ignored b y us. W e to ok the lab eled scenes. The num b er of scenes in each movie, and the num b er of unique, 2-characters or more, words used in the movie, are listed in T able 1. All punctuation was ignored. All upp er case was con verted to low er case. Otherwise there was no pruning of stopw ords. The top w ords and their frequencies of o ccurrence w ere: 16 Script No. scenes No. w ords CSI-101 50 1679 CSI-102 37 1343 CSI-103 38 1413 CSI-321 39 1584 CSI-322 40 1579 CSI-323 49 1445 T able 1: Numbers of scenes in the plot, and num b ers of unique (2-letter or more) w ords. the 443; to 239; grissom 195; y ou 176; and 166; gil 114; catherine 105; of 89; he 85; nic k 80; in 79; on 79; it 78; at 76; ted 66; sara 65; warric k 65; ... In order to equalize the Zipf distribution of words, and to homogenize the scenes and words b y considering pr ofiles (as opp osed to raw data), as b efore w e em b edded the set of scenes in a Euclidean factor space in all cases, us- ing Corresp ondence Analysis. W e then clustered the full dimensionalit y factor space, using the hierarc hical agglomerativ e algorithm that to ok in to accoun t the sequence of scenes, viz. the sequence-constrained complete link agglomerative metho d. T o capture the “drift” of direction of the story , again lik e b efore, we used correlations rather than pro jections. Figures 5, 6 and 7 show the results obtained. T o fo cus discussion of internal structure which can b e appreciated in these figures, let us lo ok at where commercial breaks are flagged. While other consid- erations are imp ortan t, lik e elapsed time from the start, it is clear that contin uity of con ten t is also highly relev ant. T elevision episodes are written to create minor cliffhangers at the commercial breaks and therefore if the breaks can b e iden ti- fied by our datamining approac h there is prima facie evidence for the finding of deep structures within screenplays. F our commercial breaks are flagged in the script of CSI-101, and three in the scripts of CSI-102 and CSI-103. F or CSI-101, Figure 5, these commercial breaks, as given in the script, w ere b et w een scenes 4 and 5; 14 and 15 (substan tial change noticeable in Figure 5); and 32 and 33. The c hange in direction in the climax scene, scene 50, is clear. The early scenes, 1 up to 6, are distinguishable from the scenes that follow. F or CSI-102, Figure 6, the commercial breaks w ere betw een scenes 4 and 5; 12 and 13; and 22 and 23. The climax here app ears to be a collection of scenes, from 33 to 37. Finally , for CSI-103, Figure 7, the commercial breaks were b etw een scenes 1 and 2 (clear c hange); 9 and 10; and 29 and 30 (clear c hange). W e summarize these findings as follo ws. There is occasionally a very strong link b et ween commercial breaks and change in thematic con tent as evidenced b y the hierarch y . In other cases w e find contin uit y of conten t bridging the gap of the commercial breaks. 17 m 0.0 0.0 M 0.0 M 0.0 0.2 0.2 M 0.2 M 0.2 0.4 0.4 M 0.4 M 0.4 0.6 0.6 M 0.6 M 0.6 0.8 0.8 M 0.8 M 0.8 1.0 1.0 M 1.0 M 1.0 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 12 1 2 12 M 12 13 1 3 13 M 13 14 1 4 14 M 14 15 1 5 15 M 15 16 1 6 16 M 16 17 1 7 17 M 17 18 1 8 18 M 18 19 1 9 19 M 19 20 2 0 20 M 20 21 2 1 21 M 21 22 2 2 22 M 22 23 2 3 23 M 23 24 2 4 24 M 24 25 2 5 25 M 25 26 2 6 26 M 26 27 2 7 27 M 27 28 2 8 28 M 28 29 2 9 29 M 29 30 3 0 30 M 30 31 3 1 31 M 31 32 3 2 32 M 32 33 3 3 33 M 33 34 3 4 34 M 34 35 3 5 35 M 35 36 3 6 36 M 36 37 3 7 37 M 37 38 3 8 38 M 38 39 3 9 39 M 39 40 4 0 40 M 40 41 4 1 41 M 41 42 4 2 42 M 42 43 4 3 43 M 43 44 4 4 44 M 44 45 4 5 45 M 45 46 4 6 46 M 46 47 4 7 47 M 47 48 4 8 48 M 48 49 4 9 49 M 49 50 5 0 50 M 50 Figure 5: Sequence-constrained complete link hierarch y for CSI episo de 101. Orien tations (full dimensionality) were used for each of the 50 scenes. 18 m 0.0 0.0 M 0.0 M 0.0 0.2 0.2 M 0.2 M 0.2 0.4 0.4 M 0.4 M 0.4 0.6 0.6 M 0.6 M 0.6 0.8 0.8 M 0.8 M 0.8 1.0 1.0 M 1.0 M 1.0 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 12 1 2 12 M 12 13 1 3 13 M 13 14 1 4 14 M 14 15 1 5 15 M 15 16 1 6 16 M 16 17 1 7 17 M 17 18 1 8 18 M 18 19 1 9 19 M 19 20 2 0 20 M 20 21 2 1 21 M 21 22 2 2 22 M 22 23 2 3 23 M 23 24 2 4 24 M 24 25 2 5 25 M 25 26 2 6 26 M 26 27 2 7 27 M 27 28 2 8 28 M 28 29 2 9 29 M 29 30 3 0 30 M 30 31 3 1 31 M 31 32 3 2 32 M 32 33 3 3 33 M 33 34 3 4 34 M 34 35 3 5 35 M 35 36 3 6 36 M 36 37 3 7 37 M 37 Figure 6: Sequence-constrained complete link hierarch y for CSI episo de 102. Orien tations (full dimensionality) were used for each of the 37 scenes. 19 m 0.0 0.0 M 0.0 M 0.0 0.2 0.2 M 0.2 M 0.2 0.4 0.4 M 0.4 M 0.4 0.6 0.6 M 0.6 M 0.6 0.8 0.8 M 0.8 M 0.8 1 1 1 M 1 2 2 2 M 2 3 3 3 M 3 4 4 4 M 4 5 5 5 M 5 6 6 6 M 6 7 7 7 M 7 8 8 8 M 8 9 9 9 M 9 10 1 0 10 M 10 11 1 1 11 M 11 12 1 2 12 M 12 13 1 3 13 M 13 14 1 4 14 M 14 15 1 5 15 M 15 16 1 6 16 M 16 17 1 7 17 M 17 18 1 8 18 M 18 19 1 9 19 M 19 20 2 0 20 M 20 21 2 1 21 M 21 22 2 2 22 M 22 23 2 3 23 M 23 24 2 4 24 M 24 25 2 5 25 M 25 26 2 6 26 M 26 27 2 7 27 M 27 28 2 8 28 M 28 29 2 9 29 M 29 30 3 0 30 M 30 31 3 1 31 M 31 32 3 2 32 M 32 33 3 3 33 M 33 34 3 4 34 M 34 35 3 5 35 M 35 36 3 6 36 M 36 37 3 7 37 M 37 38 3 8 38 M 38 Figure 7: Sequence-constrained complete link hierarch y for CSI episo de 103. Orien tations (full dimensionality) were used for each of the 38 scenes. 20 Programs CSI-102 and CSI-103 are p erhaps clearer than CSI-101 in ha ving a more balanced sub division. F or CSI-102, w e would put this sub division as from scenes 1 to 20; 21 to 26; 27 to 31; and 32 to 37. F or CSI-103, we would demarcate the plot in to scenes 1 to 19; (20 or) 21 to 29; and 30 to 38. As in section 5.3, w e lo oked at the characteristics of st yle in the scripts. F or a given script, we c haracterized it on the basis of our nine attributes. Then w e randomized the order of the scenes comprising the script. So the plot (or story) was iden tical in terms of the scenes that constitute it. But the plot was lac king in sense – in style and in structure – to the exten t that the scenes were no w in a random order. Such a randomized plot w as also characterized on the basis of our nine attributes. W e carried out 999 such randomizations. When an attribute’s v alue for the real script was found to b e less than or greater than 80% of the randomized plots, then we rep ort it in T able 2. Our significance threshold of 80% was set at this v alue to b e sufficiently decisiv e. It was rounded to an in teger p ercen tage in all cases. In regard to T able 2 we recall from section 5.3 that attributes 1 and 2 are first and second momen ts, resp ectiv ely , of relative mov ement from one scene to the next. Attributes 3 and 4 are first and second momen ts of relative orientation from one scene to the next. A ttributes 5 and 6 relate to tempo. A ttributes 7, 8 and 9 relate to rhythm. F urthermore whether our script is less than or greater than the randomized alternativ es – cf. column 3 of T able 2 – can b e understo o d as follows. If the “less than or equal to” case applies we can view this as our script b eing more compact or more parsimonious or more smo oth or low frequency , for the par- ticular attribute at issue, relativ e to the great bulk of randomized alternatives. Where the “greater than or equal to” case applies, then w e can see something exceptional in the w a y that the plot is handled. In T able 2, attribute 1 (mean relativ e mov ement) is a strong characterizing mark er for all scripts, sav e one. This attribute is “compact” for the real script (in the sense in which we hav e used this term of “compact” in the last paragraph, with reference to column 3 of T able 2). Attribute 3 (mean relative reorientation) is a go o d c haracterizing marker for four of the six scripts. A ttribute 9 (rhythm) is also a goo d mark er for three of the six scripts. Our Monte Carlo pro cedure is a rigorous one for assessing significance of patterns in the filmscript data. As w e ha ve demonstrated it allo ws us to v alidate unique seman tic prop erties underlying the “sensory surface” (McKee) of the filmscripts. 7 Conclusions The basis for accessing semantics in provided by (i) Corresp ondence Analysis, where each scene is an av erage of words or other attributes that characterize it, and each attribute is an av erage of scenes that are characterized; and (ii) in the hierarc hical clustering of the sequence of scenes, relativ e change is mo deled by the dendrogram structure. 21 program script attribute program ≤ / ≥ % of cases CSI-101 1 ≤ 87% CSI-101 3 ≤ 93% CSI-101 5 ≤ 84& CSI-101 6 ≥ 84% CSI-101 7 ≤ 90% CSI-101 9 ≤ 88% CSI-102 1 ≤ 95% CSI-102 2 ≤ 95% CSI-102 3 ≤ 95% CSI-102 5 ≥ 81% CSI-102 6 ≤ 81% CSI-103 2 ≤ 88% CSI-103 3 ≤ 95% CSI-103 4 ≥ 83% CSI-103 6 ≤ 88% CSI-103 9 ≤ 88% CSI-321 1 ≤ 83% CSI-321 2 ≥ 91% CSI-322 1 ≤ 92% CSI-322 3 ≤ 97% CSI-322 4 ≤ 86% CSI-322 6 ≥ 86% CSI-323 1 ≤ 92% CSI-323 8 ≤ 81% CSI-323 9 ≤ 81% T able 2: Shown are how and where the six television movies considered are unique in structure and st yle. F or a particular attribute (see section 5.3 for description) the program script w as different from randomized scene sequences in the indicated % of cases. 22 W e ha ve made excellent progress in this work on having the qualitative precepts of McKee [15] b oth quantified and op erationalized. Our assessments of the Casablanca movie, and the six CSI episo des, show that there is a great deal of commonalit y in st yle and structure b et w een film and television. W e hav e tak en in to account b oth the linear and the hierarc hical relationships in the plot, expressing the story . The units used w ere b eats (i.e., subscenes) and scenes, essen tially , with the hierarchical clusterings rev ealing larger scale struc- tures (b eginning scenes; climax scenes; and the halves, or thirds, or whatev er segmen ts were revealed as appropriate for the entire plot). Let us look no w at ho w our w ork is of importance for the study of style and structure in narrativ e, in general. Chafe [4], in analyzing verbalized memory , used a 7-min ute 16 mm color mo vie, with sound but no language, and collected narrative reminiscences of it from h uman sub jects, 60 of whom were English-sp eaking and at least 20 sp ok e/wrote one of nine other languages. Chafe considered the following units. 1. Memory expressed by a story (memory takes the form of an “island”; it is “highly selective”; it is a “disjoin ted ch unk”; but it is not a b ook, nor a c hapter, nor a con tinuous record, nor a stream). 2. Episo de , expressed b y a p ar agr aph . 3. Thought , expressed b y a sentenc e . 4. A fo cus , expressed by a phr ase (often these phrases are linguistic “clauses”). F oci are “in a sense, the basic units of memory in that they represen t the amoun t of information to which a person can dev ote his central attention at an y one time”. The “flo w of though t and the flo w of language” are treated at once, the latter pro xying the former, and analyzed in their linear and hierarc hical structure b y [4, 13, 14], among others. Filmscript affords us clear b oundaries b et w een the units of text that are analyzed. F or more general text, we must consider segmen tation. Examples of text segmentation to op en up the analysis of style and structure include [12, 2, 11, 10, 5, 21]. W e ha ve sho wn in this w ork how useful the story expressed in a film or television movie script can b e, in order to provide a framew ork for analysis of st yle and structure. App endix: Corresp ondence Analysis and Hierar- c hical Clustering Analysis Chain Corresp ondence Analysis, in conjunction with hierarchical clustering, provides what could b e characterized as a data analysis platform providing access to the 23 seman tics of information expressed by the data. The wa y it do es this is (i) by viewing eac h observ ation or row vector as the av erage of all attributes that are related to it; and by viewing each attribute or column vector as the av erage of all observ ations that are related to it; and (ii) b y taking in to account the clustering and dominance relationships given b y the hierarc hical clustering. The analysis c hain is as follo ws: 1. The starting p oin t is a matrix that cross-tabulates the dep endencies, e.g. frequencies of joint o ccurrence, of an observ ations crossed b y attributes matrix. 2. By endowing the cross-tabulation matrix with the χ 2 metric on b oth ob- serv ation set (rows) and attribute set (columns), we can map observ ations and attributes in to the same space, endo wed with the Euclidean metric. 3. A hierarchical clustering is induced on the Euclidean space, the factor space. 4. Interpretation is th rough pro jections of observ ations, attributes or clusters on to factors. The factors are ordered by decreasing imp ortance. There are v arious asp ects of Corresp ondence Analysis which follow on from this, suc h as Multiple Corresp ondence Analysis, differen t wa ys that one can enco de input data, and mutual description of clusters in terms of factors and vice v ersa. See [18] and references therein for further details. W e will use a v ery succinct and pow erful tensor notation in the follo wing, in tro duced by [1]. A t key points w e will indicate the equiv alen t vector and matrix expressions. Corresp ondence Analysis: Mapping χ 2 Distances into Eu- clidean Distances The giv en contingency table (or n umbers of o ccurrence) data is denoted k I J = { k I J ( i, j ) = k ( i, j ); i ∈ I , j ∈ J } . I is the set of observ ation indexes, and J is the set of attribute indexes. W e ha ve k ( i ) = P j ∈ J k ( i, j ). Analogously k ( j ) is defined, and k = P i ∈ I ,j ∈ J k ( i, j ). Next, f I J = { f ij = k ( i, j ) /k ; i ∈ I , j ∈ J } ⊂ R I × J , similarly f I is defined as { f i = k ( i ) /k ; i ∈ I , j ∈ J } ⊂ R I , and f J analogously . What we hav e describ ed here is taking num bers of occurrences in to relative frequencies. The conditional distribution of f J kno wing i ∈ I , also termed the j th profile with coordinates indexed by the elements of I , is: f i J = { f i j = f ij /f i = ( k ij /k ) / ( k i /k ); f i > 0; j ∈ J } and lik ewise for f j I . 24 Input: Cloud of P oints Endo w ed with the Chi Squared Met- ric The cloud of p oin ts consists of the couples: (multidimensional) profile co ordinate and (scalar) mass. W e hav e N J ( I ) = { ( f i J , f i ); i ∈ I } ⊂ R J , and again similarly for N I ( J ). Included in this expression is the fact that the cloud of observ ations, N J ( I ), is a subset of the real space of dimensionalit y | J | where | . | denotes cardinalit y of the attribute set, J . The o verall inertia is as follows: M 2 ( N J ( I )) = M 2 ( N I ( J )) = k f I J − f I f J k 2 f I f J = X i ∈ I ,j ∈ J ( f ij − f i f j ) 2 /f i f j (1) The term k f I J − f I f J k 2 f I f J is the χ 2 metric b et w een the probability distribution f I J and the pro duct of marginal distributions f I f J , with as center of the metric the pro duct f I f J . Decomp osing the momen t of inertia of the cloud N J ( I ) – or of N I ( J ) since b oth analyses are inheren tly related – furnishes the principal axes of inertia, defined from a singular v alue decomposition. Output: Cloud of P oints Endo w ed with the Euclidean Met- ric in F actor Space The χ 2 distance with center f J b et w een observ ations i and i 0 is written as follo ws in t wo different notations: d ( i, i 0 ) = k f i J − f i 0 J k 2 f J = X j 1 f j  f ij f i − f i 0 j f i 0  2 (2) In the factor space this pairwise distance is identical. The co ordinate system and the metric change. F or factors indexed by α and for total dimensionalit y N ( N = min {| I | − 1 , | J | − 1 } ; the subtraction of 1 is since the χ 2 distance is centered and hence there is a linear dep endency which reduces the inherent dimensionalit y by 1) we hav e the pro jection of observ ation i on the α th factor, F α , giv en by F α ( i ): d ( i, i 0 ) = X α =1 ..N ( F α ( i ) − F α ( i 0 )) 2 (3) In Corresp ondence Analysis the factors are ordered b y decreasing moments of inertia. The factors are closely related, mathematically , in the decomp osition of the o v erall cloud, N J ( I ) and N I ( J ), inertias. The eigenv alues asso ciated with the factors, iden tically in the space of observ ations indexed by set I , and in the space of attributes indexed b y set J , are given b y the eigenv alues asso ciated with the decomp osition of the inertia. The decomp osition of the inertia is a principal axis decomp osition, which is arrived at through a singular v alue decomp osition. 25 Hierarc hical Clustering Bac kground on the theory and practice of hierarc hical clustering can b e found in [17, 18]. F or the particular hierarchical clustering algorithm used here, based on the given sequence of observ ations, [17] should b e referred to. A short de- scription follo ws. Consider the pro jection of observ ation i onto the set of all factors indexed by α , { F α ( i ) } for all α , whic h defines the observ ation i in the new co ordinate frame. This new factor space is endow ed with the (unw eighted) Euclidean distance, d . W e seek a hierarchical clustering that takes into account the observ ation sequence, i.e. observ ation i precedes observ ation i 0 for all i, i 0 ∈ I . W e use the linear order on the observ ation. Let us switch to the term texts now, which is what our observ ations refer to in this w ork. W e refer to “adjacent” texts when one follows the other with resp ect to this linear order, and this definition of adjacency is extended to allow for adjacent clusters of texts. The agglomerativ e hierarchical clustering algorithm is as follo ws. 1. Consider eac h text in the sequence of texts as constituting a singleton cluster. Determine the closest pair of adjacent texts, and define a cluster from them. 2. Determine and merge the closest pair of adjacent clusters, c 1 and c 2 , where closeness is defined b y d ( c 1 , c 2 ) = max { d ii 0 suc h that i ∈ c 1 , i 0 ∈ c 2 } . 3. Rep eat step 2 until only one cluster remains. Here w e use a complete link criterion which additionally takes account of the adjacency constraint imp osed by the sequence of texts in set I . It can b e sho wn (see [17]) that the closeness v alue, given b y d , at each agglomerativ e step is strictly non-decreasing. That is, if cluster c 3 is formed earlier in the series of agglomerations compared to cluster c 4 , then the corresp onding distances will satisfy d c 3 ≤ d c 4 . ( d here is as determined in step 2 of the algorithm ab o ve.) References [1] J.-P . Benz´ ecri, L’Analyse des Donn´ ees, T ome I T axinomie, T ome I I Cor- resp ondances, 2nd ed. Duno d, P aris, 1979. [2] Y. Bestgen, Segmen tation markers as trace and signal of discourse struc- ture, Journal of Pragmatics, 29, 753–763, 1998. [3] M. Burnett and J. Allison, Everybo dy Comes to Ric k’s, screenpla y , 1940. [4] W.L. Chafe, The flow of though t and the flow of language, In Syntax and Seman tics: Discourse and Syntax, ed. by T almy Giv´ on, vol. 12, 159–181, Academic Press, 1979. 26 [5] F.Y.Y. Choi, Adv ances in domain indep enden t linear text segmen tation, Pro c. of the First Conf. of the North American Chapter of the Asso c. for Computational Linguistics (Seattle, W A), ACM Intl. Conf. Pro c. Series V ol. 4, 26–33, 2000. [6] J. Eliashberg, A. Elb erse and M.A.A.M. Leenders, The motion picture industry: critical issues in practice, curren t research, and new researc h directions, Mark eting Science, 25, 638–661, 2006. [7] J. Eliash b erg, S.K. Hui and Z.J. Zhang, F rom storyline to b o x office: a new approach for green-lighting movie scripts, Management Science, 53, 881–893, 2007. [8] M. Gladw ell, The form ula: what if you built a machine to predict hit movies?, The New Y orker, 16 Oct. 2006. www.newy orker.com/arc hive/2006/10/16/061016fa fact6 [9] A. Glassner, In teractive Storytelling, A.K. Peters, 2004. [10] B.J. Grosz, Discourse structure, in tentions, and in tonation, in The Lan- guages of the Brain, ed. A Galaburda, S Kosslyn and Y Christen, Harv ard U Press, Cam bridge, pp. 127–142, 2002. [11] B.J. Grosz and C.L. Sidner, Atten tion, inten tions, and the structure of discourse, Computational Linguistics, 12, 1986, 175–204. [12] M. Hearst, Multi-paragraph segmentation of exp ository text, Ann ual Meeting of the A CL, Proceedings of the 32nd ann ual meeting on Asso- ciation for Computational Linguistics (Las Cruces, New Mexico), (Associ- ation for Computational Linguistics Morristown, NJ, USA) 9-16, 1994. [13] J. Hinds, Organisational patterns in discourse, in Syntax and Semantics, V olume 12, Discourse and Syn tax, ed. T almy Giv´ on, Academic, 1979. Pp. 135–157. [14] R.E. Longacre, The paragraph as a grammatical unit, In Syntax and se- man tics: Discourse and syn tax, ed. by T alm y Giv´ on, vol. 12, 115–134, Academic Press, 1979. [15] R. McKee, Story: Substance, Structure, Style, and the Principles of Screen writing, Methuen, 1999. [16] F. Murtagh, Structures of hierarchic clusterings: implications for informa- tion retriev al and for multiv ariate data analysis, Information Pro cessing and Managemen t, 20, 611–617, 1984. [17] F. Murtagh, Multidimensional Clustering Algorithms, Physica-V erlag, 1985. [18] F. Murtagh, Correspondence Analysis and Data Coding with R and Ja v a, Chapman & Hall/CR C, 2005. 27 [19] M.E.J. Newman, Po wer laws, P areto distributions and Zipf ’s law, Con- temp orary Physics, 46, 323–351, 2005. [20] M.O. Riedl and R.M. Y oung, F rom linear story generation to branching story graphs, IEEE Computer Graphics and Applications, 26, 23–31, 2006. [21] E.F. Skorochod’ko, Adaptive metho d of automatic abstracting and index- ing, Proc. of IFIP Congress 71, 1179–1182, 1972. 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment