A Readability Analysis of Campaign Speeches from the 2016 US Presidential Campaign

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

Readability is defined as the reading level of the speech from grade 1 to grade 12. It results from the use of the REAP readability analysis (vocabulary - Collins-Thompson and Callan, 2004; syntax - Heilman et al ,2006, 2007), which use the lexical contents and grammatical structure of the sentences in a document to predict the reading level. After analysis, results were grouped into the average readability of each candidate, the evolution of the candidate’s speeches’ readability over time and the standard deviation, or how much each candidate varied their speech from one venue to another. For comparison, one speech from four past presidents and the Gettysburg Address were also analyzed.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

A Readability Analysis of Campaign Speeches from the 2016 US Presidential Campaign

Elliot Schumacher, Maxine Eskenazi CMU-LTI-16-001 March 15, 2016.

Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu

Introduction

The goal of this report is to assess the readability of the campaign speeches of five presidential candidates in the 2016 US presidential race and to examine their evolution over time and according to the type of speech. Readability can be defined here as the reading level, from grade 1 to grade 12, of a document. It is determined by looking at the lexical contents and the grammatical structure of the sentences in a document. It is based on the observation that some words (and grammatical structures) appear with greater frequency at one grade level than another. For example, we would expect that we could see the word “win” fairly frequently in third grade documents while the word “successful” would be more frequent in, say, seventh grade documents. We would not see dependent clauses very often at the second grade level whereas they would be quite frequent at the seventh grade level. For this analysis, we use a readability model, REAP, that was developed for vocabulary at by
Collins-Thompson and Callan (2004) and further developed for grammar by Heilman et al (2006, 2007). It is based on a database of sets of texts, one set for each grade level. Most of the texts come from student-written texts that teachers have published on their websites, noting the grade that each represents. The lexical reading difficulty measure is based on the smoothed individual probabilities of words occurring at each reading level. For example, the word, determine, was predictive of Grade 11 text, and was more predictive of high school-level text than lower-level text. The grammar reading difficulty measure is based on the one- to three-level depth parse trees of the sentences. This means that the measure is based on typical grammatical constructions in sentences of each grade level.

Background

Early readability measures made assumptions about what a difficult text was. The Dale-Chall Readability Formula (Dale and Chall, 1948) defined the readability level as a linear function of the average number of words in a sentence and the percentage of rare words in the document. Flesch-Kincaid (Kincaid et al 1975) was based on the average sentence length and the average number of syllables per word.
More recently, the Lexile Framework (version 1.0, Stenner, 1996) uses word frequency estimates as a measure of lexical difficulty and sentence length as a grammatical feature. Other approaches characterized text in more holistic terms. Coh-Metrix (Graesser et al 2011) measures text cohesiveness, accounting for both the reading difficulty of the text and other lexical and syntactic measures as well as a measure of prior knowledge needed for comprehension and the genre of the text. These factors account for the difficulty of constructing the mental representation of the text. All of the measures, REAP included, were originally developed to help teachers choose appropriate documents for their students in reading classes. The campaign speeches, while most were written in advance, are destined to be spoken. Written speech is very different from spoken speech. When we speak we usually use less structured language with shorter sentences. So while measures such as Flesch-Kincaid are appropriate for written speech, they are not really reflective of the structure of spoken language. REAP has been trained on written texts, as described above. But it concentrates on how often words and grammatical constructs are used at each grade level and less on the length of the sentence and of each word. So REAP corresponds better to an analysis of spoken language than its predecessor.

Methodology

A database was collected containing documents from each of the five current presidential candidates: Ted Cruz (5), Hillary Clinton (7), Marco Rubio (6), Bernie Sanders (6), Donald Trump (8) (see References and Appendix). The documents are transcriptions of their campaign speeches. They range from the declaration of candidacy speech to campaign trail speeches to victory speeches to defeat speeches. The numbers show it was sometimes difficult to find transcriptions rather than videos. In the future an Automatic Speech Recognition system (ASR) could be used to obtain text from the videos. Given that this process would produce some error, it was not used for the present study. For comparison we also analyzed the readability of Lincoln’s Gettysburg Address (Bliss version) and a speech from Barack Obama, George W. Bush, Bill Clinton and Ronald Reagan (the latter two at the same venue in different years). Two levels of analysis were

View Original ArXiv

This content is AI-processed based on ArXiv data.

A Readability Analysis of Campaign Speeches from the 2016 US Presidential Campaign

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found