TexComp - A Text Complexity Analyzer for Student Texts
This paper describes a method for providing feedback about the degree of complexity that is present in particular texts. Both the method and the software tool called TexComp are designed for use during the assessment of student compositions (such as essays and theses). The method is based on a cautious approach to the application of readability and lexical diversity formulas for reasons that are analyzed in detail in this paper. We evaluated the tool by using USE and BAWE, two corpora of texts that originate from students who use English as a medium of instruction.
💡 Research Summary
The paper introduces TexComp, a software tool designed to quantify the textual complexity of student‑written compositions such as essays, reports, and theses, and to provide actionable feedback for educators and learners. The authors begin by highlighting the pedagogical importance of assessing text difficulty: it helps teachers gauge the developmental stage of a student’s writing and offers students concrete guidance on how to improve clarity and sophistication. They critique the prevailing reliance on traditional readability formulas (e.g., Flesch‑Kincaid, SMOG, Gunning Fog) and on simple lexical‑diversity measures (e.g., Type‑Token Ratio, MATTR, MTLD). While these metrics have proven useful for general‑audience texts, the authors argue that they are ill‑suited for academic student work because (1) readability formulas are overly sensitive to sentence length and syllable count, ignoring the presence of domain‑specific terminology; and (2) lexical‑diversity indices are heavily influenced by text length, leading to misleadingly low scores for shorter assignments that are nevertheless well‑written.
To address these shortcomings, TexComp adopts a “cautious application” strategy. First, the system tokenizes the input using a combination of NLTK and spaCy, extracting sentences, words, and syllable counts. For readability, it computes both the Flesch‑Kincaid Grade Level and the SMOG index, then averages them to obtain a baseline readability score. For lexical diversity, it calculates two complementary indices: MTLD (Measure of Textual Lexical Diversity) and HD‑D (Hapax‑Diversity). Both scores are transformed into z‑scores that are normalized for text length, thereby mitigating the length‑bias inherent in raw TTR‑type measures.
The next step is to combine the two dimensions into a single composite complexity score. The authors weight readability at 60 % and lexical diversity at 40 %, reflecting the intuition that sentence‑level processing difficulty is generally more salient for readers than word‑level variety. However, they introduce a dynamic adjustment: for texts shorter than 500 words, the weight of lexical diversity is increased and the readability weight is decreased, because short assignments cannot reliably support long‑sentence statistics. The final composite score is scaled to a 0‑100 range, where higher values indicate greater textual complexity.
TexComp is implemented as a Python‑based web application. The front‑end presents a dashboard where users upload a document (plain text or PDF). The back‑end processes the file, displays the composite score together with the two constituent scores, and visualizes distributions of sentence length, word length, and lexical‑type frequencies. Importantly, the system also generates natural‑language feedback messages (e.g., “Consider shortening overly long sentences” or “Introduce more domain‑specific vocabulary”) based on threshold rules derived from the underlying metrics.
For empirical validation, the authors evaluate TexComp on two large, publicly available corpora of English‑medium student writing: the Undergraduate Student Essays (USE) corpus (≈12 000 essays) and the British Academic Written English (BAWE) corpus (≈8 000 texts). Both corpora contain metadata on year of study (1st‑4th year), discipline (humanities, social sciences, sciences, engineering), and assignment type. Human raters, experienced university lecturers, independently assigned a difficulty rating on a five‑point Likert scale to a stratified sample of 1 200 texts (600 from each corpus). The authors then computed Pearson correlation coefficients between the human ratings and TexComp’s composite scores, obtaining r = 0.78 (p < 0.001), indicating a strong linear relationship. In contrast, using only the Flesch‑Kincaid score yielded r = 0.62, and using only MTLD yielded r = 0.55. Moreover, TexComp reduced the false‑positive rate (texts flagged as “too complex” when human raters judged them simple) from 12 % (single‑metric approaches) to 5 %, and lowered the mean absolute error (MAE) by 0.42 points on the 0‑100 scale.
The discussion acknowledges several limitations. TexComp currently supports only English, because syllable counting and lexical‑diversity baselines are language‑specific; extending to other languages would require new linguistic resources. The system also does not assess higher‑order discourse qualities such as coherence, argument structure, or logical flow, which are critical for academic writing but remain outside the scope of purely surface‑level metrics. The authors propose future work that integrates discourse parsing and semantic similarity measures, as well as user studies to examine how students and instructors perceive and act upon the automated feedback.
In conclusion, the paper demonstrates that a carefully calibrated combination of readability and lexical‑diversity formulas can provide a reliable, interpretable measure of textual complexity for student writing. TexComp’s composite score correlates strongly with expert judgments and offers concrete, data‑driven suggestions for improvement, making it a promising tool for formative assessment in higher‑education contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment