Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA
While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.
💡 Research Summary
This paper presents a novel statistical methodology for assessing sentence structure balance across diverse text corpora, circumventing the computational complexity of traditional syntactic tools like part-of-speech (POS) tagging. The core innovation lies in using American Standard Code for Information Interchange (ASCII) numerical representations as a foundational layer for text analysis.
The method involves two parallel processing streams. First, seventeen fundamental lexical categories (e.g., noun, verb, adjective, preposition) are converted into their corresponding ASCII code arrays. Principal Component Analysis (PCA) is then applied to compress these category representations into a single one-dimensional summary vector, denoted as ¯K. Concurrently, the text corpora for analysis—comprising eleven sources including blogs, news articles, movie reviews, and a story generated by the Grok LLM—are preprocessed. Each sentence from these corpora is converted to ASCII codes and subsequently reduced via PCA to a 17-dimensional vector, denoted as ¯J. The key measurement, “lexical alignment,” is computed as the dot product between each sentence vector (¯J) and the lexical category summary vector (¯K).
The central hypothesis is that well-balanced text, characterized by a harmonious and varied use of different lexical resources, will yield a distribution of these dot product values that approximates a normal distribution. This hypothesis is tested rigorously using Shapiro-Wilk and Anderson-Darling normality tests on the dot product distributions from each corpus.
The results reveal a clear distinction. Several human-written blog posts and articles passed the normality tests, suggesting a balanced sentence structure. In contrast, the movie review datasets (train, test, and validation splits) and the combined corpus all significantly deviated from normality. Interestingly, the Grok-generated story, while visually exhibiting a bell-shaped curve in its histogram, failed the statistical normality tests. Further analysis linked this to its limited sentence length distribution (no sentences over 200 characters) and potentially unnatural positioning of certain lexical categories, despite being grammatically correct.
A critical technical insight is the high cumulative explained variance (close to 1.0) observed during PCA reduction of both the lexical categories and the text sentences. This indicates strong redundancy in the ASCII-encoded data, validating the efficiency of the dimensionality reduction step for this specific representation.
The study concludes that the proposed ASCII-and-PCA pipeline offers a resource-efficient, complementary tool for high-level text analysis. It does not replace deep syntactic analysis but provides a swift statistical lens to evaluate structural balance and style, with potential applications in automated text quality assessment and comparative style analysis between human and AI-generated content.
Comments & Academic Discussion
Loading comments...
Leave a Comment