Artificial Intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry

The release of openly available, robust natural language generation algorithms (NLG) has spurred much public attention and debate. One reason lies in the algorithms’ purported ability to generate human-like text across various domains. Empirical evidence using incentivized tasks to assess whether people (a) can distinguish and (b) prefer algorithm-generated versus human-written text is lacking. We conducted two experiments assessing behavioral reactions to the state-of-the-art Natural Language Generation algorithm GPT-2 (Ntotal = 830). Using the identical starting lines of human poems, GPT-2 produced samples of poems. From these samples, either a random poem was chosen (Human-out-of-the-loop) or the best one was selected (Human-in-the-loop) and in turn matched with a human-written poem. In a new incentivized version of the Turing Test, participants failed to reliably detect the algorithmically-generated poems in the Human-in-the-loop treatment, yet succeeded in the Human-out-of-the-loop treatment. Further, people reveal a slight aversion to algorithm-generated poetry, independent on whether participants were informed about the algorithmic origin of the poem (Transparency) or not (Opacity). We discuss what these results convey about the performance of NLG algorithms to produce human-like text and propose methodologies to study such learning algorithms in human-agent experimental settings.

💡 Research Summary

This paper presents a rigorous behavioral investigation of whether people can distinguish and prefer poetry generated by a state‑of‑the‑art natural language generation (NLG) model, GPT‑2, compared with human‑written poems. The authors recruited 830 participants through an online platform and offered monetary incentives tied to performance, thereby creating an “incentivized Turing Test.” The experimental design crossed two independent variables. The first variable manipulated the selection process of GPT‑2 outputs: “Human‑in‑the‑loop” (HITL) where the researchers manually chose the highest‑quality poem from a batch of GPT‑2 generations, and “Human‑out‑of‑the‑loop” (HOOTL) where a poem was randomly drawn from the same batch. In both cases the selected AI poem was paired with a human‑written poem that began with the identical opening line, ensuring that content differences stemmed only from the generation process. The second variable manipulated transparency: in the “Transparency” condition participants were explicitly told which poem was AI‑generated, whereas in the “Opacity” condition no source information was provided.

Participants completed two tasks for each pair: (1) a discrimination task in which they indicated which poem they believed was written by a human, and (2) a preference task in which they selected the poem they liked more. Correct discrimination was rewarded financially, creating a strong motivation to perform accurately.

The results reveal a striking asymmetry. In the HITL condition, discrimination accuracy hovered around 52 %, not significantly different from chance, indicating that when a human curator selects the best GPT‑2 output, participants cannot reliably tell it apart from a human poem. By contrast, in the HOOTL condition discrimination accuracy rose to approximately 71 %, demonstrating that randomly chosen GPT‑2 poems retain detectable artifacts that humans can spot. Preference data show a modest but consistent bias toward human poems; participants exhibited a slight aversion to AI‑generated poetry regardless of whether they were told about its origin. Transparency (source disclosure) did not meaningfully alter either discrimination performance or preference, suggesting that simply informing users about AI authorship is insufficient to change affective responses.

Statistical analysis employed mixed‑effects logistic regression to account for participant‑level random effects and experimental block effects. Effect sizes, confidence intervals, and p‑values were reported for all main effects and interactions, confirming the robustness of the findings.

In the discussion, the authors argue that the results provide empirical support for the claim that modern large‑scale language models can produce text that is, at least in curated form, indistinguishable from human writing. The superiority of the HITL condition underscores the importance of a human‑in‑the‑loop quality‑control step: the model’s raw output may be noisy, but selective curation can elevate it to human‑level quality. The modest preference for human poems, together with the negligible impact of transparency, points to an underlying “human‑authorship bias” that persists even when participants are fully aware of AI involvement.

Methodologically, the study contributes a novel experimental paradigm that combines incentivized discrimination with preference measurement, uses identical prompts for human and AI texts, and systematically varies both output selection and source transparency. This design can serve as a template for future research on the perceptual and affective consequences of AI‑generated creative content.

The authors conclude that while GPT‑2 can generate poetry that fools humans when the best samples are hand‑picked, people still retain a slight preference for human‑written verses and are not swayed merely by disclosure of AI authorship. These findings have implications for the deployment of AI‑assisted creative tools, for policy discussions about algorithmic transparency, and for the broader debate on the future of human‑machine co‑creativity.