How LLMs Are Changing the Way Scientists Do Research

By 일리케 — KOINEU curator

I run a site that exists at the intersection of LLMs and academic research, so I pay close attention to papers studying that intersection directly. How are large language models actually being used in scientific work? What are their failure modes? Here are three papers that give grounded, empirical answers.

People Use Research AI Tools — But Not How Developers Expect

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Case of… is a human-computer interaction study of how researchers actually use AI tools in practice. The finding that keeps sticking with me: usage patterns diverge significantly from what tool designers intended. Researchers tend to use AI tools for narrow, specific tasks (finding related papers, summarizing a methods section) rather than the broad exploratory workflows that tools are often designed around.

This is both reassuring and humbling. It suggests AI research tools are useful, but in a more constrained, instrumental way than the demos suggest. The implication for tool design is clear: support the narrow, high-frequency tasks well, rather than optimizing for impressive-but-rare broad capabilities.

Multi-Turn Research Conversations Are Hard

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations addresses a specific weakness in retrieval-augmented generation (RAG): the multi-turn case. Most RAG research evaluates single queries — you ask one question, the system retrieves relevant documents, and the model generates an answer. But real research conversations unfold over multiple turns, with context building up, clarifications, follow-ups, and sometimes contradictory information.

The paper introduces a benchmark for this harder case, and the results are sobering: current systems degrade significantly when conversations extend beyond two or three turns. The main failure modes are losing track of context from earlier turns and inconsistently handling conflicting information that accumulates over a conversation. These are important gaps to fill before RAG systems are reliable enough for serious research use.

Why Diffusion Language Models Struggle to Think in Parallel

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Generation is more theoretical, but important context for anyone following the LLM landscape. Diffusion language models are an alternative to the standard autoregressive approach (which generates text left-to-right, one token at a time). The appeal is speed — parallel generation could be much faster.

But the paper shows there’s a fundamental tension: language has sequential dependencies that make parallel generation difficult without sacrificing quality. The analysis clarifies exactly why this is hard and identifies what would need to change for parallel language models to work well. It’s a useful paper for calibrating expectations about where the technology is heading.

The Bottom Line

LLMs are genuinely useful for scientific work — but the hype is running ahead of the reality in several specific ways. The papers above all point in the same direction: the problems are real and well-defined, and solving them requires careful engineering and honest evaluation rather than simply scaling up.

Papers from cs.CL and cs.HC. — 일리케