NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions
Accurate nutrition estimation helps people make informed dietary choices and is essential in the prevention of serious health complications. We present NutriBench, the first publicly available natural language meal description nutrition benchmark. NutriBench consists of 11,857 meal descriptions generated from real-world global dietary intake data. The data is human-verified and annotated with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. We conduct an extensive evaluation of NutriBench on the task of carbohydrate estimation, testing twelve leading Large Language Models (LLMs), including GPT-4o, Llama3.1, Qwen2, Gemma2, and OpenBioLLM models, using standard, Chain-of-Thought and Retrieval-Augmented Generation strategies. Additionally, we present a study involving professional nutritionists, finding that LLMs can provide comparable but significantly faster estimates. Finally, we perform a real-world risk assessment by simulating the effect of carbohydrate predictions on the blood glucose levels of individuals with diabetes. Our work highlights the opportunities and challenges of using LLMs for nutrition estimation, demonstrating their potential to aid professionals and laypersons and improve health outcomes. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html
💡 Research Summary
The paper introduces NutriBench, the first publicly available benchmark that evaluates large language models (LLMs) on nutrition estimation from natural‑language meal descriptions. NutriBench comprises 11,857 human‑verified meal descriptions derived from real‑world dietary intake data collected across eleven countries through the United States Department of Agriculture’s What We Eat in America (WWEIA) survey and the FAO/WHO GIFT database. Each description is paired with precise macro‑nutrient labels—carbohydrates, proteins, fats, and total calories—annotated at the gram level.
To generate the natural‑language descriptions, the authors first cleaned the raw intake records, mapping food items to the FoodData Central (FDC) database to obtain standardized nutrient values. They then used GPT‑4o‑mini to translate the structured data into everyday language, preserving both metric (grams) and colloquial serving units (e.g., “a cup”, “half a slice”). Human annotators verified the output for grammatical correctness and factual consistency, ensuring a high‑quality dataset that captures the variability of real‑world self‑reporting.
The evaluation focuses on carbohydrate estimation, a critical task for diabetes management. Twelve state‑of‑the‑art LLMs are tested: closed‑source models (GPT‑4o, GPT‑4o‑mini), open‑source models (Llama 3, Llama 3.1, Gemma 2, Qwen 2), and a domain‑specific medical model (OpenBioLLM‑70B). Each model is prompted using four strategies: (1) standard direct prompting, (2) Chain‑of‑Thought (CoT) prompting, which forces step‑by‑step reasoning, (3) Retrieval‑Augmented Generation (RAG), which supplies external nutrition facts retrieved from a knowledge base, and (4) a hybrid RAG + CoT approach.
Results show that CoT prompting consistently improves performance across models by encouraging explicit calculation of each food item’s carbohydrate content. GPT‑4o with CoT achieves the highest accuracy of 66.82 % and an answer rate of 99.16 %, outperforming even the largest open‑source Llama 3.1‑405B‑FP8. RAG alone provides mixed benefits; its effectiveness depends heavily on the relevance and correctness of the retrieved snippets. The hybrid RAG + CoT configuration yields modest gains for some models but does not surpass pure CoT on average.
A comparative study with three professional nutritionists reveals that LLMs are not only faster—producing estimates in a fraction of the time—but also comparable in accuracy. The average absolute error of the best LLMs is statistically indistinguishable from that of the human experts, suggesting that LLMs could serve as viable decision‑support tools in clinical or consumer settings.
To assess real‑world health impact, the authors simulate 44,800 insulin‑dosing scenarios for Type 1 diabetes patients, feeding the carbohydrate estimates from each model into a standard insulin‑glucose dynamics model. Using LLM‑derived estimates, 87 % of simulated glucose excursions remain within the safe target range (70–180 mg/dL), a 12 percentage‑point improvement over manual estimates. This risk analysis demonstrates that accurate language‑model nutrition estimation can meaningfully reduce hypo‑ and hyper‑glycemic events.
The paper’s contributions are fourfold: (1) the creation and public release of NutriBench, a rigorously annotated dataset that fills a gap in nutrition‑AI research; (2) a comprehensive benchmark of diverse LLMs across multiple prompting paradigms, highlighting the importance of reasoning‑oriented prompts; (3) an expert‑human baseline that validates LLM performance against professional standards; and (4) a clinically‑oriented risk assessment that quantifies the potential health benefits of deploying LLM‑based nutrition tools.
Future work should extend the benchmark to other nutrients (e.g., micronutrients, fiber), incorporate longitudinal user feedback loops, and explore more sophisticated retrieval mechanisms that can dynamically query up‑to‑date nutrition databases. By providing a solid dataset and thorough analysis, NutriBench sets the stage for the next generation of AI‑driven dietary assistance, bridging the gap between natural language interaction and precise nutritional guidance.
Comments & Academic Discussion
Loading comments...
Leave a Comment