Trust, Safety, and Accuracy: Assessing LLMs for Routine Maternity Advice

T rust, Safet y , and Accur acy: Assessing LLMs for Routine Maternit y Ad vice. Sai Divya Vissamsett y 1 , Anagani Bhanusree 1 , Rimjhim 1 , and K V enk ataKrishna Rao 1 National Institute of T echnology , W arangal, India vs25csr1p1 0@student. nitw.ac.in, ab25csr1 p11@studen t.nitw.ac. in, rimjhim@ni tw.ac.in, kvkrao@nit w.ac.in Abstract. Maternal health information accessibilit y remains a signif- ican t concern in In dia, where inadeq uate h ealthcare infrastructure and socio-cultural hesitations often delay medical consultation. Many w omen hesitate to discuss pregnancy related concerns with do ctors d ue to stigma or priv acy issues, increasingly turning to digital platforms like LLMs for everyda y health qu eries. This study ev aluates th ree leading LLMs: ChatGPT-4o, P erplexity AI, and GeminiAI for their ability to provide medically reliable, readable, and culturally sensitive responses to seven- teen pregnancy related questions, includin g prev alent myths and miscon- ceptions in the Indian context. Readabilit y of the LLMs resp onses wa s assessed using vari ous English readabilit y metrics to en sure that infor- mation is presented in simple, clear, and easily u n derstandable language, enabling women especially those from diverse educational backgrounds to comprehend and apply the health guidance eﬀectively . Cosine similarit y w as used to eval uate semantic alignment with exp ert medical responses, and Jaccard similarity measured n oun-entit y ov erlap. Results show that ChatGPT generated the most readable and clinically coherent resp onses (FRE = 64.63; FKGL = 7.53), while Perplexit y AI achiev ed the highest seman tic similarit y (mean cosine = 0.206). F or noun ov erlap, ChatGPT exhibited stronger contextual alignmen t (Jaccard = 0.172–0.216). O ver- all, the ﬁndin gs underscore the p otential of LLMs, p articularly ChatGPT and Perplexit y AI, as supp ortive tools for addressing pregnancy related myths, enhancing maternal h ealth communication, and improving digital health literacy in underserved Indian communities. Keywords: Large Language Mod els · Maternal Health · Readability · Semantic Similarity · AI in H ealthcare · Cultural Myths. 1 In t ro duction The issue of maternal health in India p er sists a s a ma jor public health con- cern, largely due to unequa l healthcar e acce ss b etw ee n r ural and urban p opula- tions, which contributes to delays in essential medical car e. Socia l stigma, limited aw ar e ness, and cultural tab o os frequently dis c ourage women from discussing re- pro ductive or pregnancy r elated concerns o p enly with healthcare providers. At the sa me time, the increasing relia nce on digital platforms for ev er yday queries 2 Sai Divya et al. ranging fro m lifestyle advice to health information has led to growing engage- men t with LL Ms . T he s e mo dels a re incr easingly b eing used a s ﬁrst line sour ces of informatio n, even on sens itive topics like pr egnancy . While digital accessibility has expanded ra pidly with over 830 million internet user s in India and nearly half of r ural w omen now online misinformation and c ulturally ingrained m yths related to preg nancy remain widespr e a d. In such a co ntext, LLMs hav e the po- ten tial to provide quick, empathetic, a nd priv acy pre serving r esp onses, but their reliability a nd cultural sensitivit y require s ystematic ev alua tion. Recent s tudies hav e explored the use o f LLMs in materna l and repro ductive healthcare, focus ing on reliability , rea dability , and practica l utility . Lima et a l. [1] ev alua ted LLMs for delivering maternal hea lth informa tion in resource limited settings, highlighting both pro mise and c hallenges. Khrom- chenk o et al. [2] compared ChatGPT-3.5 a nd Gemini on pregnancy questions, noting diﬀerences in accura cy and reliability . Onder et al. [3] a ssessed ChatGPT- 4’s resp onses o n hypothyroidism during pregnancy , str essing the need for co nt ext sp eciﬁc ev a luation. Reck er et a l. [4] examined AI integration in g ynecologic a l care to aid pa tient decision making. Insuk et a l. [5] showed AI mo de ls match- ing hu man per formance in systematic rev iew scr eening. T a¸ skum et al. [6] found high scientiﬁc reliability but p o o r r eadability in prenatal screening conten t from LLMs. W an et al. [7] revealed limitations in accuracy a nd refer ence quality of ChatGPT resp ons e s. RimJhim et al. [8 ] sp e ciﬁes digital platforms , including so- cial media and LLM interactions, hav e b een s hown to reﬂect how cultural nor ms inﬂuence health quer ies and the spr ead of misinformation. Here, we inv es tig ated the eﬀectiveness o f leading LLMs ChatGPT, Gem- ini, and P erplexity in addr essing pr egnancy related quer ies within the Indian so cio-cultura l context. These queries covering early gesta tion to p ostnatal ca r e, including prev a le nt myths, were ev aluated for a ccuracy , readability , and con- textual understanding ag a inst exp er t medic a l opinions. Results indicated that LLMs can pr ovide reliable, culturally sens itive, and c o mprehensible information, suppo rting ma ternal health co mm unication in r ural and semi-ur ban regions. The ﬁndings emphasize the p otential of AI driven to o ls to complement he a lthcare ser- vices, br idge informa tion ga ps, and empower women to make informed decisions through access ible and co nﬁdent ial digital platforms. 2 Data Collection W e conducted a struc tur ed tw o -phase metho dolog y to examine pr egnancy r e- lated informa tion from LLMs and qualiﬁed medica l pra c titio ne r s. Se venteen questions including pre v a lent m yths and misconceptions in the Indian context addressing topics fro m early gestatio n to p o s tnatal car e were de s igned for co m- prehensive cov e r age. In the ﬁrst phase, res po nses were obtained by from lea ding LLMs to assess their gr asp of ma ternal health.The LLMs were instr ucted to mimic a s a lo cal exper ienced obstetric a nd ma ternal ca re ex pe r t . The second phase inv olved collecting corr esp onding answers from exp er ienced obstetr ic and maternal car e exp erts for co mparative a nalysis. All r esp onses were ev aluated LLMs for Maternal Health Education 3 using predeﬁned s ta ndards for medica l accurac y , clarity , a nd clinical appr opri- ateness. Ethical proto cols were maint ained throughout the pro cess to ensure participant conﬁdentialit y a nd da ta reliability . 3 Metho dology T o asses s the quality and reliability of LLM generated respo nses against those from healthcare pr ofessiona ls, a structured multi-lev el ev alua tion fr amework w as implemen ted. The metho dolo gy e mphasized c o re par ameters like readability , se- mantic similarity and noun entit y overlap. A) R eadabilit y Ass essment Using Linguisti c Metrics: Readability was analyzed to ensure a ccessibility for non-exp ert audiences using established lin- guistic metrics that ev aluate sentence complexity , word choice, and syllable count. These measures help ed determine the clarity and comprehensio n level of the generated re sp onses from the LLMs. The metrics use d in this study in- cluded: Flesc h Reading Ease (FRE ): This s core ev aluates ho w ea s y a passage is to rea d, using sentence le ng th and w ord syllable count. A higher score indica tes more accessible languag e. Flesc h-Kincaid Grade Lev el (FK GL): T he FKGL is computed to trans- lates text complexity into a U.S. grade level.A s core of 8.0 mea ns the co nten t is suitable for so meone at the 8 th -grade reading level. Lower scores are b etter for general audiences. Gunning F og Index (GFI): This index estimates the y ears of formal ed- ucation requir ed to understa nd the text. If the scor e are in the range 7–10 is suitable for gene r al a udiences and ab ov e 12 colleg e-level or higher. SMOG Index: Often used in public health, the SMOG for mula fo cuses on the num b er of p olysy llabic w ords. The result reﬂects the minimum grade lev el required to understand the text. Dale–Chall R eadability Score (DCRS): This metric co ns iders how many unfamiliar words app ea r in a passa ge. Scores b elow 5.0 sugg est high accessibility . Automated Readabil it y Index (ARI): It uses character co unt to e sti- mate g rade level. B y a pplying these metrics, w e w ere able to ob jectiv ely assess the reading diﬃculty of e ach resp onse . B) Sem an tic Similarity Analysi s Using Co sine Si milarity: This tec h- nique is used to compare the meaning of tw o text s a mples b y analyzing their vector representations. If the sco res are c lo se to 1 indicates high similar ity in meaning and a sco re nea r 0 indicates little o r no seman tic a lignment. C) Noun Entit y Recognition Overlap Usi ng Jaccard Simil arit y: T o quantify how m uch overlap existed b etw een the identiﬁed entities from b oth sources. A sco re o f 1 means c o mplete ov erlap (b oth resp o ns es mention the sa me concepts) and a sco r e of 0 means no shared terminolog y . 4 Sai Divya et al. 4 Results and Discussion 4.1 Readability Me trics W e conducted a thorough analysis o f the rea dability of r esp onses generated by each LLM. This step was crucial to asses s how acce s sible and understandable the resp onses w ould b e for non- exp ert readers, par ticula rly exp ecta nt mo thers and mothers seeking reliable infor mation ab out pregnancy . T o this end, six widely recognized rea dability metrics were employ ed. These metrics can b e broadly group ed based on how their scores sho uld b e in terpre ted: Group 1: Metrics Where a Higher Score Indicates Better Read- ability : FRE ass e s ses how easy a text is to r e a d, based on sentence leng th and syllables per word. A higher FRE sco r e indicates that the conten t is ea sier to understand. F rom T able. 1, we observe that Cha tGPT re c eived the highest FRE s core, suggesting that its outputs were the easiest to r ead among all mo dels. In this metric, a scor e be tw een 60 and 70 typically indicates that the text is understand- able to readers at a middle scho ol (13 –15 years old) reading level, which alig ns well with public hea lth c o mmunication standar ds . Group 2: Metrics Where a Low er Score Indicates Better Readabil- it y: This g roup includes metrics that estimate the minimum gra de level requir ed to comprehend the resp o nses fro m LLMs. A lower score is preferable, as it implies that the resp o nse is more acc e ssible to a wider audience. F rom T a ble. 1, we notice that ChatGPT pro duced the most ac c e ssible re- sp onses, suitable for middle sc ho ol rea ding levels. ChatGPT consistently out- per formed than Perplexit y and Gemini acro ss ﬁve reada bility metrics of Group 2, achieving the low est scores where simpler text is preferred. Perplexit y ’s re- T able 1: Average Reada bility Metr ics Across LLMs Group Metric ChatGPT P erplexity G eminiA I Group 1 FRE 64.63 53.89 58.11 Group 2 FKGL 7.53 9.67 8.24 GFI 9.86 12.35 10.59 SMOG Ind ex 10.04 12.00 10.83 DCRS 9.19 10.10 9.47 ARI 8.30 10.06 8.54 sp onses featured longe r sentences and mo r e co mplex vocabula ry , resulting in higher r eading levels. Gemini per formed mo derately but did not exce e d Chat- GPT. These results emphasize ChatGPT’s strength in pro ducing clear, acce s sible language, which is cruc ia l for sensitive medical topics lik e preg na ncy . 4.2 Seman ti c Simi larit y Ev aluation Using Co sine Simi larit y T o ev alua te the sema nt ic alignment be t ween the resp ons e s g enerated by diﬀerent LLMs and those provided by do ctor’s res p o nses (D1, D2, D3, D4), we e mployed LLMs for Maternal Health Education 5 cosine s imilarity as a qua nt itative metric. It measures how s imila r tw o pieces of text are bas ed on their vector repres e ntations which capture the semantic con ten t of the text and allow for a comparis on that go es beyond simple word matching. Fig. 1 shows p erplexity achiev es the highest average similarity in three o f the four do ctor s resp onse s. All thr ee LLMs show a sig niﬁcant drop in simila r ity with D4, mak ing it the least simila r o verall. ChatGP T p erforms sligh tly b etter than Perplexit y on D2 (0.1 91 v s. 0.190), though their sco res a re nearly identical. 4.3 Noun E n tity Recognition O v erlap Analysis Using Jaccard Similarity Here, in this metho d we ev a luated how w ell LLMs captured key noun entities such as medical terms, conditions, ana tomical r eferences, a nd procedur es when compared to r esp onses given by medica l exp erts.This metric he lps quan tify the extent to which an LLM includes the sa me cor e clinical concepts as a trained healthcare professional. Fig . 2 shows that ChatGP T prov ed the most eﬀective ov era ll, b o osted by a high matc h on D2, while Perplexity was the most consis- ten tly strong model ac ross D1, D2, and D3. All three mo dels strugg le d to ﬁnd noun ov erlap with D4. Fig. 1: Average Cosine Similarity b e- t ween LLM and Do ctor Resp ons es Fig. 2: Average Jaccard Similar- it y b etw een LLM and Do ctor Re- sp onses 5 Conclusion and F uture W ork This study asse s sed ChatGPT, Perplexit y , a nd Gemini in generating pr e gnancy related healthcar e infor mation using s emantic similarity , noun ov erla p, and read- ability metrics. Cha tGPT p erfo rmed b est, providing accura te, clear, a nd cultur- ally sensitive r esp onses that eﬀectively addr essed common pregnancy myths. Overall, the res po nses produced b y the LLMs ma intained comparatively hig h readability levels, which may hinder comprehension a mo ng women from diverse backgrounds both in rur al and semi-urban reg ions. F uture research should ex- pand ev aluations to diverse medical domains and so cio- cultural contexts for greater inclusivity . Inco r p orating lo cal health guidelines and domain sp eciﬁc tun- ing can further enhance accura cy and rele v a nce. Contin uous benchmarking and real world v alidation remain essen tia l fo r res p o nsible AI use in maternal health communication. 6 Sai Divya et al. References 1. H . A. Lima, P . H. T ro coli-Couto, Z. Moazzam, L. C. Ro cha, A . Paga no, F. F. Martins, L. T. Brab o, Z. S. Reis, L. Keder, A . Begum et al. , “Quality assessmen t of large language mo dels’ out p ut in maternal h ealth,” Scientiﬁc R ep orts , vol. 15, no. 1, p . 22474, 2025. 2. K. Khromc henko, S. Shaikh, M. Singh, G. V u rture, R . A. Rana, J. D. Baum, and R. R ana, “Chatgpt-3.5 versus go ogle bard: which large language model resp ond s b est to commonly asked pregnancy questions?” Cur eus , vol. 16, no. 7, 2024. 3. C. Onder, G. Koc, P . Gokbu lut, I. T ask aldiran, and S. Kuskonmaz, “Ev aluation of the reliabili ty and readabili ty of c hatgpt- 4 responses regarding hyp othyroidism during pregnancy ,” Scientiﬁc r ep orts , vol. 14, no. 1, p. 243, 2024. 4. F. Reck er, R. Neubauer, A. Wittek , and N. Scholten, “Large language m o dels and w omen’s health: a digital companion for informed d ecision-making,” A r chi ves of Gyne c olo gy and Obstetrics , pp. 1–8, 2025. 5. S . Insuk, K. Bo onpattharatthiti, C. Boonc haroen , P . Chai pitak, M. R ashid, S. K. V eettil, N. M. Lai, N. Chaiyakunapruk, and T. Dh ip p ay om, “How well do chatgpt and claude p erform in study selection for systematic review in obstetrics,” Journal of Me dic al Sys tems , vol. 49, no. 1, pp. 1–9, 2025. 6. ˙ I. T a¸ skum, S. S ın acı, F. Aslan, G. M. T a¸ sku m, and S. Sucu, “Assessmen t of readabil- it y , reliabilit y , and quality of large language models in addressing frequently asked questions regarding prenatal screening for fetal c h romosomal anomalies,” Interna - tional Journal of Gyne c olo gy & Obstetrics , 2025. 7. C. W an, A. Cadiente, K. K hromchenk o, N . F riedric ks, R . A. Rana, and J. D. Baum, “Chatgpt: an eva luation of ai-generated resp onses to commonly asked pregnancy questions,” Op en Journal of Obstetrics and Gyne c olo gy , vol. 13, no. 9, pp. 1528– 1546, 2023. 8. S . Dandapat et al. , “Is gend er-based violence a conﬂuence of culture? empirical evidence from so cial media,” Pe erJ Computer Scienc e , vol. 8, p. e1051, 2022.

Trust, Safety, and Accuracy: Assessing LLMs for Routine Maternity Advice

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment