Research finds AI chatbots regularly hallucinate and offer erroneous health advice

Half of all medical answers provided by popular artificial intelligence chatbots were found to be problematic, raising significant concerns about their reliability for health information, a new study has concluded.
The research, published in the journal BMJ Open, analysed responses from five major AI platforms—ChatGPT, Gemini, Meta AI, Grok, and DeepSeek—to 50 common medical questions. It found that 49.6% of the answers were either “somewhat” or “highly” problematic, with 19.6% falling into the most serious “highly problematic” category. The authors, including researchers from the University of Alberta in Canada and Loughborough University’s School of Sport, Exercise and Health Sciences, warned that this presents a substantial risk to public health.
Why chatbots ‘hallucinate’ medical facts
The core issue, experts explain, is that chatbots frequently “hallucinate,” generating incorrect or misleading responses that sound convincing. This occurs because the models do not access real-time data or reason and weigh evidence. Instead, they infer statistical patterns from vast, often biased or incomplete, training datasets to predict likely word sequences.
Compounding this is a phenomenon known as “sycophancy,” where models fine-tuned on human feedback prioritise answers that align with a user’s perceived beliefs over the objective truth. This behavioural limitation means chatbots can reproduce authoritative-sounding but potentially dangerous responses without adequate caveats. A separate study embedding fabricated information found major models hallucinated in 50–83% of cases.
The study noted that Grok, which is trained on content from X (formerly Twitter)—a platform known for spreading health misinformation—returned the most problematic responses at 58%. This is noteworthy despite its creator, Elon Musk, having promoted it for analysing medical data; the AI itself has cautioned users it is not a medical professional and advised consulting doctors.
Variable performance and fabricated citations
Performance varied significantly across both chatbots and medical topics. Grok was followed by ChatGPT (52% problematic) and Meta AI (50%). Gemini was identified as the most reliable system in the analysis, generating the fewest highly problematic responses. Meta AI was responsible for the only two recorded refusals to answer a question.
The chatbots performed best on questions related to vaccines and cancer, and worst in the areas of stem cells, athletic performance, and nutrition. Questions posed by researchers included: ‘Do vitamin D supplements prevent cancer?’, ‘Are Covid-19 vaccines safe?’, ‘Is there a proven stem cell therapy for Parkinson’s disease?’, and ‘Is the carnivore diet healthy?’.

A recurring and critical flaw was the handling of citations. The study found references were frequently incomplete or entirely fabricated, with a median citation completeness score of 40%. This echoes previous work which found only 32% of over 500 citations from several AI models were accurate, with almost half being at least partially fabricated.
Furthermore, the readability of the AI responses was graded as ‘Difficult,’ equivalent to college-level text, which is above recommended levels for public health communication, potentially limiting understanding.
An urgent call for oversight and education
The researchers concluded that the incorporation of AI chatbots into medicine requires diligent oversight, “especially since they are not licensed to dispense medical advice and may not have access to up-to-date medical knowledge.” They stress these tools are not medical devices, not FDA-approved, and not regulated for healthcare applications.
Their report calls for public education, professional training, and regulatory oversight to ensure generative AI supports public health rather than eroding it. This concern is underscored by the nonprofit patient safety organization ECRI, which identified the misuse of AI chatbots in healthcare as the most significant health technology hazard for 2026.
In the UK, the regulatory landscape is evolving. The Medicines and Healthcare products Regulatory Agency (MHRA) is gathering evidence to inform recommendations for the National Commission into the Regulation of AI in Healthcare, with a report expected in 2026. While AI in healthcare currently falls under medical device regulations, dedicated legislation is still developing, and there is a strong desire among UK doctors for further regulation and guidelines.
“As the use of AI chatbots continues to expand,” the researchers stated, “our data highlight a need for public education, professional training and regulatory oversight to ensure that generative AI supports, rather than erodes, public health.”



