Study questions accuracy of mental health diagnostic interviews

Diagnostic interviews for mental health conditions vary in reliability depending on the disorder being assessed, according to a major new review that challenges their long-standing status as a “gold standard” in psychiatry. The meta-analysis, published in JAMA Network Open, found that even when the same interview is given to the same patient twice, the results often change – a problem that could lead to misdiagnosis, inappropriate treatment or delays in care.
Moderate reliability, wide variation
The review pooled data from 57 studies involving 8,146 adults, published between February 2024 and September 2025. Researchers led by Laura Duncan, a psychiatry professor at McMaster University in Ontario, Canada, used Cohen’s kappa coefficient – a statistical measure that accounts for chance agreement – to assess test-retest reliability. A kappa of 1 indicates perfect agreement; 0 means agreement no better than luck. The overall pooled kappa was 0.69, which falls into the “substantial” or good range, but the figure masks considerable differences between conditions.
Substance use disorders (SUDs) showed higher reliability, with a pooled kappa of 0.72. Opioid use disorder had the highest score of any individual condition at 0.81 – approaching almost perfect agreement. Professor Duncan attributed this to the behavioural nature of SUD criteria. “It’s often easier to estimate how many drinks you had in a week than the number of days you felt sad or anxious,” she said, explaining that clearer behaviours and timelines make diagnoses more consistent.
By contrast, other mental health disorders had a pooled kappa of 0.65. Nonaffective psychoses performed worst, at 0.55, while bipolar disorders scored a relatively strong 0.74. Anxiety, depressive and personality disorders clustered in the low to mid 0.60s. The findings underscore that interviews based on subjective personal experience – how sad, anxious or paranoid a person feels – are less reliable than those for conditions with observable patterns of substance use.
The study’s authors noted significant heterogeneity between the included papers, meaning that interview structure alone does not guarantee stable diagnoses. Methodological factors such as sample size and retest interval did not explain the variability, though diagnostic criteria did account for some of the differences seen in SUDs.
Interview structure debate: fully structured vs semi-structured
Dr Michael First, a psychiatrist and professor at Columbia University who authored the Structured Clinical Interview for DSM-5 (SCID), expressed frustration with the study’s approach. While he agreed that diagnostic interviews vary in reliability, he argued the review failed to provide the granular detail clinicians need to choose the best tool. “It’d be nice to be able to look at this and say: ‘Oh, based upon this paper, I should pick this one because of this.’ That would be doing the field a real service,” he said. “But there’s simply not enough information here.”
First took particular issue with how the study lumped together two fundamentally different types of interviews: fully structured and semi-structured. Fully structured interviews are rigid scripts from which the interviewer cannot deviate. “If the person says something contradictory, you’re not allowed to even point out that it’s contradictory,” First explained. This format is designed for epidemiological research on large populations and can be administered by people with minimal training. Because the questions are asked identically each time, fully structured interviews tend to yield more consistent results across repeated administrations – but at the potential cost of diagnostic nuance.
Semi-structured interviews, by contrast, are intended for trained clinicians who can adapt their questions as needed. “If a patient’s answer is vague or contradictory, their provider is able to ask follow-up questions to clarify,” First said. This flexibility can improve diagnostic accuracy, but it also means a patient’s answers might vary more from session to session because the clinician probes differently. The SCID is a widely used semi-structured interview; studies have shown its severity scales may have superior psychometric properties compared to its categorical diagnoses. The Mini International Neuropsychiatric Interview (MINI) is a brief, structured alternative that can be administered in about 15 minutes and has been translated into more than 70 languages. Another tool, the Clinically Administered PTSD Scale (CAPS), is considered a “gold standard” for PTSD assessment and has demonstrated excellent reliability across items, raters and testing occasions.
Professor Duncan acknowledged that it would be useful to address all of First’s concerns, but said the data simply does not exist. In the papers her team analysed, they “attempted to extract information on interview format, but this was often unclear or not reported,” she said. The lack of available detail to compare specific instruments and designs is itself a sign of the need for greater rigour in psychiatric diagnosis, she added.
The fact that diagnostic interviews continue to be widely treated as a definitive benchmark, despite mixed evidence, reflects a lack of better alternatives, Duncan noted. First readily admitted that even the tools he helps design are less than ideal. “We’ve been saying for 50 years” that psychiatry needs objective laboratory tests for mental conditions, he said.
Future directions: moving beyond strict categories
Duncan pointed to an alternative approach: “move away from strict diagnostic categories, where a condition is either present or absent, and think about symptoms on a spectrum or continuum.” Such a model could better capture the fluid nature of mental health and reduce the variability seen when patients are forced into discrete boxes. For now, the study serves as a warning to clinicians and researchers not to rely solely on any single diagnostic interview as a gold standard, but to incorporate broader clinical judgment and contextual information. Inconsistent diagnoses can lead to over- or under-treatment, delayed care, or inappropriate interventions, the authors cautioned – and until more rigorous reporting and alternative frameworks emerge, the field will have to navigate the gap between what interviews promise and what they deliver.



