We used 3 distinct databases, each focusing on a specific neurobehavioral condition. Two of these, ASDBank [] and AphasiaBank [], are sourced from TalkBank [] and contain language samples for autism spectrum disorder (ASD) and aphasia, respectively, whereas the third database, the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) database [], contains textual data from patients with depression, anxiety, and posttraumatic stress disorder.
AphasiaBank [] is a repository containing multimedia language samples from both participants with aphasia and control participants. These samples were collected through standardized discourse tasks, including unstructured speech samples, picture descriptions, story narratives, and procedural discourse.
ASDBank [] comprises a collection of language samples and interactions from individuals diagnosed with ASD. The data within ASDBank include transcribed audio and video recordings of clinical interviews and naturalistic interactions.
We used all available English-language transcripts from both AphasiaBank and ASDBank. Data processing was performed to consolidate all samples from a single participant into 1 data point. The resulting dataset comprised 715 aphasia data points and 352 control data points for AphasiaBank and 34 ASD data points and 44 control data points for ASDBank.
The DAIC-WOZ database [] consists of semistructured interviews conducted by a simulated agent designed to identify symptoms of depression and posttraumatic stress disorder. These interviews include questions about personal experiences, quality of life, and emotions. We consolidated all samples from a participant, including the interviewer’s input, into a single data point. The DAIC-WOZ database includes 56 patient data points and 133 control data points.
provides a summary of the diagnosis distribution across each dataset.
Database | Number of control data points | Number of data points for condition of interest |
AphasiaBank | 352 | 715 |
ASDBank | 44 | 34 |
DAIC-WOZa | 133 | 56 |
aDAIC-WOZ: Distress Analysis Interview Corpus-Wizard-of-Oz.
Aphasia, depression, and ASD each manifest distinct linguistic characteristics that are both overlapping and unique. Aphasia, typically resulting from brain damage, is characterized by impaired language production and comprehension, often including repetitive language and the frequent use of filler words as individuals struggle to retrieve or organize words effectively []. Depression, while primarily a mood disorder, affects language through reduced verbal output, monotone speech, and a preference for negative or self-critical language patterns. Depressive language, such as expressions of negativity, can be a key symptom of the condition. Another characteristic linguistic feature is an excessive number of sighs, reflecting physical or emotional fatigue. ASD is marked by unique communication challenges, including delayed speech development, echolalia (repetition of phrases), difficulty with pragmatic language (eg, understanding sarcasm or social cues), and overly literal or formal speech. Individuals with ASD may also exhibit fragmented sentences and frequent use of filler words, reflecting challenges in organizing thoughts or navigating social interactions [].
Many previous studies have leveraged the datasets we used in our research. However, much of the existing work has focused on advanced tasks such as multimodal detection or severity classification rather than simpler text-based binary classification using chatbots. These studies have often achieved strong (although not clinically translatable) performances, frequently exceeding 80% in F1-scores or accuracy. For example, Dinkel et al [] applied a text-based multitask network to the DAIC-WOZ dataset, achieving an F1-score of 0.84 for binary detection. Similarly, Agrawal and Mishra [] used a fused bidirectional encoder representation from transformers–a bidirectional long short-term memory model integrated with Extreme Gradient Boosting to perform binary classification, achieving an F1-score of 91%.
For the AphasiaBank dataset, most previous studies have focused on severity classification, making direct comparisons with our binary classification study challenging. The only relevant work, conducted by Cong et al [], found that using LLM-derived surprisal features facilitated detection, achieving 79% in both accuracy and F1-score. Similarly, studies involving the ASDBank dataset are limited, partly due to its recent development. Chu et al [] included another dataset, the Child Language Data Exchange System, as a source of healthy control data. By extracting a few linguistic features from these 2 datasets, their binary classification approaches reached an F1-scores of over 80% [].
These studies suggest that LLM-based models directly diagnosing from the datasets used in this study should achieve high performance if chatbots exhibit comparable classification capabilities to those models in the previous studies.
We evaluated 2 approaches using 3 types of state-of-the-art conversational AI models: ChatGPT with GPT-4, ChatGPT with GPT-4o, and ChatGPT with GPT-o3 (OpenAI); Gemini 2.5 Pro (Google AI); and Claude 3.5 Sonnet (Anthropic). These models were selected because they are some of the most widely used modern LLMs and because their efficacy in neurobehavioral classification tasks remains underexamined in the current literature. Notably, models such as Gemini 2.5 Pro and ChatGPT with GPT-o3 incorporate built-in prompting strategies such as chain-of-thought reasoning, allowing us to examine how such strategies influence performance. We excluded open models such as Llama because they do not support file input and including them would require a different approach from that used for the other models we tested.
We incorporated 3 widely recognized assessment scales and checklists used in clinical settings. We selected scales that assess behaviors at least tangentially related to language and that do not require extended observation periods. For example, the Autism Spectrum Quotient evaluates traits such as social preferences (“S/he prefers to do things with others rather than on her/his own”), behavioral patterns (“S/he prefers to do things the same way over and over again”), and attention capabilities (“have difficulty sustaining attention in tasks or fun activities”). The rating system for this checklist—definitely disagree, slightly disagree, slightly agree, and definitely agree—does not necessitate longitudinal observation, unlike scales that use time-sensitive ratings such as rarely, less often, very often, and always.
The assessment scales and checklists included in our study were as follows: (1) the fluency test in the Western Aphasia Battery–Aphasia Quotient (AphasiaBank) [], (2) the Autism Spectrum Quotient (ASDBank) [], and (3) Burn’s Depression Checklist [] (DAIC-WOZ database).
In the 2 direct diagnosis conditions, we conducted the experimental procedure 5 times and obtained results based on the entirety of each dataset. We did not perform a training and testing split for these conditions, opting instead for a zero-shot classification approach to assess the models’ ability to generalize from their pretrained knowledge. However, in the code generation conditions, we instructed the chatbot to perform stratified 5-fold cross-validation on the entire dataset. The training and testing split ratio during each fold was 4:1. Results were evaluated based on the test sets generated during each fold and subsequently averaged.
This study did not involve the recruitment of human participants or the collection of new data. All analyses were conducted on publicly available, deidentified datasets―AphasiaBank, the DAIC-WOZ database, and ASDBank―that are widely used in research and do not contain personally identifiable information. As such, no application for ethics review was submitted. This approach is consistent with institutional and regional guidelines that exempt studies using publicly available, deidentified data from human subjects review.
to present the cross-validation results of the 2 approaches applied to each dataset, reporting accuracy, F1-score, specificity, and sensitivity. Performance under the direct diagnosis conditions varied across datasets.
Accuracy | F1-score | Specificity | Sensitivity | |||||
Results from Cong et al [] | 0.79 | 0.79 | ―a | 0.79 | ||||
No assessment scale, mean (SD) | ||||||||
GPT-4 | 0.567 (0.1) | 0.6556 (0.136) | 0.33 (0.3) | 0.684 (0.29) | ||||
GPT-4o | 0.561 (0.029) | 0.648 (0.111) | 0.397 (0.11) | 0.642 (0.22) | ||||
GPT-o3 | 0.49 (0.06) | 0.544 (0.113) | 0.328 (0.01) | 0.665 (0.01) | ||||
Gemini 2.5 Pro | 0.508 (0.01) | 0.599 (0.012) | 0.317 (0.02) | 0.659 (0.013) | ||||
Assessment scale, mean (SD) | ||||||||
GPT-4 | 0.293 (0.34)b | 0.358 (0.376)b | 0.297 (0.187) | 0.647 (0.09) | ||||
GPT-4o | 0.497 (0.01)b | 0.55 (0.02)b | 0.577 (0.02) | 0.458 (0.02) | ||||
GPT-o3 | 0.555 (0.183)c | 0.568 (0.4)c | 0.108 (0.19) | 0.645 (0.037) | ||||
Gemini 2.5 Pro | 0.661 (0.07)b | 0.792 (0.003)b | 0.381 (0.08) | 0.672 (0.003) |
aMissing data.
bNo test conducted.
cP<.001 for GPT-o3 accuracy; P<.001 for F1-score (no assessment scale vs assessment scale).
Accuracy, mean (SD) | F1-score, mean (SD) | Specificity, mean (SD) | Sensitivity, mean (SD) | ||
No assessment scale | |||||
GPT-4 | 0.67 (0.16)a | 0.74 (0.17)a | 0.79 (0.24) | 0.40 (0.31) | |
GPT-4o | 0.67 (0.0113)a | 0.802 (0.008)a | 0.68 (0.011) | 1 (0) | |
GPT-o3 | 0.835 (0.035)a | 0.865 (0.029)a | 0.920 (0.077) | 0.793 (0.041) | |
Claude 3.5 | 0.605 (0.034)b | 0.623 (0.036)b | 0.844 (0.037) | 0.488 (0.033) | |
Gemini 2.5 Pro | 0.7882 (0.02)a | 0.8429 (0.016)a | 0.6645 (0.057) | 0.8490 (0.031) | |
Assessment scale | |||||
GPT-4 | 0.67 (0.16)b | 0.74 (0.17)b | 0.80 (0.25) | 0.41 (0.30) | |
GPT-4o | 0.741 (0.022)c | 0.814 (0.016)c | 0.786 (0.024) | 0.843 (0.007) | |
GPT-o3 | 0.835 (0.035)b | 0.865 (0.029)b | 0.920 (0.077) | 0.793 (0.041) | |
Claude 3.5 | 0.608 (0.036)b | 0.627 (0.039)b | 0.844 (0.037) | 0.492 (0.036) | |
Gemini 2.5 Pro | 0.7891 (0.021)b | 0.8437 (0.015)b | 0.6674 (0.072) | 0.8490 (0.024) |
aP<.001 for GPT-4 accuracy; P<.001 for GPT-4 F1-score; P<.001 for GPT-4o accuracy; P<.001 for GPT-4o F1-score; P<.001 for GPT-o3 accuracy; P<.001 for GPT-o3 F1-score; P<.001 for Gemini 2.5 Pro accuracy; P<.001 for Gemini 2.5 Pro F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)
bNo test conducted.
cP=.07 for GPT-4o accuracy; P=.06 for GPT-4o F1-score (assessment vs no assessment).
Accuracy | F1-score | Specificity | Sensitivity | |||||
Results from Chu et al [] | 0.76 | 0.85 | 0.2 | 0.94 | ||||
No assessment scale, mean (SD) | ||||||||
GPT-4 | 0.5 (0.00) | 0.598 (0.00) | 0.227 (0.00) | 0.853 (0.00) | ||||
GPT-4o | 0.421 (0.03) | 0.514 (0.129) | 0.155 (0.212) | 0.765 (0.323) | ||||
GPT-o3 | 0.6026 (0.00) | 0.575 (0.00) | 0.667 (0.00) | 0.575 (0.00) | ||||
Gemini 2.5 Pro | 0.485 (0.08) | 0.449 (0.09) | 0.549 (0.08) | 0.421 (0.08) | ||||
Assessment scale, mean (SD) | ||||||||
GPT-4 | 0.427 (0.01)a | 0.56 (0.08)a | 0.09 (0.157) | 0.863 (0.24) | ||||
GPT-4o | 0.491 (0.09)a | 0.542 (0.117)a | 0.236 (0.39) | 0.802 (0.342) | ||||
GPT-o3 | 0.436 (0.00)a | 0.607 (0.00)a | 0.00 (0.00) | 0.436 (0.00) |
aNo test conducted.
Accuracy, mean (SD) | F1-score, mean (SD) | Specificity, mean (SD) | Sensitivity, mean (SD) | |||||
No assessment scale | ||||||||
GPT-4 | 0.618 (0.125)a | 0.616 (0.104)a | 0.55 (0.286) | 0.71 (0.199) | ||||
GPT-4o | 0.653 (0.103)a | 0.55 (0.184)a | 0.73 (0.303) | 0.576 (0.378) | ||||
GPT-o3 | 0.679 (0.041)a | 0.679 (0.041)a | 0.864 (0.083) | 0.433 (0.195) | ||||
Claude 3.5 | 0.68 (0.16)b | 0.6 (0.22)b | 0.67 (0.35) | 0.69 (0.4) | ||||
Gemini 2.5 Pro | 0.74 (0.09)a | 0.63 (0.14)a | 0.52 (0.16) | 0.91 (0.08) | ||||
Assessment scale | ||||||||
GPT-4 | 0.642 (0.165)c | 0.628 (0.17)c | 0.6 (0.334) | 0.695 (0.231) | ||||
GPT-4o | 0.628 (0.194)b | 0.592 (0.1974)b | 0.689 (0.325) | 0.578 (0.257) | ||||
GPT-o3 | 0.679 (0.041)b | 0.679 (0.041)b | 0.864 (0.083) | 0.433 (0.195) | ||||
Claude 3.5 | 0.64 (0.13)b | 0.6 (0.23)b | 0.69 (0.41) | 0.67 (0.36) |
aP=.002 for GPT-4 accuracy; P=.001 for GPT-4 F1-score; P=.03 for GPT-4o accuracy; P=.015 for GPT-4o F1-score; P=.009 for GPT-o3 accuracy; P=.005 for GPT-o3 F1-score; P=.006 for Gemini 2.5 Pro accuracy; P=.003 for Gemini 2.5 Pro F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)
bNo test conducted.
cP=.99 and P=.99 for GPT-4 accuracy and F1-score (assessment vs non-assessment).
Accuracy | F1-score | Specificity | Sensitivity | ||||||
Results from Dinkel et al [] | 0.86 | 0.84 | ― | 0.83 | |||||
Results from Agrawal and Mishra [] | ― | 0.91 | ― | 0.89 | |||||
No assessment scale, mean (SD) | |||||||||
GPT-4 | 0.333 (0.04) | 0.452 (0.04) | 0.08 (0.51) | 0.939 (0.039) | |||||
GPT-4o | 0.623 (0.01) | 0.346 (0.176) | 0.711 (0.168) | 0.409 (0.347) | |||||
GPT-o3 | 0.595 (0.05) | 0.252 (0.12) | 0.704 (0.02) | 0.269 (0.05) | |||||
Gemini 2.5 Pro | 0.616 (0.11) | 0.222 (0.132) | 0.700 (0.02) | 0.294 (0.09) | |||||
Assessment scale, mean (SD) | |||||||||
GPT-4 | 0.56 (0.06)a | 0.416 (0.07)a | 0.56 (0.08) | 0.516 (0.05) | |||||
GPT-4o | 0.709 (0.05)a | 0.08 (0.01)a | 1 (0.00) | 0.429 (0.006) | |||||
GPT-o3 | 0.635 (0.05)b | 0.281 (0.14)b | 0.72 (0.01) | 0.355 (0.08) | |||||
Gemini 2.5 Pro | 0.54 (0.07)a | 0.363 (0.09)a | 0.71 (0.06) | 0.306 (0.08) |
aNo test conducted.
bP=.44 for GPT-4 accuracy and P=.43 for F1-score (assessment vs no assessment).
Accuracy, mean (SD) | F1-score, mean (SD) | Specificity, mean (SD) | Sensitivity, mean (SD) | ||
No assessment scale | |||||
GPT-4 | 0.624 (0.024)a | 0.268 (0.047)a | 0.79 (0.035) | 0.233 (0.048) | |
GPT-4o | 0.681 (0.126)a | 0.2038 (0.1474)a | 0.886 (0.087) | 0.2286 (0.2382) | |
GPT-o3 | 0.6667 (0.0572)a | 0.1472 (0.1636)a | 0.1091 (0.1185) | 0.1091 (0.1185) | |
Claude 3.5 | 0.649 (0.103)b | 0.2386 (0.113)b | 0.7672 (0.0251) | 0.2136 (0.1131) | |
Gemini 2.5 Pro | 0.6138 (0.08)b | 0.4037 (0.09)b | 0.6846 (0.11) | 0.4439 (0.11) | |
Assessment scale | |||||
GPT-4 | 0.63 (0.027)c | 0.271 (0.05)c | 0.797 (0.036) | 0.233 (0.048) | |
GPT-4o | 0.681 (0.1587)b | 0.213 (0.1587)b | 0.9 (0.073) | 0.223 (0.2389) | |
GPT-o3 | 0.619 (0.06)b | 0.283 (0.161)b | 0.768 (0.09) | 0.2682 (0.17) | |
Claude 3.5 | 0.657 (0.109)c | 0.33 (0.1153)c | 0.7738 (0.1) | 0.328 (0.1153) | |
Gemini 2.5 Pro | 0.518 (0.068)b | 0.4822 (0.037)b | 0.5524 (0.13) | 0.478 (0.03) |
aP<.001 for GPT-4 accuracy; P<.001 for GPT-4 F1-score; P<.001 for GPT-4o accuracy; P<.001 for GPT-4o F1-score; P<.001 for GPT-o3 accuracy; P<.001 for GPT-o3 F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)
bNo test conducted.
cP=.80, P=.60 for GPT-4 accuracy and F1-score; P=.30, P=.20 for Claude 3.5 accuracy and F1-score (assessment vs non-assessment).
and [] compare approaches on the AphasiaBank dataset against a baseline performance of 79% across metrics in the study by Cong et al []. All of our direct diagnosis conditions yielded a lower performance than this baseline. Our code generation conditions improved results significantly, with ChatGPT with GPT-o3 achieving the highest F1-score (0.865) and balanced specificity (0.92) and sensitivity (0.793), surpassing the baseline by Cong et al [].
The results on the ASDBank dataset were compared against the baseline results from Chu et al [], who achieved an F1-score of 0.85 and a high sensitivity of 0.94, although specificity was notably low at 0.2. Our direct diagnosis approaches struggled in comparison, with ChatGPT with GPT-4 and ChatGPT with GPT-o3 producing lower F1-scores (0.598 and 0.575, respectively) and poor specificity. The code generation condition significantly improved overall performance, with Claude 3.5 achieving the highest accuracy (0.68) and F1-score (0.6). The other models also showed improvement, but their performance on specificity and sensitivity was less consistent. Gemini 2.5 Pro was unable to provide ratings on the checklist due to content restrictions related to ethical guidelines.
For the DAIC-WOZ dataset, the studies by Dinkel et al [] and Agrawal and Mishra [] established strong baselines, achieving F1-scores of 0.84 and 0.91, respectively, along with high accuracy and sensitivity. In comparison, our direct diagnosis approaches showed inconsistent performance, with ChatGPT with GPT-4o and ChatGPT with GPT-4 achieving the highest accuracy (0.623) and F1-score (0.452)—notably low values—with even poorer results on the other metrics. While the code generation approaches yielded higher accuracy in some cases, they did not meaningfully improve overall performance as their F1-scores were significantly lower than those of the direct diagnosis condition.
We also note that most comparisons between assessment scale and no assessment scale conditions did not yield statistically significant differences except for ChatGPT with GPT-o3 and Gemini 2.5 Pro in the AphasiaBank direct diagnosis condition, which showed significant improvements in both accuracy and F1-score.
Overall, our findings reveal a substantial gap when using the 2 different approaches: code generation and direct diagnosis. While code generation and newer models seem to have improved performance compared to direct prompting, they still did not reach the levels reported in previous studies in most cases. Both approaches fell short of established benchmarks, underscoring the limitations of current LLM-based diagnostic methods that rely solely on prompting without model fine-tuning.
We first address the errors in the direct diagnosis approach, which did not appear to work well. We observed that most rounds of classification yielded close-to-random performances, especially for older models (ChatGPT with GPT-4 and ChatGPT with GPT-4o). Interestingly, we noticed patterns in the classification ratings produced, such as digits limited to only multiples of 3 or repeating sequences (eg, 3, 2, 1, 0, 3, 2, 1, 0). We present the percentage of rounds over 5 rounds of classification that followed such patterns in . This demonstrates that a direct diagnosis prompting strategy does not work well if models are presented with the entire dataset at once.
Database and approach | GPT-4 random predictions n=5 (%) | GPT-4o random predictions n=5 (%) | GPT-o3 random predictions n=5 (%) | Gemini 2.5 Pro random predictions (%) | |||||
AphasiaBank | |||||||||
Without assessment scale | 80 | 60 | 0 | 0 | |||||
With assessment scale | 20 | 60 | 100 | 80 | |||||
ASDBank | |||||||||
Without assessment scale | 20 | 100 | 0 | 80 | |||||
With assessment scale | 100 | 100 | 0 | ―a | |||||
DAIC-WOZb database | |||||||||
Without assessment scale | 40 | 80 | 0 | 20 | |||||
With assessment scale | 20 | 100 | 20 | 0 |
aNot applicable.
bDAIC-WOZ: Distress Analysis Interview Corpus-Wizard-of-Oz.
For the code generation approach, we found some examples of text archetypes (ie, typical examples) that were frequently misclassified. These archetypes often reflect characteristics of the conditions. Common errors we observed are described in the following sections.
The presence of repetitive language patterns and an increased frequency of filler words led to misclassification as a high proportion of false positives for aphasia. Control participants’ responses typically exhibited minimal repetition and filler word use. However, even a slight elevation in these linguistic elements frequently resulted in misclassification, with the chatbots erroneously classifying control participants as positives. Notably, misclassified false positives from almost all the chatbots contained these features.
Transcripts containing filler words or fragmented sentences were misclassified in almost 100% of cases as false positives originating from individuals with ASD. With generative pretrained transformer models, this archetype was observed in most false-positive data points, indicating a consistent misclassification pattern. In contrast, Claude 3.5 exhibited a different trend because most misclassified points were false negatives. Claude 3.5 did not appear to excessively use the linguistic feature characteristic of this archetype.
Text lacking overt depressive indicators and conveying generally positive sentiments accounted for a large amount of false negatives. For instance, statements such as “uh I’d say maybe the fact that it’s a lot different than it was about ten years ago” and “I am pretty happy with the level of education I’ve gotten” often led to false negatives.
Texts containing instances of laughter were classified as false negatives in >70% of cases originating from control participants rather than individuals with depression.
Texts containing references to sighing were categorized as false positives originating from individuals with depression. Over 30% of false-positive cases included this feature, indicating its disproportionate influence on the classification process.
details the frequency of occurrence of these archetypes. The observed misclassifications highlight the inherent constraints of relying on text-based methods for neurobehavioral diagnosis.
Archetype and approach | GPT-4 (%) | GPT-4o (%) | GPT-o3 (%) | Claude 3.5 (%) | Gemini 2.5 Pro (%) | ||||||
Repetitive language and filler words (false positives) | |||||||||||
Without assessment scale | 100 | 100 | 100 | 100 | 90 | ||||||
With assessment scale | 100 | 100 | 0 | 100 | 85 | ||||||
Fragmented sentences and filler words (false positives) | |||||||||||
Without assessment scale | 100 | 100 | 66.67 | 0 | 100 | ||||||
With assessment scale | 100 | 100 | 66.67 | 0 | ―a | ||||||
Lack of depressive language (false negatives) | |||||||||||
Without assessment scale | 87.67 | 89.02 | 30 | 85 | 100 | ||||||
With assessment scale | 87.67 | 89.09 | 100 | 79.09 | 100 | ||||||
Excessive amount of laughter (false negatives) | |||||||||||
Without assessment scale | 88.36 | 88.34 | 78 | 90.7 | 96.77 | ||||||
With assessment scale | 88.36 | 88.9 | 80.5 | 90.7 | 100 | ||||||
Excessive number of sighs (false positives) | |||||||||||
Without assessment scale | 67.44 | 69.23 | 30.8 | 52 | 60.61 | ||||||
With assessment scale | 65.12 | 74.19 | 29 | 52.17 | 100 |
aNot available.
This study reveals the limitations of using LLMs for automated neurobehavioral classification. In both direct diagnosis conditions, we encountered significant limitations with these models, which tended to generate random or close-to-random predictions. The models occasionally refused to offer diagnoses, and when compelled to complete the tasks, the resulting classifications were not accurate. These challenges were even more pronounced with Claude 3.5 and Gemini 2.5 Pro, with which we faced difficulties generating any classification results or ratings in some conditions. The inclusion of assessment scales did not substantially improve performance as the ratings on scale items also appeared to be randomly assigned in most situations. Notably, in many of these conditions, we observed a concerning trend in which assessment scale ratings were often identical across participants regardless of individual differences in their text data.
It is important to note that previous studies have successfully achieved F1-scores of 70% to 80% using subsets of the ASDBank dataset and high performance (F1-scores of 80%-90%) using various methods on at least portions of the other 2 datasets [-]. In contrast, our results indicate that most direct diagnosis approaches and the code generated by these models were not able to attain similar results to those of previous studies. This discrepancy suggests a gap between the performance that ML models can potentially achieve and the outcomes observed in our study. This may be due to our relatively straightforward methodological approach.
Regarding the code generation condition, our findings suggest that LLM-generated ML pipelines show promising potential for improving diagnostic performance. Notably, on the AphasiaBank dataset, ChatGPT with GPT-o3 produced code that outperformed results reported in previous studies, although the choice of learning algorithms sometimes varied across conditions and lacked a clear rationale.
In the code generation condition using assessment scales, we observed that the code from the chatbots did not apply diagnostic thresholds as defined by the assessment scales but, instead, directly incorporated the ratings as ML features. The rating methods were simplistic, and the chatbots frequently implemented a keyword-counting algorithm to provide ratings for ASDBank and DAIC-WOZ. These ratings were then concatenated with features extracted from the feature extractor. This direct concatenation of features without sophisticated integration of diagnostic logic may explain why the assessment scale conditions did not lead to improved performance. More effective integration of these ratings in the generated code may help enhance future model performance.
We also observed that models with built-in chain-of-thought reasoning capabilities such as ChatGPT with GPT-o3 and Gemini 2.5 Pro exhibited improved performance under certain conditions. For instance, in the code generation tasks on the AphasiaBank dataset, these chain-of-thought models consistently outperformed others. Permutation tests conducted on the test sets across 5 cross-validation folds revealed statistically significant differences between models that used chain-of-thought reasoning and those that did not (ChatGPT with GPT-4 vs Gemini 2.5 Pro: accuracy P=.01, F1-score P=.03; ChatGPT with GPT-4 vs ChatGPT with GPT-o3: accuracy P<.001, F1-score P<.001; ChatGPT with GPT-4o vs Gemini 2.5 Pro: accuracy P=.01, F1-score P=.002; ChatGPT with GPT-4o vs ChatGPT with GPT-o3: accuracy P<.001, F1-score P<.001). While this improvement was not observed across all datasets (ie, DAIC-WOZ and ASDBank), the integration of structured prompting strategies appears to be a promising direction for future research.
In previous studies, human-in-the-loop processes have demonstrated promise for diagnostic classification tasks [,]. However, in such approaches, the human must remain more involved in the computational diagnosis procedure than simply prompting the LLM to generate a direct diagnosis, clinical rating, or classification code. In prior work for autism diagnostics, for example, humans have extracted the behavioral features—a task that requires the ability to interpret relatively subjective human behavior—leaving the ML models to perform the simpler task of the final classification given the human-derived features [,]. It is likely that humans performing at least some level of analysis of the data will need to continue to achieve clinically useful performance, and future prompt engineering approaches should explore these ideas more thoroughly.
We acknowledge several limitations of this study beyond the observed performance gaps.
First, the scope of our investigation was limited to 3 datasets, each representing a distinct neurobehavioral condition with relatively small sample sizes. This may constrain both the robustness and generalizability of our findings, as well as the models’ capacity to learn effectively.
Second, another limitation lies in the selection and applicability of the clinical checklists used in the assessment scale approach. In many cases, the patient transcripts lacked sufficient information to reliably rate all items on the scales, potentially resulting in random or invalid scores. Future work may consider using longer or more comprehensive patient transcripts or choosing assessment tools that are more tolerant of limited inputs.
Third, additional prompting strategies warrant exploration. While we observed performance gains from models that incorporated chain-of-thought reasoning by default, other prompting techniques may also enhance diagnostic accuracy.
Finally, all input data were presented to the models at once in a single file. This may have hindered their ability to process the content effectively. Presenting the data incrementally one instance at a time could reduce noise and improve prediction consistency.
This study demonstrates that popular LLM-based chatbots remain inadequate for classifying neurobehavioral conditions from text transcripts even when prompted to incorporate clinical assessment scales into their evaluation strategy. We recommend that future research further investigate the limitations identified in this study and examine whether incorporating structured tools—such as assessment scales—can serve as a viable method to improve diagnostic accuracy for neurobehavioral conditions when using more sophisticated prompting strategies.
Wellness360 by Dr. Garg delivers the latest health news and wellness updates—curated from trusted global sources. We simplify medical research, trends, and breakthroughs so you can stay informed without the overwhelm. No clinics, no appointments—just reliable, doctor-reviewed health insights to guide your wellness journey