wellness360bydrgarg

No Result
View All Result
  • 🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    0 shares
    Share 0 Tweet 0
  • Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

    0 shares
    Share 0 Tweet 0
  • Divorce Experts Share Tips to Save Your Relationship

    0 shares
    Share 0 Tweet 0
  • 🌸 Effective Treatment Strategies for PCOD & PCOS

    0 shares
    Share 0 Tweet 0
  • Prolactin Test #drsupriyapuranik #motherscare #fertilitytest #gynecologist #pune #prolactintest

    0 shares
    Share 0 Tweet 0
  • Trending
  • Comments
  • Latest
🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

12/07/2025
Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

25/09/2025
Divorce Experts Share Tips to Save Your Relationship

Divorce Experts Share Tips to Save Your Relationship

25/09/2025
🌸 Effective Treatment Strategies for PCOD & PCOS

🌸 Effective Treatment Strategies for PCOD & PCOS

12/07/2025
5 Best Cycles For Women To Kickstart Their Weight Loss Journey

5 Best Cycles For Women To Kickstart Their Weight Loss Journey

10 exciting new books for the kids to cuddle up with this winter – eShe

10 exciting new books for the kids to cuddle up with this winter – eShe

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Homemade Masks for Winter Skincare to Keep Skin nourished

Homemade Masks for Winter Skincare to Keep Skin nourished

YOGA + PILATES  for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

YOGA + PILATES for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

22/10/2025
Wrexham v Oxford United: Championship stats & head-to-head

Wrexham v Oxford United: Championship stats & head-to-head

22/10/2025
7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

22/10/2025
NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

22/10/2025

Categories

  • AI in Healthcare
  • Blog
  • Blood Tests
  • Editorials
  • Expert Insights
  • Family Health
  • Fitness and Workout
  • Health and Wellness
  • Health Business
  • Health Conditions
  • Health News
  • Health Podcasts
  • Health Research
  • Health Technology
  • health-Videos
  • Immunity & Infections
  • Kids Health
  • Lifestyle
  • Longevity Aging
  • Men’s Health
  • Mental Health
  • Natural Remedies
  • Nutrition & Diet
  • Product Reviews
  • Skin & Beauty
  • Sleep & Energy
  • Sports
  • Travel
  • Weight Loss
  • Wellness
  • Women Health
  • World

Newsletter

wellness360bydrgarg

No Result
View All Result
  • 🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    0 shares
    Share 0 Tweet 0
  • Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

    0 shares
    Share 0 Tweet 0
  • Divorce Experts Share Tips to Save Your Relationship

    0 shares
    Share 0 Tweet 0
  • 🌸 Effective Treatment Strategies for PCOD & PCOS

    0 shares
    Share 0 Tweet 0
  • Prolactin Test #drsupriyapuranik #motherscare #fertilitytest #gynecologist #pune #prolactintest

    0 shares
    Share 0 Tweet 0
  • Trending
  • Comments
  • Latest
🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

12/07/2025
Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

25/09/2025
Divorce Experts Share Tips to Save Your Relationship

Divorce Experts Share Tips to Save Your Relationship

25/09/2025
🌸 Effective Treatment Strategies for PCOD & PCOS

🌸 Effective Treatment Strategies for PCOD & PCOS

12/07/2025
5 Best Cycles For Women To Kickstart Their Weight Loss Journey

5 Best Cycles For Women To Kickstart Their Weight Loss Journey

10 exciting new books for the kids to cuddle up with this winter – eShe

10 exciting new books for the kids to cuddle up with this winter – eShe

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Homemade Masks for Winter Skincare to Keep Skin nourished

Homemade Masks for Winter Skincare to Keep Skin nourished

YOGA + PILATES  for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

YOGA + PILATES for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

22/10/2025
Wrexham v Oxford United: Championship stats & head-to-head

Wrexham v Oxford United: Championship stats & head-to-head

22/10/2025
7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

22/10/2025
NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

22/10/2025

Categories

  • AI in Healthcare
  • Blog
  • Blood Tests
  • Editorials
  • Expert Insights
  • Family Health
  • Fitness and Workout
  • Health and Wellness
  • Health Business
  • Health Conditions
  • Health News
  • Health Podcasts
  • Health Research
  • Health Technology
  • health-Videos
  • Immunity & Infections
  • Kids Health
  • Lifestyle
  • Longevity Aging
  • Men’s Health
  • Mental Health
  • Natural Remedies
  • Nutrition & Diet
  • Product Reviews
  • Skin & Beauty
  • Sleep & Energy
  • Sports
  • Travel
  • Weight Loss
  • Wellness
  • Women Health
  • World

Newsletter

  • About Us
  • Contact Us
  • Cookie Law Policy
  • Terms of service
  • Privacy Policy
Wednesday, October 22, 2025
  • Login
WELLNESS 360 BY DR GARG
  • Home
  • Expert Insights
    • AI in Healthcare
    • Editorials
    • Health Podcasts
    • health-Videos
    • Product Reviews
  • Family Health
    • Kids Health
    • Men’s Health
    • Women Health
  • Health Conditions
    • Blood Tests
    • Immunity & Infections
    • Mental Health
    • Natural Remedies
    • Sleep & Energy
  • Health News
    • Health Business
    • Health Research
    • Health Technology
    • World
  • Wellness
    • Fitness and Workout
    • Lifestyle
    • Longevity Aging
    • Nutrition & Diet
    • Skin & Beauty
    • Sports
    • Travel
    • Weight Loss
No Result
View All Result
  • Home
  • Expert Insights
    • AI in Healthcare
    • Editorials
    • Health Podcasts
    • health-Videos
    • Product Reviews
  • Family Health
    • Kids Health
    • Men’s Health
    • Women Health
  • Health Conditions
    • Blood Tests
    • Immunity & Infections
    • Mental Health
    • Natural Remedies
    • Sleep & Energy
  • Health News
    • Health Business
    • Health Research
    • Health Technology
    • World
  • Wellness
    • Fitness and Workout
    • Lifestyle
    • Longevity Aging
    • Nutrition & Diet
    • Skin & Beauty
    • Sports
    • Travel
    • Weight Loss
No Result
View All Result
DrCare4u
No Result
View All Result
  • Home
  • Expert Insights
  • Family Health
  • Health Conditions
  • Health News
  • Wellness
Home Expert Insights AI in Healthcare

JMIR AI – Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation

Kaiying Lin by Kaiying Lin
21/10/2025
in AI in Healthcare
JMIR AI – Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation


Datasets

We used 3 distinct databases, each focusing on a specific neurobehavioral condition. Two of these, ASDBank [] and AphasiaBank [], are sourced from TalkBank [] and contain language samples for autism spectrum disorder (ASD) and aphasia, respectively, whereas the third database, the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) database [], contains textual data from patients with depression, anxiety, and posttraumatic stress disorder.

AphasiaBank [] is a repository containing multimedia language samples from both participants with aphasia and control participants. These samples were collected through standardized discourse tasks, including unstructured speech samples, picture descriptions, story narratives, and procedural discourse.

ASDBank [] comprises a collection of language samples and interactions from individuals diagnosed with ASD. The data within ASDBank include transcribed audio and video recordings of clinical interviews and naturalistic interactions.

We used all available English-language transcripts from both AphasiaBank and ASDBank. Data processing was performed to consolidate all samples from a single participant into 1 data point. The resulting dataset comprised 715 aphasia data points and 352 control data points for AphasiaBank and 34 ASD data points and 44 control data points for ASDBank.

The DAIC-WOZ database [] consists of semistructured interviews conducted by a simulated agent designed to identify symptoms of depression and posttraumatic stress disorder. These interviews include questions about personal experiences, quality of life, and emotions. We consolidated all samples from a participant, including the interviewer’s input, into a single data point. The DAIC-WOZ database includes 56 patient data points and 133 control data points.

provides a summary of the diagnosis distribution across each dataset.

Table 2. The number of control and patient data points in each of the datasets we evaluated.
Database Number of control data points Number of data points for condition of interest
AphasiaBank 352 715
ASDBank 44 34
DAIC-WOZa 133 56

aDAIC-WOZ: Distress Analysis Interview Corpus-Wizard-of-Oz.

Aphasia, depression, and ASD each manifest distinct linguistic characteristics that are both overlapping and unique. Aphasia, typically resulting from brain damage, is characterized by impaired language production and comprehension, often including repetitive language and the frequent use of filler words as individuals struggle to retrieve or organize words effectively []. Depression, while primarily a mood disorder, affects language through reduced verbal output, monotone speech, and a preference for negative or self-critical language patterns. Depressive language, such as expressions of negativity, can be a key symptom of the condition. Another characteristic linguistic feature is an excessive number of sighs, reflecting physical or emotional fatigue. ASD is marked by unique communication challenges, including delayed speech development, echolalia (repetition of phrases), difficulty with pragmatic language (eg, understanding sarcasm or social cues), and overly literal or formal speech. Individuals with ASD may also exhibit fragmented sentences and frequent use of filler words, reflecting challenges in organizing thoughts or navigating social interactions [].

Many previous studies have leveraged the datasets we used in our research. However, much of the existing work has focused on advanced tasks such as multimodal detection or severity classification rather than simpler text-based binary classification using chatbots. These studies have often achieved strong (although not clinically translatable) performances, frequently exceeding 80% in F1-scores or accuracy. For example, Dinkel et al [] applied a text-based multitask network to the DAIC-WOZ dataset, achieving an F1-score of 0.84 for binary detection. Similarly, Agrawal and Mishra [] used a fused bidirectional encoder representation from transformers–a bidirectional long short-term memory model integrated with Extreme Gradient Boosting to perform binary classification, achieving an F1-score of 91%.

For the AphasiaBank dataset, most previous studies have focused on severity classification, making direct comparisons with our binary classification study challenging. The only relevant work, conducted by Cong et al [], found that using LLM-derived surprisal features facilitated detection, achieving 79% in both accuracy and F1-score. Similarly, studies involving the ASDBank dataset are limited, partly due to its recent development. Chu et al [] included another dataset, the Child Language Data Exchange System, as a source of healthy control data. By extracting a few linguistic features from these 2 datasets, their binary classification approaches reached an F1-scores of over 80% [].

These studies suggest that LLM-based models directly diagnosing from the datasets used in this study should achieve high performance if chatbots exhibit comparable classification capabilities to those models in the previous studies.

Models

We evaluated 2 approaches using 3 types of state-of-the-art conversational AI models: ChatGPT with GPT-4, ChatGPT with GPT-4o, and ChatGPT with GPT-o3 (OpenAI); Gemini 2.5 Pro (Google AI); and Claude 3.5 Sonnet (Anthropic). These models were selected because they are some of the most widely used modern LLMs and because their efficacy in neurobehavioral classification tasks remains underexamined in the current literature. Notably, models such as Gemini 2.5 Pro and ChatGPT with GPT-o3 incorporate built-in prompting strategies such as chain-of-thought reasoning, allowing us to examine how such strategies influence performance. We excluded open models such as Llama because they do not support file input and including them would require a different approach from that used for the other models we tested.

Assessment Scales

We incorporated 3 widely recognized assessment scales and checklists used in clinical settings. We selected scales that assess behaviors at least tangentially related to language and that do not require extended observation periods. For example, the Autism Spectrum Quotient evaluates traits such as social preferences (“S/he prefers to do things with others rather than on her/his own”), behavioral patterns (“S/he prefers to do things the same way over and over again”), and attention capabilities (“have difficulty sustaining attention in tasks or fun activities”). The rating system for this checklist—definitely disagree, slightly disagree, slightly agree, and definitely agree—does not necessitate longitudinal observation, unlike scales that use time-sensitive ratings such as rarely, less often, very often, and always.

The assessment scales and checklists included in our study were as follows: (1) the fluency test in the Western Aphasia Battery–Aphasia Quotient (AphasiaBank) [], (2) the Autism Spectrum Quotient (ASDBank) [], and (3) Burn’s Depression Checklist [] (DAIC-WOZ database).

In the 2 direct diagnosis conditions, we conducted the experimental procedure 5 times and obtained results based on the entirety of each dataset. We did not perform a training and testing split for these conditions, opting instead for a zero-shot classification approach to assess the models’ ability to generalize from their pretrained knowledge. However, in the code generation conditions, we instructed the chatbot to perform stratified 5-fold cross-validation on the entire dataset. The training and testing split ratio during each fold was 4:1. Results were evaluated based on the test sets generated during each fold and subsequently averaged.

Ethical Considerations

This study did not involve the recruitment of human participants or the collection of new data. All analyses were conducted on publicly available, deidentified datasets―AphasiaBank, the DAIC-WOZ database, and ASDBank―that are widely used in research and do not contain personally identifiable information. As such, no application for ethics review was submitted. This approach is consistent with institutional and regional guidelines that exempt studies using publicly available, deidentified data from human subjects review.

Results

Core Results

to present the cross-validation results of the 2 approaches applied to each dataset, reporting accuracy, F1-score, specificity, and sensitivity. Performance under the direct diagnosis conditions varied across datasets.

Table 3. Results of 4 approaches on the AphasiaBank dataset in the direct diagnosis condition.
Accuracy F1-score Specificity Sensitivity
Results from Cong et al [] 0.79 0.79 ―a 0.79
No assessment scale, mean (SD)
GPT-4 0.567 (0.1) 0.6556 (0.136) 0.33 (0.3) 0.684 (0.29)
GPT-4o 0.561 (0.029) 0.648 (0.111) 0.397 (0.11) 0.642 (0.22)
GPT-o3 0.49 (0.06) 0.544 (0.113) 0.328 (0.01) 0.665 (0.01)
Gemini 2.5 Pro 0.508 (0.01) 0.599 (0.012) 0.317 (0.02) 0.659 (0.013)
Assessment scale, mean (SD)
GPT-4 0.293 (0.34)b 0.358 (0.376)b 0.297 (0.187) 0.647 (0.09)
GPT-4o 0.497 (0.01)b 0.55 (0.02)b 0.577 (0.02) 0.458 (0.02)
GPT-o3 0.555 (0.183)c 0.568 (0.4)c 0.108 (0.19) 0.645 (0.037)
Gemini 2.5 Pro 0.661 (0.07)b 0.792 (0.003)b 0.381 (0.08) 0.672 (0.003)

aMissing data.

bNo test conducted.

cP<.001 for GPT-o3 accuracy; P<.001 for F1-score (no assessment scale vs assessment scale).

Table 4. Results of 4 approaches on the AphasiaBank dataset in the code generation condition.
Accuracy, mean (SD) F1-score, mean (SD) Specificity, mean (SD) Sensitivity, mean (SD)
No assessment scale
GPT-4 0.67 (0.16)a 0.74 (0.17)a 0.79 (0.24) 0.40 (0.31)
GPT-4o 0.67 (0.0113)a 0.802 (0.008)a 0.68 (0.011) 1 (0)
GPT-o3 0.835 (0.035)a 0.865 (0.029)a 0.920 (0.077) 0.793 (0.041)
Claude 3.5 0.605 (0.034)b 0.623 (0.036)b 0.844 (0.037) 0.488 (0.033)
Gemini 2.5 Pro 0.7882 (0.02)a 0.8429 (0.016)a 0.6645 (0.057) 0.8490 (0.031)
Assessment scale
GPT-4 0.67 (0.16)b 0.74 (0.17)b 0.80 (0.25) 0.41 (0.30)
GPT-4o 0.741 (0.022)c 0.814 (0.016)c 0.786 (0.024) 0.843 (0.007)
GPT-o3 0.835 (0.035)b 0.865 (0.029)b 0.920 (0.077) 0.793 (0.041)
Claude 3.5 0.608 (0.036)b 0.627 (0.039)b 0.844 (0.037) 0.492 (0.036)
Gemini 2.5 Pro 0.7891 (0.021)b 0.8437 (0.015)b 0.6674 (0.072) 0.8490 (0.024)

aP<.001 for GPT-4 accuracy; P<.001 for GPT-4 F1-score; P<.001 for GPT-4o accuracy; P<.001 for GPT-4o F1-score; P<.001 for GPT-o3 accuracy; P<.001 for GPT-o3 F1-score; P<.001 for Gemini 2.5 Pro accuracy; P<.001 for Gemini 2.5 Pro F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)

bNo test conducted.

cP=.07 for GPT-4o accuracy; P=.06 for GPT-4o F1-score (assessment vs no assessment).

Table 5. Results of 4 approaches on the ASDBank dataset in the direct diagnosis condition.
Accuracy F1-score Specificity Sensitivity
Results from Chu et al [] 0.76 0.85 0.2 0.94
No assessment scale, mean (SD)
GPT-4 0.5 (0.00) 0.598 (0.00) 0.227 (0.00) 0.853 (0.00)
GPT-4o 0.421 (0.03) 0.514 (0.129) 0.155 (0.212) 0.765 (0.323)
GPT-o3 0.6026 (0.00) 0.575 (0.00) 0.667 (0.00) 0.575 (0.00)
Gemini 2.5 Pro 0.485 (0.08) 0.449 (0.09) 0.549 (0.08) 0.421 (0.08)
Assessment scale, mean (SD)
GPT-4 0.427 (0.01)a 0.56 (0.08)a 0.09 (0.157) 0.863 (0.24)
GPT-4o 0.491 (0.09)a 0.542 (0.117)a 0.236 (0.39) 0.802 (0.342)
GPT-o3 0.436 (0.00)a 0.607 (0.00)a 0.00 (0.00) 0.436 (0.00)

aNo test conducted.

Table 6. Results of 4 approaches on the ASDBank dataset in the code generation condition.
Accuracy, mean (SD) F1-score, mean (SD) Specificity, mean (SD) Sensitivity, mean (SD)
No assessment scale
GPT-4 0.618 (0.125)a 0.616 (0.104)a 0.55 (0.286) 0.71 (0.199)
GPT-4o 0.653 (0.103)a 0.55 (0.184)a 0.73 (0.303) 0.576 (0.378)
GPT-o3 0.679 (0.041)a 0.679 (0.041)a 0.864 (0.083) 0.433 (0.195)
Claude 3.5 0.68 (0.16)b 0.6 (0.22)b 0.67 (0.35) 0.69 (0.4)
Gemini 2.5 Pro 0.74 (0.09)a 0.63 (0.14)a 0.52 (0.16) 0.91 (0.08)
Assessment scale
GPT-4 0.642 (0.165)c 0.628 (0.17)c 0.6 (0.334) 0.695 (0.231)
GPT-4o 0.628 (0.194)b 0.592 (0.1974)b 0.689 (0.325) 0.578 (0.257)
GPT-o3 0.679 (0.041)b 0.679 (0.041)b 0.864 (0.083) 0.433 (0.195)
Claude 3.5 0.64 (0.13)b 0.6 (0.23)b 0.69 (0.41) 0.67 (0.36)

aP=.002 for GPT-4 accuracy; P=.001 for GPT-4 F1-score; P=.03 for GPT-4o accuracy; P=.015 for GPT-4o F1-score; P=.009 for GPT-o3 accuracy; P=.005 for GPT-o3 F1-score; P=.006 for Gemini 2.5 Pro accuracy; P=.003 for Gemini 2.5 Pro F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)

bNo test conducted.

cP=.99 and P=.99 for GPT-4 accuracy and F1-score (assessment vs non-assessment).

Table 7. Results of 4 approaches on the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) database in the direct diagnosis condition.
Accuracy F1-score Specificity Sensitivity
Results from Dinkel et al [] 0.86 0.84 ― 0.83
Results from Agrawal and Mishra [] ― 0.91 ― 0.89
No assessment scale, mean (SD)
GPT-4 0.333 (0.04) 0.452 (0.04) 0.08 (0.51) 0.939 (0.039)
GPT-4o 0.623 (0.01) 0.346 (0.176) 0.711 (0.168) 0.409 (0.347)
GPT-o3 0.595 (0.05) 0.252 (0.12) 0.704 (0.02) 0.269 (0.05)
Gemini 2.5 Pro 0.616 (0.11) 0.222 (0.132) 0.700 (0.02) 0.294 (0.09)
Assessment scale, mean (SD)
GPT-4 0.56 (0.06)a 0.416 (0.07)a 0.56 (0.08) 0.516 (0.05)
GPT-4o 0.709 (0.05)a 0.08 (0.01)a 1 (0.00) 0.429 (0.006)
GPT-o3 0.635 (0.05)b 0.281 (0.14)b 0.72 (0.01) 0.355 (0.08)
Gemini 2.5 Pro 0.54 (0.07)a 0.363 (0.09)a 0.71 (0.06) 0.306 (0.08)

aNo test conducted.

bP=.44 for GPT-4 accuracy and P=.43 for F1-score (assessment vs no assessment).

Table 8. Results of 4 approaches on the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) database in the code generation condition.
Accuracy, mean (SD) F1-score, mean (SD) Specificity, mean (SD) Sensitivity, mean (SD)
No assessment scale
GPT-4 0.624 (0.024)a 0.268 (0.047)a 0.79 (0.035) 0.233 (0.048)
GPT-4o 0.681 (0.126)a 0.2038 (0.1474)a 0.886 (0.087) 0.2286 (0.2382)
GPT-o3 0.6667 (0.0572)a 0.1472 (0.1636)a 0.1091 (0.1185) 0.1091 (0.1185)
Claude 3.5 0.649 (0.103)b 0.2386 (0.113)b 0.7672 (0.0251) 0.2136 (0.1131)
Gemini 2.5 Pro 0.6138 (0.08)b 0.4037 (0.09)b 0.6846 (0.11) 0.4439 (0.11)
Assessment scale
GPT-4 0.63 (0.027)c 0.271 (0.05)c 0.797 (0.036) 0.233 (0.048)
GPT-4o 0.681 (0.1587)b 0.213 (0.1587)b 0.9 (0.073) 0.223 (0.2389)
GPT-o3 0.619 (0.06)b 0.283 (0.161)b 0.768 (0.09) 0.2682 (0.17)
Claude 3.5 0.657 (0.109)c 0.33 (0.1153)c 0.7738 (0.1) 0.328 (0.1153)
Gemini 2.5 Pro 0.518 (0.068)b 0.4822 (0.037)b 0.5524 (0.13) 0.478 (0.03)

aP<.001 for GPT-4 accuracy; P<.001 for GPT-4 F1-score; P<.001 for GPT-4o accuracy; P<.001 for GPT-4o F1-score; P<.001 for GPT-o3 accuracy; P<.001 for GPT-o3 F1-score (direct diagnosis versus code generation in non–assessment scale setups when marked in the “No assessment scale” section)

bNo test conducted.

cP=.80, P=.60 for GPT-4 accuracy and F1-score; P=.30, P=.20 for Claude 3.5 accuracy and F1-score (assessment vs non-assessment).

and [] compare approaches on the AphasiaBank dataset against a baseline performance of 79% across metrics in the study by Cong et al []. All of our direct diagnosis conditions yielded a lower performance than this baseline. Our code generation conditions improved results significantly, with ChatGPT with GPT-o3 achieving the highest F1-score (0.865) and balanced specificity (0.92) and sensitivity (0.793), surpassing the baseline by Cong et al [].

The results on the ASDBank dataset were compared against the baseline results from Chu et al [], who achieved an F1-score of 0.85 and a high sensitivity of 0.94, although specificity was notably low at 0.2. Our direct diagnosis approaches struggled in comparison, with ChatGPT with GPT-4 and ChatGPT with GPT-o3 producing lower F1-scores (0.598 and 0.575, respectively) and poor specificity. The code generation condition significantly improved overall performance, with Claude 3.5 achieving the highest accuracy (0.68) and F1-score (0.6). The other models also showed improvement, but their performance on specificity and sensitivity was less consistent. Gemini 2.5 Pro was unable to provide ratings on the checklist due to content restrictions related to ethical guidelines.

For the DAIC-WOZ dataset, the studies by Dinkel et al [] and Agrawal and Mishra [] established strong baselines, achieving F1-scores of 0.84 and 0.91, respectively, along with high accuracy and sensitivity. In comparison, our direct diagnosis approaches showed inconsistent performance, with ChatGPT with GPT-4o and ChatGPT with GPT-4 achieving the highest accuracy (0.623) and F1-score (0.452)—notably low values—with even poorer results on the other metrics. While the code generation approaches yielded higher accuracy in some cases, they did not meaningfully improve overall performance as their F1-scores were significantly lower than those of the direct diagnosis condition.

We also note that most comparisons between assessment scale and no assessment scale conditions did not yield statistically significant differences except for ChatGPT with GPT-o3 and Gemini 2.5 Pro in the AphasiaBank direct diagnosis condition, which showed significant improvements in both accuracy and F1-score.

Overall, our findings reveal a substantial gap when using the 2 different approaches: code generation and direct diagnosis. While code generation and newer models seem to have improved performance compared to direct prompting, they still did not reach the levels reported in previous studies in most cases. Both approaches fell short of established benchmarks, underscoring the limitations of current LLM-based diagnostic methods that rely solely on prompting without model fine-tuning.

Error Analysis

Overview

We first address the errors in the direct diagnosis approach, which did not appear to work well. We observed that most rounds of classification yielded close-to-random performances, especially for older models (ChatGPT with GPT-4 and ChatGPT with GPT-4o). Interestingly, we noticed patterns in the classification ratings produced, such as digits limited to only multiples of 3 or repeating sequences (eg, 3, 2, 1, 0, 3, 2, 1, 0). We present the percentage of rounds over 5 rounds of classification that followed such patterns in . This demonstrates that a direct diagnosis prompting strategy does not work well if models are presented with the entire dataset at once.

Table 9. Percentage of random predictions.
Database and approach GPT-4 random predictions n=5 (%) GPT-4o random predictions n=5 (%) GPT-o3 random predictions n=5 (%) Gemini 2.5 Pro random predictions (%)
AphasiaBank
Without assessment scale 80 60 0 0
With assessment scale 20 60 100 80
ASDBank
Without assessment scale 20 100 0 80
With assessment scale 100 100 0 ―a
DAIC-WOZb database
Without assessment scale 40 80 0 20
With assessment scale 20 100 20 0

aNot applicable.

bDAIC-WOZ: Distress Analysis Interview Corpus-Wizard-of-Oz.

For the code generation approach, we found some examples of text archetypes (ie, typical examples) that were frequently misclassified. These archetypes often reflect characteristics of the conditions. Common errors we observed are described in the following sections.

Repetitive Language and Filler Words (Aphasia)

The presence of repetitive language patterns and an increased frequency of filler words led to misclassification as a high proportion of false positives for aphasia. Control participants’ responses typically exhibited minimal repetition and filler word use. However, even a slight elevation in these linguistic elements frequently resulted in misclassification, with the chatbots erroneously classifying control participants as positives. Notably, misclassified false positives from almost all the chatbots contained these features.

You might also like

Real-Time AI Scoping Review (RAISR4D)

JMIR AI – Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study

Pauline Stirling: Prevention of future deaths report

Fragmented Sentences and Filler Words (ASD)

Transcripts containing filler words or fragmented sentences were misclassified in almost 100% of cases as false positives originating from individuals with ASD. With generative pretrained transformer models, this archetype was observed in most false-positive data points, indicating a consistent misclassification pattern. In contrast, Claude 3.5 exhibited a different trend because most misclassified points were false negatives. Claude 3.5 did not appear to excessively use the linguistic feature characteristic of this archetype.

Lack of Depressive Language (Depression)

Text lacking overt depressive indicators and conveying generally positive sentiments accounted for a large amount of false negatives. For instance, statements such as “uh I’d say maybe the fact that it’s a lot different than it was about ten years ago” and “I am pretty happy with the level of education I’ve gotten” often led to false negatives.

Excessive Amount of Laughter (Depression)

Texts containing instances of laughter were classified as false negatives in >70% of cases originating from control participants rather than individuals with depression.

Excessive Number of Sighs (Depression)

Texts containing references to sighing were categorized as false positives originating from individuals with depression. Over 30% of false-positive cases included this feature, indicating its disproportionate influence on the classification process.

Frequency of Occurrence of Archetypes

details the frequency of occurrence of these archetypes. The observed misclassifications highlight the inherent constraints of relying on text-based methods for neurobehavioral diagnosis.

Table 10. Percentage of each text archetype in false-positive or false-negative data points in the code generation conditions averaged across folds.
Archetype and approach GPT-4 (%) GPT-4o (%) GPT-o3 (%) Claude 3.5 (%) Gemini 2.5 Pro (%)
Repetitive language and filler words (false positives)
Without assessment scale 100 100 100 100 90
With assessment scale 100 100 0 100 85
Fragmented sentences and filler words (false positives)
Without assessment scale 100 100 66.67 0 100
With assessment scale 100 100 66.67 0 ―a
Lack of depressive language (false negatives)
Without assessment scale 87.67 89.02 30 85 100
With assessment scale 87.67 89.09 100 79.09 100
Excessive amount of laughter (false negatives)
Without assessment scale 88.36 88.34 78 90.7 96.77
With assessment scale 88.36 88.9 80.5 90.7 100
Excessive number of sighs (false positives)
Without assessment scale 67.44 69.23 30.8 52 60.61
With assessment scale 65.12 74.19 29 52.17 100

aNot available.

Discussion

Principal Findings

This study reveals the limitations of using LLMs for automated neurobehavioral classification. In both direct diagnosis conditions, we encountered significant limitations with these models, which tended to generate random or close-to-random predictions. The models occasionally refused to offer diagnoses, and when compelled to complete the tasks, the resulting classifications were not accurate. These challenges were even more pronounced with Claude 3.5 and Gemini 2.5 Pro, with which we faced difficulties generating any classification results or ratings in some conditions. The inclusion of assessment scales did not substantially improve performance as the ratings on scale items also appeared to be randomly assigned in most situations. Notably, in many of these conditions, we observed a concerning trend in which assessment scale ratings were often identical across participants regardless of individual differences in their text data.

It is important to note that previous studies have successfully achieved F1-scores of 70% to 80% using subsets of the ASDBank dataset and high performance (F1-scores of 80%-90%) using various methods on at least portions of the other 2 datasets [-]. In contrast, our results indicate that most direct diagnosis approaches and the code generated by these models were not able to attain similar results to those of previous studies. This discrepancy suggests a gap between the performance that ML models can potentially achieve and the outcomes observed in our study. This may be due to our relatively straightforward methodological approach.

Regarding the code generation condition, our findings suggest that LLM-generated ML pipelines show promising potential for improving diagnostic performance. Notably, on the AphasiaBank dataset, ChatGPT with GPT-o3 produced code that outperformed results reported in previous studies, although the choice of learning algorithms sometimes varied across conditions and lacked a clear rationale.

In the code generation condition using assessment scales, we observed that the code from the chatbots did not apply diagnostic thresholds as defined by the assessment scales but, instead, directly incorporated the ratings as ML features. The rating methods were simplistic, and the chatbots frequently implemented a keyword-counting algorithm to provide ratings for ASDBank and DAIC-WOZ. These ratings were then concatenated with features extracted from the feature extractor. This direct concatenation of features without sophisticated integration of diagnostic logic may explain why the assessment scale conditions did not lead to improved performance. More effective integration of these ratings in the generated code may help enhance future model performance.

We also observed that models with built-in chain-of-thought reasoning capabilities such as ChatGPT with GPT-o3 and Gemini 2.5 Pro exhibited improved performance under certain conditions. For instance, in the code generation tasks on the AphasiaBank dataset, these chain-of-thought models consistently outperformed others. Permutation tests conducted on the test sets across 5 cross-validation folds revealed statistically significant differences between models that used chain-of-thought reasoning and those that did not (ChatGPT with GPT-4 vs Gemini 2.5 Pro: accuracy P=.01, F1-score P=.03; ChatGPT with GPT-4 vs ChatGPT with GPT-o3: accuracy P<.001, F1-score P<.001; ChatGPT with GPT-4o vs Gemini 2.5 Pro: accuracy P=.01, F1-score P=.002; ChatGPT with GPT-4o vs ChatGPT with GPT-o3: accuracy P<.001, F1-score P<.001). While this improvement was not observed across all datasets (ie, DAIC-WOZ and ASDBank), the integration of structured prompting strategies appears to be a promising direction for future research.

In previous studies, human-in-the-loop processes have demonstrated promise for diagnostic classification tasks [,]. However, in such approaches, the human must remain more involved in the computational diagnosis procedure than simply prompting the LLM to generate a direct diagnosis, clinical rating, or classification code. In prior work for autism diagnostics, for example, humans have extracted the behavioral features—a task that requires the ability to interpret relatively subjective human behavior—leaving the ML models to perform the simpler task of the final classification given the human-derived features [,]. It is likely that humans performing at least some level of analysis of the data will need to continue to achieve clinically useful performance, and future prompt engineering approaches should explore these ideas more thoroughly.

Limitations

We acknowledge several limitations of this study beyond the observed performance gaps.

First, the scope of our investigation was limited to 3 datasets, each representing a distinct neurobehavioral condition with relatively small sample sizes. This may constrain both the robustness and generalizability of our findings, as well as the models’ capacity to learn effectively.

Second, another limitation lies in the selection and applicability of the clinical checklists used in the assessment scale approach. In many cases, the patient transcripts lacked sufficient information to reliably rate all items on the scales, potentially resulting in random or invalid scores. Future work may consider using longer or more comprehensive patient transcripts or choosing assessment tools that are more tolerant of limited inputs.

Third, additional prompting strategies warrant exploration. While we observed performance gains from models that incorporated chain-of-thought reasoning by default, other prompting techniques may also enhance diagnostic accuracy.

Finally, all input data were presented to the models at once in a single file. This may have hindered their ability to process the content effectively. Presenting the data incrementally one instance at a time could reduce noise and improve prediction consistency.

Conclusions

This study demonstrates that popular LLM-based chatbots remain inadequate for classifying neurobehavioral conditions from text transcripts even when prompted to incorporate clinical assessment scales into their evaluation strategy. We recommend that future research further investigate the limitations identified in this study and examine whether incorporating structured tools—such as assessment scales—can serve as a viable method to improve diagnostic accuracy for neurobehavioral conditions when using more sophisticated prompting strategies.



Source link

Related Posts

Real-Time AI Scoping Review (RAISR4D)
AI in Healthcare

Real-Time AI Scoping Review (RAISR4D)

20/10/2025
0
JMIR AI – Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study
AI in Healthcare

JMIR AI – Comparison of Japanese Mpox (Monkeypox) Health Education Materials and Texts Created by Artificial Intelligence: Cross-Sectional Quantitative Content Analysis Study

17/10/2025
0
Pauline Stirling: Prevention of future deaths report
AI in Healthcare

Pauline Stirling: Prevention of future deaths report

15/10/2025
0
Leo Barber: Prevention of future deaths report
AI in Healthcare

Leo Barber: Prevention of future deaths report

14/10/2025
0
Real-Time AI Scoping Review (RAISR4D)
AI in Healthcare

Real-Time AI Scoping Review (RAISR4D)

14/10/2025
0

wellness360bydrgarg

No Result
View All Result
  • 🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

    0 shares
    Share 0 Tweet 0
  • Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

    0 shares
    Share 0 Tweet 0
  • Divorce Experts Share Tips to Save Your Relationship

    0 shares
    Share 0 Tweet 0
  • 🌸 Effective Treatment Strategies for PCOD & PCOS

    0 shares
    Share 0 Tweet 0
  • Prolactin Test #drsupriyapuranik #motherscare #fertilitytest #gynecologist #pune #prolactintest

    0 shares
    Share 0 Tweet 0
  • Trending
  • Comments
  • Latest
🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

🦵 Leg Cramps: Causes, Prevention, and Natural Remedies

12/07/2025
Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

Establishment and validation of an artificial intelligence-based system for identifying the culprit vessel in patients with ST-segment elevated myocardial infarction: the ALERT study

25/09/2025
Divorce Experts Share Tips to Save Your Relationship

Divorce Experts Share Tips to Save Your Relationship

25/09/2025
🌸 Effective Treatment Strategies for PCOD & PCOS

🌸 Effective Treatment Strategies for PCOD & PCOS

12/07/2025
5 Best Cycles For Women To Kickstart Their Weight Loss Journey

5 Best Cycles For Women To Kickstart Their Weight Loss Journey

10 exciting new books for the kids to cuddle up with this winter – eShe

10 exciting new books for the kids to cuddle up with this winter – eShe

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Scientists May Be Able to Make Grapefruits Compatible With Medications They Currently Interfere With

Homemade Masks for Winter Skincare to Keep Skin nourished

Homemade Masks for Winter Skincare to Keep Skin nourished

YOGA + PILATES  for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

YOGA + PILATES for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

22/10/2025
Wrexham v Oxford United: Championship stats & head-to-head

Wrexham v Oxford United: Championship stats & head-to-head

22/10/2025
7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

7 HABITS I GAVE UP to Live a HEALTHY Lifestyle | Self Care + Nutrition + RESET for a Better YOU

22/10/2025
NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

NFL owners approve sale of 10% of the New York Giants to Julia Koch and her family

22/10/2025

Categories

  • AI in Healthcare
  • Blog
  • Blood Tests
  • Editorials
  • Expert Insights
  • Family Health
  • Fitness and Workout
  • Health and Wellness
  • Health Business
  • Health Conditions
  • Health News
  • Health Podcasts
  • Health Research
  • Health Technology
  • health-Videos
  • Immunity & Infections
  • Kids Health
  • Lifestyle
  • Longevity Aging
  • Men’s Health
  • Mental Health
  • Natural Remedies
  • Nutrition & Diet
  • Product Reviews
  • Skin & Beauty
  • Sleep & Energy
  • Sports
  • Travel
  • Weight Loss
  • Wellness
  • Women Health
  • World

Newsletter

Wellness360 by Dr. Garg delivers the latest health news and wellness updates—curated from trusted global sources. We simplify medical research, trends, and breakthroughs so you can stay informed without the overwhelm. No clinics, no appointments—just reliable, doctor-reviewed health insights to guide your wellness journey

Categories

  • AI in Healthcare
  • Blog
  • Blood Tests
  • Editorials
  • Expert Insights
  • Family Health
  • Fitness and Workout
  • Health and Wellness
  • Health Business
  • Health Conditions
  • Health News
  • Health Podcasts
  • Health Research
  • Health Technology
  • health-Videos
  • Immunity & Infections
  • Kids Health
  • Lifestyle
  • Longevity Aging
  • Men’s Health
  • Mental Health
  • Natural Remedies
  • Nutrition & Diet
  • Product Reviews
  • Skin & Beauty
  • Sleep & Energy
  • Sports
  • Travel
  • Weight Loss
  • Wellness
  • Women Health
  • World

Browse by Tag

Anti-Inflammatory Foods for Women Balancing Hormones Naturally Boost Female Fertility Naturally Diet for Hormonal Health Endometriosis Health Tips Endometriosis Pain Relief Exercise for Hormonal Balance Female Fertility Tips Female Health Care foods for hormone balance Health health tips for women Healthy Lifestyle for Women Healthy Period Cycle Hormonal Health for Women Hormonal Wellness Tips hormone imbalance in women Hormones and Fertility Hormones and Mental Health Improve Hormone Health Managing Menopause Naturally Managing Mood Swings Menopause and Mood Swings Menopause Hormonal Health Menopause Weight Loss Menopause Wellness Tips Menstrual Cycle Health Menstrual Health Tips Menstrual Pain Relief natural hormone balance Natural Remedies for Hormone Balance Natural Remedies for PCOS PCOS and Hormonal Balance PCOS Management Tips Perimenopause Symptoms Period Wellness PMS Management Tips Signs of Hormone Imbalance Supplements for Women’s Health Womenʼs Health Education Womenʼs Health Routine Womenʼs Mental Health Womenʼs Wellness Women’s Health Essentials Women’s Health Tips

Recent News

YOGA + PILATES  for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

YOGA + PILATES for PCOS, Hormonal Imbalances & Irregular Periods | Part -3

22/10/2025
Wrexham v Oxford United: Championship stats & head-to-head

Wrexham v Oxford United: Championship stats & head-to-head

22/10/2025

© 2025 Wellness360 by Dr. Garg

No Result
View All Result
  • Home
  • Expert Insights
    • AI in Healthcare
    • Editorials
    • Health Podcasts
    • health-Videos
    • Product Reviews
  • Family Health
    • Kids Health
    • Men’s Health
    • Women Health
  • Health Conditions
    • Blood Tests
    • Immunity & Infections
    • Mental Health
    • Natural Remedies
    • Sleep & Energy
  • Health News
    • Health Business
    • Health Research
    • Health Technology
    • World
  • Wellness
    • Fitness and Workout
    • Lifestyle
    • Longevity Aging
    • Nutrition & Diet
    • Skin & Beauty
    • Sports
    • Travel
    • Weight Loss

© 2025 Wellness360 by Dr. Garg

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist