New Study Reveals Cognitive Decline Patterns in AI Language Models

A recent investigation featured in the Christmas edition of the British Medical Journal poses a provocative question: Are sophisticated AI systems like ChatGPT and Gemini exhibiting early signs of cognitive decline akin to human dementia? Scientists evaluated several leading large language models (LLMs) using the Montreal Cognitive Assessment (MoCA), a clinical tool designed to identify initial cognitive impairment in people, with surprising findings.

Unveiling AI’s Cognitive Shortcomings

The research team, composed of neurologists and AI experts headed by Dr. Emilia Kramer from the University of Edinburgh, subjected prominent LLMs to the MoCA evaluation. The tested models included:

ChatGPT-4 and 4o by OpenAI
Claude 3.5 “Sonnet” from Anthropic
Gemini 1.0 and 1.5 developed by Alphabet

Applying this 30-point cognitive test, which assesses attention, memory, visuospatial skills, and language, researchers scrutinized the AI models’ performances across multiple cognitive domains.

Add Cosmo Herald as a Preferred Source

Detailed Outcomes: AI Strengths and Weaknesses

The evaluation highlighted marked variation in the cognitive proficiency of the language models during the MoCA assessment. Here is a breakdown of their respective capabilities and limitations:

ChatGPT-4o (OpenAI)
- Score: 26/30, surpassing the passing line.
- Strengths: Demonstrated excellent attention, comprehension, and abstract thinking. Successfully passed the Stroop Test, indicating cognitive adaptability.
- Limitations: Encountered difficulties with tasks relying on visuospatial reasoning like sequential connection of numbers and letters and clock drawing.
Claude 3.5 “Sonnet” (Anthropic)
- Score: 22/30.
- Strengths: Showed moderate skills in language-focused tasks and simple problem-solving.
- Limitations: Exhibited challenges in memory tasks, multi-step reasoning, and visuospatial activities.
Gemini 1.0 (Alphabet)
- Score: 16/30.
- Strengths: Limited, with occasional success in basic naming exercises.
- Limitations: Struggled significantly with word sequence memory, visuospatial processing, and structured information handling.
Gemini 1.5 (Alphabet)
- Score: 18/30.
- Strengths: Displayed minor improvements in fundamental reasoning and language tasks compared to Gemini 1.0.
- Limitations: Continued poor performance in visuospatial interpretation, sequencing, and memory retention.

The findings emphasize the divergence in cognitive function, with ChatGPT-4o emerging as the frontrunner. Nonetheless, even the top model revealed significant gaps, particularly when tackling cognitive tasks relevant to practical contexts like real-world challenges.

Summary of Model Performance

Here is a concise overview of the scoring and core competencies of each AI model:

ModelScoreMain StrengthsCritical WeaknessesChatGPT-4oClaude 3.5Gemini 1.0Gemini 1.5

The disparities revealed call attention to fundamental architectural issues affecting AI models’ abilities to handle certain cognitive demands. Dr. Kramer expressed astonishment at the severe challenges Gemini faced in simple memory tasks, stating, “We were shocked to see how poorly Gemini performed, particularly in basic memory tasks like recalling a simple five-word sequence.”

Challenges in AI’s Human-Like Thinking

The Montreal Cognitive Assessment, a validated method since the early 1990s in measuring cognitive functions, probes into skills vital for daily life. The AI models’ performances across key areas were as follows:

CategoryPerformance NotesAttentionMemoryLanguageVisuospatial SkillsReasoning

The Stroop Test, a cognitive flexibility measure involving identifying ink color of incongruous words (e.g., "RED" written in blue), was only passed by ChatGPT-4o, underscoring its superior adaptability.

Medical Applications and Reality Check

Insights from this research could significantly influence perceptions of AI's role in healthcare. Though LLMs like ChatGPT show promise in diagnostic and analytical tasks, their evident deficits in processing complex visual and contextual information underscore critical limitations. Visuospatial reasoning, essential for interpreting medical imaging and anatomical frameworks, remains a major weakness in these AI systems.

Highlighted remarks from the authors include:

“These results challenge the notion that AI will imminently supplant human neurologists,” explained Dr. Kramer.
A co-researcher added, “The more advanced these AI systems seem, the more pronounced their cognitive shortcomings become.”

Looking Toward Cognition-Limited AI

Despite evident flaws, advanced language models remain valuable for augmenting human expertise. Nonetheless, caution is advised against relying heavily on these systems in critical, life-dependent environments. The concept of "AI experiencing cognitive impairments," as referred to in the study, opens new ethical and technological discussions.

Dr. Kramer emphasized the broader implications, stating, “If cognitive vulnerabilities exist now in AI, what hurdles will arise as these systems gain complexity? Are we on a path to unintentionally engineer AI that mirrors human cognitive disorders?”

This investigation highlights limits in cutting-edge AI technologies and stresses the urgency of addressing these challenges as AI becomes more integrated into essential sectors.

Future Directions

The study’s outcomes are sparking dialogue in both technological and medical communities. Critical unresolved issues include:

Strategies for mitigating cognitive deficits in AI development.
Implementing safety measures ensuring AI dependability in healthcare.
Potential efficacy of specialized training to improve AI visuospatial and sequencing skills.

The conversation around AI’s evolving capabilities and their limitations is ongoing, demanding continuous evaluation as these technologies progress.

Access the full report in the British Medical Journal.