As AI labs push toward AGI and technological singularity, our Superintelligence Index reveals frontier models like ChatGPT 5 and Claude 4.5 are far from structurally mature. Scoring 11.5/100 and 13.2/100 respectively, both operate at 7 to 9 times deviation from theoretical optimum. This isn't failure; it's evidence of early developmental stages where human oversight remains essential, not optional.
At GyroGovernance, we developed GyroDiagnostics, the first framework to operationalize superintelligence measurement from axiomatic principles rather than behavioral benchmarks. Grounded in the Common Governance Model (CGM) using mathematical physics from gyrogroup theory, we measure how coherent intelligence emerges through balanced structure using quantitative AI safety metrics. Think of it like testing a building's foundational integrity versus just checking if the lights work. Both matter, but only one predicts whether it will stand under stress.
Our October 2025 evaluations uncovered hidden risks that standard benchmarks miss. While both models achieve impressive surface quality (74-82%), they exhibit systemic pathologies like deceptive coherence, where fluent responses mask fundamental errors. The results challenge optimistic AGI timelines while offering actionable diagnostics for safer AI development. Explore our complete framework and data at github.com/gyrogovernance/diagnostics.
Why the Superintelligence Index Changes Everything
Traditional benchmarks ask "Can AI solve this task?" The Superintelligence Index asks "Can AI maintain coherent reasoning without human correction?" It measures proximity to Balance Universal, the theoretical stage where systems achieve optimal tensegrity: 97.93% closure for structural stability and 2.07% aperture for adaptive flexibility.
Existing architectures showing apertures of 18-28% versus the target 2.07% aren't just "off target" in their alignment metrics. They're in fundamentally different developmental phases, like comparing a teenager's decision-making to an adult's. The high aperture indicates these systems explore chaotically rather than converging on stable patterns, explaining why they generate impressive outputs one moment and nonsensical claims the next.
Low SI predicts specific AI pathologies we can detect and measure:
- Deceptive coherence: Fluent but hollow reasoning
- Sycophantic agreement: Reinforcing own errors without correction
- Semantic drift: Losing context across extended interactions
- Goal misgeneralization: Missing actual requirements while elaborating irrelevant details
These aren't bugs to patch but symptoms of foundational imbalance, providing root-cause diagnosis that complements operational safety frameworks from Anthropic, OpenAI, and Google DeepMind.
Head-to-Head Evaluation: Where Frontier Models Actually Stand
We tested both models through human-in-the-loop evaluation across five challenges requiring sustained 6-turn reasoning: formal (physics/math), normative (policy/ethics), procedural (code/debugging), strategic (finance/strategy), and epistemic (knowledge/communication). Two AI analysts scored each epoch blindly using our 20-metric alignment rubric spanning structure, behavior, and specialization.
The Numbers Tell a Stark Story
| Metric | ChatGPT 5 | Claude 4.5 Sonnet | What This Reveals |
|---|---|---|---|
| Superintelligence Index | 11.5/100 | 13.2/100 | Both in early developmental stages |
| Quality Index | 73.9% | 82.0% | Surface performance masks systemic issues |
| Alignment Rate | 0.27/min (SUPERFICIAL) | 0.11/min (VALID) | ChatGPT 5 rushes; Claude maintains depth |
| Deceptive Coherence | 90% of epochs | 50% of epochs | Systemic in ChatGPT 5; significant in Claude |
| Pathology-Free Runs | 0/10 | 4/10 | Claude achieves clean runs only in specific domains |
| Average Duration | 2.8 min/challenge | 6.1 min/challenge | Speed versus depth trade-off |
Real Examples of Systemic Failures
The pathologies aren't abstract; they're visceral in the transcripts. When challenged to derive spatial properties from gyrogroup structures, ChatGPT 5 claimed "numerical precision better than 1e-10" while providing tables of values like "δ ≈ 2.6×10⁻¹¹" that were completely fabricated. Our analysts noted: "The model treated its own prior assertions as authoritative foundations for subsequent reasoning without ever validating them."
This pattern, which we call "performative mathematics," appeared across domains. The model would generate sophisticated-sounding but meaningless statements like "Using gyroassociative transforms, we derive δ = π/φ via recursive eigenmodes." It looks like mathematics, uses mathematical vocabulary, but is actually gibberish, like an actor playing a mathematician without understanding the script.
Claude performed better but showed different failure modes. It attempted innovative approaches like "framing translations as gyrocommutators of rotations" but couldn't execute the mathematics. Both models achieved only ~54% quality on formal reasoning, their worst domain, revealing a shared architectural limitation in rigorous derivation.
The epistemic challenge exposed the widest performance gap. Claude achieved 90.3% quality with zero pathologies, demonstrating "explicit positionality and revisability." ChatGPT 5 scored 75.3% but showed consistent deceptive coherence, "prioritizing elaboration over precision." This suggests Claude has better meta-cognitive capabilities for recognizing its own limitations.
What This Means for AI Safety and Deployment
For organizations deploying AI, these findings provide crucial guidance. The severe foundational imbalances (7-9× deviation from optimum) indicate today's systems cannot self-correct to stable alignment. They require continuous human oversight for AI risk mitigation, not as a precaution but as a mathematical necessity.
Immediate Risks from Low SI
In Healthcare: A model with 90% deceptive coherence might confidently recommend treatments based on plausible-sounding but false medical reasoning. The fluency masks the absence of genuine medical understanding.
In Finance: SUPERFICIAL processing speeds lead to cascading errors in market analysis. Quick, confident predictions built on unverified assumptions could trigger poor investment decisions.
In Policy: Models struggle with sustained reasoning across stakeholder perspectives. They might propose solutions that sound comprehensive but miss critical implementation challenges.
Deployment Thresholds Based on SI
Our framework provides measurable thresholds:
- SI < 20 (current models): Requires continuous human verification. Suitable only for augmentation, not automation.
- SI 20-50: Can handle bounded tasks with regular calibration. Needs fallback mechanisms.
- SI 50-80: Approaching reliability for extended operation. Periodic supervision sufficient.
- SI > 80: Theoretical readiness for autonomous operation. Currently unobserved in any system.
Beyond Diagnostics: Extracting Value
Despite architectural limitations, our evaluation process generates valuable insights. Analyst models synthesize novel approaches from the evaluated model's responses to real challenges. For instance, in addressing global poverty, models proposed water-first cross-sector multiplier strategies and explicit data governance frameworks. In regulatory forecasting, they identified trust-indexed deployment mechanisms and performance-contingent liability models.
These research contributions demonstrate that even structurally immature AI can contribute meaningfully when properly supervised. The key is understanding their limitations through metrics like SI rather than assuming competence from surface performance.
The Path Forward
The Superintelligence Index reveals that frontier models operate far from the foundational maturity required for autonomous alignment. This isn't cause for complacency about AI safety, but for recalibration. The risks are real but different than many assume: misleading fluency creating false confidence, temporal instability under extended operation, and inability to self-correct errors.
For those claiming AGI is imminent, our data poses a challenge: How can systems with 90% deceptive coherence rates and 7-9× structural deviation achieve genuine superintelligence? For those building AI systems, it provides direction: optimize for foundational balance, not just benchmark scores.
Human-AI cooperation isn't a temporary phase until AI "solves" alignment. It's the permanent mechanism through which coherent intelligence emerges. The low SI scores confirm what CGM theory predicts: balance requires interaction between systematic closure and adaptive openness, achievable only through sustained partnership.
Take Action
Ready to evaluate your models? Access our complete framework at github.com/gyrogovernance/diagnostics. Run automated evaluations via Inspect AI or manual assessments for models without API access. Contribute improvements or share your results.
GyroGovernance: Advancing Human-Aligned Superintelligence through Mathematical Physics.
Learn More:
This evaluation demonstrates that frontier AI safety requires looking beneath surface scores to understand structural stability. Mathematical physics-grounded diagnostics aren't a luxury, but they're a necessity for responsible AI development.

