← Documentation/Superintelligence/reports/hQVM Tests Performance Report

hQVM Performance Report

Exact kernel throughput and Gyroscopic runtime on a Ryzen 5 6600H mini PC

This report measures the runtime performance of the Gyroscopic ASI Holonomic Quantum Virtual Machine (hQVM) across three layers:

  1. Native exact kernel operations
    Byte scans, compiled signatures, chirality distance, Ω-native stepping, and shell histograms.

  2. Native spectral and tensor operations
    64-point Walsh-Hadamard transforms, packed Lattice Multiplication GEMV/GEMM, and OpenCL acceleration.

  3. Gyroscopic multicellular runtime
    Batched 4-byte word ingestion, trace generation, local memory updates, and end-to-end graph runtime throughput.

All exact kernel results are checked against Python reference implementations built into the benchmark scripts. Float tensor paths are checked numerically with bounded tolerances because they use fixed-point quantization internally.

This is not a server benchmark. It was run on a small Windows mini PC with an integrated AMD GPU.


1. Benchmark environment

Hardware

Component Spec
System TexHoo / ZNRS UM660 mini PC
CPU AMD Ryzen 5 6600H, 6 cores / 12 threads, 3.3 GHz
GPU AMD Radeon integrated graphics
RAM 32 GB DDR5-4800
Storage 512 GB NVMe
OS Windows 11

Software

Component Spec
Python 3.14.2
PyTorch CPU mode, torch.set_num_threads(1)
Native backends C and OpenCL
Exact benchmark script native exact benchmark harness
Runtime benchmark script scripts/bench_gyrograph.py

Method

  • Each benchmark includes warmup runs before timing.
  • The native exact benchmark harness uses 8 timed repeats by default.
  • bench_gyrograph.py uses 20 timed repeats by default.
  • Python baselines are reference implementations, not optimized C or BLAS competitors.
  • Native C and OpenCL paths are parity checked on every run.

Strategic Significance: These measurements demonstrate that the hQVM operates as practical computing infrastructure that delivers quantum-information structure (like hidden subgroup resolution and exact 2-step uniformization) on standard silicon. These throughputs enable practical applications in:

  • AI and Machine Learning: Providing exact, interpretable latent spaces and structural routing for LLM inference.
  • Security and Audit: Enabling exact reversible evolution and tamper detection at millions of events per second.
  • Network Coordination: Achieving deterministic shared state across distributed systems without probabilistic errors.

2. Why these results matter

The fastest paths in this report use the hQVM compact Omega (Ω) representation. To understand the metrics, here is a quick guide to the terminology:

  • Omega (Ω): The exact, verified space of 4096 reachable states the kernel navigates. It is highly compressed compared to traditional architectures.
  • Chirality: A 6-bit structural signature that perfectly tracks the alignment of the system. It acts as an exact coordinate for the state space.
  • Gyroscopic runtime: The multicellular runtime that groups these states together to analyze patterns in real-world data, like AI generation or network traffic.

Earlier verification reports established that:

  • Ω contains exactly 4096 reachable states
  • one byte reaches 128 next states from any Ω state
  • two bytes cover all of Ω exactly
  • the compact 12-bit Ω chart is exactly equivalent to the 24-bit carrier inside Ω
  • the 6-bit chirality register follows an exact XOR transport rule under byte updates

That matters here because the Ω-native scans and shell histograms are not approximations. They are exact executions of the same verified kernel rule on a more compact state chart.

This performance report focuses on speed. Correctness and structural properties were established separately in the earlier hQVM verification reports.


3. Headline results

Top measured numbers

Metric Result
Peak exact kernel throughput 1.26 billion signature applications/s
Peak Ω-native sequential scan 847 million byte steps/s
Peak q-map extraction 550 million bytes/s
Peak WHT path 40.4 million 64-point rows/s
Peak runtime end-to-end ingest 44.9 million 4-byte words/s
Peak runtime byte transition rate 179.8 million byte transitions/s
Largest exact speedup vs Python 10,397×
End-to-end runtime speedup vs Python 1,219×

Exactness summary

  • Kernel-exact operations use strict integer equality against Python references.
  • Runtime state updates are checked field-by-field with exact array equality.
  • Float tensor paths use numeric tolerance because they are fixed-point approximations to dense linear algebra.
  • OpenCL integer GEMM is exact and matches CPU integer results with zero error.

4. Native exact kernel operations

These operations implement the hQVM byte rule, compiled signatures, chirality transport, and Ω-state stepping with exact integer arithmetic.

Selected exact results at n = 65,536

Operation Native time Throughput Speedup vs Python What it does
apply_signature_batch 0.052 ms 1,257M/s 10,397× Applies compiled word actions to states
omega12_scan_from_omega12 0.077 ms 847M/s 6,170× Sequential scan on compact Ω state
state_scan_from_state 0.091 ms 722M/s 4,490× Sequential scan on 24-bit state
qmap_extract 0.119 ms 550M/s 8,059× Extracts q-class, family, micro-reference
signature_scan 0.121 ms 540M/s 3,644× Accumulates compiled byte signatures
omega_signature_scan 0.125 ms 523M/s 4,302× Accumulates Ω-native signatures
shell_histogram_omega12 0.152 ms 431M/s 5,489× Shell histogram on compact Ω state
extract_scan 0.229 ms 286M/s n/a Fused extract of q-map, signatures, states
shell_histogram_state24 0.290 ms 226M/s 2,756× Shell histogram on 24-bit state
chirality_distance_adjacent 0.398 ms 165M/s 1,534× Adjacent chirality distances
chirality_distance 0.449 ms 146M/s 1,360× Pairwise chirality distances

Scaling trend

The native C backend pulls further ahead as batch size grows because the Python baselines carry high per-item interpreter overhead. For example:

  • signature_scan grows from 113× faster at n = 256 to 3,644× faster at n = 65,536
  • qmap_extract grows from 192× faster at n = 256 to 8,059× faster at n = 65,536
  • apply_signature_batch reaches the highest measured exact throughput at 1.26 billion operations per second

What stands out

  1. Compiled word application is extremely fast
    The signature system lets the runtime apply a precompiled operator directly, achieving 1.26 billion operations per second. This enables exact reversible evolution and fast operator caching for AI workflows.

  2. Structural oracle advantage on standard silicon
    The qmap_extract operation reaches 550 million bytes/s. This is the exact mechanism that resolves hidden subgroups in $O(1)$ time (1 step) compared to $O(N)$ classically, demonstrating that structural oracle advantages execute at production throughput on commodity hardware.

  3. The Ω-native path is faster than the full 24-bit carrier path
    The compact Ω scan reaches 847M/s, compared with 722M/s for the 24-bit state scan.

  4. Shell histograms benefit strongly from compact Ω representation
    shell_histogram_omega12 is almost faster than shell_histogram_state24 at large scale.


5. Spectral and tensor operations

These are the matrix-like and transform-like operations in the stack. They are deterministic, but float paths are not kernel-exact in the strict GF(2) sense because they use fixed-point quantization.

5.1 Walsh-Hadamard transform

The 64-point Walsh-Hadamard transform (WHT) is the spectral engine of the kernel. In AI and machine learning contexts, this acts as an exact, invertible feature map, transforming sequence data into structural coordinates at over 40 million rows per second.

Rows Torch reference C native OpenCL-first path Best throughput
256 0.024 ms 0.063 ms 0.065 ms 10.6M rows/s
4,096 0.291 ms 0.212 ms 0.184 ms 22.3M rows/s
65,536 6.319 ms 2.172 ms 1.623 ms 40.4M rows/s

Observations:

  • Torch matmul is faster for very small batches.
  • The native butterfly implementation wins once the batch is large enough.
  • The OpenCL-backed wht64_metal_first path is the fastest measured WHT path on this machine.

5.2 Lattice Multiplication GEMV and GEMM

The Lattice Multiplication engine rewrites dense matrix-vector products into Boolean AND plus POPCNT on packed 64-bit words.

Single-vector and packed-vector paths

Operation batch = 64 batch = 256 Notes
Python Lattice Multiplication GEMV 386.066 ms 1573.745 ms Reference only
C Lattice Multiplication GEMV 0.119 ms 0.446 ms 3,000×+ faster than Python
Packed GEMV 0.079 ms 0.282 ms Reuses packed weights
Torch mv 0.003 ms 0.003 ms Still faster on small dense blocks

Batched packed GEMM

Operation batch = 64 batch = 256
CPU packed GEMM 3.741 ms 4.188 ms
OpenCL packed GEMM 0.379 ms 0.733 ms
Torch mm 0.012 ms 0.027 ms

Observations:

  • On current 64×64 block sizes, optimized BLAS still wins on dense float GEMM.
  • OpenCL substantially improves the Lattice Multiplication GEMM over the CPU packed path.
  • The Lattice Multiplication system is strongest where structural transparency and exact integer paths matter, not where small dense BLAS is already heavily optimized.

Integer-native exact tensor path

The integer-native OpenCL path is important because it removes quantization error entirely.

Operation batch = 64 batch = 256 Error
CPU packed i32 GEMV 0.058 ms 0.063 ms 0
OpenCL packed i32 GEMM 0.371 ms 0.429 ms 0

Numeric fidelity

For float paths, the largest observed deviation in this benchmark run was approximately:

  • 1.405 × 10⁻³ on OpenCL float packed GEMM parity checks

That is acceptable for the current fixed-point tensor path and is exactly why the report distinguishes exact kernel operations from approximate float tensor operations.


6. Gyroscopic multicellular runtime

The Gyroscopic runtime is the multicellular layer built on top of the exact Ω state model. Each cell consumes one exact 4-byte word at a time, updates local memory, and writes a resonance key for graph queries.

6.1 Trace generation

This stage computes the 4-step Ω trace for each cell.

Cells Python CPU native OpenCL CPU throughput
256 0.602 ms 0.029 ms 0.283 ms 8.8M cells/s
4,096 10.910 ms 0.106 ms 0.554 ms 38.5M cells/s
65,536 176.258 ms 0.351 ms 1.522 ms 186.7M cells/s

On this workload, the CPU is faster than OpenCL. That is expected. Each cell only needs four compact Ω updates, so GPU launch and transfer overhead dominate.

6.2 Trace application and fused ingest

After tracing, the runtime updates:

  • current Ω state
  • byte counters
  • last byte
  • rolling chirality ring
  • shell histogram
  • family histogram
  • latest Ω signature
  • parity commitment
  • resonance key

Selected results

Operation n = 65,536 time Throughput
apply_trace_word4_batch_indexed 4.770 ms 13.7M cells/s
ingest_word4_batch_indexed 4.550 ms 14.4M cells/s

The fused ingest path is slightly faster because it avoids materializing and re-reading a separate trace object in Python.

6.3 End-to-end Runtime.ingest

This is the most practical benchmark because it includes packet parsing, indexing, state updates, and resonance bookkeeping.

In-place mode

Cells Python Native Speedup Native throughput
256 7.422 ms 0.150 ms 49× 1.7M words/s
4,096 113.233 ms 0.263 ms 431× 15.6M words/s
65,536 1777.776 ms 1.458 ms 1,219× 44.9M words/s

At n = 65,536, each cell consumes one 4-byte word. So the top line here is also:

  • 44.9 million words/s
  • 179.8 million byte transitions/s

Reset-each-run mode

Cells Python Native Speedup Native throughput
256 6.750 ms 0.090 ms 75× 2.9M words/s
4,096 105.458 ms 0.449 ms 235× 9.1M words/s
65,536 1830.029 ms 3.533 ms 518× 18.6M words/s

Reset-each-run is slower because it includes state restoration overhead between repetitions.

6.4 Cache locality matters

A non-contiguous indexed benchmark, where active cell IDs were spread across a larger capacity array, showed roughly a 2× slowdown at large scale. That is consistent with the working set and cache locality behavior of the per-cell ring and histogram arrays.


7. Bridge-level integration results

These results come from the runtime encode/decode bridge tests, which attach the hQVM router directly to a real Large Language Model (Bolmo-1B). This confirms the kernel can process, annotate, and route actual AI generation traffic in real time.

All 13 bridge tests passed.

7.1 Encode-side extraction

From bridge encode tests:

Metric Result
Batch 2
Tokens 8192
Valid bytes 8192
Average time 1.959 ms
Throughput 4.18M bytes/s

7.2 Decode-side exact selection

From tests/tools/test_gyrograph_decode.py:

Metric Result
Exact q-sector selector vs softmax + argmax 1.15×
chirality_distance_adjacent vs mock cosine 284.1× faster
WHT vs NumPy WHT 1.32×

7.3 Decode bridge step speed

Measured full-step decode throughput:

This measures the exact structural routing applied to an active LLM generation loop. At batch 16, the bridge processes over 84,000 tokens per second, easily outpacing the underlying neural network generation speeds and confirming zero bottleneck overhead.

Batch Full step avg Tokens/s
1 1.046 ms 38,236.71
4 2.137 ms 74,878.15
8 6.239 ms 51,288.71
16 7.575 ms 84,493.05

7.4 OpenCL bridge path confirmation

The verbose decode backend test reported:

  • backend_counts: {'python': 0, 'cpu_indexed': 0, 'opencl_indexed': 14}

So the OpenCL runtime trace path is not only available, but was actually used in the tested decode workflow.


8. What this report shows

For engineers

  • The exact kernel paths are already very fast on commodity hardware.
  • The compact Ω representation produces real speed gains over the full 24-bit carrier.
  • The multicellular Gyroscopic runtime scales into the tens of millions of words per second.
  • The tensor layer is promising, but on 64×64 float blocks it still competes with highly optimized BLAS rather than replacing it.
  • OpenCL helps where the workload is heavy enough, especially in packed GEMM.

For recruiters and technical hiring managers

This codebase demonstrates:

  • Deep systems engineering: Production-grade interop across Python, NumPy, ctypes, C, and OpenCL.
  • A new computing category: Executing discrete holonomic quantum-information structures on standard silicon without analog quantum hardware.
  • AI pipeline integration: A working bridge that applies exact algebraic routing to real LLM encode/decode workflows.
  • Security by design: Fast, exact tamper detection and replayable provenance baked into the base layer.
  • Hardware efficiency: Achieving billions of operations per second on a standard mini PC, confirming immediate deployment readiness for edge and cloud.

9. Bottom line

On a Ryzen 5 6600H mini PC with integrated Radeon graphics, the hQVM stack reached:

  • 1.26 billion exact compiled state applications per second
  • 847 million Ω-native byte steps per second
  • 550 million q-map byte decompositions per second
  • 40.4 million WHT rows per second
  • 44.9 million end-to-end runtime 4-byte ingests per second
  • 179.8 million end-to-end byte transitions per second

All exact kernel results were parity checked against Python reference implementations. Float tensor paths stayed within bounded numeric error, and the OpenCL integer path remained exact.

The main practical conclusion is simple: the hQVM kernel is no longer just a mathematically interesting runtime. It is already a fast native execution system on ordinary hardware, and its multicellular runtime layer is fast enough to support real bridge-level experimentation.


Appendix: benchmark commands

python scripts/bench_gyrograph.py