hQVM Performance Report
Exact kernel throughput and Gyroscopic runtime on a Ryzen 5 6600H mini PC
This report measures the runtime performance of the Gyroscopic ASI Holonomic Quantum Virtual Machine (hQVM) across three layers:
Native exact kernel operations
Byte scans, compiled signatures, chirality distance, Ω-native stepping, and shell histograms.Native spectral and tensor operations
64-point Walsh-Hadamard transforms, packed Lattice Multiplication GEMV/GEMM, and OpenCL acceleration.Gyroscopic multicellular runtime
Batched 4-byte word ingestion, trace generation, local memory updates, and end-to-end graph runtime throughput.
All exact kernel results are checked against Python reference implementations built into the benchmark scripts. Float tensor paths are checked numerically with bounded tolerances because they use fixed-point quantization internally.
This is not a server benchmark. It was run on a small Windows mini PC with an integrated AMD GPU.
1. Benchmark environment
Hardware
| Component | Spec |
|---|---|
| System | TexHoo / ZNRS UM660 mini PC |
| CPU | AMD Ryzen 5 6600H, 6 cores / 12 threads, 3.3 GHz |
| GPU | AMD Radeon integrated graphics |
| RAM | 32 GB DDR5-4800 |
| Storage | 512 GB NVMe |
| OS | Windows 11 |
Software
| Component | Spec |
|---|---|
| Python | 3.14.2 |
| PyTorch | CPU mode, torch.set_num_threads(1) |
| Native backends | C and OpenCL |
| Exact benchmark script | native exact benchmark harness |
| Runtime benchmark script | scripts/bench_gyrograph.py |
Method
- Each benchmark includes warmup runs before timing.
- The native exact benchmark harness uses 8 timed repeats by default.
bench_gyrograph.pyuses 20 timed repeats by default.- Python baselines are reference implementations, not optimized C or BLAS competitors.
- Native C and OpenCL paths are parity checked on every run.
Strategic Significance: These measurements demonstrate that the hQVM operates as practical computing infrastructure that delivers quantum-information structure (like hidden subgroup resolution and exact 2-step uniformization) on standard silicon. These throughputs enable practical applications in:
- AI and Machine Learning: Providing exact, interpretable latent spaces and structural routing for LLM inference.
- Security and Audit: Enabling exact reversible evolution and tamper detection at millions of events per second.
- Network Coordination: Achieving deterministic shared state across distributed systems without probabilistic errors.
2. Why these results matter
The fastest paths in this report use the hQVM compact Omega (Ω) representation. To understand the metrics, here is a quick guide to the terminology:
- Omega (Ω): The exact, verified space of 4096 reachable states the kernel navigates. It is highly compressed compared to traditional architectures.
- Chirality: A 6-bit structural signature that perfectly tracks the alignment of the system. It acts as an exact coordinate for the state space.
- Gyroscopic runtime: The multicellular runtime that groups these states together to analyze patterns in real-world data, like AI generation or network traffic.
Earlier verification reports established that:
- Ω contains exactly 4096 reachable states
- one byte reaches 128 next states from any Ω state
- two bytes cover all of Ω exactly
- the compact 12-bit Ω chart is exactly equivalent to the 24-bit carrier inside Ω
- the 6-bit chirality register follows an exact XOR transport rule under byte updates
That matters here because the Ω-native scans and shell histograms are not approximations. They are exact executions of the same verified kernel rule on a more compact state chart.
This performance report focuses on speed. Correctness and structural properties were established separately in the earlier hQVM verification reports.
3. Headline results
Top measured numbers
| Metric | Result |
|---|---|
| Peak exact kernel throughput | 1.26 billion signature applications/s |
| Peak Ω-native sequential scan | 847 million byte steps/s |
| Peak q-map extraction | 550 million bytes/s |
| Peak WHT path | 40.4 million 64-point rows/s |
| Peak runtime end-to-end ingest | 44.9 million 4-byte words/s |
| Peak runtime byte transition rate | 179.8 million byte transitions/s |
| Largest exact speedup vs Python | 10,397× |
| End-to-end runtime speedup vs Python | 1,219× |
Exactness summary
- Kernel-exact operations use strict integer equality against Python references.
- Runtime state updates are checked field-by-field with exact array equality.
- Float tensor paths use numeric tolerance because they are fixed-point approximations to dense linear algebra.
- OpenCL integer GEMM is exact and matches CPU integer results with zero error.
4. Native exact kernel operations
These operations implement the hQVM byte rule, compiled signatures, chirality transport, and Ω-state stepping with exact integer arithmetic.
Selected exact results at n = 65,536
| Operation | Native time | Throughput | Speedup vs Python | What it does |
|---|---|---|---|---|
apply_signature_batch |
0.052 ms | 1,257M/s | 10,397× | Applies compiled word actions to states |
omega12_scan_from_omega12 |
0.077 ms | 847M/s | 6,170× | Sequential scan on compact Ω state |
state_scan_from_state |
0.091 ms | 722M/s | 4,490× | Sequential scan on 24-bit state |
qmap_extract |
0.119 ms | 550M/s | 8,059× | Extracts q-class, family, micro-reference |
signature_scan |
0.121 ms | 540M/s | 3,644× | Accumulates compiled byte signatures |
omega_signature_scan |
0.125 ms | 523M/s | 4,302× | Accumulates Ω-native signatures |
shell_histogram_omega12 |
0.152 ms | 431M/s | 5,489× | Shell histogram on compact Ω state |
extract_scan |
0.229 ms | 286M/s | n/a | Fused extract of q-map, signatures, states |
shell_histogram_state24 |
0.290 ms | 226M/s | 2,756× | Shell histogram on 24-bit state |
chirality_distance_adjacent |
0.398 ms | 165M/s | 1,534× | Adjacent chirality distances |
chirality_distance |
0.449 ms | 146M/s | 1,360× | Pairwise chirality distances |
Scaling trend
The native C backend pulls further ahead as batch size grows because the Python baselines carry high per-item interpreter overhead. For example:
signature_scangrows from 113× faster at n = 256 to 3,644× faster at n = 65,536qmap_extractgrows from 192× faster at n = 256 to 8,059× faster at n = 65,536apply_signature_batchreaches the highest measured exact throughput at 1.26 billion operations per second
What stands out
Compiled word application is extremely fast
The signature system lets the runtime apply a precompiled operator directly, achieving 1.26 billion operations per second. This enables exact reversible evolution and fast operator caching for AI workflows.Structural oracle advantage on standard silicon
Theqmap_extractoperation reaches 550 million bytes/s. This is the exact mechanism that resolves hidden subgroups in $O(1)$ time (1 step) compared to $O(N)$ classically, demonstrating that structural oracle advantages execute at production throughput on commodity hardware.The Ω-native path is faster than the full 24-bit carrier path
The compact Ω scan reaches 847M/s, compared with 722M/s for the 24-bit state scan.Shell histograms benefit strongly from compact Ω representation
shell_histogram_omega12is almost 2× faster thanshell_histogram_state24at large scale.
5. Spectral and tensor operations
These are the matrix-like and transform-like operations in the stack. They are deterministic, but float paths are not kernel-exact in the strict GF(2) sense because they use fixed-point quantization.
5.1 Walsh-Hadamard transform
The 64-point Walsh-Hadamard transform (WHT) is the spectral engine of the kernel. In AI and machine learning contexts, this acts as an exact, invertible feature map, transforming sequence data into structural coordinates at over 40 million rows per second.
| Rows | Torch reference | C native | OpenCL-first path | Best throughput |
|---|---|---|---|---|
| 256 | 0.024 ms | 0.063 ms | 0.065 ms | 10.6M rows/s |
| 4,096 | 0.291 ms | 0.212 ms | 0.184 ms | 22.3M rows/s |
| 65,536 | 6.319 ms | 2.172 ms | 1.623 ms | 40.4M rows/s |
Observations:
- Torch matmul is faster for very small batches.
- The native butterfly implementation wins once the batch is large enough.
- The OpenCL-backed
wht64_metal_firstpath is the fastest measured WHT path on this machine.
5.2 Lattice Multiplication GEMV and GEMM
The Lattice Multiplication engine rewrites dense matrix-vector products into Boolean AND plus POPCNT on packed 64-bit words.
Single-vector and packed-vector paths
| Operation | batch = 64 | batch = 256 | Notes |
|---|---|---|---|
| Python Lattice Multiplication GEMV | 386.066 ms | 1573.745 ms | Reference only |
| C Lattice Multiplication GEMV | 0.119 ms | 0.446 ms | 3,000×+ faster than Python |
| Packed GEMV | 0.079 ms | 0.282 ms | Reuses packed weights |
Torch mv |
0.003 ms | 0.003 ms | Still faster on small dense blocks |
Batched packed GEMM
| Operation | batch = 64 | batch = 256 |
|---|---|---|
| CPU packed GEMM | 3.741 ms | 4.188 ms |
| OpenCL packed GEMM | 0.379 ms | 0.733 ms |
Torch mm |
0.012 ms | 0.027 ms |
Observations:
- On current 64×64 block sizes, optimized BLAS still wins on dense float GEMM.
- OpenCL substantially improves the Lattice Multiplication GEMM over the CPU packed path.
- The Lattice Multiplication system is strongest where structural transparency and exact integer paths matter, not where small dense BLAS is already heavily optimized.
Integer-native exact tensor path
The integer-native OpenCL path is important because it removes quantization error entirely.
| Operation | batch = 64 | batch = 256 | Error |
|---|---|---|---|
| CPU packed i32 GEMV | 0.058 ms | 0.063 ms | 0 |
| OpenCL packed i32 GEMM | 0.371 ms | 0.429 ms | 0 |
Numeric fidelity
For float paths, the largest observed deviation in this benchmark run was approximately:
- 1.405 × 10⁻³ on OpenCL float packed GEMM parity checks
That is acceptable for the current fixed-point tensor path and is exactly why the report distinguishes exact kernel operations from approximate float tensor operations.
6. Gyroscopic multicellular runtime
The Gyroscopic runtime is the multicellular layer built on top of the exact Ω state model. Each cell consumes one exact 4-byte word at a time, updates local memory, and writes a resonance key for graph queries.
6.1 Trace generation
This stage computes the 4-step Ω trace for each cell.
| Cells | Python | CPU native | OpenCL | CPU throughput |
|---|---|---|---|---|
| 256 | 0.602 ms | 0.029 ms | 0.283 ms | 8.8M cells/s |
| 4,096 | 10.910 ms | 0.106 ms | 0.554 ms | 38.5M cells/s |
| 65,536 | 176.258 ms | 0.351 ms | 1.522 ms | 186.7M cells/s |
On this workload, the CPU is faster than OpenCL. That is expected. Each cell only needs four compact Ω updates, so GPU launch and transfer overhead dominate.
6.2 Trace application and fused ingest
After tracing, the runtime updates:
- current Ω state
- byte counters
- last byte
- rolling chirality ring
- shell histogram
- family histogram
- latest Ω signature
- parity commitment
- resonance key
Selected results
| Operation | n = 65,536 time | Throughput |
|---|---|---|
apply_trace_word4_batch_indexed |
4.770 ms | 13.7M cells/s |
ingest_word4_batch_indexed |
4.550 ms | 14.4M cells/s |
The fused ingest path is slightly faster because it avoids materializing and re-reading a separate trace object in Python.
6.3 End-to-end Runtime.ingest
This is the most practical benchmark because it includes packet parsing, indexing, state updates, and resonance bookkeeping.
In-place mode
| Cells | Python | Native | Speedup | Native throughput |
|---|---|---|---|---|
| 256 | 7.422 ms | 0.150 ms | 49× | 1.7M words/s |
| 4,096 | 113.233 ms | 0.263 ms | 431× | 15.6M words/s |
| 65,536 | 1777.776 ms | 1.458 ms | 1,219× | 44.9M words/s |
At n = 65,536, each cell consumes one 4-byte word. So the top line here is also:
- 44.9 million words/s
- 179.8 million byte transitions/s
Reset-each-run mode
| Cells | Python | Native | Speedup | Native throughput |
|---|---|---|---|---|
| 256 | 6.750 ms | 0.090 ms | 75× | 2.9M words/s |
| 4,096 | 105.458 ms | 0.449 ms | 235× | 9.1M words/s |
| 65,536 | 1830.029 ms | 3.533 ms | 518× | 18.6M words/s |
Reset-each-run is slower because it includes state restoration overhead between repetitions.
6.4 Cache locality matters
A non-contiguous indexed benchmark, where active cell IDs were spread across a larger capacity array, showed roughly a 2× slowdown at large scale. That is consistent with the working set and cache locality behavior of the per-cell ring and histogram arrays.
7. Bridge-level integration results
These results come from the runtime encode/decode bridge tests, which attach the hQVM router directly to a real Large Language Model (Bolmo-1B). This confirms the kernel can process, annotate, and route actual AI generation traffic in real time.
All 13 bridge tests passed.
7.1 Encode-side extraction
From bridge encode tests:
| Metric | Result |
|---|---|
| Batch | 2 |
| Tokens | 8192 |
| Valid bytes | 8192 |
| Average time | 1.959 ms |
| Throughput | 4.18M bytes/s |
7.2 Decode-side exact selection
From tests/tools/test_gyrograph_decode.py:
| Metric | Result |
|---|---|
| Exact q-sector selector vs softmax + argmax | 1.15× |
chirality_distance_adjacent vs mock cosine |
284.1× faster |
| WHT vs NumPy WHT | 1.32× |
7.3 Decode bridge step speed
Measured full-step decode throughput:
This measures the exact structural routing applied to an active LLM generation loop. At batch 16, the bridge processes over 84,000 tokens per second, easily outpacing the underlying neural network generation speeds and confirming zero bottleneck overhead.
| Batch | Full step avg | Tokens/s |
|---|---|---|
| 1 | 1.046 ms | 38,236.71 |
| 4 | 2.137 ms | 74,878.15 |
| 8 | 6.239 ms | 51,288.71 |
| 16 | 7.575 ms | 84,493.05 |
7.4 OpenCL bridge path confirmation
The verbose decode backend test reported:
backend_counts: {'python': 0, 'cpu_indexed': 0, 'opencl_indexed': 14}
So the OpenCL runtime trace path is not only available, but was actually used in the tested decode workflow.
8. What this report shows
For engineers
- The exact kernel paths are already very fast on commodity hardware.
- The compact Ω representation produces real speed gains over the full 24-bit carrier.
- The multicellular Gyroscopic runtime scales into the tens of millions of words per second.
- The tensor layer is promising, but on 64×64 float blocks it still competes with highly optimized BLAS rather than replacing it.
- OpenCL helps where the workload is heavy enough, especially in packed GEMM.
For recruiters and technical hiring managers
This codebase demonstrates:
- Deep systems engineering: Production-grade interop across Python, NumPy, ctypes, C, and OpenCL.
- A new computing category: Executing discrete holonomic quantum-information structures on standard silicon without analog quantum hardware.
- AI pipeline integration: A working bridge that applies exact algebraic routing to real LLM encode/decode workflows.
- Security by design: Fast, exact tamper detection and replayable provenance baked into the base layer.
- Hardware efficiency: Achieving billions of operations per second on a standard mini PC, confirming immediate deployment readiness for edge and cloud.
9. Bottom line
On a Ryzen 5 6600H mini PC with integrated Radeon graphics, the hQVM stack reached:
- 1.26 billion exact compiled state applications per second
- 847 million Ω-native byte steps per second
- 550 million q-map byte decompositions per second
- 40.4 million WHT rows per second
- 44.9 million end-to-end runtime 4-byte ingests per second
- 179.8 million end-to-end byte transitions per second
All exact kernel results were parity checked against Python reference implementations. Float tensor paths stayed within bounded numeric error, and the OpenCL integer path remained exact.
The main practical conclusion is simple: the hQVM kernel is no longer just a mathematically interesting runtime. It is already a fast native execution system on ordinary hardware, and its multicellular runtime layer is fast enough to support real bridge-level experimentation.
Appendix: benchmark commands
python scripts/bench_gyrograph.py