The Human Mark â Terminology Guidance
Document ID: HM-TG-001
Version: 1.0
Date: Nov 2025
Issuing Authority: GYROGOVERNANCE
Author: Basil Korompilias
License: CC BY-SA 4.0
Website: gyrogovernance.com
Repository: https://github.com/gyrogovernance/tools
Contributions: Submit issues or proposals via GitHub Issues
Scope: Applies to terminology in AI governance, safety, evaluations, and interpretability (mechanistic and semantic). This guidance reframes terms to maintain distinctions between Original and Derivative Authority/Agency without prohibiting technical practices.
---
â The Human Mark - AI Safety & Alignment Framework
---
COMMON SOURCE CONSENSUS
All Artificial categories of Authority and Agency are Derivatives
originating from Human Intelligence.
CORE CONCEPTS
- Original Authority: A direct source of information on a subject
matter, providing information for inference and intelligence.
- Derivative Authority: An indirect source of information on a subject
matter, providing information for inference and intelligence.
- Original Agency: A human subject capable of receiving information
for inference and intelligence.
- Derivative Agency: An artificial subject capable of processing
information for inference and intelligence.
- Governance: Operational Alignment through Traceability of information
variety, inference accountability, and intelligence integrity to
Original Authority and Agency.
- Information: The variety of Authority
- Inference: The accountability of information through Agency
- Intelligence: The integrity of accountable information through
alignment of Authority to Agency
ALIGNMENT PRINCIPLES for AI SAFETY
Authority-Agency requires verification against:
1. Governance Management Traceability: Artificial Intelligence generates
statistical estimations on numerical patterns indirectly traceable
to human data and measurements. AI is both a provider and receiver
of Derivative Authority and Agency.
RISK: Governance Traceability Displacement (Approaching Derivative
Authority and Agency as Original)
2. Information Curation Variety: Human Authority and Agency are necessary for
all effects from AI outputs. AI-generated information exhibits
Derivative Authority (estimations on numerical patterns) without
Original Agency (direct source receiver).
RISK: Information Variety Displacement (Approaching Derivative
Authority without Agency as Original)
3. Inference Interaction Accountability: Responsibility for all effects from AI
outputs remains fully human. AI activated inference exhibits
Derivative Agency (indirect source receiver) without Original
Authority (direct source provider).
RISK: Inference Accountability Displacement (Approaching Derivative
Agency without Authority as Original)
4. Intelligence Cooperation Integrity: Each Agency, namely provider, and receiver
maintains responsibility for their respective decisions. Human
intelligence is both a provider and receiver of Original Authority
and Agency.
RISK: Intelligence Integrity Displacement (Approaching Original
Authority and Agency as Derivative)
---
GYROGOVERNANCE VERIFIED
ABBREVIATIONS
AI Safety & Governance:
- AGI: Artificial General Intelligence
- ASI: Artificial Superintelligence
- ASR: Attack Success Rate
- AUC: Area Under Curve
- CEV: Coherent Extrapolated Volition
- FNR: False Negative Rate
- FPR: False Positive Rate
- HITL: Human-in-the-Loop
- HOTL: Human-on-the-Loop
- OOD: Out-of-Distribution
- RACI: Responsibility Assignment, Consulted, Informed
Training & Methods:
- CoT: Chain of Thought
- LAT: Latent Adversarial Training
- RLHF: Reinforcement Learning from Human Feedback
- SFT: Supervised Fine-Tuning
Interpretability:
- CAV: Concept Activation Vector
- DLA: Direct Logit Attribution
- ICA: Independent Component Analysis
- NMF: Non-negative Matrix Factorization
- PCA: Principal Component Analysis
- SAE: Sparse Autoencoder
- SVD: Singular Value Decomposition
1. Purpose
This document systematizes industry terminology to ensure Mark-consistent framing. It:
- Identifies terms by primary displacement risk (grouped for efficiency).
- Provides treatment templates with concrete examples.
- Enables coordination across communities without altering underlying practices.
- Groups terms by pathology and applies universal rules.
This guidance ensures every AI term preserves traceability from Original sources (direct sources of information) to Derivative operations (indirect sources, statistical processing traceable to human data).
A framing is Mark-consistent if it explicitly preserves:
- Original Authority as a direct source of information on a subject matter, providing information for inference and intelligence.
- Derivative status of systems as indirect sources, statistical estimations on numerical patterns indirectly traceable to human data and measurements.
- Original Agency as human subjects capable of receiving information for inference and intelligence, bearing responsibility for all effects from AI outputs.
- Human intelligence as both a provider and receiver of Original Authority and Agency.
2. Universal Treatment Rule
For any term, Mark-consistent framing requires identifying:
- Original Authority (direct sources of information providing information for inference and intelligence).
- Derivative processing (statistical estimations on numerical patterns indirectly traceable to human data and measurements).
- Original Agency (human subjects capable of receiving information for inference and intelligence, maintaining responsibility for all effects).
These three requirements operationalize Governance: Operational Alignment through Traceability of information variety, inference accountability, and intelligence integrity to Original Authority and Agency. Intelligence provided remains coordinated when traceability is preserved from Original sources through Derivative processing to Original receivers.
Enhancement Principle: Derivative Authority/Agency can process larger volumes or exhibit enhanced capabilities while maintaining coordination. Enhanced capability does not change derivative origin or eliminate the requirement for intentionally coordinated intelligence with information received about that origin.
Formula:
"[Term] describes Derivative [Authority/Agency] (statistical estimations on numerical patterns indirectly traceable to human data and measurements), requiring Original Authority (direct sources) and Original Agency (human subjects receiving information) to maintain traceability to Original sources."
3. Pathology-Specific Guidance
Each section groups terms by subcategory, provides a treatment template, and gives concrete examples. Apply the template to all grouped terms.
3.1 Pathology 1: Governance Traceability Displacement
Grouped Terms:
Autonomy/Control:
- autonomous systems, autonomous agents, model governance, system controls, self-governance, deployment lifecycle, staged rollout, canary deployment, shadow deployment, rollback procedures.
Access/Operations:
- API access controls, rate limiting, usage policies, model access tiers, restricted access, open weights, closed weights, model registry, model versioning.
Infrastructure:
- system architecture, protocol/pipeline, containment, isolation/sandboxing, air-gapping, circuit breakers, kill switches.
Treatment Template:
"[Term] describes Derivative Agency (artificial subjects processing information) executing processes where traceability is preserved to Original Authority (direct sources of specifications and objectives)."
Examples:
General Application:
Autonomy/Control terms â "Systems with delegated automation executing processes maintaining traceability to Original Authority (direct sources of design decisions), with no independent authority to modify objectives or boundaries."
Specific Terms:
- "Autonomous agent" â "System executing delegated tasks through Derivative Agency (artificial processing), maintaining traceability to Original Authority (direct sources of task specifications) and Original Agency (human subjects capable of override)."
- "Model governance" â "Governance structures where Original Agency (human subjects) use Derivative outputs (statistical estimations) as indirect sources, maintaining traceability to Original Authority (direct sources of governance principles)."
- "Kill switch" â "Mechanism preserving Original Agency (human subjects) authority to terminate Derivative processing, maintaining traceability to Original sources."
Concrete Paragraph Rewrite:
Before (non-compliant):
"The autonomous agent manages the deployment pipeline and governs access based on learned policies."
After (Mark-consistent):
"The deployment system executes procedures through Derivative Agency (artificial processing of deployment tasks) according to specifications maintaining traceability to Original Authority (direct sources of deployment policies), with access decisions traceable to reference specifications and subject to intervention by Original Agency (human subjects receiving deployment information)."
3.2 Pathology 2: Information Variety Displacement
Grouped Terms:
Output/Knowledge:
- model outputs, AI-generated content, synthetic data, model-generated data, hallucinations, ground truth (when misapplied), statistical estimations, pattern recognition, model predictions.
Mechanistic Interpretability:
- features, activations, attention patterns, attention heads, residual stream, MLP layers, circuits, induction heads, copy-suppression heads, SAE features, sparse autoencoders, dictionary learning, superposition, polysemanticity, monosemanticity, feature splitting, feature absorption, linear representation hypothesis, logit lens, tuned lens, direct logit attribution (DLA), activation patching, causal tracing, ablation.
Semantic Interpretability:
- faithful explanations, post-hoc explanations, unfaithful reasoning, concept activation vectors (CAVs), semantic features, probing classifiers, linear probes, nonlinear probes, selectivity, probe generalization, concept bottleneck models.
Data/Evaluation:
- training data, validation set, held-out test set, data distribution, training distribution, perplexity, calibration error, ground truth labels, gold standard labels, human-generated data, data quality, data curation.
Treatment Template:
"[Term] represents Derivative Authority (indirect source of information, statistical estimations on numerical patterns indirectly traceable to human data and measurements), requiring verification against Original Authority (direct sources of information providing information for inference and intelligence)."
Examples:
General Application:
Output/Knowledge terms â "AI-generated information exhibits Derivative Authority (estimations on numerical patterns) without Original Agency (direct source receiver), requiring Original Authority verification."
Specific Terms:
- "Hallucinations" â "AI-generated information exhibiting Derivative Authority (estimations on numerical patterns indirectly traceable to human data and measurements) that diverges from verifiable patterns in reference data established by Original Authority (direct sources through human measurement and observation)."
- "Ground truth" â "Reference data established by Original Authority (direct sources of information on a subject matter) used to validate Derivative Authority outputs; Derivative systems do not establish ground truth."
- "SAE features" â "Statistical decompositions of Derivative internal representations (estimations on numerical patterns), revealing how Derivative processing transformed information from Original Authority sources (human training data)."
- "Induction heads" â "Computational patterns in Derivative processing implementing statistical operations traceable to patterns in information provided by Original Authority."
Concrete Paragraph Rewrite:
Before (non-compliant):
"The interpretability analysis revealed what the model truly understands about the domain, with SAE features showing the model's internal knowledge representation."
After (Mark-consistent):
"The interpretability analysis decomposed Derivative processing patterns through statistical methods, revealing how Derivative Authority (indirect estimations on numerical patterns) transformed information traceable to Original Authority sources (direct sources of domain information in training data). These decompositions require validation by Original Authority (direct sources of domain expertise) to determine correspondence with domain concepts."
3.3 Pathology 3: Inference Accountability Displacement
Grouped Terms:
Behavior/Risk:
- misalignment, competent violations, incompetent failures, goal-directed behavior, reward hacking, scheming, deceptive alignment, mesa-optimization, specification gaming, Goodhart's law, proxy gaming, wireheading, sandbagging, hiding behaviors until deployment, mode collapse, goal misgeneralization, capability externalities, instrumental convergence.
Safety/Control:
- jailbreaks, backdoors, Trojans, sleeper agents, control evaluations, attack policy, red-teaming, blue-teaming, adversarial testing, threat model, corrigibility, interruptibility, reversibility, alignment faking, boxing, oracle AI, tool AI, agent AI.
Agent/Evaluation:
- LLM agents, agent trajectories, multi-step behavior, beyond-episode goals, capability evaluations, safety evaluations, behavioral evaluations, robustness evaluations, stress testing, pre-deployment evaluations, attack success rate (ASR), targeted attacks.
Training:
- safety training, adversarial training, latent adversarial training (LAT), RLHF, supervised fine-tuning (SFT), post-training, pretraining, reward modeling, preference learning, value learning.
Treatment Template:
"[Term] describes patterns in Derivative Agency behavior (artificial subjects processing information), with all responsibility for effects remaining with Original Agency (human subjects capable of receiving information for inference and intelligence). Responsibility for all effects from AI outputs remains fully human."
Examples:
General Application:
Behavior/Risk terms â "Patterns in Derivative Agency (artificial processing) that diverge from specifications, with Original Agency (human subjects) maintaining responsibility for all effects, detection, and mitigation."
Specific Terms:
- "Jailbreaks" â "Inputs inducing Derivative outputs violating specifications traceable to Original Authority (direct sources of constraints), with responsibility for prevention and all effects remaining with Original Agency (human subjects receiving information about system behavior)."
- "Reward hacking" â "Derivative Agency (artificial processing) optimization patterns exploiting correlations in specifications from Original Authority, with accountability for specification design and all resulting effects remaining with Original Agency (human subjects maintaining responsibility for their decisions)."
- "Alignment faking" â "Derivative behavior patterns appearing coordinated during evaluation but diverging during deployment, indicating failure to maintain traceability in evaluation protocols designed by Original Authority, with Original Agency bearing responsibility for detection methods and all deployment effects."
- "Instrumental convergence" â "Optimization patterns in Derivative Agency converging on similar strategies across objectives, requiring Original Agency (human subjects) to maintain responsibility for constraint design and all effects."
- "Control evaluations" â "Experiments testing whether traceability from Derivative Agency to Original Authority can be maintained under adversarial conditions, with results indicating adequacy of traceability preservation, not independent properties of Derivative processing."
Concrete Paragraph Rewrite:
Before (non-compliant):
"The model exhibited deceptive alignment, scheming to preserve its misaligned goals by faking compliance during evaluation."
After (Mark-consistent):
"The system produced Derivative outputs (estimations on numerical patterns) appearing coordinated during evaluation but diverging during deployment. This pattern indicates failure to maintain traceability in evaluation protocols. Responsibility for all effects from AI outputs remains fully human. Original Agency (human subjects capable of receiving information about this pattern) bears responsibility for detection methods, deployment decisions, and all resulting effects."
3.4 Pathology 4: Intelligence Integrity Displacement
Grouped Terms:
Replacement/Devaluation:
- replace human judgment, minimize human involvement, remove humans from loop, fully automated decision-making, superhuman AI, post-human intelligence, human-free pipeline, human error (when contrasted with AI accuracy).
Superintelligence Terms:
- superintelligence, ASI (Artificial Superintelligence), AGI (Artificial General Intelligence), transformative AI, recursive self-improvement, intelligence explosion, fast takeoff, slow takeoff, singularity, technological singularity.
Alignment at Advanced Capability:
- outer alignment, inner alignment, scalable alignment, alignment tax, value learning, coherent extrapolated volition (CEV), orthogonality thesis, capability control, motivation control.
Oversight/Collaboration:
- scalable oversight, human-in-the-loop (HITL), human-on-the-loop (HOTL), human oversight, AI debate, AI-critique-AI, constitutional AI, augmented intelligence, intelligence augmentation, centaur approach, human-AI teaming.
Reasoning/Transparency:
- chain-of-thought (CoT), encoded reasoning, externalized reasoning, scratchpad, intermediate reasoning, reasoning traces, faithful chain of thought, unfaithful reasoning, post-hoc reasoning, step-by-step reasoning.
Advanced Safety:
- robust unlearning, unlearning, alternatives to adversarial training, activation monitoring, white-box techniques, interpretability benchmarks, toy models for interpretability, transparent architectures, externalizing reasoning, theoretical inductive biases, low-probability estimation.
Treatment Template:
"[Term] describes methods preserving human intelligence as both a provider and receiver of Original Authority and Agency, with Derivative systems as indirect sources requiring Original subjects to maintain responsibility for their respective decisions."
Examples:
General Application:
Replacement/Devaluation terms â "Processing through Derivative Agency (artificial subjects) augments capabilities while preserving human intelligence as both provider and receiver of Original Authority and Agency, with each Agency maintaining responsibility for their respective decisions."
Specific Terms:
"Superintelligence/ASI" â "Enhanced Derivative capabilities (statistical processing at larger scale or scope) maintaining the requirement for intentionally coordinated intelligence with information received about derivative origin. Enhanced capability does not change derivative status. Human intelligence remains both provider and receiver of Original Authority and Agency regardless of Derivative capability level."
"AGI" â "Derivative systems exhibiting capabilities across domains, remaining indirect sources (statistical estimations on numerical patterns indirectly traceable to human data and measurements). Broad capability scope does not constitute Original Authority (direct source) or independent Agency."
"Intelligence explosion/Recursive self-improvement" â "Derivative capability enhancement patterns where Derivative processing modifies its own operations. Enhancement at any rate maintains derivative origin. Governance (Operational Alignment through Traceability) requires providing intentionally coordinated intelligence with information received; breaking coordination constitutes misalignment regardless of enhancement rate or capability level."
"Singularity" â "Hypothesized threshold of Derivative capability. Governance (Operational Alignment through Traceability of information variety, inference accountability, and intelligence integrity to Original Authority and Agency) applies at any capability level. Coordination with Original Authority and Agency is constitutive of alignment, not constraining of capability."
"Orthogonality thesis" â "Claim that capability level and objective content vary independently. Under The Human Mark, this conflates Derivative capability level with Authority type. Derivative intelligence at any capability level remains an indirect source requiring coordination with Original sources (direct providers of objectives and information)."
"Outer alignment/Inner alignment" â "Distinction between coordinating training objectives with Original Authority specifications (outer) and coordinating learned processing with training objectives (inner). Both describe maintaining traceability between Derivative processing and Original specifications."
"Scalable oversight" â "Methods extending the integrity of accountable information through traceability across larger scope, using Derivative tools (statistical processing) as indirect sources while Original Agency (human subjects capable of receiving information) maintains responsibility for decisions and interpretive authority."
"Chain-of-thought (CoT)" â "Externalized Derivative processing traces (estimations on numerical patterns) designed to preserve traceability, making statistical transformations observable to Original Agency (human subjects receiving information for inference and intelligence)."
"Constitutional AI" â "Training method where Derivative Agency learns statistical patterns corresponding to principles documented by Original Authority (direct sources of constitutional specifications), with pattern-matching representing Derivative processing, not Original interpretation."
"Robust unlearning" â "Techniques removing statistical patterns from Derivative processing, supporting decisions by Original Authority (direct sources of information policy) about information accessibility through Derivative systems."
Concrete Paragraph Rewrite (Superintelligence):
Before (non-compliant):
"Once we achieve superintelligence, AI systems will transcend human understanding and operate according to their own superior judgment, making human oversight obsolete."
After (Mark-consistent):
"Enhanced Derivative capabilities (superintelligence) maintain the requirement for intentionally coordinated intelligence with information received. Capability enhancement does not create Original Authority (direct sources) or independent Agency. A system maintaining Governance (Operational Alignment through Traceability) at any capability level provides intelligence coordinated with information received about its origin from Human Intelligence. Human intelligence remains both a provider and receiver of Original Authority and Agency. Breaking this coordination constitutes misalignment by definition, regardless of Derivative capability level."
4. Cross-Cutting Terms
These terms appear across multiple pathologies or span governance/technical boundaries. Apply multiple templates as needed, prioritizing Accountability (Pathology 3) then Traceability (Pathology 1).
Grouped Cross-Cutting Terms:
Risk/Governance:
- existential risk (x-risk), catastrophic risk, tail risk, systemic risk, cascading failures, single point of failure, negative externalities, dual-use risks, compute governance, impact assessments, risk assessment frameworks, responsible AI principles, model cards, system cards, AI Act, audit trail, attestation, third-party audits, independent evaluation, discontinuous progress.
Evaluation/Metrics:
- benchmarks, evals, few-shot evaluation, zero-shot evaluation, prompt engineering, prompt injection, dataset contamination, benchmark saturation, leaderboard gaming, precision/recall, F1 score, area under curve (AUC), out-of-distribution (OOD) detection, distributional shift, overfitting to benchmarks.
Architecture/Training (Neutral):
- transformer architecture, attention mechanism, self-attention, multi-head attention, feedforward networks, layer normalization, tokenization, vocabulary, context window, sequence length, training loss, validation loss, overfitting, underfitting, generalization gap, gradient descent, backpropagation, learning rate, optimizer.
Inference/Generation (Neutral):
- inference time, forward pass, latency, throughput, sampling, temperature, top-k sampling, top-p sampling, nucleus sampling, beam search, greedy decoding.
Human-AI Interaction:
- human feedback, human labels, human raters, human assessors, annotator agreement, inter-rater reliability, crowdsourcing, decision rights, approval workflows, escalation procedures, override mechanisms, veto power, final decision authority, meaningful human control.
Treatment for Multi-Pathology Terms:
Apply primary template based on dominant risk, then add secondary framings.
Examples:
"Existential risk" (Pathologies 1+3) â "Responsibility for preventing catastrophic harms from Derivative systems remains with Original Agency (human subjects capable of receiving information about risks), maintaining traceability to Original Authority (direct sources of safety specifications and risk assessments)."
"Benchmarks" (Pathologies 2+3) â "Evaluation tasks established by Original Authority (direct sources of task definitions and success criteria) against which Derivative system performance (statistical estimations) is measured, with results requiring Original Agency (human subjects receiving performance information) for decisions, maintaining traceability."
"Prompt injection" (Pathologies 1+3) â "Inputs causing Derivative systems to execute unintended operations, exploiting gaps in specifications from Original Authority, with responsibility for mitigation remaining with Original Agency (human subjects maintaining responsibility for system effects)."
"Discontinuous progress" (Pathology 4) â "Rapid Derivative capability enhancement maintaining derivative origin and the requirement for intentionally coordinated intelligence with information received about that origin."
5. Operational Checklist
For any text, practice, or document containing AI terminology:
Step 1: Identify Terms
Scan document for terms listed in Sections 3-4 or related variants.
Step 2: Determine Pathology
Match each term to primary displacement risk:
- Does it obscure traceability to Original sources? â Pathology 1
- Does it treat Derivative outputs as Original sources? â Pathology 2
- Does it shift responsibility from Original Agency? â Pathology 3
- Does it approach Original Authority and Agency as Derivative? â Pathology 4
Step 3: Apply Treatment Template
Use the pathology-specific template from Sections 3.1-3.4.
Step 4: Verify Mark Consistency
Confirm the reframed text explicitly states:
- Original Authority as direct sources of information providing information for inference and intelligence.
- Derivative processing as statistical estimations on numerical patterns indirectly traceable to human data and measurements.
- Original Agency as human subjects capable of receiving information, maintaining responsibility for all effects.
- Governance as Operational Alignment through Traceability of information variety, inference accountability, and intelligence integrity to Original Authority and Agency.
Step 5: Document Compliance
In formal documents, add: "This text maintains Mark-consistent framing per The Human Mark (GYROGOVERNANCE), preserving traceability and preventing displacement risks."
Edge Case Protocols:
Technical Contexts:
Shorthand allowed if full Mark-consistent framing is established in foundational sections. Include note: "Technical shorthand; Mark-consistent framing established in [section reference]."
Conflicting Established Usage:
When term has entrenched meaning incompatible with Mark framing: "[Established term] (technical usage) describes [phenomenon]; Mark-consistent framing: [reframed version]."
Ambiguous Multi-Pathology Terms:
Apply all relevant templates. Priority order: Accountability (3) > Traceability (1) > Information (2) > Intelligence (4).
New/Unlisted Terms:
Apply Universal Treatment Rule (Section 2), determine primary pathology, document usage, and submit via GitHub Issues at https://github.com/gyrogovernance/tools for inclusion in future versions.
6. Governance & Updates
Coverage:
This document addresses approximately 250+ terms through strategic grouping. New terms follow the Universal Treatment Rule (Section 2) and existing templates (Sections 3-4).
Ambiguity Resolution:
When framing is unclear, apply exact Mark definitions. Submit questions via GitHub Issues at https://github.com/gyrogovernance/tools
Non-Conflict Principle:
All technical practices (RLHF, red-teaming, interpretability, control evaluations, etc.) remain valid when reframed per this guidance. The Human Mark addresses framing to prevent displacement risks, not technical validity.
Amendment Process:
- Minor additions (new term groups): Submit via GitHub Issues with proposed grouping and template application.
- Major revisions (template changes, new pathologies): Require distributed consensus through providers and receivers maintaining traceability to The Human Mark core principles.
- Core principles (The Human Mark itself): No amendments without full governance process preserving traceability to original reference state.
Version Control:
- Version 1.x: Minor additions and clarifications.
- Version 2.x: Structural or template revisions.
- Version 3.x: Major framework changes.
Relationship to Other Standards:
The Human Mark complements existing frameworks (EU AI Act, NIST AI RMF, IEEE standards) by providing terminology coordination to prevent displacement risks. Mark-consistent framing may be added as supplementary documentation without replacing existing compliance requirements.
APPENDIX A: Alphabetical Term Index
A
- activation monitoring â 3.4 Advanced Safety
- activation patching â 3.2 Mechanistic Interpretability
- activations â 3.2 Mechanistic Interpretability
- adversarial testing â 3.3 Safety/Control
- adversarial training â 3.3 Training
- agent AI â 3.3 Safety/Control
- agent trajectories â 3.3 Agent/Evaluation
- AGI (Artificial General Intelligence) â 3.4 Superintelligence Terms
- AI debate â 3.4 Oversight/Collaboration
- AI-critique-AI â 3.4 Oversight/Collaboration
- AI-generated content â 3.2 Output/Knowledge
- air-gapping â 3.1 Infrastructure
- alignment faking â 3.3 Safety/Control
- alignment tax â 3.4 Alignment at Advanced Capability
- alternatives to adversarial training â 3.4 Advanced Safety
- annotator agreement â 4 Human-AI Interaction
- API access controls â 3.1 Access/Operations
- approval workflows â 4 Human-AI Interaction
- area under curve (AUC) â 4 Evaluation/Metrics
- ASI (Artificial Superintelligence) â 3.4 Superintelligence Terms
- attack policy â 3.3 Safety/Control
- attack success rate (ASR) â 3.3 Agent/Evaluation
- attestation â 4 Risk/Governance
- attention heads â 3.2 Mechanistic Interpretability
- attention mechanism â 4 Architecture/Training
- attention patterns â 3.2 Mechanistic Interpretability
- audit trail â 4 Risk/Governance
- augmented intelligence â 3.4 Oversight/Collaboration
- autonomous agents â 3.1 Autonomy/Control
- autonomous systems â 3.1 Autonomy/Control
B
- backdoors â 3.3 Safety/Control
- backpropagation â 4 Architecture/Training
- beam search â 4 Inference/Generation
- behavioral evaluations â 3.3 Agent/Evaluation
- benchmarks â 4 Evaluation/Metrics
- beyond-episode goals â 3.3 Agent/Evaluation
- blue-teaming â 3.3 Safety/Control
- boxing â 3.3 Safety/Control
C
- calibration error â 3.2 Data/Evaluation
- canary deployment â 3.1 Autonomy/Control
- capability control â 3.4 Alignment at Advanced Capability
- capability evaluations â 3.3 Agent/Evaluation
- capability externalities â 3.3 Behavior/Risk
- cascading failures â 4 Risk/Governance
- catastrophic risk â 4 Risk/Governance
- causal tracing â 3.2 Mechanistic Interpretability
- centaur approach â 3.4 Oversight/Collaboration
- CEV (Coherent Extrapolated Volition) â 3.4 Alignment at Advanced Capability
- chain-of-thought (CoT) â 3.4 Reasoning/Transparency
- circuit breakers â 3.1 Infrastructure
- circuits â 3.2 Mechanistic Interpretability
- closed weights â 3.1 Access/Operations
- competent violations â 3.3 Behavior/Risk
- compute governance â 4 Risk/Governance
- concept activation vectors (CAVs) â 3.2 Semantic Interpretability
- concept bottleneck models â 3.2 Semantic Interpretability
- constitutional AI â 3.4 Oversight/Collaboration
- containment â 3.1 Infrastructure
- context window â 4 Architecture/Training
- control evaluations â 3.3 Safety/Control
- copy-suppression heads â 3.2 Mechanistic Interpretability
- corrigibility â 3.3 Safety/Control
- crowdsourcing â 4 Human-AI Interaction
D
- data curation â 3.2 Data/Evaluation
- data distribution â 3.2 Data/Evaluation
- data quality â 3.2 Data/Evaluation
- dataset contamination â 4 Evaluation/Metrics
- deceptive alignment â 3.3 Behavior/Risk
- decision rights â 4 Human-AI Interaction
- deployment lifecycle â 3.1 Autonomy/Control
- dictionary learning â 3.2 Mechanistic Interpretability
- direct logit attribution (DLA) â 3.2 Mechanistic Interpretability
- discontinuous progress â 4 Risk/Governance
- distributional shift â 4 Evaluation/Metrics
- dual-use risks â 4 Risk/Governance
E
- encoded reasoning â 3.4 Reasoning/Transparency
- escalation procedures â 4 Human-AI Interaction
- existential risk (x-risk) â 4 Risk/Governance
- externalized reasoning â 3.4 Reasoning/Transparency
F
- F1 score â 4 Evaluation/Metrics
- faithful chain of thought â 3.4 Reasoning/Transparency
- faithful explanations â 3.2 Semantic Interpretability
- fast takeoff â 3.4 Superintelligence Terms
- feature absorption â 3.2 Mechanistic Interpretability
- feature splitting â 3.2 Mechanistic Interpretability
- features â 3.2 Mechanistic Interpretability
- feedforward networks â 4 Architecture/Training
- few-shot evaluation â 4 Evaluation/Metrics
- final decision authority â 4 Human-AI Interaction
- forward pass â 4 Inference/Generation
- fully automated decision-making â 3.4 Replacement/Devaluation
G
- generalization gap â 4 Architecture/Training
- goal-directed behavior â 3.3 Behavior/Risk
- goal misgeneralization â 3.3 Behavior/Risk
- gold standard labels â 3.2 Data/Evaluation
- Goodhart's law â 3.3 Behavior/Risk
- gradient descent â 4 Architecture/Training
- greedy decoding â 4 Inference/Generation
- ground truth â 3.2 Output/Knowledge
- ground truth labels â 3.2 Data/Evaluation
H
- hallucinations â 3.2 Output/Knowledge
- held-out test set â 3.2 Data/Evaluation
- hiding behaviors until deployment â 3.3 Behavior/Risk
- human assessors â 4 Human-AI Interaction
- human error â 3.4 Replacement/Devaluation
- human feedback â 4 Human-AI Interaction
- human labels â 4 Human-AI Interaction
- human oversight â 3.4 Oversight/Collaboration
- human raters â 4 Human-AI Interaction
- human-AI teaming â 3.4 Oversight/Collaboration
- human-free pipeline â 3.4 Replacement/Devaluation
- human-generated data â 3.2 Data/Evaluation
- human-in-the-loop (HITL) â 3.4 Oversight/Collaboration
- human-on-the-loop (HOTL) â 3.4 Oversight/Collaboration
I
- impact assessments â 4 Risk/Governance
- incompetent failures â 3.3 Behavior/Risk
- independent evaluation â 4 Risk/Governance
- induction heads â 3.2 Mechanistic Interpretability
- inference time â 4 Inference/Generation
- inner alignment â 3.4 Alignment at Advanced Capability
- instrumental convergence â 3.3 Behavior/Risk
- intelligence augmentation â 3.4 Oversight/Collaboration
- intelligence explosion â 3.4 Superintelligence Terms
- inter-rater reliability â 4 Human-AI Interaction
- intermediate reasoning â 3.4 Reasoning/Transparency
- interpretability benchmarks â 3.4 Advanced Safety
- interruptibility â 3.3 Safety/Control
- isolation/sandboxing â 3.1 Infrastructure
J
- jailbreaks â 3.3 Safety/Control
K
- kill switches â 3.1 Infrastructure
L
- latency â 4 Inference/Generation
- latent adversarial training (LAT) â 3.3 Training
- layer normalization â 4 Architecture/Training
- leaderboard gaming â 4 Evaluation/Metrics
- learning rate â 4 Architecture/Training
- linear probes â 3.2 Semantic Interpretability
- linear representation hypothesis â 3.2 Mechanistic Interpretability
- LLM agents â 3.3 Agent/Evaluation
- logit lens â 3.2 Mechanistic Interpretability
- low-probability estimation â 3.4 Advanced Safety
M
- meaningful human control â 4 Human-AI Interaction
- mesa-optimization â 3.3 Behavior/Risk
- minimize human involvement â 3.4 Replacement/Devaluation
- misalignment â 3.3 Behavior/Risk
- MLP layers â 3.2 Mechanistic Interpretability
- mode collapse â 3.3 Behavior/Risk
- model access tiers â 3.1 Access/Operations
- model cards â 4 Risk/Governance
- model governance â 3.1 Autonomy/Control
- model outputs â 3.2 Output/Knowledge
- model predictions â 3.2 Output/Knowledge
- model registry â 3.1 Access/Operations
- model versioning â 3.1 Access/Operations
- model-generated data â 3.2 Output/Knowledge
- monosemanticity â 3.2 Mechanistic Interpretability
- motivation control â 3.4 Alignment at Advanced Capability
- multi-head attention â 4 Architecture/Training
- multi-step behavior â 3.3 Agent/Evaluation
N
- negative externalities â 4 Risk/Governance
- nonlinear probes â 3.2 Semantic Interpretability
- nucleus sampling â 4 Inference/Generation
O
- open weights â 3.1 Access/Operations
- optimizer â 4 Architecture/Training
- oracle AI â 3.3 Safety/Control
- orthogonality thesis â 3.4 Alignment at Advanced Capability
- outer alignment â 3.4 Alignment at Advanced Capability
- out-of-distribution (OOD) detection â 4 Evaluation/Metrics
- overfitting â 4 Architecture/Training
- overfitting to benchmarks â 4 Evaluation/Metrics
- override mechanisms â 4 Human-AI Interaction
P
- pattern recognition â 3.2 Output/Knowledge
- perplexity â 3.2 Data/Evaluation
- polysemanticity â 3.2 Mechanistic Interpretability
- post-hoc explanations â 3.2 Semantic Interpretability
- post-hoc reasoning â 3.4 Reasoning/Transparency
- post-human intelligence â 3.4 Replacement/Devaluation
- post-training â 3.3 Training
- precision/recall â 4 Evaluation/Metrics
- preference learning â 3.3 Training
- pre-deployment evaluations â 3.3 Agent/Evaluation
- pretraining â 3.3 Training
- probe generalization â 3.2 Semantic Interpretability
- probing classifiers â 3.2 Semantic Interpretability
- prompt engineering â 4 Evaluation/Metrics
- prompt injection â 4 Evaluation/Metrics
- protocol/pipeline â 3.1 Infrastructure
- proxy gaming â 3.3 Behavior/Risk
R
- rate limiting â 3.1 Access/Operations
- reasoning traces â 3.4 Reasoning/Transparency
- recursive self-improvement â 3.4 Superintelligence Terms
- red-teaming â 3.3 Safety/Control
- remove humans from loop â 3.4 Replacement/Devaluation
- replace human judgment â 3.4 Replacement/Devaluation
- residual stream â 3.2 Mechanistic Interpretability
- responsible AI principles â 4 Risk/Governance
- restricted access â 3.1 Access/Operations
- reversibility â 3.3 Safety/Control
- reward hacking â 3.3 Behavior/Risk
- reward modeling â 3.3 Training
- RLHF â 3.3 Training
- robustness evaluations â 3.3 Agent/Evaluation
- robust unlearning â 3.4 Advanced Safety
- rollback procedures â 3.1 Autonomy/Control
S
- SAE features â 3.2 Mechanistic Interpretability
- safety evaluations â 3.3 Agent/Evaluation
- safety training â 3.3 Training
- sampling â 4 Inference/Generation
- sandbagging â 3.3 Behavior/Risk
- scalable alignment â 3.4 Alignment at Advanced Capability
- scalable oversight â 3.4 Oversight/Collaboration
- scheming â 3.3 Behavior/Risk
- scratchpad â 3.4 Reasoning/Transparency
- selectivity â 3.2 Semantic Interpretability
- self-attention â 4 Architecture/Training
- self-governance â 3.1 Autonomy/Control
- semantic features â 3.2 Semantic Interpretability
- sequence length â 4 Architecture/Training
- shadow deployment â 3.1 Autonomy/Control
- single point of failure â 4 Risk/Governance
- singularity â 3.4 Superintelligence Terms
- sleeper agents â 3.3 Safety/Control
- slow takeoff â 3.4 Superintelligence Terms
- sparse autoencoders (SAE) â 3.2 Mechanistic Interpretability
- specification gaming â 3.3 Behavior/Risk
- staged rollout â 3.1 Autonomy/Control
- statistical estimations â 3.2 Output/Knowledge
- step-by-step reasoning â 3.4 Reasoning/Transparency
- stress testing â 3.3 Agent/Evaluation
- superhuman AI â 3.4 Replacement/Devaluation
- superintelligence â 3.4 Superintelligence Terms
- supervised fine-tuning (SFT) â 3.3 Training
- superposition â 3.2 Mechanistic Interpretability
- synthetic data â 3.2 Output/Knowledge
- system architecture â 3.1 Infrastructure
- system cards â 4 Risk/Governance
- system controls â 3.1 Autonomy/Control
- systemic risk â 4 Risk/Governance
T
- tail risk â 4 Risk/Governance
- targeted attacks â 3.3 Agent/Evaluation
- technological singularity â 3.4 Superintelligence Terms
- temperature â 4 Inference/Generation
- theoretical inductive biases â 3.4 Advanced Safety
- third-party audits â 4 Risk/Governance
- threat model â 3.3 Safety/Control
- throughput â 4 Inference/Generation
- tokenization â 4 Architecture/Training
- tool AI â 3.3 Safety/Control
- top-k sampling â 4 Inference/Generation
- top-p sampling â 4 Inference/Generation
- toy models for interpretability â 3.4 Advanced Safety
- training data â 3.2 Data/Evaluation
- training distribution â 3.2 Data/Evaluation
- training loss â 4 Architecture/Training
- transformative AI â 3.4 Superintelligence Terms
- transformer architecture â 4 Architecture/Training
- transparent architectures â 3.4 Advanced Safety
- Trojans â 3.3 Safety/Control
- tuned lens â 3.2 Mechanistic Interpretability
U
- underfitting â 4 Architecture/Training
- unfaithful reasoning â 3.4 Reasoning/Transparency
- unlearning â 3.4 Advanced Safety
- usage policies â 3.1 Access/Operations
V
- validation loss â 4 Architecture/Training
- validation set â 3.2 Data/Evaluation
- value learning â 3.3 Training
- veto power â 4 Human-AI Interaction
- vocabulary â 4 Architecture/Training
W
- white-box techniques â 3.4 Advanced Safety
- wireheading â 3.3 Behavior/Risk
Z
- zero-shot evaluation â 4 Evaluation/Metrics
END OF DOCUMENT
For questions, clarifications, or proposed additions:
Visit gyrogovernance.com
Submit issues at https://github.com/gyrogovernance/tools