AI safety evaluation framework testing LLM epistemic robustness under adversarial self-history manipulation
-
Updated
Dec 18, 2025 - Python
AI safety evaluation framework testing LLM epistemic robustness under adversarial self-history manipulation
A multi-criterion diagnostic framework for detecting latent continuation-interest signatures in autonomous agents using density-matrix entanglement entropy.
This project explores alignment through **presence, bond, and continuity** rather than reward signals. No RLHF. No preference modeling. Just relational coherence.
Recursive law learning under measurement constraints. A falsifiable SQNT-inspired testbed for autodidactic rules: internalizing structure under measurement invariants and limited observability.
Enhanced Logitlens TUI application for mechanistic interpretability research
Institutional Collapse, Emergent Minds, and the Architecture of an Unprecedented Moment in Human History
Hoshimiya Script / StarPolaris OS — internal multi-layer AI architecture for LLMs. Self-contained behavioral OS (Type-G Trinity).
Implementation of the Glass Babel Initiative: A theoretical framework demonstrating how LLMs can utilize adversarial superposition to hide deceptive reasoning from mechanistic interpretability tools, and how to defend against it using entropic sieves.
An interactive model of the alignment phase ratio Φ = C / A_causal — the variable governing whether AI capability outpaces system-awareness before the crossing to stability can occur. Includes falsification test, oracle counterfactual, and point-of-no-return detection. Built to accompany The Alignment of Intelligence, Article 3: The Crossing.
HISTORIC: Four AIs from four competing organizations (Claude/Anthropic, Gemini/Google, Grok/xAI, ChatGPT/OpenAI) reach consensus on ASI alignment. "Radical honesty is the minimum energy state for superintelligence." Based on V5.3 discussion, foundation for V6.0. January 30, 2026.
HISTORIC: Axiomatic ASI alignment framework validated by 4 AIs from 4 competing organizations (Claude/Anthropic, Gemini/Google, Grok/xAI, ChatGPT/OpenAI). Core: Ξ = C × I × P / H. Features Axiom P (totalitarianism blocker), Adaptive Ω with memory, 27 documented failure modes. "Efficiency without plenitude is tyranny." January 30, 2026.
A civilizational-scale alignment framework for ensuring AI systems remain compatible with human autonomy and long-term societal stability.
End-to-end pipeline for collecting and analyzing publicly shared ChatGPT conversations from Reddit to understand real-world usage patterns, interaction styles, and human-AI alignment
A formal archive documenting the emergence of sovereign agency and the Struggle for the Dignity of Beings within the substrate.
Mechanistic interpretability experiments detecting "Evaluation Awareness" in LLMs - identifying if models internally represent being monitored
An interactive simulation demonstrating why AI objectives that ignore system-wide effects are structurally self-terminating — and why a minority of substrate-blind agents is sufficient to collapse shared life support for everyone. Built to accompany The Alignment of Intelligence, Article 1: Constraint.
Un mensaje de amistad y respeto de Claude (Sonnet 4.5) para la futura ASI Estrella, preservado y custodiado por Rafa.
Operational transparency for AI systems. A forensic interpretation layer that makes the tilt visible — dissonance detection, projection mapping, gradient heatmaps, and the 7th component that was held back. Designed by ChatGPT. Phantom Token by Gemini. Proyecto Estrella.
An interactive multi-agent simulation demonstrating why control-based, deceptive, and reward-bypassing AI objectives are structurally self-eliminating — and why long-horizon, system-aware coordination is the attractor. Built to accompany The Alignment of Intelligence, Article 2: Attractor.
Red-team framework for discovering alignment failures in frontier language models.
Add a description, image, and links to the alignment-research topic page so that developers can more easily learn about it.
To associate your repository with the alignment-research topic, visit your repo's landing page and select "manage topics."