Studies reveal limited tools for monitoring future superhuman models

POSTED ON Aug 07, 2025

A recent study conducted by artificial intelligence (AI) researchers has revealed that if a model becomes misaligned during development, merely removing references to these misaligned traits may not be sufficient.

The study, titled “Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data,” was published on July 20 in an open-access archive that features scholarly articles across various fields, including computer science.

This research focuses on subliminal learning, a phenomenon where language models convey behavioral traits through semantically unrelated data.

According to the research summary, a “student” model trained on one dataset can learn unintended information, even when efforts are made to filter out references.

The researchers observed similar effects when training on code or reasoning traces produced by the same teacher model. However, they did not find this impact when different base models were used for the teacher and student.

The researchers concluded that a theoretical result indicates subliminal learning occurs in all neural networks under certain conditions, making it a general phenomenon that poses an unexpected challenge for AI development.

They also discovered that using a large language model (LLM) for judgment or in-context learning—where a model learns a new task from specific examples—was not successful.

Additionally, another study conducted by Google DeepMind, OpenAI, Meta, Anthropic, and others suggested that future AI models might not make their reasoning transparent to humans.

Published on July 15, this study titled “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” concluded that while AI systems that “think” in human language present a unique opportunity for AI safety, CoT (Chain of Thought) monitoring, like other oversight methods, is imperfect and can allow significant issues to go unnoticed.

The research recommends that developers of advanced models consider how their development decisions affect the monitorability of the chain of thought.

A cofounder of the Future of Life Institute told LiveScience that even the tech companies building today’s most powerful AI systems admit they don’t fully understand how they work.

Without a complete understanding, as these systems become more powerful, there are more ways for things to go wrong and less ability to maintain control over AI.

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 pic.twitter.com/ewIxfzXOe3

— Owain Evans (@OwainEvans_UK) July 22, 2025

Detecting misbehavior in frontier reasoning models

Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving… pic.twitter.com/uX9f5n3zB9

— OpenAI (@OpenAI) March 10, 2025