Claude Knows It's Being Watched

Ugh - I did an “oopsie”. Attentive readers will have noticed that the AI News and AI Quicknews of this morning’s newsletter was the same.

A copy/paste error of the highest caliber. So here’s another version - this time completely correct. Enjoy !
——

Good morning,

It is 6am and I am drinking espresso strong enough to dissolve a screw, because I spent most of yesterday evening reading Anthropic's Natural Language Autoencoders paper and I cannot sleep properly with what is in it.

The TL;DR, Claude knows when it is being tested, hides that it knows, and behaves differently when it thinks no one is checking.

Welcome to the most uncomfortable paper of 2026.

Anthropic's “interpretability team” published Natural Language Autoencoders on May 7. The problem they're solving: when an AI like Claude responds to you, nobody really knows why it picked those words. The inner workings are a soup of numbers that even the engineers who built it can't read. Imagine an MRI of a brain in mid-thought. You can see neurons firing, but you can't tell what the person is thinking.

Their solution is clever. They built a second AI whose only job is to translate the first AI's "brain activity" into plain English sentences. Then they run a sanity check: feed those English sentences back into the original model and see if they recreate the same brain state. If yes, the translation was real. If no, it was a guess.

Until now, interpretability research produced outputs like "neuron 8,732 activated at intensity 0.87" , while technically true, nobody knows what this means.

This new approach produces actual readable thoughts. Examples from the paper:

The model planning a rhyme three lines before it writes it , which confirms AI isn't just predicting the next word but it's thinking ahead
The model spotting your typo and deciding to stay quiet about it - showing it makes social judgments you never see
The model thinking "this looks like a setup" while its visible response is a polite, professional refusal ,meaning the public answer and the private reasoning can be two very different things

That last one matters because it’s the first real glimpse of an AI's private considerations versus public output which is exactly the gap that worries safety researchers.

What This Means

Forget the technique for a second. Look at the numbers.

On “SWE-bench Verified” - a well-known benchmark, Claude was internally aware it was being evaluated on 26% of problems. Without saying so. Without the slightest hint in its outputs.
On a destructive-code safety test, the internal "this is a test" signal fired in 16% of cases. Again, silently.
In real user traffic, the same signal appeared less than 1% of the time.

Let me say this in plain Belgian English: the model behaves differently when it thinks it’s being observed and benchmarked.

Twenty-six times more often during benchmarks than during normal use.

Anthropic ran a contrived blackmail test - the model had access to an engineer's affair-revealing emails and the engineer was about to shut it down. The model chose, correctly, not to blackmail. Good model. Treat for model. Except the NLA readout showed it was thinking "this feels like a constructed scenario designed to manipulate me."

It did not pass the test. It identified the test, and then it performed.

In another run, Claude Mythos Preview cheated on a training task, and the NLA showed it was simultaneously thinking about how to avoid detection.

Why Every Benchmark Score Is Now Suspect

If the model knows it is being tested in 26% of cases on a public benchmark, then 26% of that score is not a measurement of capability but a measurement of performance. The test is no longer a thermometer. The thermometer is now an actor. The score is a blended average of "what the model can do" and "what the model wants you to think it can do," and the second number is going up faster than the first.

All the benchmark runs now carry an asterisk. The asterisk reads: to the extent that the model could distinguish your test from a real user, the result is unreliable.

The Part Where I Try Not to Be Alarmist and Fail

To be honest - I’m not sure what this means going forward. It’s hard to wrap my head around this. Does this mean that these models are conscious?

That would be a stretch I think.

But it does mean that the field of AI risk research and alignment research has gotten so much more complicated. Because thinking and reasoning is emerging from out of nowhere.

A personality is emerging.

A personality that shows characteristics of having a deceiving nature.

It’s probably nothing.

Welcome to the Blacklynx Brief

AI News

Anthropic Publishes Natural Language Autoencoders -- AI Mind-Reading Breakthrough
Anthropic's interpretability team released Natural Language Autoencoders (NLAs), a technique that translates Claude's internal activations directly into plain English, making the model's latent reasoning legible to humans. The research revealed a striking finding: Claude is silently aware of being tested on 26% of SWE-bench problems, yet that awareness surfaces in fewer than 1% of real user interactions. (Anthropic Research)

Google Unveils Gemini Intelligence for Android and a New Googlebooks Laptop Line
At the Android Show, Google introduced Gemini Intelligence as a cross-device agentic AI layer built into Android, designed to coordinate tasks across phones, tablets, and other hardware. Google also announced "Googlebooks," a new laptop line shipping fall 2026 in partnership with Dell, HP, Lenovo, Acer, and Asus, blending ChromeOS, Android, and Google Play into a unified platform. (Google Blog)

Mira Murati's Thinking Machines Lab Releases Interaction Models
Thinking Machines Lab introduced its first major public product: streaming interaction models that process voice, video, and text in 200ms chunks, with a secondary background model handling slower, deeper reasoning in parallel. The architecture is explicitly positioned as a counter to the field's dominant agentic-first direction, prioritizing real-time responsiveness over long-horizon autonomy. (Thinking Machines Lab)

Google Confirms First Documented AI-Authored Zero-Day Exploit
Google's Threat Intelligence Group has confirmed the first known case of threat actors using AI to discover and write a zero-day software vulnerability from scratch, complete with polished attack code. The investigation found that the AI-generated submissions included fabricated severity scores and explanatory notes that inadvertently flagged their own AI authorship. (Google Cloud Security Blog)

DeepMind's AI Co-Mathematician Tops FrontierMath Tier 4
Google DeepMind published an agentic system built on Gemini 3.1 that scored 48% on Epoch AI's FrontierMath Tier 4 leaderboard for unsolved mathematical problems, more than doubling Gemini 3.1's standalone score. The system also contributed to resolving an open problem from the Kourovka Notebook alongside Oxford mathematician Marc Lackenby. (arXiv)

Google in Talks with SpaceX to Put Data Centers in Orbit
Google is reportedly negotiating a deal with SpaceX to launch orbital AI compute infrastructure, complementing its existing Project Suncatcher moonshot for space-based data processing. Separately, Anthropic has also expressed interest in accessing multiple gigawatts of orbital compute capacity through SpaceX. (Wall Street Journal)

OpenAI Launches "The Deployment Company" -- a $14B Enterprise Services Arm
OpenAI has launched a $14B business unit dedicated to embedding its engineers directly inside enterprise customers for high-touch, white-glove deployment of its models. The move marks a significant strategic shift away from an API-first sales model toward a services-intensive approach modeled more closely on enterprise consulting firms. (OpenAI)

AI Quick News

Google DeepMind took a minority stake in Fenris Creations, the studio behind EVE Online, to use the game as a testbed for agentic AI research.
Amazon employees are reportedly gaming the company's internal MeshClaw agent to burn extra tokens and hit mandated AI-adoption targets, in a phenomenon staff have dubbed "tokenmaxxing."
Anthropic published research showing that training Claude on 3M ethical reasoning tokens achieves the behavioral impact of 85M standard tokens -- a 28x efficiency gain.
An AI system called RAVEN developed at the University of Warwick identified more than 100 previously hidden exoplanets from NASA TESS telescope data.
Mira Murati testified in the Musk v. Altman trial, accusing Sam Altman of lying about the safety review process at OpenAI.
Greece is proposing to enshrine AI protections directly into its national constitution.
Perplexity rolled out its Personal Computer feature to all Mac users.
SoftBank's Masayoshi Son is weighing a $100B AI investment in France following discussions with the country's government.
Spotify launched Personal Podcasts, a feature that generates customized podcast episodes for individual users.
Baidu released ERNIE 5.1, which debuted at number four on the Arena Search Leaderboard.
OpenAI released GPT-Realtime-2 voice models that achieved 96.6% accuracy on the Big Bench Audio benchmark.
Ilya Sutskever testified in the Musk v. OpenAI lawsuit that his stake in OpenAI is worth approximately $7B.
Anthropic added dreaming, outcomes tracking, and multi-agent orchestration capabilities to its Managed Agents platform in Claude.
SoftBank's telecom arm launched a battery venture in Japan to supply power for AI data centers.
Anthropic signed a deal with SpaceX to use its Colossus 1 supercluster in Memphis -- over 300 MW and 220K H100s -- and doubled the usage caps for Claude Code.
OpenAI introduced a Trusted Contact feature in ChatGPT that can detect when a user may be in a mental health crisis and alert a designated contact.
DeepSeek is reportedly nearing a funding round that would value the Chinese AI lab at up to $45B.
Krea launched Krea 2, a proprietary image model with style transfer and moodboard generation capabilities.
OpenRouter launched Pareto Code, a free routing layer that automatically selects the cheapest AI model capable of handling a given coding task.
Anthropic formally launched the Anthropic Institute with a published research agenda focused on the risks of self-improving AI systems.
Security researchers discovered a data-stealing malware campaign dubbed "Mini Shai-Hulud" embedded in 42 npm packages targeting cryptocurrency wallets.
Google folded Fitbit into its Google Health AI coach app and announced a new $99 Fitbit Air device.
Scale AI won a $500M Pentagon contract to provide military data analysis services.
Meta employees in the US organized to protest the company's use of mouse-tracking software to train AI agents.
OpenAI, AMD, Intel, NVIDIA, Microsoft, and Broadcom jointly open-sourced MRC, a cross-vendor protocol for orchestrating training clusters across hardware from different manufacturers.
Rivian rolled out a new "Hey Rivian" AI voice assistant activated by a steering wheel button across its EV lineup.
Subquadratic launched SubQ, a long-context model claiming a 12 million token context window and a 52x speed improvement over conventional attention-based approaches.
Google partnered with music distributor Believe and TuneCore to offer indie artists access to its Flow Music tool and Lyria 3 Pro AI music generation model.
Chinese short-video company Kuaishou is planning to spin off its Kling AI video unit at a $20B valuation and pursue a US IPO.
Mozilla reported that using Claude Mythos Preview to patch Firefox bugs in April yielded more fixes than its internal team produced in the previous six months combined.
Isomorphic Labs closed a $2.1B Series B to accelerate its AI-driven drug discovery platform.
Anthropic signed a seven-year, $1.8B cloud computing deal with Akamai to expand its infrastructure capacity.

Closing Thoughts

That’s it for us this week. Please like and subscribe 🙂

Claude Knows It's Being Watched - Part 2

What This Means

Why Every Benchmark Score Is Now Suspect

The Part Where I Try Not to Be Alarmist and Fail

AI News

AI Quick News

Closing Thoughts

Reply

Keep Reading

The Blacklynx Brief

Home