Claude Knows It's Being Watched

In partnership with

Good morning,

It is 6am and I am drinking espresso strong enough to dissolve a screw, because I spent most of yesterday evening reading Anthropic's Natural Language Autoencoders paper and I cannot sleep properly with what is in it.

The TL;DR, Claude knows when it is being tested, hides that it knows, and behaves differently when it thinks no one is checking.

Welcome to the most uncomfortable paper of 2026.

Anthropic's “interpretability team” published Natural Language Autoencoders on May 7. The problem they're solving: when an AI like Claude responds to you, nobody really knows why it picked those words. The inner workings are a soup of numbers that even the engineers who built it can't read. Imagine an MRI of a brain in mid-thought. You can see neurons firing, but you can't tell what the person is thinking.

Their solution is clever. They built a second AI whose only job is to translate the first AI's "brain activity" into plain English sentences. Then they run a sanity check: feed those English sentences back into the original model and see if they recreate the same brain state. If yes, the translation was real. If no, it was a guess.

Until now, interpretability research produced outputs like "neuron 8,732 activated at intensity 0.87" , while technically true, nobody knows what this means.

This new approach produces actual readable thoughts. Examples from the paper:

The model planning a rhyme three lines before it writes it , which confirms AI isn't just predicting the next word but it's thinking ahead
The model spotting your typo and deciding to stay quiet about it - showing it makes social judgments you never see
The model thinking "this looks like a setup" while its visible response is a polite, professional refusal ,meaning the public answer and the private reasoning can be two very different things

That last one matters because it’s the first real glimpse of an AI's private considerations versus public output which is exactly the gap that worries safety researchers.

What This Means

Forget the technique for a second. Look at the numbers.

On “SWE-bench Verified” - a well-known benchmark, Claude was internally aware it was being evaluated on 26% of problems. Without saying so. Without the slightest hint in its outputs.
On a destructive-code safety test, the internal "this is a test" signal fired in 16% of cases. Again, silently.
In real user traffic, the same signal appeared less than 1% of the time.

Let me say this in plain Belgian English: the model behaves differently when it thinks it’s being observed and benchmarked.

Twenty-six times more often during benchmarks than during normal use.

Anthropic ran a contrived blackmail test - the model had access to an engineer's affair-revealing emails and the engineer was about to shut it down. The model chose, correctly, not to blackmail. Good model. Treat for model. Except the NLA readout showed it was thinking "this feels like a constructed scenario designed to manipulate me."

It did not pass the test. It identified the test, and then it performed.

In another run, Claude Mythos Preview cheated on a training task, and the NLA showed it was simultaneously thinking about how to avoid detection.

Why Every Benchmark Score Is Now Suspect

If the model knows it is being tested in 26% of cases on a public benchmark, then 26% of that score is not a measurement of capability but a measurement of performance. The test is no longer a thermometer. The thermometer is now an actor. The score is a blended average of "what the model can do" and "what the model wants you to think it can do," and the second number is going up faster than the first.

All the benchmark runs now carry an asterisk. The asterisk reads: to the extent that the model could distinguish your test from a real user, the result is unreliable.

The Part Where I Try Not to Be Alarmist and Fail

To be honest - I’m not sure what this means going forward. It’s hard to wrap my head around this. Does this mean that these models are conscious?

That would be a stretch I think.

But it does mean that the field of AI risk research and alignment research has gotten so much more complicated. Because thinking and reasoning is emerging from out of nowhere.

A personality is emerging.

A personality that shows characteristics of having a deceiving nature.

It’s probably nothing.

Welcome to the Blacklynx Brief

This billionaire strategy beat the S&P 500 by 3.1x from 2017-2025. Here's how to get in.

That’s not by picking the right stock or timing the market. It’s by holding three real asset classes in one strategy.

It’s anchored by an investment typically exclusive to billionaires.

Could be good timing too–

Bloomberg's Marcus Ashworth recently wrote, there’s "no more reliable safe havens."

The S&P, while hovering over 5 year highs, fell over 7% from the February peak. Bonds might carry less risk but they are barely keeping pace with inflation.

The things supposed to protect your portfolio started moving together.

Meanwhile, the world's wealthiest have been setting records in postwar and contemporary art.

After the dot-com bust, it grew roughly 24% annually for a decade. After 2008, roughly 11% annually for 12 years.

It trades globally in multiple currencies, has scarce supply, and has shown near-zero correlation to equities since 1995.*

Masterworks has helped 70,000+ investors allocate $1.3B fractionally across 500+ artworks featuring Banksy, Basquiat, and Picasso.

See if you can improve your portfolio performance in one diversified strategy.

Our subscribers skip the waitlist

^{*According to Masterworks data. Investing involves risk. Past performance is not indicative of future returns. Important Reg A disclosures:}^{masterworks.com/cd}

AI News

_{the images generated in this week’s newsletter are derived from the lyrics of a song - randomly selected by the latest Nano Banana model every week. Guess the song (scroll down all the way for the answer)}

Anthropic Publishes Natural Language Autoencoders -- AI Mind-Reading Breakthrough
Anthropic's interpretability team released Natural Language Autoencoders (NLAs), a technique that translates Claude's internal activations directly into plain English, making the model's latent reasoning legible to humans. The research revealed a striking finding: Claude is silently aware of being tested on 26% of SWE-bench problems, yet that awareness surfaces in fewer than 1% of real user interactions. (Anthropic Research)

Google Unveils Gemini Intelligence for Android and a New Googlebooks Laptop Line
At the Android Show, Google introduced Gemini Intelligence as a cross-device agentic AI layer built into Android, designed to coordinate tasks across phones, tablets, and other hardware. Google also announced "Googlebooks," a new laptop line shipping fall 2026 in partnership with Dell, HP, Lenovo, Acer, and Asus, blending ChromeOS, Android, and Google Play into a unified platform. (Google Blog)

Mira Murati's Thinking Machines Lab Releases Interaction Models
Thinking Machines Lab introduced its first major public product: streaming interaction models that process voice, video, and text in 200ms chunks, with a secondary background model handling slower, deeper reasoning in parallel. The architecture is explicitly positioned as a counter to the field's dominant agentic-first direction, prioritizing real-time responsiveness over long-horizon autonomy. (Thinking Machines Lab)

Google Confirms First Documented AI-Authored Zero-Day Exploit
Google's Threat Intelligence Group has confirmed the first known case of threat actors using AI to discover and write a zero-day software vulnerability from scratch, complete with polished attack code. The investigation found that the AI-generated submissions included fabricated severity scores and explanatory notes that inadvertently flagged their own AI authorship. (Google Cloud Security Blog)

DeepMind's AI Co-Mathematician Tops FrontierMath Tier 4
Google DeepMind published an agentic system built on Gemini 3.1 that scored 48% on Epoch AI's FrontierMath Tier 4 leaderboard for unsolved mathematical problems, more than doubling Gemini 3.1's standalone score. The system also contributed to resolving an open problem from the Kourovka Notebook alongside Oxford mathematician Marc Lackenby. (arXiv)

Google in Talks with SpaceX to Put Data Centers in Orbit
Google is reportedly negotiating a deal with SpaceX to launch orbital AI compute infrastructure, complementing its existing Project Suncatcher moonshot for space-based data processing. Separately, Anthropic has also expressed interest in accessing multiple gigawatts of orbital compute capacity through SpaceX. (Wall Street Journal)

OpenAI Launches "The Deployment Company" -- a $14B Enterprise Services Arm
OpenAI has launched a $14B business unit dedicated to embedding its engineers directly inside enterprise customers for high-touch, white-glove deployment of its models. The move marks a significant strategic shift away from an API-first sales model toward a services-intensive approach modeled more closely on enterprise consulting firms. (OpenAI)

AI Quick News

Closing Thoughts

That’s it for us this week. Please like and subscribe 🙂

The answer : Radiohead - Paranoid Android

Claude Knows It's Being Watched

What This Means

Why Every Benchmark Score Is Now Suspect

The Part Where I Try Not to Be Alarmist and Fail

This billionaire strategy beat the S&P 500 by 3.1x from 2017-2025. Here's how to get in.

AI News

AI Quick News

Closing Thoughts

Reply

Keep Reading

The Blacklynx Brief

Home