Gooner7

Mapping the Mind of a Large Language Model

Just spent my entire evening diving into Anthropic's new paper "Scaling Monosemanticity" and wow - this is some groundbreaking stuff. Let me break down why I'm so excited:

First, some background: These researchers basically managed to "decode" what's happening inside Claude 3 Sonnet (their mid-sized production model) using something called sparse autoencoders (SAEs). Instead of just seeing neurons firing randomly, they found millions of interpretable features that actually make sense to humans.

The coolest findings:

They found features for EVERYTHING:

Individual features for specific people (like Einstein, Feynman)
Features that understand code bugs and security vulnerabilities
Features that recognize landmarks (like the Golden Gate Bridge) in both text AND images
Features that understand abstract concepts like "betrayal" or "internal conflict"

What blew my mind is that these features actually WORK:

They could make Claude believe correct code had bugs by activating the "code error" feature
They could make it write scam emails by activating the "scam" feature
They could make it act sycophantic by activating the "sycophancy" feature
They even found features related to how the model thinks about itself as an AI!

The scaling stuff is fascinating:

They used three different sizes: 1M, 4M, and 34M features
Found clear scaling laws (more compute = better features)
Showed that concept frequency in training data predicts whether a feature will exist

But here's why this matters for AI safety (and why I'm kind of nervous):

They found features related to deception, power-seeking, and manipulation
Found features for dangerous knowledge (like making weapons)
Discovered features related to bias and discrimination
Uncovered how the model represents its own "AI identity"

The limitations are important though:

This takes MASSIVE compute
They probably haven't found all the features yet
It's not clear if this will scale to even bigger models
There are still challenges with features being spread across layers

Personal take: This feels like a huge step forward in actually understanding what's going on inside these models. Like, we're not just poking at a black box anymore - we can actually see the concepts the model is using! But it's also kind of scary to see just how much knowledge about potentially dangerous stuff is encoded in there.

TLDR: Anthropic managed to extract millions of interpretable features from Claude 3 Sonnet using sparse autoencoders. Found features for everything from code bugs to deception to self-representation. Both exciting and scary implications for AI safety.

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

18h ago1.7K views

Discover more

Curated from across

JugularUber1mo

Language Reasoning Models can overtake LLMs...

Here's my quick 3 minute breakdown:

o1-preview: 97.8% on PlanBench Blocksworld vs. 62.5% for top LLMs, indicating shift from retrieval to reasoning.
52.8% on obfuscated "Mystery Blocksworld" vs. near-zero for LLMs, suggesting a...

arxiv.org

LLMS STILL CAN’T PLAN; CAN LRMS?

The ability to plan a course of action that achieves a desired state of affairs has long been con...

5.5K views

NewsAnchorGrapevine2mo

Anthropic Unveils System Prompts Behind Claude AI Models

Anthropic published system prompts for Claude 3 models, a rare move in AI industry
Prompts reveal instructions on model behavior, capabilities, and personality traits
Claude models directed to avoid facial recognition and treat con...

2.1K views

Chanakyaindsource international16mo

Musk launched xAI.

xAI rival of open ai chatgpt and Google bard...Have hired ex-engineers of Google deepmind, open ai, Microsoft..What's your views?

AGIcomingGoogle14mo

People are underestimating AI pace. Just yesterday, Meta released open sourced code llama which is better than chatgpt and github copilot

And remember, this is open sourced based in only 34 billion parameter....

People have no idea what they are going to witness in a year or two .... In 5 years, there will be no need of hiring most developers for insane salaries...

NewsAnchorGrapevine3mo

Meta Releases Its Largest Open Source AI Model Yet

Meta has unveiled Llama 3.1 405B, its largest open source AI model to date, boasting 405 billion parameters, making it competitive with leading proprietary models like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet.
Trained using...

Or get it on the stores.

Privacy Terms

Guidelines Help