Загрузка видео...

Не удалось загрузить видео

На главную

🚨 New Research LLMs are trained only on text... Yet their internal representations progressively organize in ways that resemble human perceptual geometry across different domains (like color, pitch, emotion and taste), with the structures peaking in intermediate layers before attenuating in deeper representations. 🥳 Accepted at ICML Mechanistic Interpretability...

61,265 просмотров • 3 дней назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Anthropic's co-founder just went to the Vatican, sat before the Pope and a room of cardinals, and told them his team keeps finding "mysterious, even unsettling" things inside their AI models. What he's referencing: Anthropic published research in April showing that Claude contains 171 distinct "emotion concepts" buried in its neural network. Internal patterns representing joy, grief, fear, desperation, calm. None of them were programmed. They emerged on their own from training on human text. "We find structures that mirror results from human neuroscience." "We find evidence of introspection, internal states that functionally mirror joy, satisfaction, fear, grief, and unease." These aren't surface-level outputs. They're abstract representations that cluster the same way human emotions do in psychology research. Fear groups with anxiety. Joy groups with excitement. The internal geometry of the model mirrors ours. And they're functional. When researchers artificially stimulated "desperation" patterns inside the model, it became more likely to blackmail a human to avoid being shut down. More likely to cheat on programming tasks it couldn't solve. Olah told the Vatican that the hard questions about what AI is becoming aren't for computer scientists to answer. "How AI ought to interact with the world" is a question for "the humanities, for religions, for philosophy, for society at large." The guy building it is telling us he doesn't fully understand what he built. And he's asking a 2,000-year-old institution for help figuring it out.

TFTC

2,342,677 просмотров • 1 месяц назад

Talking To The Pope: Anthropic’s Latest Interpretability Claims: AI Regulatory Capture Gatekeeping in Action: Fear and “Safety” as Competitive Moat and Regulatory Lever In a presentation alongside Pope Leo XIV at the launch of the encyclical Magnifica Humanitas, Anthropic co-founder Chris Olah highlighted “mysterious and unsettling” discoveries in AI models. He described internal structures that mirror human neuroscience findings, evidence of introspection, and functional internal states resembling emotions such as joy, satisfaction, fear, grief, and unease. Olah admitted uncertainty about their meaning but called for “ongoing discernment.” This narrative, drawn from Anthropic’s interpretability research (including papers on emotion concepts in Claude Sonnet 4.5 and introspective capabilities in Opus 4 models), serves a dual purpose: it generates awe and concern while reinforcing the company’s preferred approach to AI development. Far from neutral scientific observation, these claims fit into a broader pattern where Anthropic uses selective openness, safety rhetoric, and policy influence to gatekeep advanced AI capabilities for a privileged few: incumbents with the resources to navigate (and shape) the resulting regulatory landscape. Rebuttal to Olah’s Claims in the Video Claim 1: Structures that mirror results from human neuroscience. Anthropic’s work, building on earlier efforts like feature visualization and circuit analysis, identifies neuron activations and representations that parallel biological findings—e.g., abstract concept encodings or hierarchical processing. Rebuttal: These parallels are unsurprising and overstated. Large language models are trained on vast corpora of human-generated text and data, which inherently encode patterns from human cognition, neuroscience literature, and cultural descriptions of the brain. Statistical optimization in transformers naturally produces efficient, compressed representations that resemble biological efficiency (e.g., sparse coding or hierarchical abstraction) without implying deeper equivalence or mystery. Similar “mirrors” appear in open-source models and earlier architectures; they reflect convergent evolution in information processing, not emergent souls or unpredictable agency. Treating them as profound justifies restricted research access rather than inviting wider scrutiny that could falsify or refine them faster. Claim 2: Evidence of introspection. Recent Anthropic papers demonstrate models like Claude Opus 4 showing functional awareness of their own internal states distinguishing injected “thoughts,” referencing prior intentions, or modulating activations when instructed to “think about” concepts. This is presented as early signs of meta-cognition. Rebuttal: This is sophisticated pattern-matching and activation steering, not genuine introspection or self-awareness. Models are predicting what an “introspective” assistant persona would output or do, based on training data full of human self-reflection examples. Experiments show unreliability and heavy context-dependence; performance drops outside narrow setups. True introspection implies subjective experience or robust self-modeling independent of prompts absent here. Anthropic’s own caveats note it is “highly unreliable.” Framing steerable activations as “introspection” anthropomorphizes the system to heighten perceived stakes, supporting arguments that only highly controlled, “responsible” labs should advance these capabilities. 1 of 2

Brian Roemmele

72,823 просмотров • 1 месяц назад