Topics

H

LLM Diversity & The Artificial Hivemind

Explore how LLMs produce homogeneous outputs and the societal implications of model monoculture.

Best Paper D&B Simulator
A

Gated Attention Mechanisms

Learn about attention sink problems and how gating fixes long-context LLM performance.

Best Paper Simulator
R

1000-Layer Deep RL Networks

Discover how extreme depth enables self-supervised RL agents to reach goals without rewards.

Best Paper Simulator
D

Diffusion Training Dynamics

Understand why diffusion models generalize instead of memorizing their training data.

Best Paper Simulator
L

RLVR & LLM Reasoning

Examine whether reinforcement learning truly expands LLM reasoning capabilities.

Runner-Up
O

Transductive Online Learning

A 30-year-old open problem solved: the value of unlabeled data in online learning.

Runner-Up
S

Superposition & Neural Scaling

Learn how feature superposition drives the famous neural scaling laws.

Runner-Up Simulator

Quick Start

New to ML?

Start with the Diffusion Process simulator for an intuitive visual introduction, then explore the Attention Visualizer to understand how transformers work.

Experienced Practitioner?

Dive into Superposition & Scaling for cutting-edge theory, or explore the RLVR & Reasoning analysis for insights on LLM training.

The Problem: LLM Monoculture

When millions of people ask similar questions to the same LLM, they receive remarkably similar answers. This phenomenon, termed the "Artificial Hivemind", raises fundamental concerns about the diversity of information, perspectives, and creative outputs in an AI-augmented society.

Key Insight: Even when prompted with open-ended questions that should yield diverse responses, LLMs exhibit strong intra-model repetition (same model gives similar answers) and inter-model homogeneity (different models converge on similar outputs).

Understanding Diversity Metrics

The paper introduces several metrics to quantify output diversity:

1. Lexical Diversity

Measures variety at the word/token level using metrics like:

  • Type-Token Ratio (TTR): Unique words / Total words
  • N-gram Diversity: Unique n-grams across outputs
  • Self-BLEU: How similar outputs are to each other (lower = more diverse)

2. Semantic Diversity

Captures meaning-level variation:

  • Embedding Distance: Cosine distance between response embeddings
  • Topic Distribution: Spread across identified themes
  • Stance Diversity: Range of positions on subjective questions
Self-BLEU = (1/N) * sum(BLEU(response_i, {all other responses}))

The Infinity-Chat Benchmark

The paper introduces Infinity-Chat, a benchmark of 26,000+ open-ended queries with dense human annotations specifically designed to evaluate response diversity.

Query Categories

  • Creative writing prompts
  • Opinion & preference questions
  • Open-ended problem solving
  • Cultural & subjective topics

Annotation Types

  • Response quality ratings
  • Diversity judgments
  • Originality scores
  • Human baseline comparisons

Key Findings

Intra-Model Homogeneity

When sampling multiple responses from the same model with the same prompt, outputs cluster tightly in semantic space. Even with high temperature settings, the fundamental "answer shape" remains remarkably consistent.

Inter-Model Convergence

Different LLM families (GPT, Claude, Llama, etc.) trained on overlapping internet data produce responses that are more similar to each other than human responses are to each other. This suggests a concerning convergence toward a shared "AI perspective."

Implications

  • Echo chambers: AI-assisted content creation may reduce information diversity
  • Cultural flattening: Minority viewpoints underrepresented in training data get further marginalized
  • Creative stagnation: Automated content converges toward "average" outputs
  • Value alignment: Whose values are reflected in homogeneous outputs?

Technical Deep Dive

Why Does Homogeneity Occur?

Several factors contribute to the hivemind effect:

  1. Training Data Overlap: Major LLMs are trained on largely overlapping web corpora, leading to similar statistical patterns.
  2. RLHF Homogenization: Reinforcement Learning from Human Feedback pushes models toward "safe," mainstream responses that satisfy average annotator preferences.
  3. Mode Collapse in Generation: Autoregressive sampling tends to find high-probability paths, which are often the same paths across similar models.
  4. Benchmark Optimization: Models optimized for the same benchmarks develop similar capabilities and failure modes.
Try It: Use the Diversity Analyzer simulator to measure homogeneity in text samples and visualize clustering patterns.

Related Topics

RLVR & LLM Reasoning

How RL training affects model behavior and output diversity.

Neural Scaling Laws

Why larger models might not solve the diversity problem.

Self-Attention Refresher

The transformer architecture relies on self-attention to allow each token to attend to all other tokens in a sequence. The standard formulation:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where:

  • Q (Query): What information am I looking for?
  • K (Key): What information do I contain?
  • V (Value): What information do I provide if matched?
  • d_k: Dimension of keys (for scaling)

The Attention Sink Problem

A critical issue emerges in long-context scenarios: attention sink. This phenomenon causes attention to concentrate heavily on early tokens (often the BOS token or initial punctuation), regardless of their semantic relevance.

Why Does This Happen?
The softmax function must produce a probability distribution that sums to 1. When no tokens are particularly relevant, the model "dumps" excess attention mass onto early tokens as a default behavior learned during training.

Consequences of Attention Sink

  • Wasted capacity: Attention heads spend resources on uninformative tokens
  • Long-context degradation: Performance drops as context length increases
  • Position bias: Early tokens disproportionately influence outputs
  • Training instability: Gradient flow becomes uneven across positions

The Gated Attention Solution

The paper proposes a simple yet effective modification: adding a head-specific sigmoid gate after the scaled dot-product attention (SDPA).

GatedAttn(Q, K, V) = sigmoid(g) * softmax(QK^T / sqrt(d_k)) * V

Where g is a learnable scalar parameter per attention head.

How It Works

Gate Value Near 0

The attention head's output is suppressed. This allows heads to "opt out" when they have nothing meaningful to contribute, rather than dumping attention on sinks.

Gate Value Near 1

Standard attention behavior. The head contributes normally when it has found relevant key-query matches.

Experimental Results

The paper conducts extensive comparisons across dozens of gating variants on both 15B MoE (Mixture of Experts) and 1.7B dense models:

Key Findings

  1. Attention Sink Elimination: Gated attention removes the concentration of attention mass on early tokens, leading to more uniform and semantically meaningful attention patterns.
  2. Training Stability: Models train more smoothly with reduced gradient variance across layers and positions.
  3. Long-Context Performance: Significant improvements on tasks requiring attention over 32K+ tokens.
  4. Sparse Attention Emergence: Gates naturally learn to create sparse attention patterns, improving efficiency.
Real-World Impact: This gating mechanism is now deployed in Qwen3-Next models, demonstrating its practical value at scale.

Technical Deep Dive

Gating Variants Explored

The paper systematically evaluates multiple gating designs:

  • Per-head scalar gate: Single learnable parameter per head (winner)
  • Per-position gate: Gate varies by sequence position
  • Content-dependent gate: Gate computed from input embeddings
  • Key-query gate: Gate based on attention scores themselves

Surprisingly, the simplest variant (per-head scalar) outperforms more complex alternatives, suggesting that the key benefit comes from allowing heads to suppress themselves entirely rather than fine-grained per-token control.

Implementation

class GatedAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        # One gate per head, initialized to pass through
        self.gates = nn.Parameter(torch.ones(n_heads))

    def forward(self, x):
        attn_out = self.attn(x)  # [batch, seq, heads, dim]
        gates = torch.sigmoid(self.gates)  # [heads]
        return attn_out * gates.view(1, 1, -1, 1)

Visualize It

Try It: Use the Attention Visualizer to see attention patterns with and without gating, and observe how the sink phenomenon disappears.

Related Topics

Neural Scaling Laws

How architectural improvements interact with model scale.

Deep Network Scaling

Depth scaling in neural networks beyond transformers.

The Depth Paradox in RL

For years, the conventional wisdom held that reinforcement learning and very deep networks don't mix. While computer vision scaled to hundreds of layers (ResNet-152, ViT-Giant), RL algorithms struggled with networks beyond a few dozen layers.

Why Was Depth Problematic?
RL training involves non-stationary targets, high variance gradients, and bootstrapping errors that compound through deep networks. The result: training instability, gradient explosion/vanishing, and poor sample efficiency.

Self-Supervised RL Fundamentals

The paper leverages self-supervised reinforcement learning, which learns useful representations and policies without explicit reward signals. Instead, the agent learns through intrinsic objectives like:

Goal-Conditioned Learning

The agent learns to reach arbitrary goal states. Given current state s and goal g, learn policy pi(a|s,g) that reaches g.

Contrastive Objectives

Learn state representations where "reachable" state pairs are close in embedding space and "unreachable" pairs are far apart.

The Contrastive RL Formulation

L = -log[exp(f(s,g)) / (exp(f(s,g)) + sum_neg(exp(f(s,g_neg))))]

This objective encourages the network to learn representations that capture the temporal structure of the environment: which states can be reached from which other states.

Why Depth Enables New Capabilities

The paper's central finding is that extreme depth (up to 1024 layers) isn't just possible in self-supervised RL; it enables qualitatively new behaviors that shallower networks cannot achieve.

Emergent Capabilities with Depth

  1. Long-Horizon Planning: Deeper networks can represent longer chains of reasoning about state reachability, enabling goals that require many intermediate steps.
  2. Compositional Skills: The network learns to compose primitive behaviors into complex sequences, similar to how deep vision models compose edges into objects.
  3. Generalization to Novel Goals: Deeper networks show better zero-shot transfer to goal configurations not seen during training.
  4. Robust Locomotion: In simulated robotics tasks, very deep networks discover more stable and efficient movement patterns.

Architecture & Training

Network Architecture

The 1024-layer network uses:

  • Residual connections: Essential for gradient flow
  • Layer normalization: Stabilizes activations at each layer
  • Careful initialization: Scaled initialization to prevent explosion
  • Bottleneck blocks: Reduce computation while maintaining depth

Training Innovations

  • Hindsight goal relabeling: Every trajectory provides supervision for multiple goal-reaching problems
  • Hard negative mining: Sample challenging negative examples for the contrastive loss
  • Progressive depth: Start with fewer layers and gradually increase
# Simplified contrastive goal-reaching
def compute_loss(encoder, states, goals, negatives):
    s_embed = encoder(states)      # Very deep encoder
    g_embed = encoder(goals)
    neg_embed = encoder(negatives)

    pos_sim = cosine_sim(s_embed, g_embed)
    neg_sim = cosine_sim(s_embed, neg_embed)

    return -log_softmax(pos_sim, neg_sim)

Experimental Results

Maze Navigation

1024-layer: 94% success
64-layer: 71% success
Deep networks find shorter paths through complex mazes.

Robotic Manipulation

1024-layer: 87% success
64-layer: 62% success
Enables multi-step object rearrangement.

Key Insight: There's a phase transition around 256-512 layers where the network suddenly gains the ability to solve previously impossible long-horizon tasks.

Try the Simulator

Interactive Demo: Watch a goal-reaching agent learn in real-time. Adjust network depth and observe how deeper networks discover more efficient paths.

Related Topics

RLVR & LLM Reasoning

How RL interacts with language model capabilities.

Attention Mechanisms

Architectural innovations for deep networks.

Diffusion Models 101

Diffusion models generate data by learning to reverse a noise-adding process. The key idea is elegantly simple: if we know how to gradually add noise until data becomes pure Gaussian noise, we can learn to reverse this process and generate new data from noise.

Forward Process (Adding Noise)

x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon

Over T timesteps, clean data x_0 becomes increasingly noisy until x_T is nearly pure Gaussian noise.

Reverse Process (Removing Noise)

x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * epsilon_theta(x_t, t))

A neural network epsilon_theta learns to predict the noise at each step, enabling step-by-step denoising.

The Over-Parameterization Paradox

Modern diffusion models have billions of parameters trained on datasets of millions of images. Classical learning theory suggests these models should massively overfit, essentially memorizing their training data.

The Puzzle: Despite having enough capacity to memorize every training example, diffusion models generate novel images that don't exist in the training set. How?

This paper provides a theoretical framework explaining why diffusion models generalize instead of memorizing, based on the implicit regularization of training dynamics.

Two Characteristic Timescales

The paper identifies two critical timescales in diffusion model training:

t_learn

The time required to learn the underlying distribution. The model captures the statistical structure of the data class.

Scales as: O(sqrt(N))

t_memorize

The time required to memorize individual examples. The model starts reproducing specific training images.

Scales as: O(N)

The Generalization Window

Because t_memorize grows faster than t_learn with dataset size N, there exists a "generalization window" where the model has learned the data distribution but hasn't yet memorized specific examples.

Generalization Window = [t_learn, t_memorize] ~ [O(sqrt(N)), O(N)]

For large datasets, this window is huge! With 10M images, the window spans roughly from 3,000 to 10,000,000 training steps.

Implicit Dynamical Regularization

The key insight is that gradient descent on the diffusion objective has an inherent regularization effect that emerges from the training dynamics themselves, without any explicit regularization term.

How It Works

  1. Early Training: The network learns broad, coarse-grained features that capture the general structure of the data distribution.
  2. Mid Training: Fine-grained details are learned, but the model still generates diverse outputs by interpolating between training examples.
  3. Late Training: Individual training examples begin to emerge as the network capacity starts encoding specific images.
Practical Implication: Training diffusion models too long can lead to memorization and potential privacy/copyright issues. The paper provides theoretical guidance for optimal stopping times.

Mathematical Framework

The paper analyzes training dynamics using tools from statistical physics:

Score Function Learning

Diffusion models learn the score function: the gradient of log probability density.

s_theta(x, t) approximates grad_x log p_t(x)

Information-Theoretic Bound

The generalization error is bounded by the mutual information between the model parameters and individual training examples, which grows slowly during the generalization window.

Kernel Regime Analysis

In the infinite-width limit, diffusion training dynamics can be analyzed as a kernel method, revealing the implicit bias toward smooth, generalizing solutions.

Visualize the Diffusion Process

Interactive Demo: Use the Diffusion Process Visualizer to see forward and reverse diffusion in action, and explore the generalization window concept.

Related Topics

Neural Scaling Laws

How model capacity relates to generalization.

Online Learning Theory

Theoretical foundations of learning dynamics.

What is RLVR?

Reinforcement Learning with Verifiable Rewards (RLVR) is a training paradigm where LLMs are fine-tuned using RL with rewards based on objectively verifiable outcomes. Unlike RLHF (human feedback), RLVR rewards are programmatic: math problems have correct answers, code either passes tests or doesn't.

Verifiable Domains

  • Mathematical reasoning
  • Code generation
  • Logic puzzles
  • Formal proofs

RLVR Methods

  • PPO with outcome rewards
  • Expert iteration
  • GRPO / DPO variants
  • Process reward models

The Central Question

Does RLVR teach LLMs new reasoning capabilities, or does it simply make the model better at finding and selecting reasoning patterns it already knows?

This distinction matters enormously. If RLVR adds new reasoning, we can expect continued improvement with more RL training. If it only improves sampling efficiency, we'll hit a ceiling defined by the base model's latent capabilities.

Key Findings

Finding 1: No New Reasoning Patterns

Across extensive experiments, the paper finds that RLVR models solve problems using the same reasoning strategies present in base model samples. When you sample enough responses from the base model, you find the same solution patterns that RLVR models produce more reliably.

Finding 2: Improved Sampling Efficiency

What RLVR does accomplish is dramatic improvement in sampling efficiency. Where a base model might need 100 samples to find a correct solution, an RLVR model might need only 5.

Pass@1(RLVR) approx Pass@k(Base) for some k >> 1

Finding 3: Narrowed Exploration

RLVR training narrows the model's exploration. It learns to avoid low-reward trajectories, effectively pruning the sampling tree. This improves efficiency but doesn't expand what's reachable.

Before RLVR

Wide exploration, many wrong paths sampled, occasional correct solutions found through extensive sampling.

After RLVR

Narrow exploration, high-reward paths favored, correct solutions found reliably but from the same set of reachable solutions.

The Distillation Alternative

Interestingly, the paper finds that distillation from stronger models does add genuinely new reasoning capabilities. When a weaker model learns from a stronger teacher's traces, it acquires reasoning patterns that weren't in its original distribution.

Implication: For expanding reasoning capabilities, distillation from stronger models may be more effective than RLVR on the same model. RLVR is better suited for improving reliability of existing capabilities.

Technical Analysis

Methodology

The paper uses several techniques to probe reasoning boundaries:

  • Coverage Analysis: Compare the set of problems solvable by base model (with many samples) vs RLVR model (with few samples).
  • Solution Pattern Clustering: Embed solution traces and cluster to identify distinct reasoning strategies.
  • Ablation Studies: Test across model sizes, RL algorithms, and problem domains.

Results Across Conditions

The core finding holds across:

  • Model sizes from 7B to 70B parameters
  • PPO, DPO, GRPO algorithms
  • Math (GSM8K, MATH) and code (HumanEval, MBPP) benchmarks
  • Different reward models and training durations

Implications for AI Development

Scaling RLVR

These findings suggest diminishing returns from RLVR as training continues. Once the model has learned to reliably select good reasoning paths, further RL provides minimal benefit.

Compute Allocation

For pushing reasoning frontiers, compute may be better spent on:

  1. Training larger base models with more diverse data
  2. Distillation from ensemble of strong reasoners
  3. Test-time compute (search, verification, retry)

The Reasoning Ceiling

Current LLMs may have a "reasoning ceiling" determined by their pretraining. RLVR helps reach this ceiling efficiently but doesn't raise it. Breaking through may require fundamentally different approaches.

Related Topics

LLM Diversity

How training affects output diversity and homogeneity.

Deep RL Networks

When RL does enable new capabilities.

Online Learning Setup

In online learning, a learner faces a sequence of examples one at a time. For each example, the learner must make a prediction before seeing the true label. After predicting, the learner receives feedback and can update their hypothesis.

The Protocol

  1. Adversary presents instance x_t
  2. Learner predicts label y_hat_t
  3. True label y_t revealed
  4. Learner updates hypothesis
  5. Repeat for T rounds

Performance Metric

Mistake bound: The maximum number of prediction errors the learner makes over the entire sequence, in the worst case.

Transductive vs Standard Online Learning

Standard Online Learning

The adversary can present any instance at each round. The learner has no advance knowledge of what instances will appear.

Mistake Bound = O(d) where d = VC dimension

Transductive Online Learning

The learner knows in advance which instances will appear (but not their labels or order). This represents having access to unlabeled data.

The Open Problem (30 years!): How much does knowing the unlabeled instances in advance help? Can we prove tight bounds on the improvement?

The Main Result

The paper proves tight bounds showing a quadratic gap between transductive and standard online learning:

Standard Online

Mistakes = Theta(d)

Linear in VC dimension

Transductive

Mistakes = Theta(sqrt(d))

Square root of VC dimension

The Quadratic Gap

This means that knowing which instances you'll face (but not their labels) can provide an exponential improvement in learning efficiency. For a hypothesis class with d = 100, standard online learning may make ~100 mistakes while transductive can achieve ~10 mistakes.

Technical Insights

Lower Bound Construction

The paper constructs an adversarial instance sequence where any online learner must make at least Omega(sqrt(d)) mistakes, even with advance knowledge of instances. The construction uses sophisticated combinatorial arguments involving:

  • Carefully designed hypothesis classes
  • Probabilistic adversary strategies
  • Information-theoretic lower bounds

Upper Bound Algorithm

The paper provides an algorithm achieving O(sqrt(d)) mistakes in the transductive setting. Key techniques include:

  • Version space analysis: Track consistent hypotheses
  • Weighted majority voting: Hedge across plausible hypotheses
  • Careful tie-breaking: Exploit structure of unlabeled instances
Intuition: Knowing the instances in advance lets you precompute which hypotheses are "similar" on the specific instances you'll face, enabling more efficient version space reduction.

Why This Matters

The Value of Unlabeled Data

This result formally quantifies a fundamental question in machine learning: how much does unlabeled data help? In the online learning setting, the answer is: a lot, specifically a quadratic improvement.

Connections to Semi-Supervised Learning

The transductive setting models scenarios where we have a fixed pool of unlabeled examples and must predict labels as they're requested. This connects to:

  • Active learning: choosing which examples to label
  • Semi-supervised learning: leveraging unlabeled structure
  • Transductive SVMs: predicting on a fixed test set

Practical Implications

  • Data collection: Knowing your test distribution can dramatically improve sample efficiency
  • Domain adaptation: Access to unlabeled target data enables better transfer
  • Cold start: In recommendation systems, knowing the items enables better user modeling

Historical Context

This problem was first posed in the 1990s during the early theoretical study of online learning. Notable attempts include:

  • 1995: Initial upper bounds proved but not tight
  • 2000s: Progress on special cases (linear classifiers, finite classes)
  • 2010s: Improved bounds but still gaps between upper and lower
  • 2025: This paper finally closes the gap!

Related Topics

Neural Scaling Laws

How sample complexity relates to model capacity.

Diffusion Training Dynamics

Another lens on learning-theoretic phenomena.

Neural Scaling Laws Refresher

One of the most remarkable empirical findings in deep learning is that model performance follows predictable power laws as we scale compute, data, and parameters:

Loss ~ C^(-alpha) where C = compute, parameters, or data

The famous Chinchilla scaling laws showed that for a fixed compute budget, there's an optimal balance between model size and training data. But why do these clean power laws emerge?

The Mystery: Scaling laws seem almost magical. Why should such complex systems follow such simple mathematical relationships? This paper proposes an answer: feature superposition.

What is Feature Superposition?

In a neural network, the model must represent many more features (concepts, patterns) than it has dimensions (neurons, embedding coordinates). The solution: pack multiple features into the same dimensions using superposition.

The Geometry

Imagine representing N features in a d-dimensional space where N >> d. Pure representation would be impossible (we'd need N dimensions). Instead, features are represented as non-orthogonal vectors that partially overlap:

Without Superposition

Each feature gets its own dimension. N features need N dimensions. Clean but capacity-limited.

With Superposition

Features share dimensions. N >> d features packed into d dimensions. More capacity but with interference.

The Trade-off

Superposition allows encoding more features but introduces interference. When features share dimensions, activating one feature partially activates others. The model must manage this interference to make accurate predictions.

The Main Result

The paper proves that when models use superposition to pack features, the resulting interference naturally produces power-law scaling of loss with model size:

Loss proportional to 1/d^alpha where alpha depends on feature statistics

Key Insights

  1. Interference Decreases with Dimension: In higher-dimensional spaces, random vectors are more nearly orthogonal, reducing interference.
  2. Power-Law Emerges from Geometry: The rate at which orthogonality improves with dimension follows a power law, explaining the scaling behavior.
  3. Feature Importance Distribution Matters: The exponent alpha depends on how features are distributed by importance (Zipfian, uniform, etc.).

Technical Deep Dive

The Toy Model

The paper develops analysis using a controlled toy model:

  • Input: Sparse binary feature vectors
  • Task: Reconstruct features from compressed representation
  • Architecture: Linear encoder-decoder with bottleneck
  • Objective: Minimize reconstruction loss
# Simplified toy model
def superposition_model(x, W_encode, W_decode):
    # Encode: N features -> d dimensions
    h = W_encode @ x    # shape: (d,)

    # Decode: d dimensions -> N features (with interference)
    x_hat = W_decode @ h  # shape: (N,)

    return x_hat

# Loss includes interference from other features
loss = ||x - x_hat||^2

Analysis Framework

Using random matrix theory and concentration inequalities, the paper shows:

  • Optimal encoding distributes features as nearly-orthogonal vectors
  • Expected interference scales as O(N/d) per feature
  • Total loss scales as O(N/d) = O(1/d) when N is fixed
  • For Zipfian feature importance, this becomes O(1/d^alpha)

Connection to Real LLMs

The paper validates predictions on open LLMs (Pythia, OLMo) by:

  1. Extracting feature representations using sparse autoencoders
  2. Measuring actual superposition levels
  3. Correlating superposition with scaling behavior
  4. Finding quantitative agreement with theory

When Scaling Laws Break

The superposition framework also predicts when scaling laws should fail:

Scaling Breaks When...

  • Features can't be approximated linearly
  • Feature co-occurrence is highly structured
  • Model is too small for superposition
  • Task requires exact feature recovery

Scaling Holds When...

  • Many sparse features exist
  • Approximate reconstruction suffices
  • Features have Zipfian importance
  • Model capacity allows superposition

Implications

For Scaling Research

Understanding why scaling laws work helps predict when they'll break. Tasks requiring precise, non-superposed representations may not follow standard scaling predictions.

For Interpretability

Superposition is both a blessing (efficiency) and a curse (interpretability). Features entangled in superposition are hard to disentangle, explaining why neural network interpretability is challenging.

For Architecture Design

Architectures that manage superposition better (e.g., sparse models, modular networks) may achieve better scaling or interpretability trade-offs.

Explore Superposition

Interactive Demo: Use the Feature Superposition Demo to visualize how features interfere in low-dimensional spaces and see scaling law emergence.

Related Topics

Gated Attention

Architectural innovations that improve efficiency.

Diffusion Dynamics

How implicit regularization affects model behavior.

Text Diversity Metrics

Attention Matrix

INPUT TOKENS
ATTENTION MATRIX
Low attention
High attention

Current Head Stats

Entropy
--
Sink Ratio
--
Sparsity
--
Effective Tokens
--

Environment

Agent
Goal
Wall
Path

Performance Stats

Episodes
0
Success Rate
--
Avg Steps
--
Current Step
0
Depth Effect: Deeper networks (256+ layers) can plan longer paths and find more efficient routes. Try switching depths to see the difference!

Noise Process

ORIGINAL
CURRENT (t=0)
PURE NOISE
DIFFUSION TIMELINE
Learning
Generalization Window
Memorization

Process Stats

Noise Level
0%
Signal Preserved
100%
Try it: In the generalization window (middle zone), the model has learned the distribution but hasn't memorized specific examples.

Feature Space

2D FEATURE EMBEDDING SPACE
Feature vectors
Interference
(Higher d shown as 2D projection)

Superposition Metrics

N/d Ratio
4.0
Avg Interference
--
Reconstruction Loss
--
Predicted Scaling
--
Scaling Law: As d increases, interference decreases proportionally to 1/d, explaining why larger models generalize better.