Explore NeurIPS 2025 Best Papers
Deep-dive into the research behind the 7 award-winning papers with comprehensive explanations and interactive simulators.
Topics
LLM Diversity & The Artificial Hivemind
Explore how LLMs produce homogeneous outputs and the societal implications of model monoculture.
Gated Attention Mechanisms
Learn about attention sink problems and how gating fixes long-context LLM performance.
1000-Layer Deep RL Networks
Discover how extreme depth enables self-supervised RL agents to reach goals without rewards.
Diffusion Training Dynamics
Understand why diffusion models generalize instead of memorizing their training data.
RLVR & LLM Reasoning
Examine whether reinforcement learning truly expands LLM reasoning capabilities.
Transductive Online Learning
A 30-year-old open problem solved: the value of unlabeled data in online learning.
Superposition & Neural Scaling
Learn how feature superposition drives the famous neural scaling laws.
Quick Start
New to ML?
Start with the Diffusion Process simulator for an intuitive visual introduction, then explore the Attention Visualizer to understand how transformers work.
Experienced Practitioner?
Dive into Superposition & Scaling for cutting-edge theory, or explore the RLVR & Reasoning analysis for insights on LLM training.
LLM Diversity & The Artificial Hivemind
Understanding homogeneity in large language model outputs and its implications.
The Problem: LLM Monoculture
When millions of people ask similar questions to the same LLM, they receive remarkably similar answers. This phenomenon, termed the "Artificial Hivemind", raises fundamental concerns about the diversity of information, perspectives, and creative outputs in an AI-augmented society.
Understanding Diversity Metrics
The paper introduces several metrics to quantify output diversity:
1. Lexical Diversity
Measures variety at the word/token level using metrics like:
- Type-Token Ratio (TTR): Unique words / Total words
- N-gram Diversity: Unique n-grams across outputs
- Self-BLEU: How similar outputs are to each other (lower = more diverse)
2. Semantic Diversity
Captures meaning-level variation:
- Embedding Distance: Cosine distance between response embeddings
- Topic Distribution: Spread across identified themes
- Stance Diversity: Range of positions on subjective questions
The Infinity-Chat Benchmark
The paper introduces Infinity-Chat, a benchmark of 26,000+ open-ended queries with dense human annotations specifically designed to evaluate response diversity.
Query Categories
- Creative writing prompts
- Opinion & preference questions
- Open-ended problem solving
- Cultural & subjective topics
Annotation Types
- Response quality ratings
- Diversity judgments
- Originality scores
- Human baseline comparisons
Key Findings
Intra-Model Homogeneity
When sampling multiple responses from the same model with the same prompt, outputs cluster tightly in semantic space. Even with high temperature settings, the fundamental "answer shape" remains remarkably consistent.
Inter-Model Convergence
Different LLM families (GPT, Claude, Llama, etc.) trained on overlapping internet data produce responses that are more similar to each other than human responses are to each other. This suggests a concerning convergence toward a shared "AI perspective."
Implications
- Echo chambers: AI-assisted content creation may reduce information diversity
- Cultural flattening: Minority viewpoints underrepresented in training data get further marginalized
- Creative stagnation: Automated content converges toward "average" outputs
- Value alignment: Whose values are reflected in homogeneous outputs?
Technical Deep Dive
Why Does Homogeneity Occur?
Several factors contribute to the hivemind effect:
- Training Data Overlap: Major LLMs are trained on largely overlapping web corpora, leading to similar statistical patterns.
- RLHF Homogenization: Reinforcement Learning from Human Feedback pushes models toward "safe," mainstream responses that satisfy average annotator preferences.
- Mode Collapse in Generation: Autoregressive sampling tends to find high-probability paths, which are often the same paths across similar models.
- Benchmark Optimization: Models optimized for the same benchmarks develop similar capabilities and failure modes.
Related Topics
RLVR & LLM Reasoning
How RL training affects model behavior and output diversity.
Neural Scaling Laws
Why larger models might not solve the diversity problem.
Gated Attention for Large Language Models
Non-linearity, sparsity, and attention-sink-free mechanisms.
Self-Attention Refresher
The transformer architecture relies on self-attention to allow each token to attend to all other tokens in a sequence. The standard formulation:
Where:
- Q (Query): What information am I looking for?
- K (Key): What information do I contain?
- V (Value): What information do I provide if matched?
- d_k: Dimension of keys (for scaling)
The Attention Sink Problem
A critical issue emerges in long-context scenarios: attention sink. This phenomenon causes attention to concentrate heavily on early tokens (often the BOS token or initial punctuation), regardless of their semantic relevance.
The softmax function must produce a probability distribution that sums to 1. When no tokens are particularly relevant, the model "dumps" excess attention mass onto early tokens as a default behavior learned during training.
Consequences of Attention Sink
- Wasted capacity: Attention heads spend resources on uninformative tokens
- Long-context degradation: Performance drops as context length increases
- Position bias: Early tokens disproportionately influence outputs
- Training instability: Gradient flow becomes uneven across positions
The Gated Attention Solution
The paper proposes a simple yet effective modification: adding a head-specific sigmoid gate after the scaled dot-product attention (SDPA).
Where g is a learnable scalar parameter per attention head.
How It Works
Gate Value Near 0
The attention head's output is suppressed. This allows heads to "opt out" when they have nothing meaningful to contribute, rather than dumping attention on sinks.
Gate Value Near 1
Standard attention behavior. The head contributes normally when it has found relevant key-query matches.
Experimental Results
The paper conducts extensive comparisons across dozens of gating variants on both 15B MoE (Mixture of Experts) and 1.7B dense models:
Key Findings
- Attention Sink Elimination: Gated attention removes the concentration of attention mass on early tokens, leading to more uniform and semantically meaningful attention patterns.
- Training Stability: Models train more smoothly with reduced gradient variance across layers and positions.
- Long-Context Performance: Significant improvements on tasks requiring attention over 32K+ tokens.
- Sparse Attention Emergence: Gates naturally learn to create sparse attention patterns, improving efficiency.
Technical Deep Dive
Gating Variants Explored
The paper systematically evaluates multiple gating designs:
- Per-head scalar gate: Single learnable parameter per head (winner)
- Per-position gate: Gate varies by sequence position
- Content-dependent gate: Gate computed from input embeddings
- Key-query gate: Gate based on attention scores themselves
Surprisingly, the simplest variant (per-head scalar) outperforms more complex alternatives, suggesting that the key benefit comes from allowing heads to suppress themselves entirely rather than fine-grained per-token control.
Implementation
class GatedAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
# One gate per head, initialized to pass through
self.gates = nn.Parameter(torch.ones(n_heads))
def forward(self, x):
attn_out = self.attn(x) # [batch, seq, heads, dim]
gates = torch.sigmoid(self.gates) # [heads]
return attn_out * gates.view(1, 1, -1, 1)
Visualize It
Related Topics
Neural Scaling Laws
How architectural improvements interact with model scale.
Deep Network Scaling
Depth scaling in neural networks beyond transformers.
1000-Layer Networks for Self-Supervised RL
Scaling depth can enable new goal-reaching capabilities.
The Depth Paradox in RL
For years, the conventional wisdom held that reinforcement learning and very deep networks don't mix. While computer vision scaled to hundreds of layers (ResNet-152, ViT-Giant), RL algorithms struggled with networks beyond a few dozen layers.
RL training involves non-stationary targets, high variance gradients, and bootstrapping errors that compound through deep networks. The result: training instability, gradient explosion/vanishing, and poor sample efficiency.
Self-Supervised RL Fundamentals
The paper leverages self-supervised reinforcement learning, which learns useful representations and policies without explicit reward signals. Instead, the agent learns through intrinsic objectives like:
Goal-Conditioned Learning
The agent learns to reach arbitrary goal states. Given current state s and goal g, learn policy pi(a|s,g) that reaches g.
Contrastive Objectives
Learn state representations where "reachable" state pairs are close in embedding space and "unreachable" pairs are far apart.
The Contrastive RL Formulation
This objective encourages the network to learn representations that capture the temporal structure of the environment: which states can be reached from which other states.
Why Depth Enables New Capabilities
The paper's central finding is that extreme depth (up to 1024 layers) isn't just possible in self-supervised RL; it enables qualitatively new behaviors that shallower networks cannot achieve.
Emergent Capabilities with Depth
- Long-Horizon Planning: Deeper networks can represent longer chains of reasoning about state reachability, enabling goals that require many intermediate steps.
- Compositional Skills: The network learns to compose primitive behaviors into complex sequences, similar to how deep vision models compose edges into objects.
- Generalization to Novel Goals: Deeper networks show better zero-shot transfer to goal configurations not seen during training.
- Robust Locomotion: In simulated robotics tasks, very deep networks discover more stable and efficient movement patterns.
Architecture & Training
Network Architecture
The 1024-layer network uses:
- Residual connections: Essential for gradient flow
- Layer normalization: Stabilizes activations at each layer
- Careful initialization: Scaled initialization to prevent explosion
- Bottleneck blocks: Reduce computation while maintaining depth
Training Innovations
- Hindsight goal relabeling: Every trajectory provides supervision for multiple goal-reaching problems
- Hard negative mining: Sample challenging negative examples for the contrastive loss
- Progressive depth: Start with fewer layers and gradually increase
# Simplified contrastive goal-reaching
def compute_loss(encoder, states, goals, negatives):
s_embed = encoder(states) # Very deep encoder
g_embed = encoder(goals)
neg_embed = encoder(negatives)
pos_sim = cosine_sim(s_embed, g_embed)
neg_sim = cosine_sim(s_embed, neg_embed)
return -log_softmax(pos_sim, neg_sim)
Experimental Results
Maze Navigation
1024-layer: 94% success
64-layer: 71% success
Deep networks find shorter paths through complex mazes.
Robotic Manipulation
1024-layer: 87% success
64-layer: 62% success
Enables multi-step object rearrangement.
Try the Simulator
Related Topics
RLVR & LLM Reasoning
How RL interacts with language model capabilities.
Attention Mechanisms
Architectural innovations for deep networks.
Why Diffusion Models Don't Memorize
The role of implicit dynamical regularization in training.
Diffusion Models 101
Diffusion models generate data by learning to reverse a noise-adding process. The key idea is elegantly simple: if we know how to gradually add noise until data becomes pure Gaussian noise, we can learn to reverse this process and generate new data from noise.
Forward Process (Adding Noise)
Over T timesteps, clean data x_0 becomes increasingly noisy until x_T is nearly pure Gaussian noise.
Reverse Process (Removing Noise)
A neural network epsilon_theta learns to predict the noise at each step, enabling step-by-step denoising.
The Over-Parameterization Paradox
Modern diffusion models have billions of parameters trained on datasets of millions of images. Classical learning theory suggests these models should massively overfit, essentially memorizing their training data.
This paper provides a theoretical framework explaining why diffusion models generalize instead of memorizing, based on the implicit regularization of training dynamics.
Two Characteristic Timescales
The paper identifies two critical timescales in diffusion model training:
t_learn
The time required to learn the underlying distribution. The model captures the statistical structure of the data class.
Scales as: O(sqrt(N))
t_memorize
The time required to memorize individual examples. The model starts reproducing specific training images.
Scales as: O(N)
The Generalization Window
Because t_memorize grows faster than t_learn with dataset size N, there exists a "generalization window" where the model has learned the data distribution but hasn't yet memorized specific examples.
For large datasets, this window is huge! With 10M images, the window spans roughly from 3,000 to 10,000,000 training steps.
Implicit Dynamical Regularization
The key insight is that gradient descent on the diffusion objective has an inherent regularization effect that emerges from the training dynamics themselves, without any explicit regularization term.
How It Works
- Early Training: The network learns broad, coarse-grained features that capture the general structure of the data distribution.
- Mid Training: Fine-grained details are learned, but the model still generates diverse outputs by interpolating between training examples.
- Late Training: Individual training examples begin to emerge as the network capacity starts encoding specific images.
Mathematical Framework
The paper analyzes training dynamics using tools from statistical physics:
Score Function Learning
Diffusion models learn the score function: the gradient of log probability density.
Information-Theoretic Bound
The generalization error is bounded by the mutual information between the model parameters and individual training examples, which grows slowly during the generalization window.
Kernel Regime Analysis
In the infinite-width limit, diffusion training dynamics can be analyzed as a kernel method, revealing the implicit bias toward smooth, generalizing solutions.
Visualize the Diffusion Process
Related Topics
Neural Scaling Laws
How model capacity relates to generalization.
Online Learning Theory
Theoretical foundations of learning dynamics.
Does RL Really Incentivize Reasoning in LLMs?
Beyond the base model: examining RLVR's true impact.
What is RLVR?
Reinforcement Learning with Verifiable Rewards (RLVR) is a training paradigm where LLMs are fine-tuned using RL with rewards based on objectively verifiable outcomes. Unlike RLHF (human feedback), RLVR rewards are programmatic: math problems have correct answers, code either passes tests or doesn't.
Verifiable Domains
- Mathematical reasoning
- Code generation
- Logic puzzles
- Formal proofs
RLVR Methods
- PPO with outcome rewards
- Expert iteration
- GRPO / DPO variants
- Process reward models
The Central Question
This distinction matters enormously. If RLVR adds new reasoning, we can expect continued improvement with more RL training. If it only improves sampling efficiency, we'll hit a ceiling defined by the base model's latent capabilities.
Key Findings
Finding 1: No New Reasoning Patterns
Across extensive experiments, the paper finds that RLVR models solve problems using the same reasoning strategies present in base model samples. When you sample enough responses from the base model, you find the same solution patterns that RLVR models produce more reliably.
Finding 2: Improved Sampling Efficiency
What RLVR does accomplish is dramatic improvement in sampling efficiency. Where a base model might need 100 samples to find a correct solution, an RLVR model might need only 5.
Finding 3: Narrowed Exploration
RLVR training narrows the model's exploration. It learns to avoid low-reward trajectories, effectively pruning the sampling tree. This improves efficiency but doesn't expand what's reachable.
Before RLVR
Wide exploration, many wrong paths sampled, occasional correct solutions found through extensive sampling.
After RLVR
Narrow exploration, high-reward paths favored, correct solutions found reliably but from the same set of reachable solutions.
The Distillation Alternative
Interestingly, the paper finds that distillation from stronger models does add genuinely new reasoning capabilities. When a weaker model learns from a stronger teacher's traces, it acquires reasoning patterns that weren't in its original distribution.
Technical Analysis
Methodology
The paper uses several techniques to probe reasoning boundaries:
- Coverage Analysis: Compare the set of problems solvable by base model (with many samples) vs RLVR model (with few samples).
- Solution Pattern Clustering: Embed solution traces and cluster to identify distinct reasoning strategies.
- Ablation Studies: Test across model sizes, RL algorithms, and problem domains.
Results Across Conditions
The core finding holds across:
- Model sizes from 7B to 70B parameters
- PPO, DPO, GRPO algorithms
- Math (GSM8K, MATH) and code (HumanEval, MBPP) benchmarks
- Different reward models and training durations
Implications for AI Development
Scaling RLVR
These findings suggest diminishing returns from RLVR as training continues. Once the model has learned to reliably select good reasoning paths, further RL provides minimal benefit.
Compute Allocation
For pushing reasoning frontiers, compute may be better spent on:
- Training larger base models with more diverse data
- Distillation from ensemble of strong reasoners
- Test-time compute (search, verification, retry)
The Reasoning Ceiling
Current LLMs may have a "reasoning ceiling" determined by their pretraining. RLVR helps reach this ceiling efficiently but doesn't raise it. Breaking through may require fundamentally different approaches.
Related Topics
LLM Diversity
How training affects output diversity and homogeneity.
Deep RL Networks
When RL does enable new capabilities.
Optimal Mistake Bounds for Transductive Online Learning
Resolving a 30-year-old open problem on the value of unlabeled data.
Online Learning Setup
In online learning, a learner faces a sequence of examples one at a time. For each example, the learner must make a prediction before seeing the true label. After predicting, the learner receives feedback and can update their hypothesis.
The Protocol
- Adversary presents instance x_t
- Learner predicts label y_hat_t
- True label y_t revealed
- Learner updates hypothesis
- Repeat for T rounds
Performance Metric
Mistake bound: The maximum number of prediction errors the learner makes over the entire sequence, in the worst case.
Transductive vs Standard Online Learning
Standard Online Learning
The adversary can present any instance at each round. The learner has no advance knowledge of what instances will appear.
Transductive Online Learning
The learner knows in advance which instances will appear (but not their labels or order). This represents having access to unlabeled data.
The Main Result
The paper proves tight bounds showing a quadratic gap between transductive and standard online learning:
Standard Online
Linear in VC dimension
Transductive
Square root of VC dimension
The Quadratic Gap
This means that knowing which instances you'll face (but not their labels) can provide an exponential improvement in learning efficiency. For a hypothesis class with d = 100, standard online learning may make ~100 mistakes while transductive can achieve ~10 mistakes.
Technical Insights
Lower Bound Construction
The paper constructs an adversarial instance sequence where any online learner must make at least Omega(sqrt(d)) mistakes, even with advance knowledge of instances. The construction uses sophisticated combinatorial arguments involving:
- Carefully designed hypothesis classes
- Probabilistic adversary strategies
- Information-theoretic lower bounds
Upper Bound Algorithm
The paper provides an algorithm achieving O(sqrt(d)) mistakes in the transductive setting. Key techniques include:
- Version space analysis: Track consistent hypotheses
- Weighted majority voting: Hedge across plausible hypotheses
- Careful tie-breaking: Exploit structure of unlabeled instances
Why This Matters
The Value of Unlabeled Data
This result formally quantifies a fundamental question in machine learning: how much does unlabeled data help? In the online learning setting, the answer is: a lot, specifically a quadratic improvement.
Connections to Semi-Supervised Learning
The transductive setting models scenarios where we have a fixed pool of unlabeled examples and must predict labels as they're requested. This connects to:
- Active learning: choosing which examples to label
- Semi-supervised learning: leveraging unlabeled structure
- Transductive SVMs: predicting on a fixed test set
Practical Implications
- Data collection: Knowing your test distribution can dramatically improve sample efficiency
- Domain adaptation: Access to unlabeled target data enables better transfer
- Cold start: In recommendation systems, knowing the items enables better user modeling
Historical Context
This problem was first posed in the 1990s during the early theoretical study of online learning. Notable attempts include:
- 1995: Initial upper bounds proved but not tight
- 2000s: Progress on special cases (linear classifiers, finite classes)
- 2010s: Improved bounds but still gaps between upper and lower
- 2025: This paper finally closes the gap!
Related Topics
Neural Scaling Laws
How sample complexity relates to model capacity.
Diffusion Training Dynamics
Another lens on learning-theoretic phenomena.
Superposition Yields Robust Neural Scaling
How feature superposition drives the famous scaling laws.
Neural Scaling Laws Refresher
One of the most remarkable empirical findings in deep learning is that model performance follows predictable power laws as we scale compute, data, and parameters:
The famous Chinchilla scaling laws showed that for a fixed compute budget, there's an optimal balance between model size and training data. But why do these clean power laws emerge?
What is Feature Superposition?
In a neural network, the model must represent many more features (concepts, patterns) than it has dimensions (neurons, embedding coordinates). The solution: pack multiple features into the same dimensions using superposition.
The Geometry
Imagine representing N features in a d-dimensional space where N >> d. Pure representation would be impossible (we'd need N dimensions). Instead, features are represented as non-orthogonal vectors that partially overlap:
Without Superposition
Each feature gets its own dimension. N features need N dimensions. Clean but capacity-limited.
With Superposition
Features share dimensions. N >> d features packed into d dimensions. More capacity but with interference.
The Trade-off
Superposition allows encoding more features but introduces interference. When features share dimensions, activating one feature partially activates others. The model must manage this interference to make accurate predictions.
The Main Result
The paper proves that when models use superposition to pack features, the resulting interference naturally produces power-law scaling of loss with model size:
Key Insights
- Interference Decreases with Dimension: In higher-dimensional spaces, random vectors are more nearly orthogonal, reducing interference.
- Power-Law Emerges from Geometry: The rate at which orthogonality improves with dimension follows a power law, explaining the scaling behavior.
- Feature Importance Distribution Matters: The exponent alpha depends on how features are distributed by importance (Zipfian, uniform, etc.).
Technical Deep Dive
The Toy Model
The paper develops analysis using a controlled toy model:
- Input: Sparse binary feature vectors
- Task: Reconstruct features from compressed representation
- Architecture: Linear encoder-decoder with bottleneck
- Objective: Minimize reconstruction loss
# Simplified toy model
def superposition_model(x, W_encode, W_decode):
# Encode: N features -> d dimensions
h = W_encode @ x # shape: (d,)
# Decode: d dimensions -> N features (with interference)
x_hat = W_decode @ h # shape: (N,)
return x_hat
# Loss includes interference from other features
loss = ||x - x_hat||^2
Analysis Framework
Using random matrix theory and concentration inequalities, the paper shows:
- Optimal encoding distributes features as nearly-orthogonal vectors
- Expected interference scales as O(N/d) per feature
- Total loss scales as O(N/d) = O(1/d) when N is fixed
- For Zipfian feature importance, this becomes O(1/d^alpha)
Connection to Real LLMs
The paper validates predictions on open LLMs (Pythia, OLMo) by:
- Extracting feature representations using sparse autoencoders
- Measuring actual superposition levels
- Correlating superposition with scaling behavior
- Finding quantitative agreement with theory
When Scaling Laws Break
The superposition framework also predicts when scaling laws should fail:
Scaling Breaks When...
- Features can't be approximated linearly
- Feature co-occurrence is highly structured
- Model is too small for superposition
- Task requires exact feature recovery
Scaling Holds When...
- Many sparse features exist
- Approximate reconstruction suffices
- Features have Zipfian importance
- Model capacity allows superposition
Implications
For Scaling Research
Understanding why scaling laws work helps predict when they'll break. Tasks requiring precise, non-superposed representations may not follow standard scaling predictions.
For Interpretability
Superposition is both a blessing (efficiency) and a curse (interpretability). Features entangled in superposition are hard to disentangle, explaining why neural network interpretability is challenging.
For Architecture Design
Architectures that manage superposition better (e.g., sparse models, modular networks) may achieve better scaling or interpretability trade-offs.
Explore Superposition
Related Topics
Gated Attention
Architectural innovations that improve efficiency.
Diffusion Dynamics
How implicit regularization affects model behavior.
LLM Diversity Analyzer
Measure and visualize text homogeneity across multiple samples.
Text Diversity Metrics
Attention Mechanism Visualizer
Interactive exploration of attention patterns and gating effects.
Attention Matrix
Current Head Stats
Goal-Reaching RL Agent
Watch an agent learn with different network depths.
Environment
Performance Stats
Diffusion Process Visualizer
Explore forward and reverse diffusion with the generalization window.
Noise Process
Process Stats
Feature Superposition Demo
Visualize how features interfere in low-dimensional spaces.