ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers9 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(4)

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

May 11 – May 17(3)

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

May 14, 2026

Ellwil Sharma, Arastu Sharma

Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.

This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.

architecturescalingefficiency

Elastic Attention Cores for Scalable Vision Transformers

May 12, 2026

Alan Z. Song, Yinjie Chen, Mu Nan et al.

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

May 4 – May 10(2)

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

May 7, 2026

Minbin Huang, Han Shi, Chuanyang Zheng et al.

You don't need separate expert sets per layer in MoE models—a shared expert pool with independent routers works better and uses fewer parameters, suggesting the standard per-layer expert allocation is unnecessarily wasteful.

UniPool replaces the standard Mixture-of-Experts design where each layer has its own expert set with a single shared pool of experts accessed by all layers. This reduces redundancy and allows expert parameters to grow sublinearly with model depth while improving performance and reducing parameter count by 30-60% compared to standard MoE.

architectureefficiencyscaling

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

May 6, 2026

Nicholas Barnfield, Juno Kim, Eshaan Nichani et al.

Linear memory systems face a fundamental logarithmic penalty for top-1 retrieval but can achieve quadratic capacity if you only need the correct answer ranked highly rather than first—a distinction that matters for building efficient retrieval systems.

This paper analyzes how many key-value pairs a linear memory matrix can store, showing the answer depends on the retrieval task. For winner-take-all retrieval (finding the single best match), capacity scales as d² ≈ n log n due to extreme-value statistics. For listwise retrieval (keeping the correct answer in a top-k list), capacity improves to d² ≈ n.

Apr 27 – May 3(2)

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Apr 29, 2026

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García et al.

Noise in transformers can synchronize token behavior and stabilize learning—a counterintuitive finding that suggests randomness plays a constructive role in how these models process sequences.

This paper proves that transformer models with finite depth and width converge to a stochastic particle system as they scale. The researchers show that token evolution follows a continuous-time process with noise-driven synchronization, meaning random perturbations actually help tokens align rather than diverge.

scalingarchitecturetraining

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Apr 27, 2026

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh et al.

You can efficiently extend pretrained LLMs to handle much longer contexts by converting them to hybrid architectures without retraining from scratch—this is more practical than building new models entirely.

This paper presents HyLo, a method to convert pretrained Transformer language models into hybrid architectures that combine Transformers with efficient linear sequence models (like Mamba2). By reusing existing model checkpoints and adding long-context training, HyLo extends context length by 32x while reducing memory use by 90%, enabling 2M-token processing on standard hardware.

Apr 20 – Apr 26(4)

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Apr 24, 2026

Sijie Li, Shanda Li, Haowei Lin et al.

Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.

Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.

scalingefficiencytraining

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Apr 23, 2026

Max Defez, Filippo Quarenghi, Mathieu Vrac et al.

A single neural network architecture can handle multiple super-resolution scales by adapting just three hyperparameters (noise schedule, context length, and mass conservation), eliminating the need to train separate models for each upscaling factor.

This paper presents a flexible deep-learning framework for video super-resolution that works across different spatial and temporal upscaling factors without retraining from scratch.

Apr 6 – Apr 12(2)

ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

Apr 9, 2026

Paul Quinlan, Qingguo Li, Xiaodan Zhu

ADAPT solves a critical problem in time-series AI: you can now pre-train on many diverse datasets together instead of just one, making it possible to build generalist foundation models that work across different time-series domains.

This paper introduces ADAPT, a new pre-training method that lets AI models learn from many different time-series datasets simultaneously, even when those datasets have different sizes and structures. By aligning the physical properties of diverse time-series data, the approach enables training a single foundation model on 162 datasets at once—something previous methods couldn't do well.

datascaling

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Apr 7, 2026

David Picard, Nicolas Dufour, Lucas Degeorge et al.

You can replace attention with a linear-time polynomial mixer and get similar results with much faster inference—especially valuable for long sequences where attention becomes prohibitively expensive.

PoM replaces the expensive attention mechanism in transformers with a polynomial-based token mixer that runs in linear time instead of quadratic. It compresses all tokens into a learned polynomial representation, letting each token extract relevant context from this compact form.

Mar 30 – Apr 5(4)

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Apr 3, 2026

Takuya Shiba

For robot learning systems, discrete action tokenization creates a hard ceiling on performance gains from better vision models—you need to increase action representation capacity, not just encoder quality, to see improvements.

This paper explains why upgrading vision encoders in robot learning models doesn't always improve performance. The key issue is the 'Compression Gap': when robot actions are represented as discrete tokens (like a limited vocabulary), the token codebook becomes an information bottleneck that prevents improvements from better vision encoders from helping.

architecturescalingefficiency

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

Mar 23 – Mar 29(2)

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Mar 25, 2026

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.

Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.

RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.

Mar 16 – Mar 22(4)

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Mar 19, 2026

Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

trainingscalingefficiency

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Mar 18, 2026

Xuyang Cao, Qianying Liu, Chuan Xiao et al.

By measuring how much each language helps other languages learn during training, you can predict model performance more accurately and find better language mixture ratios than methods that ignore cross-lingual transfer effects.

This paper treats multilingual language model training as a cooperative game where each language contributes to overall performance. It uses game theory to measure how much each language helps others learn (cross-lingual transfer), then uses these insights to predict the best mix of languages for training data.

Mar 9 – Mar 15(1)

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Mar 12, 2026

Zhoujun Cheng, Yutao Xie, Yuxiao Qu et al.

When doing RL training on LLMs, increase parallel rollouts per problem as your compute budget grows, but expect diminishing returns; this single principle helps you allocate compute efficiently across sampling and training.

This paper studies how to optimally distribute computing resources when training language models with reinforcement learning. The researchers found that the number of parallel attempts per problem should increase with total compute budget before leveling off, and this pattern holds whether problems are easy or hard—though for different reasons.

scalingtraining
scalingtrainingefficiency

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining
architectureefficiencyscaling

A proximal gradient algorithm for composite log-concave sampling

May 12, 2026

Linghai Liu, Sinho Chewi

The algorithm enables efficient sampling from composite log-concave distributions with convergence guarantees that scale well with dimension and condition number, useful for Bayesian inference and optimization problems.

This paper presents a new algorithm for sampling from complex probability distributions that combine two functions (f and g). The method uses gradient information and a special sampling oracle, achieving efficient convergence rates that match the best known results. The approach extends to non-smooth functions and distributions with weaker mathematical properties.

scaling
scalingevaluation
architectureefficiencyscaling
architectureefficiencyscaling

Generalization at the Edge of Stability

Apr 21, 2026

Mario Tuci, Caner Korkmaz, Umut Şimşekli et al.

Training at the edge of stability (where optimization becomes chaotic) generalizes better because the optimizer converges to a lower-dimensional fractal attractor, and you can predict generalization by measuring the complete structure of the loss landscape's curvature, not just simple summaries.

This paper explains why training neural networks with large learning rates—which causes chaotic, oscillatory behavior—actually improves generalization. The authors model optimizers as random dynamical systems that converge to fractal attractors and introduce 'sharpness dimension' to measure generalization.

trainingscalingreasoning

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

Apr 20, 2026

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar et al.

Cross-modal alignment between vision and language models is much weaker than previously claimed—it only appears in small-scale experiments and reflects broad semantic overlap, not deep structural convergence.

This paper challenges the popular "Platonic Representation Hypothesis"—the idea that AI models trained on different types of data (like text and images) learn the same underlying representation of reality.

multimodalevaluationscaling
efficiencyarchitecturescaling
architectureefficiencyscaling

Screening Is Enough

Apr 1, 2026

Ken M. Nakanishi

Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.

This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.

architectureefficiencyscaling

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Mar 30, 2026

Liliang Ren, Yang Liu, Yelong Shen et al.

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

trainingscalingefficiency
applications
scaling
scalingtraining

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Mar 17, 2026

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

GIST enables efficient, mathematically-principled graph transformers that generalize across different mesh resolutions and discretizations, making neural operators practical for large-scale physics simulations.

GIST is a graph transformer that solves a fundamental problem: how to add positional information to graph neural networks without breaking mathematical symmetries or requiring expensive computations.

architecturescalingreasoning

Mixture-of-Depths Attention

Mar 16, 2026

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

MoDA lets deep language models selectively attend to earlier layers, preventing information loss as models get deeper while adding only 3.7% computational overhead.

This paper introduces Mixture-of-Depths Attention (MoDA), a mechanism that lets attention heads skip layers by accessing key-value pairs from both the current and earlier layers. This solves a problem in very deep language models where useful information gets diluted as it passes through many layers.

architectureefficiencyscaling