Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers9 this month12 topics

All Efficiency 37 Reasoning 36 Training 35 Evaluation 29 Architecture 23 Agents 23 Multimodal 17 Applications 15 Alignment 9 Safety 8 scaling 8 Data 3

May 18 – May 24(4)

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

May 11 – May 17(3)

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

May 14, 2026

Ellwil Sharma, Arastu Sharma

Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.

This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.

architecturescalingefficiency

Elastic Attention Cores for Scalable Vision Transformers

May 12, 2026

Alan Z. Song, Yinjie Chen, Mu Nan et al.

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

May 4 – May 10(2)

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

May 7, 2026

Minbin Huang, Han Shi, Chuanyang Zheng et al.

You don't need separate expert sets per layer in MoE models—a shared expert pool with independent routers works better and uses fewer parameters, suggesting the standard per-layer expert allocation is unnecessarily wasteful.

UniPool replaces the standard Mixture-of-Experts design where each layer has its own expert set with a single shared pool of experts accessed by all layers. This reduces redundancy and allows expert parameters to grow sublinearly with model depth while improving performance and reducing parameter count by 30-60% compared to standard MoE.

architectureefficiencyscaling

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

May 6, 2026

Nicholas Barnfield, Juno Kim, Eshaan Nichani et al.

Linear memory systems face a fundamental logarithmic penalty for top-1 retrieval but can achieve quadratic capacity if you only need the correct answer ranked highly rather than first—a distinction that matters for building efficient retrieval systems.

This paper analyzes how many key-value pairs a linear memory matrix can store, showing the answer depends on the retrieval task. For winner-take-all retrieval (finding the single best match), capacity scales as d² ≈ n log n due to extreme-value statistics. For listwise retrieval (keeping the correct answer in a top-k list), capacity improves to d² ≈ n.

Apr 27 – May 3(2)

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Apr 29, 2026

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García et al.

Noise in transformers can synchronize token behavior and stabilize learning—a counterintuitive finding that suggests randomness plays a constructive role in how these models process sequences.

This paper proves that transformer models with finite depth and width converge to a stochastic particle system as they scale. The researchers show that token evolution follows a continuous-time process with noise-driven synchronization, meaning random perturbations actually help tokens align rather than diverge.

scalingarchitecturetraining

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Apr 27, 2026

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh et al.

You can efficiently extend pretrained LLMs to handle much longer contexts by converting them to hybrid architectures without retraining from scratch—this is more practical than building new models entirely.

This paper presents HyLo, a method to convert pretrained Transformer language models into hybrid architectures that combine Transformers with efficient linear sequence models (like Mamba2). By reusing existing model checkpoints and adding long-context training, HyLo extends context length by 32x while reducing memory use by 90%, enabling 2M-token processing on standard hardware.

Apr 20 – Apr 26(4)

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Apr 24, 2026

Sijie Li, Shanda Li, Haowei Lin et al.

Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.

Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.

scalingefficiencytraining

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Apr 23, 2026

Max Defez, Filippo Quarenghi, Mathieu Vrac et al.

A single neural network architecture can handle multiple super-resolution scales by adapting just three hyperparameters (noise schedule, context length, and mass conservation), eliminating the need to train separate models for each upscaling factor.

This paper presents a flexible deep-learning framework for video super-resolution that works across different spatial and temporal upscaling factors without retraining from scratch.

Apr 6 – Apr 12(2)

ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

Apr 9, 2026

Paul Quinlan, Qingguo Li, Xiaodan Zhu

ADAPT solves a critical problem in time-series AI: you can now pre-train on many diverse datasets together instead of just one, making it possible to build generalist foundation models that work across different time-series domains.

This paper introduces ADAPT, a new pre-training method that lets AI models learn from many different time-series datasets simultaneously, even when those datasets have different sizes and structures. By aligning the physical properties of diverse time-series data, the approach enables training a single foundation model on 162 datasets at once—something previous methods couldn't do well.

datascaling

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Apr 7, 2026

David Picard, Nicolas Dufour, Lucas Degeorge et al.

You can replace attention with a linear-time polynomial mixer and get similar results with much faster inference—especially valuable for long sequences where attention becomes prohibitively expensive.

PoM replaces the expensive attention mechanism in transformers with a polynomial-based token mixer that runs in linear time instead of quadratic. It compresses all tokens into a learned polynomial representation, letting each token extract relevant context from this compact form.

Mar 30 – Apr 5(4)

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Apr 3, 2026

Takuya Shiba

For robot learning systems, discrete action tokenization creates a hard ceiling on performance gains from better vision models—you need to increase action representation capacity, not just encoder quality, to see improvements.

This paper explains why upgrading vision encoders in robot learning models doesn't always improve performance. The key issue is the 'Compression Gap': when robot actions are represented as discrete tokens (like a limited vocabulary), the token codebook becomes an information bottleneck that prevents improvements from better vision encoders from helping.

architecturescalingefficiency

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

Mar 23 – Mar 29(2)

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Mar 25, 2026

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.

Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.

RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.

Mar 16 – Mar 22(4)

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Mar 19, 2026

Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

trainingscalingefficiency

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Mar 18, 2026

Xuyang Cao, Qianying Liu, Chuan Xiao et al.

By measuring how much each language helps other languages learn during training, you can predict model performance more accurately and find better language mixture ratios than methods that ignore cross-lingual transfer effects.

This paper treats multilingual language model training as a cooperative game where each language contributes to overall performance. It uses game theory to measure how much each language helps others learn (cross-lingual transfer), then uses these insights to predict the best mix of languages for training data.

Mar 9 – Mar 15(1)

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Mar 12, 2026

Zhoujun Cheng, Yutao Xie, Yuxiao Qu et al.

When doing RL training on LLMs, increase parallel rollouts per problem as your compute budget grows, but expect diminishing returns; this single principle helps you allocate compute efficiently across sampling and training.

This paper studies how to optimally distribute computing resources when training language models with reinforcement learning. The researchers found that the number of parallel attempts per problem should increase with total compute budget before leveling off, and this pattern holds whether problems are easy or hard—though for different reasons.

scalingtraining