ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers2 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(3)

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

architectureefficiencyscaling

Screening Is Enough

Apr 1, 2026

Ken M. Nakanishi

Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.

This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.

architecture

Mar 23 – Mar 29(2)

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Mar 25, 2026

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.

Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.

RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.

Mar 16 – Mar 22(4)

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Mar 19, 2026

Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

trainingscalingefficiency

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Mar 18, 2026

Xuyang Cao, Qianying Liu, Chuan Xiao et al.

By measuring how much each language helps other languages learn during training, you can predict model performance more accurately and find better language mixture ratios than methods that ignore cross-lingual transfer effects.

This paper treats multilingual language model training as a cooperative game where each language contributes to overall performance. It uses game theory to measure how much each language helps others learn (cross-lingual transfer), then uses these insights to predict the best mix of languages for training data.

Mar 9 – Mar 15(1)

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Mar 12, 2026

Zhoujun Cheng, Yutao Xie, Yuxiao Qu et al.

When doing RL training on LLMs, increase parallel rollouts per problem as your compute budget grows, but expect diminishing returns; this single principle helps you allocate compute efficiently across sampling and training.

This paper studies how to optimally distribute computing resources when training language models with reinforcement learning. The researchers found that the number of parallel attempts per problem should increase with total compute budget before leveling off, and this pattern holds whether problems are easy or hard—though for different reasons.

scalingtraining
efficiency
scaling

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Mar 30, 2026

Liliang Ren, Yang Liu, Yelong Shen et al.

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

trainingscalingefficiency
applications
scaling
scalingtraining

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Mar 17, 2026

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

GIST enables efficient, mathematically-principled graph transformers that generalize across different mesh resolutions and discretizations, making neural operators practical for large-scale physics simulations.

GIST is a graph transformer that solves a fundamental problem: how to add positional information to graph neural networks without breaking mathematical symmetries or requiring expensive computations.

architecturescalingreasoning

Mixture-of-Depths Attention

Mar 16, 2026

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

MoDA lets deep language models selectively attend to earlier layers, preventing information loss as models get deeper while adding only 3.7% computational overhead.

This paper introduces Mixture-of-Depths Attention (MoDA), a mechanism that lets attention heads skip layers by accessing key-value pairs from both the current and earlier layers. This solves a problem in very deep language models where useful information gets diluted as it passes through many layers.

architectureefficiencyscaling