Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Torque Dandachi, Sophia Diggs-Galligan
go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.
This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.
Ken M. Nakanishi
Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.
This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.
Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.
Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.
This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.
Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.
Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.
RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.
Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.
You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.
This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.
Xuyang Cao, Qianying Liu, Chuan Xiao et al.
By measuring how much each language helps other languages learn during training, you can predict model performance more accurately and find better language mixture ratios than methods that ignore cross-lingual transfer effects.
This paper treats multilingual language model training as a cooperative game where each language contributes to overall performance. It uses game theory to measure how much each language helps others learn (cross-lingual transfer), then uses these insights to predict the best mix of languages for training data.