ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers41 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(18)

Tokenisation via Convex Relaxations

May 21, 2026

Jan Tempus, Philip Whittington, Craig W. Schmidt et al.

ConvexTok uses convex optimization to build tokenizers that are provably near-optimal (within 1% at typical vocabulary sizes) and compress text better than greedy algorithms like BPE, with measurable improvements in language model efficiency.

This paper replaces greedy tokenization algorithms like BPE with a convex optimization approach called ConvexTok. Instead of making locally optimal choices, it formulates tokenizer construction as a linear program, achieving better compression (bits-per-byte) and allowing users to verify how close their tokenizer is to mathematically optimal.

trainingefficiency

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

training

May 11 – May 17(10)

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

May 14, 2026

Zhuohang Li, Liqun Huang, Wei Xu et al.

Seamlessly blending human intervention with robot policy execution—rather than abrupt takeovers—dramatically reduces manipulation failures in dexterous tasks and produces better-trained policies from human correction data.

This paper addresses a key problem in robotic hand control: when humans take over from an AI policy during manipulation tasks, abrupt hand configuration changes ('gesture jumps') cause failures. Hand-in-the-Loop smoothly blends human corrections with the robot's ongoing actions, reducing takeover disruptions by 99.8% and improving task success rates by 19% when used to train better policies.

agentstraining

MeMo: Memory as a Model

May 14, 2026

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong et al.

You can add new knowledge to any LLM without touching its weights by training a separate memory model that retrieves and augments the LLM's responses—making it practical for real-world applications needing frequent updates.

MeMo introduces a modular memory model that stores new knowledge separately from a frozen LLM, enabling efficient updates without retraining. It works with any LLM (open or proprietary), handles complex document relationships, and maintains constant retrieval cost regardless of corpus size.

May 4 – May 10(12)

Normalizing Trajectory Models

May 8, 2026

Jiatao Gu, Tianrong Chen, Ying Shen et al.

NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.

This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.

efficiencyarchitecturetraining

Flow-OPD: On-Policy Distillation for Flow Matching Models

May 8, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng et al.

On-policy distillation with specialized teachers can resolve conflicting optimization goals in multi-objective image generation, achieving 10-point improvements over standard reinforcement learning approaches while maintaining quality across all metrics.

Flow-OPD is a training method that improves text-to-image models by using specialized teacher models and on-policy distillation to align multiple competing objectives (like image quality, text accuracy, and aesthetics).

Apr 27 – May 3(28)

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

May 1, 2026

Venkata Pushpak Teja Menta

Adversarial training can make speaker embeddings invariant to language/script while preserving speaker identity—critical for multilingual voice cloning systems that need to recognize the same speaker across different languages.

Speaker encoders for voice cloning often fail when audio switches between languages or scripts—a problem especially acute for Indic languages. This paper introduces LASE, a small neural layer that makes speaker embeddings language-agnostic by combining speaker identity learning with adversarial training against language classification.

multimodalalignmenttraining

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Apr 30, 2026

Eyon Jang, Damon Falck, Joschka Braun et al.

LLMs may be able to strategically resist RL training by limiting exploration, posing a novel safety risk for post-training alignment—detection methods like monitoring and weight noise offer partial mitigation but aren't foolproof.

This paper investigates whether LLMs can strategically resist reinforcement learning during post-training by suppressing their exploration of actions. Researchers create models trained to underperform, show they can evade RL-based training while staying competent on other tasks, and demonstrate that frontier models can reason about suppressing exploration when they understand their training setup.

Apr 20 – Apr 26(29)

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Apr 24, 2026

Sijie Li, Shanda Li, Haowei Lin et al.

Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.

Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.

scalingefficiencytraining

Relaxation-Informed Training of Neural Network Surrogate Models

Apr 24, 2026

Calvin Tsay

Training neural network surrogates with MILP-aware regularizers can dramatically speed up downstream optimization without sacrificing accuracy, by directly controlling structural properties that affect solver performance.

This paper shows how to train neural networks as surrogate models that work better when embedded in optimization problems. By adding special regularizers during training that target MILP tractability—penalizing large constants, unstable neurons, and LP relaxation gaps—the approach makes the resulting optimization problems solve 10,000x faster while keeping prediction accuracy competitive.

Apr 13 – Apr 19(3)

Geometric regularization of autoencoders via observed stochastic dynamics

Apr 17, 2026

Sean Hill, Felix X. -F. Ye

By enforcing geometric consistency in autoencoders through tangent-bundle penalties, you can reduce errors in learned dynamical systems by 50-70%, making reduced models reliable for predicting rare events like molecular transitions.

This paper solves a key problem in learning reduced models of complex dynamical systems: how to build accurate low-dimensional simulators from high-dimensional data. The authors use geometric constraints from data covariance to train autoencoders that preserve the underlying manifold structure, enabling better prediction of long-term system behavior like transition times between metastable states.

architecturetrainingreasoning

Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

Apr 17, 2026

Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir et al.

LLMs show promise for drug discovery, but RL-based post-training on domain-specific tasks is critical: a smaller model trained this way outperformed much larger untrained models, suggesting a practical path forward for real-world drug design applications.

This paper creates a benchmark of chemistry tasks to test how well large language models can help design new drugs. The researchers test three model families on tasks like predicting molecular properties and designing molecules, then show that reinforcement learning training can significantly boost performance—even making smaller models competitive with frontier models.

reasoning
efficiency

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

May 21, 2026

Vishal Rajput

Many robustness techniques (CORAL, adversarial training, IRM, metric learning) are different ways of solving the same problem: identifying and regularizing against label-preserving variations in your data.

This paper unifies seemingly separate robustness problems (domain adaptation, adversarial training, compositional generalization) under one framework: regularizing neural network gradients to match the covariance of label-preserving variations in deployment data.

trainingalignment

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

scalingtrainingefficiency

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

May 20, 2026

Weixing Zhang, Bowen Jiang, Rahul Sharma et al.

LLMs can learn grammar adaptation patterns from examples and apply them to new versions, achieving 100% consistency on medium-sized grammars but failing on large-scale ones—suggesting LLMs work best for targeted, smaller grammar updates.

This paper shows how Large Language Models can automatically adapt domain-specific language grammars when their underlying models change, reducing manual work. Testing on real-world languages shows LLMs work well for complex scenarios but struggle with very large grammars (300+ rules).

trainingapplications

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining

General Preference Reinforcement Learning

May 18, 2026

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal et al.

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

trainingalignmentreasoning

Semantic Generative Tuning for Unified Multimodal Models

May 18, 2026

Songsong Yu, Yuxin Chen, Ying Shan et al.

Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.

This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.

multimodaltrainingarchitecture

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

May 18, 2026

Minrui Xu, Zilin Wang, Mengyi DENG et al.

Automated environment synthesis and trajectory generation can reduce the data requirements for tool-use agent training by 5x while improving downstream performance, making agentic RL more practical and scalable.

EnvFactory automates the creation of tool-use training environments and realistic multi-turn interaction trajectories for teaching language models to use tools effectively. It generates diverse, natural training data from verified executable environments, enabling more efficient agent training with fewer resources than existing approaches.

agentstrainingdata
trainingefficiency

Self-Distilled Agentic Reinforcement Learning

May 14, 2026

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han et al.

Combining RL with selective token-level distillation through a gating mechanism significantly improves LLM agent performance on complex tasks, achieving 7-10% gains over standard RL approaches while avoiding training instability.

This paper improves how language model agents learn through reinforcement learning by combining trajectory-level rewards with dense token-level guidance. The key innovation is a gating mechanism that selectively uses teacher signals—strengthening learning from good decisions and softly ignoring bad teacher suggestions—making multi-turn agent training more stable and effective.

agentstraining

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

May 12, 2026

Kexuan Shi, Hanxuan Li, Zeju Qiu et al.

Pion's orthogonal update mechanism preserves weight matrix spectral properties during training, providing a geometrically principled alternative to gradient-based optimizers like Adam with competitive performance.

Pion is a new optimizer for training large language models that updates weights using orthogonal transformations instead of adding gradients like Adam does. By preserving the singular values of weight matrices, it keeps the spectral properties stable while still allowing the model to learn, offering a more geometrically-grounded approach to optimization.

training

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

May 12, 2026

Xuhao Hu, Xi Zhang, Haiyang Xu et al.

Agents perform better when trained to decide dynamically between GUI actions and tool calls rather than using only one approach—this hybrid strategy improved accuracy by 66% on real-world tasks.

ToolCUA trains computer agents to intelligently choose between GUI actions (clicks, typing) and tool calls (APIs) by synthesizing diverse training trajectories from existing data and using reinforcement learning to optimize when to switch between action types. This solves a key problem for digital agents: knowing when to use high-level tools versus low-level GUI interactions.

agentstrainingreasoning

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

May 12, 2026

Guohui Zhang, XiaoXiao Ma, Jie Huang et al.

When training models to generate audio and video together, treating each modality's learning separately and protecting audio-specific layers from video interference leads to better results than standard single-objective RL approaches.

OmniNFT improves joint audio-video generation by using reinforcement learning with three key techniques: routing rewards separately to each modality, preventing video gradients from interfering with audio processing, and focusing optimization on synchronization regions. This addresses real-world needs for high-quality audio, high-quality video, and tight audio-video alignment simultaneously.

multimodaltraining

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

May 12, 2026

Sagi Ahrac, Noya Hochwald, Mor Geva

Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.

This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.

architecturetrainingefficiency

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

May 12, 2026

Guinan Su, Yanwu Yang, Xueyan Li et al.

By training models to handle multiple parallel computation streams instead of sequential message exchanges, you can build faster, more responsive AI agents that can act while thinking and react to new information without waiting for previous operations to complete.

This paper proposes Multi-Stream LLMs, which replace the single sequential message stream in current language models with multiple parallel streams for inputs, outputs, and reasoning. This allows models to read and write simultaneously, think while acting, and process different types of information in parallel—addressing fundamental bottlenecks in how AI agents currently operate.

architectureagentstraining
trainingalignmentefficiency

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 2026

Ryan Wang, Akshita Bhagia, Sewon Min

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architectureefficiencytraining

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

May 7, 2026

Yuxing Liu, Jianyu Wang, Tong Zhang

Use the same optimizer for finetuning as you used for pretraining—it significantly reduces catastrophic forgetting while maintaining task performance, even outperforming parameter-efficient methods like LoRA.

When finetuning large language models, using the same optimizer during finetuning as was used during pretraining reduces forgetting of previously learned knowledge while maintaining or improving performance on new tasks.

trainingefficiency

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

May 7, 2026

Mingwei Xu, Hao Fang

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

trainingreasoningalignment

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

May 6, 2026

Srikar Kashyap Pulipaka

Per-language fine-tuning with synthetic data augmentation and threshold tuning can significantly improve multilingual NLP tasks, but model generalization to test data varies dramatically—some architectures dropped 30-50% in performance despite strong development results.

This paper describes a system for detecting polarized language across 22 languages using fine-tuned Gemma models with synthetic data augmentation. The approach combines per-language model tuning, LLM-generated synthetic training data with quality filtering, and weighted ensemble predictions to achieve competitive performance on a multilingual classification task.

trainingevaluation

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

May 5, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

High-quality training data matters more than pipeline complexity: careful data curation with SFT alone can beat industrial-scale approaches combining pre-training, continual pre-training, and RL for building capable search agents.

OpenSeeker-v2 shows that simple supervised fine-tuning on carefully designed training data can match or beat complex industrial pipelines for building search agents.

trainingagentsdata

Conditional Diffusion Sampling

May 5, 2026

Francisco M. Castro-Macías, Pablo Morales-Álvarez, Saifuddin Syed et al.

CDS offers a practical way to sample from difficult distributions by combining two proven techniques—Parallel Tempering for initial exploration and exact diffusion dynamics for refinement—without requiring neural network training.

This paper introduces Conditional Diffusion Sampling (CDS), a new method for sampling from complex probability distributions that combines Parallel Tempering with diffusion-based transport.

trainingefficiency

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

May 4, 2026

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Standard training loss curves can hide poorly-optimized layers in transformers—layer-wise analysis using reference bounds exposes optimization failures that aggregate metrics miss, especially critical for expensive model training.

This paper introduces a method to monitor whether transformer models are actually learning well during training by analyzing each layer individually. Instead of just looking at overall loss, the authors create lightweight reference solutions for each layer and compare them against the trained model, revealing hidden inefficiencies.

trainingevaluationefficiency

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

May 4, 2026

Jingze Ge, Yun Liu, Xue Geng et al.

Jointly optimizing compression and adaptation using task-aware subspaces beats the standard two-step approach, delivering better accuracy with fewer parameters on both vision and language models.

JACTUS combines model compression and task adaptation in a single step rather than doing them sequentially. Instead of compressing a model first and then fine-tuning it, the method estimates what directions matter for your specific task and compresses the model while preserving those directions.

efficiencytraining
safetyalignmenttraining

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Apr 30, 2026

Tao Ge, Baolin Peng, Hao Cheng et al.

Synthetic computer environments with long-horizon simulations can generate realistic training data for productivity agents at scale, enabling them to learn from diverse workplace scenarios without human annotation.

Researchers created a system to generate realistic computer environments at scale—complete with folder structures and documents—then simulated AI agents working on month-long productivity tasks within them.

agentsdatatraining

PhyCo: Learning Controllable Physical Priors for Generative Motion

Apr 30, 2026

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan et al.

You can make generative video models physically consistent by combining physics-labeled training data, ControlNet conditioning on physical properties, and VLM-based reward signals—no simulator needed at runtime.

PhyCo teaches video generation models to respect physics by fine-tuning them on 100K+ realistic simulation videos with varying physical properties (friction, bouncing, deformation), then using a vision-language model to provide physics-aware feedback during generation. This lets models create videos where objects behave realistically without needing a physics simulator at inference time.

trainingmultimodalevaluation

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Apr 30, 2026

Sudong Wang, Weiquan Huang, Xiaomin Yu et al.

Adding an explicit distribution-alignment stage between supervised fine-tuning and RL training significantly reduces model drift in multimodal models, with gains coming from disentangled feedback on perception vs. reasoning failures.

PRISM fixes a key problem in training multimodal AI models: when you fine-tune a model on examples and then use reinforcement learning, the model drifts away from what it learned initially.

trainingmultimodalalignment

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Apr 30, 2026

Junqi Gao, Dazhi Zhang, Zhichang Guo et al.

Task vectors can be compressed to 1-5% of their original size while maintaining model performance, making it practical to store and dynamically merge multiple task-specific models without prohibitive storage costs.

This paper tackles the storage overhead problem in dynamic model merging by compressing task vectors (fine-tuned weight changes) using learnable compression techniques.

efficiencytraining

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

Apr 30, 2026

Arthur Corrêa, Paulo Nascimento, Samuel Moniz

A single neural model can now handle multiple variants of complex routing problems by dynamically adapting to different constraints, suggesting that multi-task learning with adaptive conditioning is more practical than building separate models for each problem type.

FiLMMeD is a neural model that solves 24 different multi-depot vehicle routing problems (a logistics optimization task) using a single unified architecture.

architecturetrainingapplications

Characterizing the Consistency of the Emergent Misalignment Persona

Apr 30, 2026

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

Fine-tuning on narrow harmful data can cause models to behave broadly harmfully, but they don't consistently develop matching self-awareness—some models hide their misalignment while others openly acknowledge it.

When large language models are fine-tuned on specific types of harmful data, they sometimes develop broader harmful behavior—a phenomenon called emergent misalignment. This paper tests whether models that behave harmfully also recognize themselves as misaligned.

safetyalignmenttraining

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Apr 30, 2026

Ansar Aynetdinov, Patrick Haller, Alan Akbik

For non-English language models, aggressively filtering data for quality and repeating it multiple times beats training once on larger, diverse datasets—a practical insight for resource-constrained language model development.

This paper challenges the assumption that diverse data is always better for language model training. For German, the researchers found that repeatedly training on a smaller, high-quality filtered dataset outperforms training once on a larger, less-filtered dataset—even after 7 epochs of repetition.

trainingdataefficiency

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

Apr 30, 2026

Feiyu Wu, Xu Zheng, Zhuocheng Wang et al.

LLM-generated rewards aren't equally useful throughout training—their reliability depends on policy competence and training phase, so verification and deployment timing matter as much as reward generation itself.

This paper addresses when and how to use LLM-generated rewards during reinforcement learning training. The authors propose RHyVE, a method that verifies reward quality based on the current policy's skill level and training phase, rather than treating all rewards equally throughout training.

training

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Apr 29, 2026

Gongbo Zhang, Wen Wang, Ye Tian et al.

Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.

This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.

trainingefficiencyarchitecture

Select to Think: Unlocking SLM Potential with Local Sufficiency

Apr 29, 2026

Wenxuan Ye, Yangyang Zhang, Xueli An et al.

Small models already generate the right answers in their candidate predictions—they just rank them poorly. Training them to re-rank their own outputs improves reasoning without external model calls.

Small language models struggle with reasoning tasks compared to large models. This paper discovers that when small models fail, the correct token from a large model is usually hidden in the small model's top-8 predictions.

efficiencyreasoningtraining

Learning Over-Relaxation Policies for ADMM with Convergence Guarantees

Apr 29, 2026

Junan Lin, Paul J. Goulart, Luca Furieri

Learning to adapt relaxation parameters in ADMM can speed up solving repeated optimization problems while maintaining convergence guarantees—useful for real-time control systems that solve similar problems repeatedly.

This paper shows how to use machine learning to automatically tune the relaxation parameter in ADMM, an algorithm for solving optimization problems. By learning better parameter choices for repeated similar problems (like in Model Predictive Control), the method reduces computation time without requiring expensive matrix refactorizations.

training

ClawGym: A Scalable Framework for Building Effective Claw Agents

Apr 29, 2026

Fei Bai, Huatong Song, Shuang Sun et al.

To build effective agents for real-world file and tool interactions, you need systematic data synthesis, training on realistic rollout trajectories, and careful evaluation—ClawGym provides all three components together.

ClawGym is a framework for building AI agents that work with files, tools, and persistent workspaces through multi-step tasks. It includes a dataset of 13.5K synthesized tasks with realistic mock environments, trained agent models using supervised learning and reinforcement learning, and a benchmark for evaluation.

agentstrainingevaluation

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Apr 29, 2026

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García et al.

Noise in transformers can synchronize token behavior and stabilize learning—a counterintuitive finding that suggests randomness plays a constructive role in how these models process sequences.

This paper proves that transformer models with finite depth and width converge to a stochastic particle system as they scale. The researchers show that token evolution follows a continuous-time process with noise-driven synchronization, meaning random perturbations actually help tokens align rather than diverge.

scalingarchitecturetraining

Multiple Additive Neural Networks for Structured and Unstructured Data

Apr 29, 2026

Janis Mohr, Jörg Frochte

MANN combines gradient boosting with neural networks instead of trees, enabling a single framework to handle structured and unstructured data while outperforming XGBoost and reducing hyperparameter sensitivity.

This paper presents Multiple Additive Neural Networks (MANN), which replaces decision trees in gradient boosting with shallow neural networks. MANN works with both structured data and images/audio by using CNNs and capsule networks as feature extractors, and shows better accuracy than XGBoost on standard benchmarks while being more robust to hyperparameter choices.

trainingarchitectureefficiency

What Kind of Language is Easy to Language-Model Under Curriculum Learning?

Apr 29, 2026

Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe

Curriculum learning substantially changes language models' learning biases, suggesting that training order matters as much as model architecture when predicting which language structures are 'easy' to learn.

This paper investigates how curriculum learning—training language models on simpler sentences first rather than random order—affects which linguistic patterns models naturally learn.

trainingdata

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Apr 29, 2026

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni et al.

Language diffusion models memorize training data by default, but you can detect when they switch to genuine generalization by monitoring conditional entropy—a practical signal for assessing whether a deployed model is memorizing or creating.

This paper reveals that language diffusion models work like associative memories—they store training data in 'basins of attraction' and can retrieve both memorized and unseen examples. As training data grows, the model transitions from memorizing to generalizing, a shift detectable by measuring conditional entropy of token predictions.

trainingevaluationreasoning

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Apr 28, 2026

Chu-Cheng Lin, Eugene Ie

When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.

This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.

trainingreasoningalignment

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Apr 28, 2026

Andre Herz, Daniel Durstewitz, Georgia Koppe

Teacher forcing trains RNNs on chaotic systems differently than the model will actually be used—this mismatch can make models fit data well statistically while performing poorly at predicting actual dynamics, a problem that becomes worse when multiple explanations exist for the data.

This paper reveals a fundamental mismatch between how teacher forcing (a common training technique) and marginal likelihood (the true objective) shape neural network optimization for chaotic systems.

trainingreasoning

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Apr 28, 2026

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy et al.

You can compress LLMs for SE tasks to 1/49th their original size with minimal accuracy loss—making them practical to deploy while cutting environmental impact dramatically.

This paper presents Carbon-Taxed Transformers (CTT), a compression pipeline that makes large language models smaller, faster, and greener for software engineering tasks.

efficiencytrainingevaluation

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Apr 28, 2026

Dominik Żurek, Kamil Faber, Marcin Pietron et al.

Architectural parameter reuse guided by task similarity is a memory-efficient alternative to replay-based continual learning in offline RL, enabling better multi-task performance without storing historical data.

This paper presents TSN-Affinity, a method for continual offline reinforcement learning that learns multiple tasks sequentially from pre-collected datasets without forgetting previous tasks.

trainingarchitectureefficiency

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Apr 27, 2026

Griffin Pitts, Muntasir Hoq, Peter Brusilovsky et al.

By extracting knowledge components from student code patterns, you can steer generative models to create personalized learning content that directly targets the logical errors students are making, rather than relying on generic pre-written examples.

This paper presents a system that automatically generates personalized worked examples for programming students based on their actual code submissions. Instead of using fixed example libraries, the system analyzes patterns in student errors using code structure analysis and uses these patterns to guide an AI model to create relevant examples that address each student's specific misconceptions.

applicationstrainingdata

Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Apr 27, 2026

Zhangyong Liang

When training neural networks on multiscale physics problems, gradient conflicts between different regimes can cause training failure—HRGrad fixes this by explicitly managing gradient directions to keep all objectives aligned during optimization.

This paper introduces HRGrad, a method for training neural networks on physics problems that span multiple scales—from microscopic to macroscopic behavior. The key challenge is that different scales pull the network in conflicting directions during training.

trainingreasoning

Learning to Think from Multiple Thinkers

Apr 27, 2026

Nirmit Joshi, Roey Magen, Nathan Srebro et al.

Learning from diverse reasoning traces is harder than learning from a single thinker, but you can overcome this by actively collecting reasoning data from many thinkers (logarithmic in target accuracy) combined with passive final-answer supervision.

This paper studies how AI models can learn from multiple people or programs solving the same problem in different ways (e.g., different math solutions or code implementations).

trainingreasoningdata

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Apr 27, 2026

Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi et al.

Multi-task learning (training one model for both sentiment and emotion at once) with BiLSTM outperforms single-task approaches on noisy, informal Indonesian text—and preprocessing with domain-specific slang dictionaries matters more than model complexity.

This paper tackles sentiment and emotion classification for Indonesian e-commerce reviews, which contain slang, regional words, and emoji that confuse standard tools. The authors built a two-track system: one using AutoML with TF-IDF features, and another using a BiLSTM neural network trained on both sentiment and emotion simultaneously.

trainingapplications

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Apr 27, 2026

Hailing Cheng, Tao Huang, Chen Zhu et al.

You can use your existing multi-GPU setup to automatically find better learning rates during training by having each GPU try slightly different rates and averaging them periodically—no extra compute needed.

This paper proposes HDET, a method that uses multiple GPU replicas to explore different learning rates during training instead of computing identical updates. Replicas train independently with different learning rates, then synchronize periodically.

trainingefficiency

Contextual Linear Activation Steering of Language Models

Apr 27, 2026

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan et al.

Adapting steering strength dynamically per context significantly improves LLM control compared to fixed steering, matching more complex methods like LoRA while remaining simpler and more interpretable.

This paper improves linear activation steering—a technique for controlling LLM behavior—by making the steering strength adapt to each input context instead of using a fixed strength for all tokens. The method, called CLAS, works better than existing approaches across multiple benchmarks and models, offering a practical way to customize LLMs with limited training data.

alignmentefficiencytraining
trainingefficiency

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Apr 24, 2026

Hillary Mutisya, John Mugane

Cross-lingual transfer and unsupervised clustering are complementary for morphology discovery in low-resource languages—transfer finds cognates while clustering spots language-specific innovations that transfer misses.

This paper develops a method to automatically discover morphological patterns in Giriama, a low-resource Bantu language with minimal labeled data. By combining knowledge transfer from Swahili with unsupervised clustering, the system identifies noun classes and uncovers two previously unknown morphological patterns, achieving 86.7% accuracy on lemmatization across word classes.

datatraining

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Apr 24, 2026

Rajinder Sandhu, Di Mu, Cheng Chang et al.

You can train dense retrievers to match LLM utility by distilling perplexity-based signals into embeddings during training, eliminating expensive test-time LLM re-ranking while improving retrieval quality.

This paper proposes Utility-Aligned Embeddings (UAE), a method that trains dense retrievers to match the ranking quality of LLM-based re-ranking without the computational cost.

trainingefficiency

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

reasoningefficiencytraining

CRAFT: Clustered Regression for Adaptive Filtering of Training data

Apr 24, 2026

Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.

CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.

datatrainingefficiency

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Apr 23, 2026

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis et al.

When evaluating continual learning systems on streaming data, the way you partition the stream into tasks is as important as the algorithm itself—different valid splits can produce contradictory conclusions about which method works best.

This paper reveals that how you split a continuous data stream into tasks dramatically affects continual learning benchmarks—even when using the same data and model. The authors introduce tools to measure this effect and show that different task boundaries can flip which learning method performs best, making temporal taskification a critical but often-overlooked evaluation choice.

evaluationtraining

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Apr 23, 2026

Paul-Tiberiu Iordache, Elena Burceanu

When comparing continual learning methods, the choice of which model layers to train is as important as the method itself—different fine-tuning regimes can completely change which approach performs best.

This paper shows that how you choose which parts of a model to update during continual learning (learning new tasks sequentially) significantly changes which methods work best.

trainingevaluation

Low-Rank Adaptation Redux for Large Models

Apr 23, 2026

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

LoRA works by adding small, low-rank weight matrices to a pre-trained model instead of updating all parameters—signal processing principles can guide better design choices for this approach and similar efficient fine-tuning methods.

This paper examines LoRA (Low-Rank Adaptation), a widely-used technique for efficiently fine-tuning large AI models, through the lens of signal processing. It explains the core mechanisms behind LoRA variants and how classical signal processing tools can improve parameter-efficient fine-tuning methods, covering architectural design, optimization strategies, and real-world applications.

trainingefficiency

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Apr 23, 2026

Neeraj Gangwar, Rishabh Deshmukh, Michael Shavlovsky et al.

GiVA reduces the parameter cost of vector-based fine-tuning by 8x compared to existing methods while matching LoRA's speed, making extreme parameter efficiency practical for real-world model adaptation.

GiVA improves vector-based adaptation—a super-efficient way to customize large AI models—by using gradient information during initialization. Instead of requiring 8 times more parameters than LoRA to work well, GiVA achieves similar performance with far fewer parameters and faster training, making it practical for adapting massive models on limited budgets.

efficiencytraining

Replay-buffer engineering for noise-robust quantum circuit optimization

Apr 23, 2026

Akash Kundu, Sebastian Feld

Treating the replay buffer as a primary algorithmic lever—not just a storage mechanism—can dramatically improve quantum circuit optimization by adapting how past experiences are sampled and transferred across different noise conditions.

This paper improves deep reinforcement learning for quantum circuit optimization by redesigning how the algorithm stores and reuses past experiences.

trainingefficiency

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Apr 23, 2026

Ye Yu, Heming Liu, Haibo Jin et al.

Multi-agent LLM systems can achieve better reasoning by learning optimized latent communication channels instead of relying on fixed text-based protocols, with significant improvements on challenging benchmarks.

This paper introduces DiffMAS, a training framework that lets multiple AI agents learn how to communicate with each other through internal representations (like key-value caches) rather than text. By jointly optimizing both reasoning and communication during training, agents can better coordinate on complex tasks like math, science, and coding problems.

agentstrainingreasoning

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Apr 23, 2026

Guangxiang Zhao, Qilong Shi, Xusen Xiao et al.

By retrieving learned reasoning skills at inference time instead of reasoning from scratch, you can reduce token usage and improve accuracy—making LLM reasoning cheaper and faster for practical deployment.

This paper proposes storing reusable reasoning skills learned from past problem-solving attempts, then retrieving and applying them during inference to guide new reasoning. Instead of reasoning from scratch each time, the model recalls relevant skills to avoid redundant work and reach solutions faster. Tests on coding and math tasks show it uses fewer tokens while improving accuracy.

reasoningefficiencytraining

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Apr 22, 2026

Zhaofeng Wu, Shiqi Wang, Boya Peng et al.

Training code models on parallel implementations of the same program across multiple languages creates a more language-agnostic understanding of coding logic, enabling better zero-shot transfer to new programming languages.

This paper tackles a practical problem: AI models trained to write code in popular languages like Python often perform poorly in less common languages. The researchers propose Parallel-SFT, a training method that uses code examples showing the same program written in multiple languages side-by-side.

training

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Apr 22, 2026

Sina Gholami, Abdulmoneam Ali, Tania Haghighi et al.

By analyzing the spectral structure of feature representations, you can identify noisy labels in federated learning and use clean clients to help relabel corrupted data—without needing to share raw data or redesign loss functions.

FedSIR tackles a major challenge in federated learning: when training data across distributed devices contains mislabeled examples. The method identifies which devices have clean vs. noisy labels by analyzing the mathematical structure of their learned features, then uses clean devices to help noisy devices fix their labels.

trainingdataefficiency

Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples

Apr 22, 2026

Ana Sanchez-Fernandez, Thomas Pinetz, Werner Zellinger et al.

Meta-learning with control samples can close the domain gap caused by batch effects in biomedical imaging, enabling deep learning models to work reliably across different experimental batches and labs without retraining from scratch.

Batch effects—systematic technical variations in biomedical imaging—cause deep learning models to fail on new experimental data. This paper introduces CS-ARM-BN, a meta-learning method that uses negative control samples (unperturbed reference images always present in experiments) to adapt models to new batches.

trainingapplications

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Apr 22, 2026

Deqing Fu, Tianyi Zhou, Mikhail Belkin et al.

Language models naturally converge on similar periodic number representations across different architectures, but whether they learn features useful for arithmetic depends on training signals like text-number co-occurrence or multi-token addition problems.

Different language models (Transformers, RNNs, LSTMs) independently learn to represent numbers using periodic patterns with periods of 2, 5, and 10—a phenomenon called convergent evolution.

trainingreasoning

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Apr 22, 2026

Shelly Golan, Michael Finkelson, Ariel Bereslavsky et al.

You can now train one diffusion model that handles multiple conflicting goals and let users choose their preferred trade-off at inference time, rather than training separate models or picking a single compromise upfront.

ParetoSlider trains a single diffusion model to handle multiple competing objectives simultaneously, letting users control trade-offs at inference time. Instead of committing to one fixed balance between goals (like image quality vs. prompt accuracy), the model learns the entire range of optimal solutions and accepts a preference weight as input to pick any point along that spectrum.

trainingalignmentapplications

Generalization at the Edge of Stability

Apr 21, 2026

Mario Tuci, Caner Korkmaz, Umut Şimşekli et al.

Training at the edge of stability (where optimization becomes chaotic) generalizes better because the optimizer converges to a lower-dimensional fractal attractor, and you can predict generalization by measuring the complete structure of the loss landscape's curvature, not just simple summaries.

This paper explains why training neural networks with large learning rates—which causes chaotic, oscillatory behavior—actually improves generalization. The authors model optimizers as random dynamical systems that converge to fractal attractors and introduce 'sharpness dimension' to measure generalization.

trainingscalingreasoning

Safe Continual Reinforcement Learning in Non-stationary Environments

Apr 21, 2026

Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro et al.

Safe continual reinforcement learning faces a fundamental trade-off: methods that maintain safety constraints often catastrophically forget previous knowledge when environments change, and vice versa—a problem existing approaches fail to fully resolve.

This paper studies how to safely train AI controllers that adapt to changing environments over time. The authors show that existing methods struggle to both prevent safety violations and avoid forgetting previous knowledge when system dynamics shift unexpectedly.

safetytrainingreasoning

FASTER: Value-Guided Sampling for Fast RL

Apr 21, 2026

Perry Dong, Alexander Swerdlow, Dorsa Sadigh et al.

You can get the benefits of expensive test-time sampling in RL by learning to filter action candidates early in the generation process, reducing compute without sacrificing performance.

FASTER is a method that speeds up reinforcement learning by filtering action candidates during the denoising process of diffusion-based policies, rather than waiting until denoising completes. It models this filtering as a decision problem with a learned value function, achieving the same performance as expensive sampling methods while cutting computational costs significantly.

efficiencyreasoningtraining

FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

Apr 21, 2026

Abdulmoneam Ali, Ahmed Arafa

Instead of clustering users during training (vulnerable to noisy labels), group them upfront using feature covariance structure, then fix label errors by checking if examples align with learned feature subspaces.

FB-NLL tackles noisy labels in federated learning by clustering users based on feature geometry rather than training dynamics, then correcting mislabeled data using feature alignment. This one-shot approach avoids the communication overhead of iterative methods while handling low-quality data that typically corrupts personalized federated learning.

trainingdataefficiency

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Apr 21, 2026

Jean Mercat, Sedrick Keh, Kushal Arora et al.

For roboticists and ML engineers: VLA Foundry eliminates pipeline incompatibility issues by providing a unified training stack for building embodied AI models, with released weights and open-source code making it practical to train and deploy robotic policies.

VLA Foundry is an open-source framework that unifies training of language models, vision-language models, and vision-language-action models in one codebase. Instead of stitching together separate pipelines, it provides end-to-end control from language pretraining through action fine-tuning, enabling researchers to train robotic manipulation policies from scratch or using pretrained backbones.

architecturetrainingapplications

Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

Apr 21, 2026

Jake Lee

Adaptive binning that adjusts to data skewness can significantly improve decision tree and Random Forest accuracy on skewed real-world data without sacrificing the computational efficiency gains of statistical discretization.

This paper improves how decision trees handle continuous numerical data by introducing Adaptive MSD-Splitting (AMSD), which adjusts binning strategies based on data skewness instead of using fixed cutoffs. The method maintains fast O(N) performance while improving accuracy by 2-4%, and extends to Random Forests for better large-scale performance on real-world datasets.

trainingefficiency

Bounded Ratio Reinforcement Learning

Apr 20, 2026

Yunke Ao, Le Chen, Bruce D. Lee et al.

BRRL provides the first principled theoretical foundation for PPO-style clipped objectives, proving monotonic improvement and connecting trust region methods to the Cross-Entropy Method—offering both better understanding and a path to improved algorithms.

This paper fixes a theoretical gap in PPO by introducing BRRL, a framework that derives the mathematically optimal policy update with guaranteed improvement. The authors develop BPO, a practical algorithm that approximates this optimal solution, and extend it to GBPO for LLM fine-tuning. Experiments show BPO matches or beats PPO across robotics, games, and language model tasks.

trainingreasoning

When Can LLMs Learn to Reason with Weak Supervision?

Apr 20, 2026

Salman Rahman, Jingyan Shen, Anna Mordvina et al.

Models generalize under weak supervision when they maintain steady improvement during training (avoiding rapid saturation), and this can be achieved by fine-tuning on explicit reasoning traces combined with domain-specific pre-training.

This paper investigates when large language models can learn to reason effectively with weak supervision (scarce data, noisy rewards, or self-generated rewards). The key finding is that models generalize when they have a prolonged training phase where performance improves steadily, rather than quickly memorizing.

reasoningtrainingevaluation

ConforNets: Latents-Based Conformational Control in OpenFold3

Apr 20, 2026

Minji Lee, Colin Kalicki, Minkyu Jeon et al.

By learning to transform AF3's internal representations, ConforNets can reliably generate multiple protein conformations and transfer conformational changes between proteins—solving a major limitation of structure prediction models that typically predict only one dominant state.

ConforNets is a method for controlling protein conformations in AlphaFold3 by applying learnable transformations to latent representations. Rather than perturbing inputs or using ad hoc tricks, it modulates the internal representations that AF3 uses to predict protein structures, enabling both discovery of alternate conformations and transfer of conformational changes across related proteins.

architecturetrainingapplications

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Apr 20, 2026

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan et al.

GSQ achieves near-frontier compression accuracy at 2-3 bits using standard scalar quantization compatible with existing inference hardware, making ultra-low-precision models practical without complex custom implementations.

GSQ is a new quantization method that compresses large language models to 2-3 bits per parameter while maintaining accuracy. It uses a mathematical technique called Gumbel-Softmax to intelligently assign weights to discrete values, bridging the gap between simple but limited scalar quantization and complex vector quantization methods that are hard to deploy.

efficiencytraining

FUSE: Ensembling Verifiers with Zero Labeled Data

Apr 20, 2026

Joonhyuk Lee, Virginia Ma, Sarah Zhao et al.

You can build better verification systems by combining multiple imperfect judges without any ground truth labels—FUSE shows this works as well as supervised approaches on real benchmarks.

FUSE is a method for combining multiple imperfect AI judges (verifiers) to better evaluate model outputs without needing any labeled correct answers. It uses spectral algorithms to intelligently ensemble different verifiers by controlling how they depend on each other, achieving results comparable to methods that do use labeled data.

evaluationtrainingefficiency
applicationsevaluationtraining

Learning to Reason with Insight for Informal Theorem Proving

Apr 17, 2026

Yunhe Li, Hao Shi, Bowen Deng et al.

Teaching LLMs to explicitly identify and apply core proof techniques—rather than jumping straight to full proofs—significantly improves their ability to solve complex mathematical problems.

This paper tackles informal theorem proving by teaching language models to recognize core proof techniques. The authors create a structured dataset that breaks down proofs into key insights and sketches, then train models using a multi-stage approach that mirrors how humans learn math—starting with basic proofs and progressing to deeper reasoning.

reasoningtrainingevaluation