ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers100 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(45)

Tokenisation via Convex Relaxations

May 21, 2026

Jan Tempus, Philip Whittington, Craig W. Schmidt et al.

ConvexTok uses convex optimization to build tokenizers that are provably near-optimal (within 1% at typical vocabulary sizes) and compress text better than greedy algorithms like BPE, with measurable improvements in language model efficiency.

This paper replaces greedy tokenization algorithms like BPE with a convex optimization approach called ConvexTok. Instead of making locally optimal choices, it formulates tokenizer construction as a linear program, achieving better compression (bits-per-byte) and allowing users to verify how close their tokenizer is to mathematically optimal.

trainingefficiency

Integrable Elasticity via Neural Demand Potentials

May 21, 2026

Carlos Heredia, Daniel Roncel

Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.

This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.

May 11 – May 17(30)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

reasoningmultimodalagents

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

May 14, 2026

Ruozhen He, Meng Wei, Ziyan Yang et al.

Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.

EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.

May 4 – May 10(25)

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

May 8, 2026

Tong Zheng, Haolin Liu, Chengsong Huang et al.

You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.

This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.

reasoning

Normalizing Trajectory Models

May 8, 2026

Jiatao Gu, Tianrong Chen, Ying Shen et al.

NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.

This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.

efficiencyarchitecturetraining
architecture
applications

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

reasoningagents

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

May 21, 2026

Vishal Rajput

Many robustness techniques (CORAL, adversarial training, IRM, metric learning) are different ways of solving the same problem: identifying and regularizing against label-preserving variations in your data.

This paper unifies seemingly separate robustness problems (domain adaptation, adversarial training, compositional generalization) under one framework: regularizing neural network gradients to match the covariance of label-preserving variations in deployment data.

trainingalignment

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

May 21, 2026

Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.

Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.

MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.

agentssafety

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architectureefficiencyreasoning

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

May 21, 2026

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.

When LLM agents communicate through shared KV caches for efficiency, you need explicit safeguards—LCGuard shows how to block sensitive information leakage at the representation level without breaking task coordination.

LCGuard is a safety framework that protects sensitive information when multiple AI agents share transformer key-value caches to coordinate tasks. It uses adversarial training to transform shared cache data so that agents can't reconstruct each other's private inputs, while keeping the information useful for task performance.

safetyagentsefficiency

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

evaluationmultimodaldata

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

May 21, 2026

Yunpeng Dong, Jingkai He, Yuze Hou et al.

By tracking only differences between consecutive states rather than full duplicates, DeltaBox reduces AI agent checkpoint/rollback latency from seconds to milliseconds, directly enabling deeper search and larger-scale exploration for reasoning and RL tasks.

DeltaBox is a system that makes AI agents much faster by storing only the changes between checkpoints instead of copying entire sandbox states. Using new OS-level mechanisms for filesystems and process state, it reduces checkpoint/rollback time from hundreds of milliseconds to just milliseconds, enabling agents to explore more possibilities in the same time budget.

efficiencyagents

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

efficiencyevaluationapplications

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

May 21, 2026

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

Diffusion models can effectively handle continuous-time survival analysis by modeling censored outcomes directly, avoiding parametric assumptions and discretization errors that limit traditional survival methods.

SDPM uses diffusion models to estimate time-to-event distributions from data with censored observations, without requiring assumptions about the hazard function or discretizing time. The model generates samples that can be converted to survival curves, achieving competitive performance on real datasets while accurately recovering underlying continuous distributions.

applicationsevaluation

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Mamba's linear-complexity architecture enables real-time cognitive load monitoring from noisy eye-tracking signals on wearable devices—a practical alternative to Transformers for temporal sensor data with frequent gaps.

MambaGaze uses a bidirectional Mamba neural network to assess cognitive load from eye-tracking data in real-time. It handles missing data from eye blinks and tracking failures by explicitly encoding uncertainty, and runs efficiently on edge devices like smartglasses for applications like driver monitoring.

architectureefficiencyapplications

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Foundation models trained on large clinical datasets can be effectively adapted to wearable sensor tasks through domain-specific adapters and careful fine-tuning, enabling better cognitive load assessment with limited labeled data.

CogAdapt adapts pre-trained clinical ECG models to assess cognitive load from wearable devices. It uses a learnable adapter to convert 3-lead wearable signals into 12-lead clinical format and a progressive fine-tuning strategy to preserve learned knowledge while adapting to the new task, achieving strong performance on cognitive load prediction.

applications

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

scalingtrainingefficiency

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

May 20, 2026

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.

WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.

evaluationmultimodal

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

May 20, 2026

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

Compiling agent tasks into code upfront—rather than deciding actions one step at a time—enables parallelization and validation, dramatically reducing latency and errors in web automation.

This paper introduces a compilation approach for web agents that converts natural language tasks into executable code plans instead of executing step-by-step. By generating multiple candidate plans, validating them against tool specifications, and optimizing for parallelization, the system achieves 10x faster execution and better accuracy than existing sequential approaches.

agentsefficiencyreasoning

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

May 20, 2026

Weixing Zhang, Bowen Jiang, Rahul Sharma et al.

LLMs can learn grammar adaptation patterns from examples and apply them to new versions, achieving 100% consistency on medium-sized grammars but failing on large-scale ones—suggesting LLMs work best for targeted, smaller grammar updates.

This paper shows how Large Language Models can automatically adapt domain-specific language grammars when their underlying models change, reducing manual work. Testing on real-world languages shows LLMs work well for complex scenarios but struggle with very large grammars (300+ rules).

trainingapplications

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

HITL-D: Human In The Loop Diffusion Assisted Shared Control

May 20, 2026

Riley Zilka, Sergey Khlynovskiy, Allie Wang et al.

Diffusion models can effectively assist human operators in robotic control by automating specific subtasks (like orientation), reducing cognitive load while maintaining human oversight—a practical model for human-AI collaboration in physical systems.

This paper presents HITL-D, a shared control system that combines diffusion-based AI policies with human input for robotic manipulation tasks. Instead of requiring operators to control every aspect of a robot arm, the system automatically handles orientation adjustments while the human focuses on positioning, reducing mental workload and task completion time by 40% in user studies.

agentsapplicationsreasoning

Mitigating Label Bias with Interpretable Rubric Embeddings

May 20, 2026

Calvin Isley, Johann D. Gaebler, Sharad Goel

Replace opaque learned embeddings with interpretable features derived from expert-defined rubrics to reduce bias inheritance from biased training labels in high-stakes decisions.

When training AI models on biased historical data (like past hiring decisions), the models learn and perpetuate those biases. This paper proposes using 'rubric embeddings'—features based on expert-defined criteria—instead of black-box embeddings to make fairer predictions. Testing on university admissions data, the approach reduces group disparities while maintaining quality.

alignmentevaluation

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

May 20, 2026

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI-generated refactoring often improves code but frequently introduces new quality and security issues that developers accept anyway, highlighting the need for automated quality checks before merging AI contributions.

This study examines Python refactoring pull requests created by AI agents, measuring their impact on code quality and security.

evaluationsafetyapplications

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

May 18, 2026

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti et al.

DashAttention enables efficient long-context processing by combining adaptive sparse selection with differentiable training, outperforming fixed-sparsity methods while maintaining gradient flow through both attention stages.

DashAttention improves how language models handle long documents by using a smarter two-stage attention mechanism. Instead of always selecting the same number of relevant tokens, it adaptively picks different amounts based on what each query needs, while keeping the entire process trainable. This achieves full-attention quality with 75% fewer computations.

efficiencyarchitecture

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

Code as Agent Harness

May 18, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu et al.

Code is becoming the primary substrate for building reliable, verifiable AI agents. Understanding code as agent harness—the infrastructure layer—is essential for building systems that can plan, remember, use tools, and coordinate across multiple agents.

This survey examines how code serves as the operational foundation for AI agents—not just as output, but as the infrastructure that enables agents to reason, act, model environments, and verify their own behavior.

agentsarchitecturereasoning

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

May 18, 2026

Yining Hong, Jiageng Liu, Han Yin et al.

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodalreasoning

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

May 18, 2026

Lifu Wei, Yinuo Ren, Naichen Shi et al.

You can guide diffusion models without computing gradients or scores—just reweight trajectories and resample periodically, making inference-time improvements cheaper and easier to implement.

This paper introduces URGE, a gradient-free method for improving diffusion model outputs at inference time. Instead of computing expensive gradients, URGE reweights and resamples trajectories using a mathematical technique called Girsanov estimation, making guidance simpler and faster while maintaining theoretical guarantees.

efficiency

Actionable World Representation

May 18, 2026

Kunqi Xu, Jitao Li, Jianglong Ye et al.

By explicitly modeling object state changes as a learnable manifold, WorldString provides a unified way to represent how objects respond to actions—bridging the gap between perception and control for physical world models.

WorldString is a neural architecture that learns to represent how real-world objects change state over time by processing point clouds or video data. It creates a digital twin of objects that captures their actionable properties, serving as a building block for world models that can predict and interact with the physical world.

architecturereasoning

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

May 18, 2026

Payal Chandak, Victoria Alkin, David Wu et al.

LLMs deployed for medical advice have hidden, consistent ethical biases that don't reflect real physician diversity; without explicit auditing and balancing, a single model's values could be imposed at scale to thousands of patients.

This paper audits how large language models handle ethical dilemmas in medicine, revealing that while models discuss multiple ethical perspectives in their reasoning, they make near-identical decisions across repeated attempts.

safetyevaluationalignment

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

May 18, 2026

Miguel Farinha, Ronald Clark

By conditioning on intrinsic image properties (albedo and shading) extracted from both photos and 3D renders, you can achieve photorealistic relighting with full PBR lighting control while staying fast enough for practical use.

PIXLRelight is a fast neural relighting method that lets you change lighting in photos using physically-based rendering controls. It decomposes images into intrinsic components (albedo, shading, residuals) and uses these to condition a transformer model, enabling realistic lighting adjustments in under 0.1 seconds per image without per-image optimization.

multimodalarchitectureefficiency

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

May 18, 2026

Feng Chen, Tianzhe Chu, Li Sun et al.

Current embodied systems struggle with the full loop: even when vision models perform well on isolated tasks (67% accuracy), they fail at recovering complete game state needed for decision-making (34% accuracy), and execution errors cascade during real deployment.

DexHoldem is a real-world benchmark that tests embodied AI systems playing Texas Hold'em with a dexterous robot hand. It combines three challenges: executing 14 card-manipulation skills precisely, perceiving game state from images, and making decisions based on that perception—revealing how errors compound when all three run together in closed-loop control.

evaluationagents

General Preference Reinforcement Learning

May 18, 2026

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal et al.

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

trainingalignmentreasoning

Semantic Generative Tuning for Unified Multimodal Models

May 18, 2026

Songsong Yu, Yuxin Chen, Ying Shan et al.

Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.

This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.

multimodaltrainingarchitecture

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

May 18, 2026

Minrui Xu, Zilin Wang, Mengyi DENG et al.

Automated environment synthesis and trajectory generation can reduce the data requirements for tool-use agent training by 5x while improving downstream performance, making agentic RL more practical and scalable.

EnvFactory automates the creation of tool-use training environments and realistic multi-turn interaction trajectories for teaching language models to use tools effectively. It generates diverse, natural training data from verified executable environments, enabling more efficient agent training with fewer resources than existing approaches.

agentstrainingdata
evaluationmultimodalarchitecture

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

May 14, 2026

Xiang Fan, Yuheng Wang, Bohan Fang et al.

Video generation systems lose detail because their decoders ignore the input image—adding reference conditioning to the decoder recovers this information and improves quality by up to 2.1dB PSNR.

RefDecoder improves video generation by conditioning the decoder on a reference image, fixing a common architectural flaw where decoders ignore input details. By injecting reference image information through attention mechanisms during decoding, it preserves fine details and consistency without requiring retraining of existing systems.

architecturemultimodalefficiency

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

evaluationagentsreasoning

Quantitative Video World Model Evaluation for Geometric-Consistency

May 14, 2026

Jiaxin Wu, Yihao Pi, Yinling Zhang et al.

Video generators often fail at maintaining consistent 3D geometry in ways that human raters and perceptual metrics don't catch; PDI-Bench provides a diagnostic tool to measure and improve these failures systematically.

This paper introduces PDI-Bench, a quantitative framework for evaluating whether generated videos maintain physically plausible 3D structure and motion.

evaluationmultimodal

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

May 14, 2026

Sahil Sen, Akhil Kasturi, Elias Lumer et al.

When building agentic search systems, simple grep-based retrieval can outperform vector search, but the agent architecture and how you present tool outputs to the model matter more than retrieval method alone.

This paper compares different retrieval strategies (grep vs. vector search) in AI agent systems that autonomously retrieve information and call tools.

agentsevaluation

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

May 14, 2026

Ellwil Sharma, Arastu Sharma

Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.

This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.

architecturescalingefficiency

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

May 14, 2026

Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.

Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.

OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.

reasoningevaluation

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

May 14, 2026

Rui Wen, Mark Russinovich, Andrew Paverd et al.

LLM backdoors don't need suspicious text triggers—attackers can hide them in positional encoding, making them invisible to content-based defenses and activatable through normal conversation length patterns.

This paper reveals a new way to attack large language models by exploiting how they process word positions rather than modifying the text itself. Researchers show that backdoors can be triggered by input length alone, allowing attackers to make models leak secrets or misbehave without leaving obvious traces in the conversation.

safety

Evidential Reasoning Advances Interpretable Real-World Disease Screening

May 14, 2026

Chenyu Lian, Hong-Yu Zhou, Jing Qin

Adding case-based reasoning with historical examples to medical screening models improves both accuracy and interpretability—doctors can see which past cases influenced the diagnosis and where the model detected problems.

EviScreen is a medical image screening system that improves disease detection by retrieving and reasoning with similar historical cases. Instead of making predictions in isolation, it finds relevant past examples and uses them to guide its decision-making, while also generating interpretable maps showing where abnormalities are located.

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

May 14, 2026

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim et al.

Combining unstructured clinical text with structured EHR tables through retrieval-augmented alignment produces significantly more accurate and complete patient timelines than using either source alone, with 35% of clinically important events appearing only in text.

This paper tackles a critical healthcare problem: reconstructing accurate timelines of patient events from messy clinical records. Clinical narratives (text) contain rich context but vague timing, while structured EHR tables have precise timestamps but miss many events.

multimodalapplications

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

May 14, 2026

Pratinav Seth, Vinay Kumar Sankarapu

Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.

This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.

safetyevaluationalignment

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

May 14, 2026

Zhuohang Li, Liqun Huang, Wei Xu et al.

Seamlessly blending human intervention with robot policy execution—rather than abrupt takeovers—dramatically reduces manipulation failures in dexterous tasks and produces better-trained policies from human correction data.

This paper addresses a key problem in robotic hand control: when humans take over from an AI policy during manipulation tasks, abrupt hand configuration changes ('gesture jumps') cause failures. Hand-in-the-Loop smoothly blends human corrections with the robot's ongoing actions, reducing takeover disruptions by 99.8% and improving task success rates by 19% when used to train better policies.

agentstraining

MeMo: Memory as a Model

May 14, 2026

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong et al.

You can add new knowledge to any LLM without touching its weights by training a separate memory model that retrieves and augments the LLM's responses—making it practical for real-world applications needing frequent updates.

MeMo introduces a modular memory model that stores new knowledge separately from a frozen LLM, enabling efficient updates without retraining. It works with any LLM (open or proprietary), handles complex document relationships, and maintains constant retrieval cost regardless of corpus size.

trainingefficiency

Self-Distilled Agentic Reinforcement Learning

May 14, 2026

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han et al.

Combining RL with selective token-level distillation through a gating mechanism significantly improves LLM agent performance on complex tasks, achieving 7-10% gains over standard RL approaches while avoiding training instability.

This paper improves how language model agents learn through reinforcement learning by combining trajectory-level rewards with dense token-level guidance. The key innovation is a gating mechanism that selectively uses teacher signals—strengthening learning from good decisions and softly ignoring bad teacher suggestions—making multi-turn agent training more stable and effective.

agentstraining

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

May 12, 2026

Di Wu, Zixiang Ji, Asmi Kawatkar et al.

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agentsevaluationreasoning

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

May 12, 2026

Kexuan Shi, Hanxuan Li, Zeju Qiu et al.

Pion's orthogonal update mechanism preserves weight matrix spectral properties during training, providing a geometrically principled alternative to gradient-based optimizers like Adam with competitive performance.

Pion is a new optimizer for training large language models that updates weights using orthogonal transformations instead of adding gradients like Adam does. By preserving the singular values of weight matrices, it keeps the spectral properties stable while still allowing the model to learn, offering a more geometrically-grounded approach to optimization.

training

Elastic Attention Cores for Scalable Vision Transformers

May 12, 2026

Alan Z. Song, Yinjie Chen, Mu Nan et al.

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

architectureefficiencyscaling

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

May 12, 2026

Ariel Gera, Shir Ashury-Tahan, Gal Bloch et al.

You can boost embedding model performance on hard search tasks by having an LLM refine queries at test-time, making embeddings practical for scenarios where running LLMs on all documents is too expensive.

This paper shows how to improve embedding models for search and classification by using an LLM to refine user queries in real-time. Instead of changing the embedding model itself, the approach adjusts the query representation based on feedback from a small sample of documents, achieving up to 25% improvement on challenging tasks without requiring expensive LLM processing at scale.

efficiencyevaluation

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

May 12, 2026

Xuhao Hu, Xi Zhang, Haiyang Xu et al.

Agents perform better when trained to decide dynamically between GUI actions and tool calls rather than using only one approach—this hybrid strategy improved accuracy by 66% on real-world tasks.

ToolCUA trains computer agents to intelligently choose between GUI actions (clicks, typing) and tool calls (APIs) by synthesizing diverse training trajectories from existing data and using reinforcement learning to optimize when to switch between action types. This solves a key problem for digital agents: knowing when to use high-level tools versus low-level GUI interactions.

agentstrainingreasoning

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

May 12, 2026

Guohui Zhang, XiaoXiao Ma, Jie Huang et al.

When training models to generate audio and video together, treating each modality's learning separately and protecting audio-specific layers from video interference leads to better results than standard single-objective RL approaches.

OmniNFT improves joint audio-video generation by using reinforcement learning with three key techniques: routing rewards separately to each modality, preventing video gradients from interfering with audio processing, and focusing optimization on synchronization regions. This addresses real-world needs for high-quality audio, high-quality video, and tight audio-video alignment simultaneously.

multimodaltraining

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026

Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.

LLM agents struggle with dependency reasoning in persistent memory—when facts relate to each other, systems collapse to near-random performance, and fixing this requires impractically expensive configurations.

This paper introduces MEME, a benchmark for evaluating how well AI agents manage information across multiple sessions. It tests six memory tasks including complex scenarios like tracking dependencies between facts and handling deletions.

evaluationagentsreasoning

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

May 12, 2026

Sagi Ahrac, Noya Hochwald, Mor Geva

Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.

This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.

architecturetrainingefficiency

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

May 12, 2026

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi et al.

You can extend transformer context length by simply reusing and accumulating the KV cache across chunks—no training needed, and the approach stays numerically stable even across very long sequences.

KV-Fold enables long-context inference by treating the key-value cache as an accumulator that gets passed between sequence chunks. The model processes each chunk while attending to cached information from previous chunks, allowing it to handle contexts up to 128K tokens without retraining or architectural changes.

efficiency

Solve the Loop: Attractor Models for Language and Reasoning

May 12, 2026

Jacob Fein-Ashley, Paria Rashidinejad

Attractor Models make iterative refinement practical by using implicit differentiation to solve fixed points, enabling smaller models (27M-770M parameters) to outperform much larger ones on reasoning and language tasks without the training instability of traditional recurrent architectures.

This paper introduces Attractor Models, which improve on looped Transformers by using implicit differentiation to solve for fixed points in latent representations.

architecturereasoningefficiency

Search Your Block Floating Point Scales!

May 12, 2026

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu et al.

Smarter scale selection in Block Floating Point quantization can reduce quantization error by 27% and improve language model performance by up to 15 points without slowing down inference.

This paper improves quantization for AI models by optimizing how Block Floating Point formats choose their scale factors. Instead of using a fixed maximum-based scale, ScaleSearch searches for better scales that minimize quantization error. The method works with existing quantization techniques and includes a specialized attention algorithm, showing 15-point improvements on math reasoning tasks.

efficiency

A proximal gradient algorithm for composite log-concave sampling

May 12, 2026

Linghai Liu, Sinho Chewi

The algorithm enables efficient sampling from composite log-concave distributions with convergence guarantees that scale well with dimension and condition number, useful for Bayesian inference and optimization problems.

This paper presents a new algorithm for sampling from complex probability distributions that combine two functions (f and g). The method uses gradient information and a special sampling oracle, achieving efficient convergence rates that match the best known results. The approach extends to non-smooth functions and distributions with weaker mathematical properties.

scaling

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

May 12, 2026

Guinan Su, Yanwu Yang, Xueyan Li et al.

By training models to handle multiple parallel computation streams instead of sequential message exchanges, you can build faster, more responsive AI agents that can act while thinking and react to new information without waiting for previous operations to complete.

This paper proposes Multi-Stream LLMs, which replace the single sequential message stream in current language models with multiple parallel streams for inputs, outputs, and reasoning. This allows models to read and write simultaneously, think while acting, and process different types of information in parallel—addressing fundamental bottlenecks in how AI agents currently operate.

architectureagentstraining

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoningevaluationsafety

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

May 8, 2026

Maryam Maghsoudi, Shihab Shamma

You can decode what someone is imagining saying by training on their brain activity while listening to speech, then mapping imagined brain patterns to listened patterns—solving the data scarcity problem in brain-computer interfaces.

This paper shows how to decode imagined speech from brain recordings (MEG) by training on the more abundant listened speech data instead. Researchers mapped brain activity from imagining speech to brain activity from listening to speech, then used a decoder trained on listened speech to identify imagined words. This approach works without needing large imagined speech datasets.

multimodal

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

May 8, 2026

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.

When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.

GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.

evaluationreasoning

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

May 8, 2026

Wei Yu, Yunhang Qian

State space models offer a practical alternative to transformers for event-based image reconstruction, achieving better results with linear computational complexity instead of quadratic, making high-resolution processing feasible.

EmambaIR uses a new type of neural network architecture (state space models) to reconstruct clear images from event camera data.

architectureefficiencymultimodal

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

May 8, 2026

James Petullo, Sonny George, Dylan Cashman et al.

You can make confidence-weighted answer selection 47% cheaper by clustering similar reasoning traces and only evaluating unique ones, without sacrificing accuracy.

VecCISC reduces the cost of weighted majority voting for LLM reasoning by filtering out duplicate or low-quality reasoning traces before sending them to a critic model. It uses semantic similarity to identify which candidate answers are worth evaluating, cutting token usage by 47% while maintaining accuracy across math, science, and reasoning tasks.

efficiencyreasoning

Flow-OPD: On-Policy Distillation for Flow Matching Models

May 8, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng et al.

On-policy distillation with specialized teachers can resolve conflicting optimization goals in multi-objective image generation, achieving 10-point improvements over standard reinforcement learning approaches while maintaining quality across all metrics.

Flow-OPD is a training method that improves text-to-image models by using specialized teacher models and on-policy distillation to align multiple competing objectives (like image quality, text accuracy, and aesthetics).

trainingalignmentefficiency

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

May 8, 2026

Jiayuan Liu, Tianqin Li, Shiyi Du et al.

Giving LLM agents access to longer memory doesn't automatically improve performance; it can actually harm cooperation in multi-agent settings by shifting how they reason about the future, not by making them more suspicious.

When LLMs can remember more conversation history, they actually cooperate less in multi-agent games—a problem called the memory curse. The researchers found that expanded context windows cause models to lose forward-looking intent rather than become paranoid, and they proved this by showing that synthetic positive history and targeted fine-tuning can restore cooperation.

agentsreasoningalignment

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

May 8, 2026

James Petullo, Nianwen Xue

Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.

CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.

reasoningapplications

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

May 8, 2026

Gugan Thoppe, L. A. Prashanth, Ankur Naskar et al.

You can now use principled Q-learning algorithms for risk-sensitive decision-making (exponential utility), with mathematical guarantees that they find optimal policies—previously this lacked solid theoretical foundations.

This paper develops reinforcement learning algorithms for optimizing exponential utility in decision-making problems, which is important for risk-sensitive applications. The authors prove that their Q-learning-style algorithms converge to optimal policies and provide theoretical guarantees on convergence speed.

reasoning

Accurate and Efficient Statistical Testing for Word Semantic Breadth

May 8, 2026

Yo Ehara

When statistically comparing semantic breadth of words using embeddings, you must account for directional differences or your significance tests will be unreliable—this paper provides a practical, GPU-accelerated solution.

This paper solves a statistical problem in measuring how broadly a word's meaning spreads across different contexts using word embeddings. When comparing two words' semantic breadth, naive statistical tests fail because they confuse directional differences (where words point in different semantic directions) with actual breadth differences.

evaluation

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

May 8, 2026

Yi Yu, Parker Martin, Zhenyu Bu et al.

Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.

CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.

applicationsefficiencyevaluation

Fast Byte Latent Transformer

May 8, 2026

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al.

Byte-level models can now generate 50% faster by predicting multiple bytes in parallel instead of one at a time, making them practical for real-world use without sacrificing quality.

Byte-level language models match token-based models but generate slowly because they produce one byte at a time. This paper introduces three faster variants: BLT-D uses diffusion to generate multiple bytes per step, BLT-S uses local drafting with verification, and BLT-DV combines both. All reduce memory bandwidth costs by over 50% during generation.

efficiencyarchitecture

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

May 7, 2026

Omar El Khalifi, Thomas Rossi, Oscar Fossey et al.

You can control both character motion and camera angles in video generation by using a two-phase conditioning approach that prioritizes geometric consistency, without needing to train new models.

ActCam enables precise control over both actor motion and camera movement in AI-generated videos without requiring training. It works with existing video generation models by providing carefully sequenced guidance: first using pose and depth information to establish scene structure, then refining details with pose-only guidance.

multimodalapplicationsarchitecture

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

May 7, 2026

Minbin Huang, Han Shi, Chuanyang Zheng et al.

You don't need separate expert sets per layer in MoE models—a shared expert pool with independent routers works better and uses fewer parameters, suggesting the standard per-layer expert allocation is unnecessarily wasteful.

UniPool replaces the standard Mixture-of-Experts design where each layer has its own expert set with a single shared pool of experts accessed by all layers. This reduces redundancy and allows expert parameters to grow sublinearly with model depth while improving performance and reducing parameter count by 30-60% compared to standard MoE.

architectureefficiencyscaling

BAMI: Training-Free Bias Mitigation in GUI Grounding

May 7, 2026

Borui Zhang, Bo Zhang, Bo Wang et al.

You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.

This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.

agentsevaluationefficiency

EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 2026

Ryan Wang, Akshita Bhagia, Sewon Min

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architectureefficiencytraining

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

May 7, 2026

Jai Moondra, Ayela Chughtai, Bhargavi Lanka et al.

Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.

Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.

evaluationmultimodal

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

May 7, 2026

Yuxing Liu, Jianyu Wang, Tong Zhang

Use the same optimizer for finetuning as you used for pretraining—it significantly reduces catastrophic forgetting while maintaining task performance, even outperforming parameter-efficient methods like LoRA.

When finetuning large language models, using the same optimizer during finetuning as was used during pretraining reduces forgetting of previously learned knowledge while maintaining or improving performance on new tasks.

trainingefficiency

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

May 7, 2026

Sushant Gautam, Finn Schwall, Annika Willoch Olstad et al.

When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.

This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.

safetyevaluation

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

May 7, 2026

Daniel Zheng, Ingrid von Glehn, Yori Zwols et al.

AI agents work best for complex research when designed as collaborative partners that maintain context, track what didn't work, and produce native outputs—not just as answer machines.

Researchers built an interactive AI workbench that helps mathematicians explore open-ended research problems by combining agents for literature search, computation, theorem proving, and theory building. The system tracks failed ideas, manages uncertainty, and outputs mathematical artifacts—mimicking how human collaborators work together.

agentsreasoningapplications

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

May 7, 2026

Mingwei Xu, Hao Fang

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

trainingreasoningalignment