Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers52 this month12 topics

All Efficiency 37 Reasoning 36 Training 35 Evaluation 29 Architecture 23 Agents 23 Multimodal 17 Applications 15 Alignment 9 Safety 8 scaling 8 Data 3

May 18 – May 24(16)

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

reasoningmultimodalagents

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(22)

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

May 8, 2026

Tong Zheng, Haolin Liu, Chengsong Huang et al.

You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.

This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.

reasoning

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoning

Apr 27 – May 3(19)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

May 1, 2026

Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.

Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.

HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.

reasoningarchitectureefficiency

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

May 1, 2026

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

Apr 20 – Apr 26(34)

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Apr 24, 2026

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.

World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.

This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).

agentsreasoningevaluation

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

Papers

May 18 – May 24(16)

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 4 – May 10(22)

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

Apr 27 – May 3(19)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Apr 20 – Apr 26(34)

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Mem-$π$: Adaptive Memory through Learning When and What to Generate

HITL-D: Human In The Loop Diffusion Assisted Shared Control

Code as Agent Harness

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Actionable World Representation

General Preference Reinforcement Learning

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Learning, Fast and Slow: Towards LLMs That Adapt Continually

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

MEME: Multi-entity & Evolving Memory Evaluation

Solve the Loop: Attractor Models for Language and Reasoning

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

AIs and Humans with Agency

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

Characterizing the Expressivity of Local Attention in Transformers

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Select to Think: Unlocking SLM Potential with Local Sufficiency

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Recursive Multi-Agent Systems

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Toward a Functional Geometric Algebra for Natural Language Semantics

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Learning to Think from Multiple Thinkers

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

MathDuels: Evaluating LLMs as Problem Posers and Solvers

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale