Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.
Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.
This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.
Lily Goli, Justin Kerr, Daniele Reda et al.
Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.
This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.
Ziyu Guo, Rain Liu, Xinyan Chen et al.
A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.
ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.
Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.
Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.
FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.
Tong Zheng, Haolin Liu, Chengsong Huang et al.
You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.
This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.
Shuhang Lin, Chuhao Zhou, Xiao Lin et al.
Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.
This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.
Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.
Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.
HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.
Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.
LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.
This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.
World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.
This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).
Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.
This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.