Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Lily Goli, Justin Kerr, Daniele Reda et al.
Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.
This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.
Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.
Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.
MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.
Ziyu Guo, Rain Liu, Xinyan Chen et al.
A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.
ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.
Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.
Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.
FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.
Jiayuan Liu, Tianqin Li, Shiyi Du et al.
Giving LLM agents access to longer memory doesn't automatically improve performance; it can actually harm cooperation in multi-agent settings by shifting how they reason about the future, not by making them more suspicious.
When LLMs can remember more conversation history, they actually cooperate less in multi-agent games—a problem called the memory curse. The researchers found that expanded context windows cause models to lose forward-looking intent rather than become paranoid, and they proved this by showing that synthetic positive history and targeted fine-tuning can restore cooperation.
Borui Zhang, Bo Zhang, Bo Wang et al.
You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.
This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.
Ziyang Huang, Yi Cao, Ali K. Shargh et al.
AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.
This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.
Arunabh Srivastava, Mohammad A., Khojastepour et al.
To make LLMs reliable at executing plans, you need to enforce structure through explicit control constructs, validate outputs against derived constraints at each step, and dynamically route to the best execution method (reasoning, tools, or code).
RunAgent is a system that helps AI agents execute multi-step plans written in natural language by converting them into a structured format with explicit control flow (like IF statements and loops).
Longju Bai, Zhemin Huang, Xingyao Wang et al.
AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.
This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.
World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.
This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).
Eric Gan, Aryan Bhatt, Buck Shlegeris et al.
Current AI systems and auditors are poor at detecting subtle sabotage in research code—even frontier LLMs only catch 77% of cases—highlighting a critical gap in oversight for autonomous AI research.
This paper introduces ASMR-Bench, a benchmark for testing whether AI systems and human auditors can detect sabotage hidden in ML research code. The benchmark includes 9 real ML projects with intentionally introduced bugs that change experimental results while keeping the paper's description accurate.
Yan Li, Zezi Zeng, Yifan Yang et al.
Generating webpages with AI requires coordinating multiple content types (text, images, video) at both global and local levels—treating layout and content generation as interconnected problems rather than separate tasks.
MM-WebAgent is a hierarchical AI system that generates complete webpages by coordinating the creation of layouts, text, images, and videos together. Unlike simpler approaches that generate each element separately, it uses planning and self-reflection to ensure all parts work together visually and stylistically.
Igor Jankowski
Event-driven temporal graph networks can bridge the simulation-to-reality gap in multi-agent cyber defense by processing asynchronous, noisy alerts in continuous time rather than synchronous ticks, enabling policies trained in simulation to work on real systems.
NetForge_RL is a cyber defense simulator that trains AI agents to protect networks in realistic, continuous-time conditions rather than simplified turn-based games. It uses a new technique called CT-GMARL that processes irregular security alerts like a human analyst would, achieving 2x better performance than existing methods and successfully transferring trained policies to real systems.
Anthony T. Nixon
Agents with different computational limits need different semantic representations of the world; communication between them hits a hard threshold determined by capacity mismatch, and you can derive the minimum communication rate needed from the agents' capacity constraints alone.
Wancong Zhang, Basile Terver, Artem Zholus et al.
Hierarchical planning with multi-scale world models enables robots to handle long-horizon tasks with 4x less compute and works zero-shot in new environments—a practical win for embodied AI systems.
This paper tackles long-horizon robot control by learning world models at multiple time scales and planning hierarchically across them. Instead of predicting every single step far into the future (which accumulates errors), the approach learns coarse and fine-grained models and plans at both levels, reducing computation while improving success on real-world tasks like pick-and-place.
Maximiliano Armesto, Christophe Kolb
Agentic AI systems need tightly integrated control, memory, and verification mechanisms working together; separating these concerns (as robotics, retrieval, and alignment research typically do) misses critical robustness gains that come from their coupling.
This paper proposes SCRAT, a framework for agentic AI that couples control, memory, and verification by drawing parallels from squirrel behavior.
This paper shows how agents with different computational capacities develop different 'semantic alphabets' when interacting with the same environment. It proves that communication between mismatched agents has a sharp threshold: below a critical rate, meaningful communication is impossible, but above it, information flows efficiently.