Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers42 this month12 topics

All Efficiency 37 Reasoning 36 Training 35 Evaluation 29 Architecture 23 Agents 23 Multimodal 17 Applications 15 Alignment 9 Safety 8 scaling 8 Data 3

May 18 – May 24(11)

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

reasoningagents

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

May 21, 2026

Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.

Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.

MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

reasoningmultimodalagents

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(20)

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

May 8, 2026

Jiayuan Liu, Tianqin Li, Shiyi Du et al.

Giving LLM agents access to longer memory doesn't automatically improve performance; it can actually harm cooperation in multi-agent settings by shifting how they reason about the future, not by making them more suspicious.

When LLMs can remember more conversation history, they actually cooperate less in multi-agent games—a problem called the memory curse. The researchers found that expanded context windows cause models to lose forward-looking intent rather than become paranoid, and they proved this by showing that synthetic positive history and targeted fine-tuning can restore cooperation.

agentsreasoningalignment

BAMI: Training-Free Bias Mitigation in GUI Grounding

May 7, 2026

Borui Zhang, Bo Zhang, Bo Wang et al.

You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.

This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.

Apr 27 – May 3(16)

Can Coding Agents Reproduce Findings in Computational Materials Science?

May 1, 2026

Ziyang Huang, Yi Cao, Ali K. Shargh et al.

AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.

This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.

agentsevaluationapplications

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

May 1, 2026

Arunabh Srivastava, Mohammad A., Khojastepour et al.

To make LLMs reliable at executing plans, you need to enforce structure through explicit control constructs, validate outputs against derived constraints at each step, and dynamically route to the best execution method (reasoning, tools, or code).

RunAgent is a system that helps AI agents execute multi-step plans written in natural language by converting them into a structured format with explicit control flow (like IF statements and loops).

Apr 20 – Apr 26(15)

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Apr 24, 2026

Longju Bai, Zhemin Huang, Xingyao Wang et al.

AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.

This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.

efficiencyagentsevaluation

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Apr 24, 2026

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.

World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.

This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).

Apr 13 – Apr 19(13)

ASMR-Bench: Auditing for Sabotage in ML Research

Apr 17, 2026

Eric Gan, Aryan Bhatt, Buck Shlegeris et al.

Current AI systems and auditors are poor at detecting subtle sabotage in research code—even frontier LLMs only catch 77% of cases—highlighting a critical gap in oversight for autonomous AI research.

This paper introduces ASMR-Bench, a benchmark for testing whether AI systems and human auditors can detect sabotage hidden in ML research code. The benchmark includes 9 real ML projects with intentionally introduced bugs that change experimental results while keeping the paper's description accurate.

safetyevaluationagents

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Apr 16, 2026

Yan Li, Zezi Zeng, Yifan Yang et al.

Generating webpages with AI requires coordinating multiple content types (text, images, video) at both global and local levels—treating layout and content generation as interconnected problems rather than separate tasks.

MM-WebAgent is a hierarchical AI system that generates complete webpages by coordinating the creation of layouts, text, images, and videos together. Unlike simpler approaches that generate each element separately, it uses planning and self-reflection to ensure all parts work together visually and stylistically.

Apr 6 – Apr 12(14)

Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

Apr 10, 2026

Igor Jankowski

Event-driven temporal graph networks can bridge the simulation-to-reality gap in multi-agent cyber defense by processing asynchronous, noisy alerts in continuous time rather than synchronous ticks, enabling policies trained in simulation to work on real systems.

NetForge_RL is a cyber defense simulator that trains AI agents to protect networks in realistic, continuous-time conditions rather than simplified turn-based games. It uses a new technique called CT-GMARL that processes irregular security alerts like a human analyst would, achieving 2x better performance than existing methods and successfully transferring trained policies to real systems.

agentstrainingapplications

Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

Apr 10, 2026

Anthony T. Nixon

Agents with different computational limits need different semantic representations of the world; communication between them hits a hard threshold determined by capacity mismatch, and you can derive the minimum communication rate needed from the agents' capacity constraints alone.

Mar 30 – Apr 5(2)

Hierarchical Planning with Latent World Models

Apr 3, 2026

Wancong Zhang, Basile Terver, Artem Zholus et al.

Hierarchical planning with multi-scale world models enables robots to handle long-horizon tasks with 4x less compute and works zero-shot in new environments—a practical win for embodied AI systems.

This paper tackles long-horizon robot control by learning world models at multiple time scales and planning hierarchically across them. Instead of predicting every single step far into the future (which accumulates errors), the approach learns coarse and fine-grained models and plans at both levels, reducing computation while improving success on real-world tasks like pick-and-place.

reasoningefficiencyagents

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

Apr 3, 2026

Maximiliano Armesto, Christophe Kolb

Agentic AI systems need tightly integrated control, memory, and verification mechanisms working together; separating these concerns (as robotics, retrieval, and alignment research typically do) misses critical robustness gains that come from their coupling.

This paper proposes SCRAT, a framework for agentic AI that couples control, memory, and verification by drawing parallels from squirrel behavior.

Papers

May 18 – May 24(11)

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 4 – May 10(20)

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

BAMI: Training-Free Bias Mitigation in GUI Grounding

Apr 27 – May 3(16)

Can Coding Agents Reproduce Findings in Computational Materials Science?

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Apr 20 – Apr 26(15)

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Apr 13 – Apr 19(13)

ASMR-Bench: Auditing for Sabotage in ML Research

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Apr 6 – Apr 12(14)

Event-Driven Temporal Graph Networks for Asynchronous Multi-Agent Cyber Defense in NetForge_RL

Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

Mar 30 – Apr 5(2)

Hierarchical Planning with Latent World Models

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Mem-$π$: Adaptive Memory through Learning When and What to Generate

HITL-D: Human In The Loop Diffusion Assisted Shared Control

Code as Agent Harness

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

Self-Distilled Agentic Reinforcement Learning

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

MEME: Multi-entity & Evolving Memory Evaluation

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

AIs and Humans with Agency

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

ClawGym: A Scalable Framework for Building Effective Claw Agents

Recursive Multi-Agent Systems

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

A paradox of AI fluency

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Compliance Moral Hazard and the Backfiring Mandate