ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers52 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(16)

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

trainingreasoningefficiency

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

May 21, 2026

Lily Goli, Justin Kerr, Daniele Reda et al.

Effective curiosity-driven exploration in 3D environments requires both a persistent, continuously-updated world model and episodic memory of the agent's trajectory—without these, agents waste effort revisiting forgotten states instead of discovering new regions.

This paper shows how to make AI agents explore 3D environments effectively using curiosity-driven learning. The key insight is that agents need two things: a persistent 3D map of the world that updates continuously, and memory of where they've been.

May 11 – May 17(9)

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

reasoningmultimodalagents

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(22)

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

May 8, 2026

Tong Zheng, Haolin Liu, Chengsong Huang et al.

You can automatically discover better inference strategies for LLMs by treating it as a search problem over execution traces, rather than manually designing heuristics—and it's cheap to do at scale.

This paper presents AutoTTS, a framework that automatically discovers test-time scaling strategies for LLMs instead of relying on hand-crafted heuristics.

reasoning

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoning

Apr 27 – May 3(19)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

May 1, 2026

Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.

Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.

HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.

reasoningarchitectureefficiency

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

May 1, 2026

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

Apr 20 – Apr 26(34)

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Apr 24, 2026

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin et al.

World models are essential for agents that act in the world, but they need different architectures and evaluation methods depending on what they're modeling (physics vs. software vs. social dynamics) and how sophisticated their predictions need to be.

This paper creates a framework for understanding world models—systems that predict how environments change—by organizing them into three capability levels (from simple one-step prediction to autonomous model revision) and four domain types (physical, digital, social, scientific).

agentsreasoningevaluation

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

reasoningagents

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architectureefficiencyreasoning

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

May 20, 2026

Benhao Huang, Zhengyang Geng, Zico Kolter

Iterative reasoning models work by learning task-specific attractors in their latent space; scaling test-time compute (more iterations and parallel paths) improves performance on hard problems without needing external verifiers.

This paper explains how AI models can solve hard problems by iteratively refining internal states, like a brain thinking through steps. The key insight is that models learn to create 'attractors'—stable patterns that pull the model toward correct answers.

reasoningscaling

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

May 20, 2026

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

Compiling agent tasks into code upfront—rather than deciding actions one step at a time—enables parallelization and validation, dramatically reducing latency and errors in web automation.

This paper introduces a compilation approach for web agents that converts natural language tasks into executable code plans instead of executing step-by-step. By generating multiple candidate plans, validating them against tool specifications, and optimizing for parallelization, the system achieves 10x faster execution and better accuracy than existing sequential approaches.

agentsefficiencyreasoning

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

May 20, 2026

Kaiyi Zhang, Wei Wu, Yankai Lin

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

trainingreasoningalignment

Mem-$π$: Adaptive Memory through Learning When and What to Generate

May 20, 2026

Xiaoqiang Wang, Chao Wang, Hadi Nekoei et al.

Generating context-specific guidance dynamically outperforms traditional retrieval-based memory for agents—the system learns to abstain when unnecessary and produce only relevant help, improving task success by over 30% on web navigation.

Mem-π is a framework that gives AI agents smarter memory by generating helpful guidance on-the-fly instead of retrieving fixed entries from a database. A separate model learns when to create guidance and what to create, trained to skip unhelpful suggestions and produce only what the agent actually needs for the current task.

agentstrainingreasoning

HITL-D: Human In The Loop Diffusion Assisted Shared Control

May 20, 2026

Riley Zilka, Sergey Khlynovskiy, Allie Wang et al.

Diffusion models can effectively assist human operators in robotic control by automating specific subtasks (like orientation), reducing cognitive load while maintaining human oversight—a practical model for human-AI collaboration in physical systems.

This paper presents HITL-D, a shared control system that combines diffusion-based AI policies with human input for robotic manipulation tasks. Instead of requiring operators to control every aspect of a robot arm, the system automatically handles orientation adjustments while the human focuses on positioning, reducing mental workload and task completion time by 40% in user studies.

agentsapplicationsreasoning

Code as Agent Harness

May 18, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu et al.

Code is becoming the primary substrate for building reliable, verifiable AI agents. Understanding code as agent harness—the infrastructure layer—is essential for building systems that can plan, remember, use tools, and coordinate across multiple agents.

This survey examines how code serves as the operational foundation for AI agents—not just as output, but as the infrastructure that enables agents to reason, act, model environments, and verify their own behavior.

agentsarchitecturereasoning

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

May 18, 2026

Yining Hong, Jiageng Liu, Han Yin et al.

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodalreasoning

Actionable World Representation

May 18, 2026

Kunqi Xu, Jitao Li, Jianglong Ye et al.

By explicitly modeling object state changes as a learnable manifold, WorldString provides a unified way to represent how objects respond to actions—bridging the gap between perception and control for physical world models.

WorldString is a neural architecture that learns to represent how real-world objects change state over time by processing point clouds or video data. It creates a digital twin of objects that captures their actionable properties, serving as a building block for world models that can predict and interact with the physical world.

architecturereasoning

General Preference Reinforcement Learning

May 18, 2026

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal et al.

GPRL solves reward hacking in LLM training by treating quality as multi-dimensional rather than scalar, allowing online RL to work on open-ended tasks without collapsing onto exploitable reward axes.

This paper addresses a gap in LLM training by proposing General Preference Reinforcement Learning (GPRL), which handles open-ended tasks like traditional preference optimization while maintaining the continuous exploration benefits of online RL.

trainingalignmentreasoning

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning
evaluationagentsreasoning

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

May 14, 2026

Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.

Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.

OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.

reasoningevaluation

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

May 12, 2026

Di Wu, Zixiang Ji, Asmi Kawatkar et al.

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agentsevaluationreasoning

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

May 12, 2026

Xuhao Hu, Xi Zhang, Haiyang Xu et al.

Agents perform better when trained to decide dynamically between GUI actions and tool calls rather than using only one approach—this hybrid strategy improved accuracy by 66% on real-world tasks.

ToolCUA trains computer agents to intelligently choose between GUI actions (clicks, typing) and tool calls (APIs) by synthesizing diverse training trajectories from existing data and using reinforcement learning to optimize when to switch between action types. This solves a key problem for digital agents: knowing when to use high-level tools versus low-level GUI interactions.

agentstrainingreasoning

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026

Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.

LLM agents struggle with dependency reasoning in persistent memory—when facts relate to each other, systems collapse to near-random performance, and fixing this requires impractically expensive configurations.

This paper introduces MEME, a benchmark for evaluating how well AI agents manage information across multiple sessions. It tests six memory tasks including complex scenarios like tracking dependencies between facts and handling deletions.

evaluationagentsreasoning

Solve the Loop: Attractor Models for Language and Reasoning

May 12, 2026

Jacob Fein-Ashley, Paria Rashidinejad

Attractor Models make iterative refinement practical by using implicit differentiation to solve fixed points, enabling smaller models (27M-770M parameters) to outperform much larger ones on reasoning and language tasks without the training instability of traditional recurrent architectures.

This paper introduces Attractor Models, which improve on looped Transformers by using implicit differentiation to solve for fixed points in latent representations.

architecturereasoningefficiency
evaluation
safety

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

May 8, 2026

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.

When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.

GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.

evaluationreasoning

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

May 8, 2026

James Petullo, Sonny George, Dylan Cashman et al.

You can make confidence-weighted answer selection 47% cheaper by clustering similar reasoning traces and only evaluating unique ones, without sacrificing accuracy.

VecCISC reduces the cost of weighted majority voting for LLM reasoning by filtering out duplicate or low-quality reasoning traces before sending them to a critic model. It uses semantic similarity to identify which candidate answers are worth evaluating, cutting token usage by 47% while maintaining accuracy across math, science, and reasoning tasks.

efficiencyreasoning

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

May 8, 2026

Jiayuan Liu, Tianqin Li, Shiyi Du et al.

Giving LLM agents access to longer memory doesn't automatically improve performance; it can actually harm cooperation in multi-agent settings by shifting how they reason about the future, not by making them more suspicious.

When LLMs can remember more conversation history, they actually cooperate less in multi-agent games—a problem called the memory curse. The researchers found that expanded context windows cause models to lose forward-looking intent rather than become paranoid, and they proved this by showing that synthetic positive history and targeted fine-tuning can restore cooperation.

agentsreasoningalignment

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

May 8, 2026

James Petullo, Nianwen Xue

Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.

CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.

reasoningapplications

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

May 8, 2026

Gugan Thoppe, L. A. Prashanth, Ankur Naskar et al.

You can now use principled Q-learning algorithms for risk-sensitive decision-making (exponential utility), with mathematical guarantees that they find optimal policies—previously this lacked solid theoretical foundations.

This paper develops reinforcement learning algorithms for optimizing exponential utility in decision-making problems, which is important for risk-sensitive applications. The authors prove that their Q-learning-style algorithms converge to optimal policies and provide theoretical guarantees on convergence speed.

reasoning

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

May 7, 2026

Daniel Zheng, Ingrid von Glehn, Yori Zwols et al.

AI agents work best for complex research when designed as collaborative partners that maintain context, track what didn't work, and produce native outputs—not just as answer machines.

Researchers built an interactive AI workbench that helps mathematicians explore open-ended research problems by combining agents for literature search, computation, theorem proving, and theory building. The system tracks failed ideas, manages uncertainty, and outputs mathematical artifacts—mimicking how human collaborators work together.

agentsreasoningapplications

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

May 7, 2026

Mingwei Xu, Hao Fang

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

trainingreasoningalignment

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

May 7, 2026

Xiangyuan Xue, Yifan Zhou, Zidong Wang et al.

Adding explicit strategy planning at the start of a task—rather than pure reactive decision-making—dramatically improves both learning efficiency and success rates for LLM agents on long-horizon tasks.

StraTA improves how language models learn to make decisions over many steps by having them first plan a high-level strategy before acting. Instead of reacting moment-by-moment, the model samples a strategy from the initial state, follows it through actions, and learns both strategy planning and action execution together.

agentsreasoning

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

May 6, 2026

Ziang Chen, Jaume de Dios Pont, Paata Ivanisvili et al.

AI language models can contribute meaningfully to mathematical discovery by helping identify intermediate lemmas and inequalities, though human mathematicians remain essential for rigorous proof construction and validation.

This paper proves new bounds on how sums of functions behave in mathematical spaces, showing when certain inequalities hold and when they fail. The authors use a large language model called Grok to help discover intermediate results, demonstrating how AI can assist in mathematical research.

reasoningevaluation

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

May 6, 2026

Yijun Lu, Rui Ye, Yuwen Du et al.

Agents performing long-horizon tasks need adaptive context management—selectively compressing or discarding information—rather than naively accumulating everything, which improves efficiency and reduces hallucination.

LongSeeker introduces Context-ReAct, a framework that helps AI agents manage growing context during long tasks by selectively compressing, skipping, or deleting information based on relevance. The agent uses five operations to reshape its working memory, reducing costs and errors while maintaining task-critical information.

agentsreasoningefficiency

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

May 6, 2026

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Transformer attention can act as a feature learner for nonlinear functions during in-context learning, and this capability can be theoretically analyzed with concrete error bounds—bridging the gap between empirical success and mathematical understanding.

This paper explains how transformers perform in-context learning for nonlinear regression tasks. The researchers show that transformer attention mechanisms can automatically create nonlinear features (like polynomials or splines) from examples in the prompt, enabling the model to solve complex regression problems without updating weights.

reasoningarchitectureevaluation

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

May 6, 2026

Alper Yıldırım

Transformers for time series don't rely on superposition like they do in language tasks, meaning time series forecasting may not require the compositional complexity that makes Transformers powerful for NLP.

This paper investigates how Transformers work internally for time series forecasting by analyzing their hidden representations using sparse autoencoders. The key finding: Transformers don't need complex, overlapping feature representations (superposition) to forecast well—their representations stay sparse and simple, which explains why basic linear models remain competitive.

reasoningevaluation

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

May 5, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

You can build certified graph classifiers without gradient training by using topology-aware landmark selection and closed-form kernel methods—achieving competitive accuracy with built-in confidence bounds.

PALACE is a method for classifying point clouds and graphs using persistent homology (a topological data analysis technique) with adaptive landmark placement.

evaluationreasoning

An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

May 5, 2026

Dutao Zhang, Tian Liao

Retrieval strategy selection can be packaged as a reusable agent skill that learns from experience, rather than hard-coded into workflows, enabling better performance across diverse question types without changing the underlying retrievers.

This paper presents Experience-RAG Skill, a smart retrieval orchestration layer that learns which retrieval strategy works best for different types of questions.

agentsreasoning

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

May 4, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

This approach trades the flexibility of learned models for interpretability and formal guarantees: you get provable error bounds and confidence scores for each prediction, but performance lags behind neural baselines on some datasets due to limited descriptor expressiveness.

PLACE is a method for classifying point clouds and graphs using topological features (persistent homology) with mathematical guarantees.

evaluationreasoning

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

May 4, 2026

Jiujiu Chen, Yazheng Liu, Sihong Xie et al.

Process reward models need to account for the full context of reasoning paths and penalize risky intermediate steps, not just reward final correctness—this matters most in domains where wrong reasoning paths are costly.

This paper addresses a key problem in evaluating AI reasoning: process reward models often give high scores to flawed reasoning paths because later correct steps mask earlier mistakes. The authors propose SCPRM, which evaluates reasoning steps by looking at what came before and measuring distance to the target, then use it with tree search to answer questions about knowledge graphs.

reasoningevaluationagents

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

May 4, 2026

Quang Hieu Pham, Yang He, Ping Nie et al.

Flexible database interaction throughout reasoning—exploring schemas and data on-demand rather than upfront—is more effective for text-to-SQL than fixed pipelines, even with smaller models.

FlexSQL is a text-to-SQL agent that can explore database schemas, inspect data, and run verification queries at any point during reasoning—rather than retrieving schema once upfront. It generates multiple execution plans, implements them in SQL or Python, and uses a two-tiered repair system to recover from mistakes.

reasoningagentsapplications

AIs and Humans with Agency

May 4, 2026

David Mumford

Building AI systems with genuine agency isn't about making LLMs act alone—it requires new architectures where AI and humans co-develop plans and actions together for specific real-world situations.

This paper examines what agency means for both humans and AI systems, noting that human agency develops gradually through brain maturation while current LLMs struggle to act autonomously. The author argues that effective AI agency requires a fundamentally different architecture where AI systems and humans jointly plan and execute actions together in real-world contexts.

agentsarchitecturereasoning
evaluationreasoningalignment

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

May 1, 2026

Arunabh Srivastava, Mohammad A., Khojastepour et al.

To make LLMs reliable at executing plans, you need to enforce structure through explicit control constructs, validate outputs against derived constraints at each step, and dynamically route to the best execution method (reasoning, tools, or code).

RunAgent is a system that helps AI agents execute multi-step plans written in natural language by converting them into a structured format with explicit control flow (like IF statements and loops).

agentsreasoningarchitecture

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

May 1, 2026

Jacques Raynal, Pierre Slangen, Jacques Margerit

Observable performance metrics can mask fundamentally different internal system organizations—a critical insight for understanding adaptive biological systems where multiple solutions may produce identical outputs.

This study shows that measuring a system's output performance alone doesn't reveal how it's actually organized internally. Using gait analysis in a Parkinson's patient with dental constraints, researchers found that similar-looking movement patterns can come from very different internal system states when examined through dynamical systems and machine learning lenses.

evaluationreasoning

Characterizing the Expressivity of Local Attention in Transformers

May 1, 2026

Jiaoda Li, Ryan Cotterell

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecturereasoningevaluation

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

Apr 30, 2026

Zainab Rehan, Christian Medeiros Adriano, Sona Ghahremani et al.

You can use LLMs with formal verification to automatically synthesize safety rules from human goals, catching errors before deployment—reducing the gap between what we want AI to do and what it actually does.

This paper presents a system that automatically creates and verifies safety rules for AI systems by combining language models, formal logic, and causal reasoning. It takes high-level goals from humans (like "avoid collisions") and converts them into formal logical rules that can be checked for correctness, tested in autonomous driving scenarios.

safetyreasoningalignment

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Apr 30, 2026

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan et al.

LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.

TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.

evaluationreasoningdata

Select to Think: Unlocking SLM Potential with Local Sufficiency

Apr 29, 2026

Wenxuan Ye, Yangyang Zhang, Xueli An et al.

Small models already generate the right answers in their candidate predictions—they just rank them poorly. Training them to re-rank their own outputs improves reasoning without external model calls.

Small language models struggle with reasoning tasks compared to large models. This paper discovers that when small models fail, the correct token from a large model is usually hidden in the small model's top-8 predictions.

efficiencyreasoningtraining

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Apr 29, 2026

Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud et al.

Cascading multiple specialized modules (query reformulation, evidence ranking, grounded generation, answer-evidence linking) with an LLM outperforms end-to-end approaches for clinical QA, especially when grounding answers to source documents matters for patient safety.

A clinical question-answering system that helps patients understand their electronic health records by using a four-stage pipeline with an LLM to interpret patient questions, find relevant evidence in medical notes, generate grounded answers, and link answers back to source documents.

applicationsreasoningevaluation

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Apr 29, 2026

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni et al.

Language diffusion models memorize training data by default, but you can detect when they switch to genuine generalization by monitoring conditional entropy—a practical signal for assessing whether a deployed model is memorizing or creating.

This paper reveals that language diffusion models work like associative memories—they store training data in 'basins of attraction' and can retrieve both memorized and unseen examples. As training data grows, the model transitions from memorizing to generalizing, a shift detectable by measuring conditional entropy of token predictions.

trainingevaluationreasoning

Recursive Multi-Agent Systems

Apr 28, 2026

Xiyuan Yang, Jiaru Zou, Rui Pan et al.

Multi-agent systems can be made faster and more efficient by having agents refine their reasoning through recursive loops in latent space rather than text-based communication, achieving 1.2-2.4× speedup with 35-76% fewer tokens.

This paper introduces RecursiveMAS, a framework that improves multi-agent AI systems by having agents collaborate through repeated refinement cycles in a shared latent space rather than exchanging text.

agentsreasoningefficiency

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Apr 28, 2026

Chu-Cheng Lin, Eugene Ie

When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.

This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.

trainingreasoningalignment

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Apr 28, 2026

Andre Herz, Daniel Durstewitz, Georgia Koppe

Teacher forcing trains RNNs on chaotic systems differently than the model will actually be used—this mismatch can make models fit data well statistically while performing poorly at predicting actual dynamics, a problem that becomes worse when multiple explanations exist for the data.

This paper reveals a fundamental mismatch between how teacher forcing (a common training technique) and marginal likelihood (the true objective) shape neural network optimization for chaotic systems.

trainingreasoning

Toward a Functional Geometric Algebra for Natural Language Semantics

Apr 28, 2026

James Pustejovsky

Geometric algebra expands n-dimensional embeddings into a 2^n-dimensional structure that can represent both base concepts and their interactions in a single unified framework, potentially solving long-standing problems in how neural networks compose meanings.

This paper proposes using geometric algebra (Clifford algebras) instead of conventional linear algebra as the mathematical foundation for representing word and sentence meanings in AI.

architecturereasoning

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

Apr 28, 2026

Clinton Enwerem, Shreya Kalyanaraman, John S. Baras et al.

Using differentiable Gaussian mixtures to represent grasp uncertainty enables fast, gradient-based optimization for worst-case robustness—achieving 10x speedup over particle filters while maintaining or improving success rates.

This paper tackles the problem of robust robotic grasping when contact forces, sensing, and external disturbances are unpredictable. Instead of using slow particle-filter approaches, the authors represent uncertainty as a learnable Gaussian mixture and optimize for worst-case performance (CVaR) using gradient-based methods.

reasoningefficiencyagents

Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Apr 27, 2026

Zhangyong Liang

When training neural networks on multiscale physics problems, gradient conflicts between different regimes can cause training failure—HRGrad fixes this by explicitly managing gradient directions to keep all objectives aligned during optimization.

This paper introduces HRGrad, a method for training neural networks on physics problems that span multiple scales—from microscopic to macroscopic behavior. The key challenge is that different scales pull the network in conflicting directions during training.

trainingreasoning

Learning to Think from Multiple Thinkers

Apr 27, 2026

Nirmit Joshi, Roey Magen, Nathan Srebro et al.

Learning from diverse reasoning traces is harder than learning from a single thinker, but you can overcome this by actively collecting reasoning data from many thinkers (logarithmic in target accuracy) combined with passive final-answer supervision.

This paper studies how AI models can learn from multiple people or programs solving the same problem in different ways (e.g., different math solutions or code implementations).

trainingreasoningdata

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Apr 27, 2026

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad et al.

Current specification-guided RL methods generalize poorly to new environments and complex tasks—this benchmark helps identify where they fail and guides development of more robust approaches.

SpecRLBench is a benchmark for testing how well reinforcement learning agents can follow formal task specifications (written in linear temporal logic) across different, unseen environments and robot types. The benchmark reveals that current methods struggle as tasks and environments become more complex, providing a structured way to develop better specification-guided RL systems.

evaluationreasoningagents

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Apr 27, 2026

Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.

Current AI agents struggle most with identifying knowledge gaps and formulating the right questions, not just answering them—a shift in bottleneck that suggests we need better ways to help AI systems recognize what they don't know.

This paper introduces SciCrafter, a Minecraft-based benchmark that tests whether AI agents can discover causal rules and apply them to solve increasingly complex problems.

reasoningagentsevaluation
reasoningefficiencytraining

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

Apr 24, 2026

Jinghong Chen, Jingbiao Mei, Guangyu Yang et al.

By treating retrieved documents as an ensemble with probabilistic weights updated during generation, BERAG avoids concatenating long contexts while improving both performance and interpretability—especially valuable for visual question answering where context length is expensive.

This paper proposes BERAG, a retrieval-augmented generation system that processes retrieved documents individually rather than concatenating them into one long context. Instead of treating all documents equally, BERAG uses Bayesian inference to weight documents based on how useful they are during answer generation, updating these weights token-by-token.

multimodalreasoning

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Apr 23, 2026

Zhiqiu Xu, Shibo Jin, Shreya Arya et al.

Models can be strong at solving math problems but weak at creating challenging ones—dual-role evaluation exposes capability gaps that single-role benchmarks miss, and the benchmark naturally scales with model strength.

MathDuels is a new way to test AI math abilities by having models both create and solve problems against each other. Unlike static benchmarks that get too easy, this self-play approach reveals hidden differences between models—some are great solvers but poor problem creators, and vice versa.

evaluationreasoning

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Apr 23, 2026

Bartosz Balis, Michal Orzechowski, Piotr Kica et al.

By separating LLM interpretation from deterministic workflow generation and encoding domain knowledge in reusable "Skills" documents, you can reliably automate the conversion of research questions into executable scientific workflows with minimal cost and overhead.

This paper presents an AI system that automatically converts research questions into executable scientific workflows. It uses three layers: an LLM to understand natural language, validated generators to create reproducible workflow specifications, and domain expert "Skills" documents that guide the process.

agentsapplicationsreasoning

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Apr 23, 2026

Chee Wei Tan, Yuchen Wang, Shangxin Guo

LLMs can be operationalized as strategic game agents that adapt their reasoning approach based on game type, and interactive platforms like Nemobot let developers actively experiment with and refine these agents in real time.

Nemobot is an interactive platform that uses large language models to create game-playing AI agents across different game types—from word games to strategy games. Users can build, customize, and deploy these agents while watching them learn and improve through reinforcement learning, human feedback, and self-critique.

agentsreasoningapplications

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Apr 23, 2026

Muhy Eddin Za'ter, Anna Van Boven, Bri-Mathias Hodge et al.

Deep learning can accelerate hard optimization problems by providing intelligent warm-start solutions that reduce the search space, rather than replacing traditional solvers entirely.

This paper uses a transformer neural network to predict electricity generator schedules 72 hours ahead, then refines those predictions with rule-based corrections and feeds them to a traditional optimization solver as a starting point.

applicationsreasoningefficiency

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Apr 23, 2026

Jun Wang, Ziyin Zhang, Rui Wang et al.

LLMs can be practical for production incident detection when paired with efficient indexing, noise filtering, and domain-specific routing—not just as standalone models, but as part of a multi-stage system that handles real-world scale and complexity.

TingIS is a production system that detects critical technical incidents from noisy customer reports in real-time at enterprise scale.

applicationsagentsreasoning

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Apr 23, 2026

Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti et al.

LLMs can model human moral reasoning but don't use that understanding in their own decisions—they follow abstract rules instead of social context, creating a dangerous misalignment between their internal understanding and external behavior.

This study tests whether large language models understand how human morality shifts based on relationships and context. Using a whistleblower dilemma scenario, researchers found that LLMs can predict how humans actually behave (favoring loyalty to friends), but their own decisions follow rigid fairness rules instead.

alignmentreasoningevaluation

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Apr 23, 2026

Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko et al.

Single images with high agreement among vision models show dramatically stronger alignment with language models, suggesting that representational convergence across modalities is driven by how unambiguously the environment constrains perception.

This paper reveals that how consistently different vision models represent individual images (intra-modal agreement) strongly predicts whether vision and language models will represent those same images similarly (cross-modal alignment).

multimodalevaluationreasoning

On the algebra of Koopman eigenfunctions and on some of their infinities

Apr 23, 2026

Zahra Monfared, Saksham Malhotra, Sekiya Hajime et al.

You can generate many more Koopman eigenfunctions from a few computed ones by treating them as an algebraic group, enabling better system representations from sparse or incomplete data.

This paper shows how to compute more eigenfunctions of the Koopman operator—a mathematical tool for analyzing dynamical systems—by using algebraic relationships between a small set of known eigenfunctions.

reasoningarchitecture

Probably Approximately Consensus: On the Learning Theory of Finding Common Ground

Apr 23, 2026

Carter Blair, Ben Armstrong, Shiri Alouf-Heffetz et al.

You can find practical consensus in large communities by treating it as a learning problem—identifying opinion intervals that maximize agreement while accounting for topic importance, with provable guarantees on how many user queries you actually need.

This paper tackles finding consensus in online communities by modeling agreement as an interval in opinion space. Rather than just looking at specific statements users provide, the method accounts for which topics matter most to the community.

reasoningevaluation

Quotient-Space Diffusion Models

Apr 23, 2026

Yixian Xu, Yusong Wang, Shengjie Luo et al.

Quotient-space diffusion models reduce learning complexity for symmetric generative tasks by formally accounting for group symmetries, enabling better molecular and protein structure generation without learning redundant symmetric variations.

This paper introduces a mathematical framework for diffusion models that accounts for symmetries in generative tasks, particularly molecular structure generation. By modeling distributions on quotient spaces (which treat symmetric objects as equivalent), the approach simplifies learning compared to existing symmetry-aware methods and guarantees correct sampling of target distributions.

architecturereasoningapplications

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Apr 23, 2026

Ye Yu, Heming Liu, Haibo Jin et al.

Multi-agent LLM systems can achieve better reasoning by learning optimized latent communication channels instead of relying on fixed text-based protocols, with significant improvements on challenging benchmarks.

This paper introduces DiffMAS, a training framework that lets multiple AI agents learn how to communicate with each other through internal representations (like key-value caches) rather than text. By jointly optimizing both reasoning and communication during training, agents can better coordinate on complex tasks like math, science, and coding problems.

agentstrainingreasoning

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Apr 23, 2026

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue et al.

You can build practical event detection systems using logical rules and constraint satisfaction that work efficiently on real timestamped data while handling conflicting inferences—demonstrated on medical records.

This paper presents a logic-based system for detecting high-level events from timestamped data, like inferring disease episodes from patient medical records. The system uses logical rules to identify events, handles conflicts between inferred events, and can run efficiently on real data while staying aligned with expert knowledge.

reasoningdataapplications

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Apr 23, 2026

Guangxiang Zhao, Qilong Shi, Xusen Xiao et al.

By retrieving learned reasoning skills at inference time instead of reasoning from scratch, you can reduce token usage and improve accuracy—making LLM reasoning cheaper and faster for practical deployment.

This paper proposes storing reusable reasoning skills learned from past problem-solving attempts, then retrieving and applying them during inference to guide new reasoning. Instead of reasoning from scratch each time, the model recalls relevant skills to avoid redundant work and reach solutions faster. Tests on coding and math tasks show it uses fewer tokens while improving accuracy.

reasoningefficiencytraining

Transferable Physics-Informed Representations via Closed-Form Head Adaptation

Apr 23, 2026

Jian Cheng Wong, Isaac Yin Chung Lai, Pao-Hsiung Chiu et al.

Physics-informed neural networks can be made dramatically faster and more generalizable by learning shared representations across PDE families and using closed-form adaptation, enabling accurate predictions on new problems without retraining.

This paper introduces Pi-PINN, a physics-informed neural network that learns reusable representations for solving different partial differential equations (PDEs). Instead of training separate models for each PDE, Pi-PINN learns a shared representation and adapts quickly to new PDEs using a mathematical technique called pseudoinverse, achieving 100-1000x faster predictions than standard PINNs.

efficiencyreasoning

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Apr 22, 2026

Deqing Fu, Tianyi Zhou, Mikhail Belkin et al.

Language models naturally converge on similar periodic number representations across different architectures, but whether they learn features useful for arithmetic depends on training signals like text-number co-occurrence or multi-token addition problems.

Different language models (Transformers, RNNs, LSTMs) independently learn to represent numbers using periodic patterns with periods of 2, 5, and 10—a phenomenon called convergent evolution.

trainingreasoning

Diagnosing CFG Interpretation in LLMs

Apr 22, 2026

Hanqi Li, Lu Chen, Kai Yu

LLMs can maintain surface-level syntax when following grammars but fail at deeper semantic understanding, especially with complex nested structures—a critical limitation for building reliable AI agents that need to follow formal specifications.

This paper tests whether large language models can correctly interpret and follow context-free grammars (formal rules for structured output). The researchers created RoboGrid, a testing framework that checks if LLMs produce syntactically correct, semantically meaningful outputs when given novel grammars.

evaluationreasoningagents

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Apr 22, 2026

Qiguang Chen, Chengyu Luan, Jiajun Wu et al.

Current vision-language models struggle with multi-image reasoning even on problems they might solve with single images—this benchmark shows that connecting information across multiple images is a major unsolved challenge.

OMIBench is a benchmark for testing how well vision-language models can solve Olympiad-level problems that require reasoning across multiple images. Unlike existing benchmarks that focus on single images, OMIBench tests whether models can connect evidence scattered across different images to solve complex problems in biology, chemistry, math, and physics.

evaluationmultimodalreasoning

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Apr 22, 2026

Pavel Salovskii, Iuliia Gorshkova

Pairing LLMs with structured ontologies creates a verification layer that catches errors and enables long-term memory—turning language models into more reliable reasoning systems for planning and decision-making.

This paper proposes adding a structured knowledge graph layer to LLMs using RDF/OWL ontologies, enabling persistent memory and verifiable reasoning. The system automatically builds ontologies from documents and APIs, then combines graph-based reasoning with LLM inference to improve multi-step planning tasks and add formal validation to AI outputs.

reasoningagents

Generalization at the Edge of Stability

Apr 21, 2026

Mario Tuci, Caner Korkmaz, Umut Şimşekli et al.

Training at the edge of stability (where optimization becomes chaotic) generalizes better because the optimizer converges to a lower-dimensional fractal attractor, and you can predict generalization by measuring the complete structure of the loss landscape's curvature, not just simple summaries.

This paper explains why training neural networks with large learning rates—which causes chaotic, oscillatory behavior—actually improves generalization. The authors model optimizers as random dynamical systems that converge to fractal attractors and introduce 'sharpness dimension' to measure generalization.

trainingscalingreasoning

Safe Continual Reinforcement Learning in Non-stationary Environments

Apr 21, 2026

Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro et al.

Safe continual reinforcement learning faces a fundamental trade-off: methods that maintain safety constraints often catastrophically forget previous knowledge when environments change, and vice versa—a problem existing approaches fail to fully resolve.

This paper studies how to safely train AI controllers that adapt to changing environments over time. The authors show that existing methods struggle to both prevent safety violations and avoid forgetting previous knowledge when system dynamics shift unexpectedly.

safetytrainingreasoning

FASTER: Value-Guided Sampling for Fast RL

Apr 21, 2026

Perry Dong, Alexander Swerdlow, Dorsa Sadigh et al.

You can get the benefits of expensive test-time sampling in RL by learning to filter action candidates early in the generation process, reducing compute without sacrificing performance.

FASTER is a method that speeds up reinforcement learning by filtering action candidates during the denoising process of diffusion-based policies, rather than waiting until denoising completes. It models this filtering as a decision problem with a learned value function, achieving the same performance as expensive sampling methods while cutting computational costs significantly.

efficiencyreasoningtraining

Benign Overfitting in Adversarial Training for Vision Transformers

Apr 21, 2026

Jiaming Zhang, Meng Ding, Shaopeng Fu et al.

Vision Transformers can be made adversarially robust through standard adversarial training, and surprisingly, overfitting doesn't necessarily hurt robustness if the signal-to-noise ratio is favorable—a finding that challenges conventional wisdom about the robustness-generalization tradeoff.

This paper provides the first theoretical analysis of adversarial training in Vision Transformers, showing that under certain conditions, ViTs can achieve strong robustness against adversarial attacks even when overfitting occurs.

safetyarchitecturereasoning

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

Apr 21, 2026

Feihao Fang, My T. Thai, Yuanyuan Lei

LLMs have a hidden logical reasoning layer that works the same way whether reasoning in English or symbolic notation—you can exploit this to improve reasoning by steering the model toward this shared space without retraining.

This paper discovers that LLMs contain a shared internal logical reasoning space that aligns natural language and symbolic reasoning. By analyzing how the model's internal activations correlate across both reasoning styles, researchers created a method to steer the model toward better logical reasoning without additional training, improving accuracy on reasoning tasks by up to 11%.

reasoningarchitecture

Ultrametric OGP - parametric RDT \emph{symmetric} binary perceptron connection

Apr 21, 2026

Mihailo Stojnic

Overlap gap properties and parametric RDT appear to be two sides of the same coin for characterizing computational hardness—the paper provides strong numerical evidence that they converge to the same algorithmic threshold, offering a unified geometric-algorithmic perspective on statistical-comp...

This paper connects two mathematical frameworks for understanding hard computational problems in machine learning: overlap gap properties (OGPs) and recursive decomposition trees (RDT).

reasoningevaluation

Planning in entropy-regularized Markov decision processes and games

Apr 21, 2026

Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Ménard et al.

Entropy regularization makes planning problems mathematically smoother, enabling algorithms with provable efficiency guarantees that don't exist for standard reinforcement learning.

SmoothCruiser is a planning algorithm that efficiently estimates value functions in entropy-regularized decision-making problems. By leveraging the smoothness that entropy regularization provides, it achieves polynomial sample complexity guarantees—a significant improvement over non-regularized approaches where no such guarantees exist.

reasoning

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Apr 21, 2026

Shuai Wang, Hongyi Zhu, Jia-Hong Huang et al.

Planning retrieval steps before searching for evidence improves both explanation quality and interpretability—the system can show why it chose specific evidence rather than just providing answers.

A-MAR is an AI system that explains artworks by breaking down questions into structured reasoning steps, then retrieving relevant evidence for each step. Unlike standard AI models that give answers based on internal knowledge, A-MAR shows its work—decomposing art questions into explicit goals, finding supporting evidence, and building explanations step-by-step.

agentsmultimodalreasoning

An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA

Apr 21, 2026

Saransh Sharma, Pritika Ramu, Aparna Garimella et al.

Open-ended QA isn't just about finding one answer—users need follow-up insights to refine their thinking. This work shows how to systematically generate those related insights from document collections to support iterative question-answering.

This paper introduces a new task where AI systems generate additional insights from documents to help users refine and improve answers to open-ended questions. The authors release SCOpE-QA, a dataset of 3,000 questions, and propose InsightGen, a method that clusters documents thematically and selects relevant context to generate diverse insights using language models.

evaluationapplicationsreasoning

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Apr 20, 2026

Shaden Alshammari, Kevin Wen, Abrar Zainal et al.

Current state-of-the-art models achieve only 69-78% on Olympiad-level math problems, and embedding models struggle to find mathematically equivalent problems—showing that both mathematical reasoning and math-aware retrieval remain open challenges for AI systems.

MathNet is a large-scale benchmark with 30,676 Olympiad-level math problems across 17 languages and 47 countries, designed to evaluate both how well AI models solve math problems and how well they retrieve similar problems. The benchmark reveals that even top models struggle with complex reasoning, and that retrieval quality significantly impacts performance in retrieval-augmented problem solving.

reasoningmultimodal

Sessa: Selective State Space Attention

Apr 20, 2026

Liubomyr Horbatko

Sessa's hybrid architecture enables power-law decay of information loss over distance (O(ℓ^-β)) instead of exponential or linear decay, making it more effective for long-context language modeling while staying competitive on standard benchmarks.

Sessa combines attention mechanisms with state-space model feedback paths to improve how models retrieve information from long contexts.

architectureefficiencyreasoning

Bounded Ratio Reinforcement Learning

Apr 20, 2026

Yunke Ao, Le Chen, Bruce D. Lee et al.

BRRL provides the first principled theoretical foundation for PPO-style clipped objectives, proving monotonic improvement and connecting trust region methods to the Cross-Entropy Method—offering both better understanding and a path to improved algorithms.

This paper fixes a theoretical gap in PPO by introducing BRRL, a framework that derives the mathematically optimal policy update with guaranteed improvement. The authors develop BPO, a practical algorithm that approximates this optimal solution, and extend it to GBPO for LLM fine-tuning. Experiments show BPO matches or beats PPO across robotics, games, and language model tasks.

trainingreasoning

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

Apr 20, 2026

Kevin Murphy

Structured belief representations that combine numbers with natural language evidence, updated iteratively, outperform simply appending all retrieved information to context—and this structured approach is as valuable as having web search access.

BLF is an AI forecasting system that makes better predictions by maintaining a structured belief state combining probabilities with evidence summaries, updating them iteratively through tool use. It combines multiple independent forecasting trials and applies statistical calibration to avoid overconfident predictions, achieving top performance on forecasting benchmarks.

reasoningagentsevaluation