Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1492 papers12 this month12 topics

All Evaluation 42 Training 39 Agents 31 Reasoning 27 Efficiency 25 Safety 18 Multimodal 17 Applications 17 Alignment 11 Data 11 Architecture 8 scaling 6

Jul 6 – Jul 12(2)

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

Jul 6, 2026

Wenhao Li, Xueying Jiang, Quanhao Qian et al.

Robot policies can achieve view robustness without camera calibration by learning to predict both action in camera space and camera-to-robot geometry, making deployment more practical when camera positions vary.

This paper introduces CamVLA, a robot vision-language-action model that learns to figure out camera positioning automatically instead of requiring explicit calibration. By predicting both camera-relative actions and the geometric relationship between camera and robot, the model works with any camera setup without needing depth data or prior calibration.

multimodalagentsapplications

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Jul 6, 2026

Haozhe Wang, Weijia Feng, Jinpeng Yu et al.

Visual generators need to learn *when* to search for external knowledge, not just *how* to use it—and this knowledge boundary is discoverable through co-training, not fixed in advance.

This paper identifies a critical gap in visual generators: they confidently create incorrect images for requests about new entities, trending topics, and post-training events. The authors show that naive search-augmentation fails because generators have an evolving 'knowledge boundary'—a threshold between what they learned and what needs external context.

Jun 29 – Jul 5(15)

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Jul 2, 2026

Yuxuan Li, Lingxi Xie, Xinyue Huo et al.

Reasoning models can improve speaker identification in video by combining multiple modalities and contextual evidence, outperforming traditional audio-only approaches on challenging cases.

This paper tackles speaker recognition in long-form TV dramas by introducing DramaSR-532K, a large benchmark with 532K annotated dialogue lines, and DramaSR-LRM, a reasoning-based approach that combines audio, text, and visual information to accurately identify which character is speaking. The method works especially well on short utterances where voice alone isn't reliable.

multimodalreasoningapplications

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Jul 2, 2026

Liyan Tang, Fangcong Yin, Greg Durrett

Vision-language models can be trained to self-correct more effectively by explicitly grounding their reflection in visual inputs, rather than just generating text-based corrections—this matters especially when models encounter out-of-distribution images.

This paper improves how vision-language models correct their own mistakes by training them to look back at images while reasoning. The authors use reinforcement learning with two key techniques: masking earlier reasoning steps to force the model to recover from errors, and replaying diverse failure scenarios. Their method helps models stay accurate even when given unfamiliar images.

Jun 22 – Jun 28(22)

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration

Jun 26, 2026

Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast

By combining parameter-efficient fine-tuning (LoRA) with multimodal fusion of urban context, you can build accurate traffic prediction models that use fewer trainable parameters without sacrificing performance.

This paper presents PEHT, a traffic prediction model that combines Transformers with urban mobility data to forecast cellular network demand. It uses LoRA to reduce parameters while a multimodal fusion strategy integrates congestion and mobility information, achieving better accuracy than existing methods on real telecom data.

efficiencymultimodalapplications

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Jun 26, 2026

Niclas Lietzow, Danielle Bitterman, Carsten Eickhoff et al.

Vision-language models have a sparse, identifiable causal circuit that controls whether they trust visual input or stored knowledge—removing just a few attention heads flips the model from knowledge-based to vision-based answers in most cases.

This paper reveals how vision-language models choose between visual evidence and memorized knowledge when they conflict. Using activation analysis, researchers identified a small set of attention heads (2.5-4.8% of heads) that act as a causal switch: removing them makes models trust their eyes instead of what they've learned.

Jun 15 – Jun 21(15)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Jun 18, 2026

Wenhao Chi, Arkaprava Sinha, Dominick Reilly et al.

Using proxy models as intermediaries between diverse teachers prevents conflicting gradients and enables learning richer egocentric representations from heterogeneous knowledge sources—achieving better results than naive multi-teacher distillation.

This paper introduces UNIEGO, a unified egocentric video encoder trained through a novel multi-teacher distillation framework.

multimodaltrainingarchitecture

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Jun 18, 2026

Nityanand Mathur, Hamees Sayed, Wasim Madha et al.

Style instructions in TTS are processed differently than content words—they influence acoustic properties like pitch and energy globally rather than locally, with maximum effect in early generation steps and mid-depth network layers.

This paper reveals how individual words in style descriptions influence speech generation by analyzing attention patterns in a text-to-speech system.

Jun 8 – Jun 14(17)

Gaze Heads: How VLMs Look at What They Describe

Jun 12, 2026

Rohit Gandikota, David Bau

VLMs have interpretable internal mechanisms (gaze heads) that can be surgically edited at inference time to control what the model describes, offering a practical way to steer multimodal outputs without model retraining.

This paper discovers that vision-language models develop specialized attention heads called 'gaze heads' that track which image regions they're describing. By redirecting these heads' attention during inference, researchers can steer the model to describe any chosen image region without retraining—achieving 83% accuracy on comic panels and extending to natural images.

multimodal

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Jun 12, 2026

Sicheng Yang, Hangjie Yuan, Wenjun Zhang et al.

Medical AI hallucinations have different sources (visual, knowledge, reasoning); diagnosing which stage fails helps you fix the right problem and improve trustworthiness.

ClinHallu is a benchmark with 7,031 medical cases that diagnoses where hallucinations occur in medical AI systems—whether from misreading images, recalling wrong medical facts, or flawed reasoning. It includes detailed reasoning traces and shows that training on these traces reduces errors.

evaluation

Jun 1 – Jun 7(11)

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Jun 5, 2026

Cong Chen, Guo Gan, Kaixiang Ji et al.

For long-form video understanding, decoupling perception (building structured memory) from reasoning (agentic exploration) is more efficient than end-to-end processing, achieving better accuracy while using only 2% of the context that full-video processing would require.

MemDreamer solves the problem of understanding very long videos by splitting the task into two parts: a perception system that builds a memory structure from video frames, and a reasoning system that explores this memory like an agent using tools.

multimodalagentsreasoning

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Jun 4, 2026

Dong Jing, Jingchen Nie, Tianqi Zhang et al.

Robot policies can control execution speed by scaling action magnitudes, enabling a single model to adapt between fast and slow motions without retraining—useful for tasks requiring both speed and precision.

TempoVLA enables robots to execute manipulation tasks at variable speeds by conditioning a Vision-Language-Action model on a speed parameter. The approach uses trajectory augmentation to create training data at different speeds and adds a conditioning mechanism to the policy, allowing a single model to handle both fast transit phases and slow, precise contact phases.

May 25 – May 31(15)

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

May 29, 2026

Jiazheng Xing, Hangjie Yuan, Lingling Cai et al.

By separating training (lightweight generator) from inference (high-capacity generator), you can build reasoning-driven video models that produce cinema-quality results without prohibitive training costs.

Lumos-Nexus is a video generation system that combines reasoning capabilities with high visual quality by using a lightweight generator during training and progressively handing off to a powerful generator at inference time. This two-stage approach lets models understand user intent and generate coherent videos without the computational cost of training with large generators.

multimodalefficiencyarchitecture

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

May 29, 2026

Ruotong Liao, Guowen Huang, Qing Cheng et al.

You can steer video generation at inference time by identifying and leveraging natural turning points in the diffusion denoising process—no retraining needed, and it scales better with more events.

This paper presents TunerDiT, a method for generating videos with multiple sequential events from text descriptions without requiring additional training. By identifying key moments in the diffusion process where text conditioning affects different aspects of video generation, the authors use strategic masking and prompt fusion to control event boundaries and transitions in long-form videos.

May 18 – May 24(3)

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

May 22, 2026

Jianshu Zhang, Yijiang Li, Huifeixin Chen et al.

Current VLMs struggle to genuinely understand spatial numbers—they can't reliably map between visual coordinates and numerical values, which is critical for embodied AI tasks like robotics that require precise spatial outputs.

This paper tests whether Vision-Language Models (VLMs) truly understand spatial numbers like coordinates and distances. Using SpaceNum, a framework with two tasks (converting numbers to spatial positions and vice versa), researchers find that VLMs largely fail at grounding numbers in actual spatial meaning, relying instead on shallow visual cues rather than genuine spatial reasoning.

evaluationmultimodalreasoning

ETCHR: Editing To Clarify and Harness Reasoning

May 22, 2026

Beichen Zhang, Yuhong Liu, Jinsong Li et al.

Decoupling image editing from language understanding—and training the editor specifically for reasoning tasks—improves multimodal reasoning accuracy across diverse visual tasks without modifying the base model.

ETCHR is a specialized image editing model that helps multimodal AI systems reason better by transforming images based on questions. Unlike general image editors, it's trained to understand abstract reasoning tasks and produce clearer images for downstream analysis, improving performance across visual reasoning tasks by 4-5% without retraining the main AI model.

Papers

Jul 6 – Jul 12(2)

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

Search Beyond What Can Be Taught: Evolving the Knowledge Boundary in Agentic Visual Generation

Jun 29 – Jul 5(15)

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Jun 22 – Jun 28(22)

Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Jun 15 – Jun 21(15)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Jun 8 – Jun 14(17)

Gaze Heads: How VLMs Look at What They Describe

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Jun 1 – Jun 7(11)

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

May 25 – May 31(15)

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

May 18 – May 24(3)

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

ETCHR: Editing To Clarify and Harness Reasoning

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

LIME: Learning Intent-aware Camera Motion from Egocentric Video

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

World Wide Models: Literary Tools for Cultural AI

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

GROW$^2$: Grounding Which and Where for Robot Tool Use

Beyond 2D Matching: A Unified Single-Stage Framework for Geometry-Aware Cross-View Object Geo-Localization

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

DanceOPD: On-Policy Generative Field Distillation

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Language-Based Digital Twins for Elderly Cognitive Assistance

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Automating Potential-based Reward Shaping with Vision Language Model Guidance

Learning Action Priors for Cross-embodiment Robot Manipulation

Real-Time Voice AI Hears but Does Not Listen

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

InSight: Self-Guided Skill Acquisition via Steerable VLAs

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

Semantic Browsing: Controllable Diversity for Image Generation

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

TailorMind: Towards Preference-Aligned Multimodal Content Generation

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

DataMagic: Transforming Tabular Data into Data Insight Video

Native Active Perception as Reasoning for Omni-Modal Understanding

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

Context-Aware RL for Agentic and Multimodal LLMs

Geometric Action Model for Robot Policy Learning

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Beyond task performance: Decoding bioacoustic embeddings with speech features

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents