ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers26 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(6)

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

evaluationmultimodaldata

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

May 20, 2026

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.

WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.

May 11 – May 17(7)

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

May 14, 2026

Ruozhen He, Meng Wei, Ziyan Yang et al.

Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.

EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.

evaluationmultimodalarchitecture

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

May 14, 2026

Ziyu Guo, Rain Liu, Xinyan Chen et al.

A single discrete token can serve dual purposes—executing visual operations like code while also functioning as a learnable reasoning unit—making visual reasoning more efficient and trainable without architectural changes.

ATLAS introduces a single 'functional token' that acts as both an agentic operation and a latent visual reasoning unit, enabling models to reason about images without generating intermediate visual content. This approach combines the interpretability of code-based reasoning with the efficiency of latent reasoning, while remaining compatible with standard language model training.

May 4 – May 10(10)

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

May 8, 2026

Maryam Maghsoudi, Shihab Shamma

You can decode what someone is imagining saying by training on their brain activity while listening to speech, then mapping imagined brain patterns to listened patterns—solving the data scarcity problem in brain-computer interfaces.

This paper shows how to decode imagined speech from brain recordings (MEG) by training on the more abundant listened speech data instead. Researchers mapped brain activity from imagining speech to brain activity from listening to speech, then used a decoder trained on listened speech to identify imagined words. This approach works without needing large imagined speech datasets.

multimodal

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

May 8, 2026

Wei Yu, Yunhang Qian

State space models offer a practical alternative to transformers for event-based image reconstruction, achieving better results with linear computational complexity instead of quadratic, making high-resolution processing feasible.

EmambaIR uses a new type of neural network architecture (state space models) to reconstruct clear images from event camera data.

architecture

Apr 27 – May 3(7)

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

May 1, 2026

Siyuan Huang, Xiaoye Qu, Yafu Li et al.

PVM solves a fundamental problem in vision-language models where visual understanding degrades during long text generation by creating a separate, always-accessible pathway to visual information—improving reasoning tasks with minimal added parameters.

Large vision-language models struggle when generating long text because visual information gets diluted by accumulated text tokens. This paper introduces Persistent Visual Memory (PVM), a lightweight add-on module that maintains direct access to visual embeddings throughout generation, preventing the model from losing sight of the image as it produces longer outputs.

architecturemultimodalefficiency

Make Your LVLM KV Cache More Lightweight

May 1, 2026

Xihao Chen, Yangyang Guo, Roger Zimmermann

You can cut vision-language model KV cache memory in half by intelligently compressing vision tokens based on what the text prompt actually needs, rather than keeping all visual information.

LightKV reduces GPU memory overhead in vision-language models by compressing the Key-Value cache during inference. It uses text prompts to guide which vision tokens are most important, keeping only 55% of tokens while maintaining performance and cutting memory use in half.

Apr 20 – Apr 26(16)

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

Apr 24, 2026

Jinghong Chen, Jingbiao Mei, Guangyu Yang et al.

By treating retrieved documents as an ensemble with probabilistic weights updated during generation, BERAG avoids concatenating long contexts while improving both performance and interpretability—especially valuable for visual question answering where context length is expensive.

This paper proposes BERAG, a retrieval-augmented generation system that processes retrieved documents individually rather than concatenating them into one long context. Instead of treating all documents equally, BERAG uses Bayesian inference to weight documents based on how useful they are during answer generation, updating these weights token-by-token.

multimodalreasoning

Seeing Fast and Slow: Learning the Flow of Time in Videos

Apr 23, 2026

Yen-Siang Wu, Rundong Luo, Jingsen Zhu et al.

Time is a learnable visual concept—models can be trained to perceive temporal changes in videos and use that understanding to generate or transform videos with precise control over playback speed and temporal detail.

This paper teaches AI models to understand and control the passage of time in videos. The researchers develop self-supervised models that detect when videos are sped up or slowed down, then use this capability to build a large dataset of slow-motion videos.

Apr 13 – Apr 19(14)

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Apr 17, 2026

Xiangbo Gao, Sicong Jiang, Bangya Liu et al.

To build better video editing systems, you need specialized evaluation tools—generic vision-language models don't understand editing quality the way humans do.

This paper introduces VEFX-Bench, a comprehensive dataset and evaluation system for video editing. It includes 5,049 human-annotated video editing examples across multiple categories, a specialized reward model (VEFX-Reward) that judges editing quality across three dimensions, and a 300-video benchmark for comparing editing systems.

evaluationmultimodalapplications

Information Router for Mitigating Modality Dominance in Vision-Language Models

Apr 17, 2026

Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

Fixing modality dominance requires enriching missing information, not just redirecting attention—MoIR routes complementary information between modalities to create more balanced, information-dense representations before the language model processes them.

Vision-language models often rely too heavily on one modality (vision or text), ignoring useful information from the other. This paper proposes MoIR, a method that identifies weak or ambiguous tokens in one modality and enriches them with information from the stronger modality before processing.

Apr 6 – Apr 12(27)

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Apr 10, 2026

Zibin Geng, Xuefeng Jiang, Jia Li et al.

When training with noisy labels, anchoring text prompts to visual evidence makes them more robust—visual information is inherently more reliable than potentially incorrect labels, so using it to guide prompt updates reduces memorization of mislabeled samples.

VisPrompt is a lightweight framework that makes prompt learning for vision-language models more robust to mislabeled data. It uses visual information to guide and stabilize prompt learning by injecting image semantics into text prompts through a cross-modal attention mechanism, while adaptively controlling how much visual information to use per sample.

multimodaltrainingefficiency

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Apr 10, 2026

Guanyu Zhou, Yida Yin, Wenhao Chai et al.

Synthetic data targeted at specific visual skills can significantly improve VLM performance on perception tasks, suggesting that natural images alone don't provide enough supervision for low-level visual understanding.

VisionFoundry is a system that generates synthetic training data for vision-language models to improve their visual perception skills like spatial understanding and 3D reasoning.

Mar 30 – Apr 5(13)

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Apr 3, 2026

Gengwei Zhang, Jie Peng, Zhen Tan et al.

RL post-training of multimodal models may improve performance through learned hallucination patterns rather than genuine visual reasoning, challenging assumptions about how these models actually learn from images.

This paper investigates how reinforcement learning improves multimodal AI models' visual reasoning by studying the role of hallucination—when models generate plausible-sounding but incorrect information.

trainingmultimodalreasoning

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

evaluation
multimodal

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

May 18, 2026

Yining Hong, Jiageng Liu, Han Yin et al.

AI agents fail at embodied spatial reasoning primarily because they make poor action choices, not because they can't see—and they confidently stick to wrong answers even when evidence contradicts them, unlike humans who actively seek disconfirming evidence.

ESI-Bench is a benchmark for testing how well AI agents actively explore physical environments to understand spatial relationships. Rather than passively looking at images, agents must decide when to move, manipulate objects, and gather observations to solve tasks.

multimodalreasoning

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

May 18, 2026

Miguel Farinha, Ronald Clark

By conditioning on intrinsic image properties (albedo and shading) extracted from both photos and 3D renders, you can achieve photorealistic relighting with full PBR lighting control while staying fast enough for practical use.

PIXLRelight is a fast neural relighting method that lets you change lighting in photos using physically-based rendering controls. It decomposes images into intrinsic components (albedo, shading, residuals) and uses these to condition a transformer model, enabling realistic lighting adjustments in under 0.1 seconds per image without per-image optimization.

multimodalarchitectureefficiency

Semantic Generative Tuning for Unified Multimodal Models

May 18, 2026

Songsong Yu, Yuxin Chen, Ying Shan et al.

Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.

This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.

multimodaltrainingarchitecture
reasoningmultimodalagents

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

May 14, 2026

Xiang Fan, Yuheng Wang, Bohan Fang et al.

Video generation systems lose detail because their decoders ignore the input image—adding reference conditioning to the decoder recovers this information and improves quality by up to 2.1dB PSNR.

RefDecoder improves video generation by conditioning the decoder on a reference image, fixing a common architectural flaw where decoders ignore input details. By injecting reference image information through attention mechanisms during decoding, it preserves fine details and consistency without requiring retraining of existing systems.

architecturemultimodalefficiency

Quantitative Video World Model Evaluation for Geometric-Consistency

May 14, 2026

Jiaxin Wu, Yihao Pi, Yinling Zhang et al.

Video generators often fail at maintaining consistent 3D geometry in ways that human raters and perceptual metrics don't catch; PDI-Bench provides a diagnostic tool to measure and improve these failures systematically.

This paper introduces PDI-Bench, a quantitative framework for evaluating whether generated videos maintain physically plausible 3D structure and motion.

evaluationmultimodal

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

May 14, 2026

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim et al.

Combining unstructured clinical text with structured EHR tables through retrieval-augmented alignment produces significantly more accurate and complete patient timelines than using either source alone, with 35% of clinically important events appearing only in text.

This paper tackles a critical healthcare problem: reconstructing accurate timelines of patient events from messy clinical records. Clinical narratives (text) contain rich context but vague timing, while structured EHR tables have precise timestamps but miss many events.

multimodalapplications

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

May 12, 2026

Runhui Huang, Jie Wu, Rui Yang et al.

Self-reflective multimodal models can improve generation quality by learning to reason about user intent and autonomously correct their outputs using decomposed, verifiable rewards from language models.

AlphaGRPO enhances multimodal AI models to generate images and text by teaching them to reason about what users want and fix their own mistakes. It uses a novel reward system that breaks down complex requests into simple checkable questions, allowing the model to learn from reliable feedback without needing extra training setup.

multimodalreasoningtraining

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

May 12, 2026

Guohui Zhang, XiaoXiao Ma, Jie Huang et al.

When training models to generate audio and video together, treating each modality's learning separately and protecting audio-specific layers from video interference leads to better results than standard single-objective RL approaches.

OmniNFT improves joint audio-video generation by using reinforcement learning with three key techniques: routing rewards separately to each modality, preventing video gradients from interfering with audio processing, and focusing optimization on synchronization regions. This addresses real-world needs for high-quality audio, high-quality video, and tight audio-video alignment simultaneously.

multimodaltraining
efficiency
multimodal

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

May 7, 2026

Omar El Khalifi, Thomas Rossi, Oscar Fossey et al.

You can control both character motion and camera angles in video generation by using a two-phase conditioning approach that prioritizes geometric consistency, without needing to train new models.

ActCam enables precise control over both actor motion and camera movement in AI-generated videos without requiring training. It works with existing video generation models by providing carefully sequenced guidance: first using pose and depth information to establish scene structure, then refining details with pose-only guidance.

multimodalapplicationsarchitecture

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

May 7, 2026

Jai Moondra, Ayela Chughtai, Bhargavi Lanka et al.

Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.

Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.

evaluationmultimodal

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

May 7, 2026

Hao Dong, Hongzhao Li, Shupan Li et al.

Despite claims of progress, multimodal domain generalization methods show only marginal improvements over basic approaches when fairly compared—the field needs better methods and standardized evaluation to make real progress.

This paper creates MMDG-Bench, the first standardized benchmark for multimodal domain generalization across action recognition, fault diagnosis, and sentiment analysis. Testing 9 methods on 6 datasets with 7,402 trained models, it reveals that recent specialized methods barely beat simple baselines, no method works consistently across tasks, and all methods struggle with corrupted or missing data.

evaluationmultimodal

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

May 7, 2026

Ziyu Zhai, Siyou Li, Juexi Shao et al.

This dataset bridges AI and materials science by providing standardized benchmarks for predicting ceramic properties and generating glaze visuals—showing that multimodal AI can accelerate traditionally trial-and-error design processes.

GlazyBench is the first large-scale dataset for AI-assisted ceramic glaze design, containing 23,148 real glaze formulations. It enables two tasks: predicting glaze properties (color, transparency) from raw materials, and generating visual images of glazes.

multimodalapplicationsdata

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

May 6, 2026

Enhui Chai, Sicheng Chen, Tianyi Zhang et al.

Representing pathology image features in complementary geometric spaces (hyperbolic + Euclidean) with efficient sequence modeling enables more accurate whole-slide image analysis by capturing both tissue hierarchy and cellular details.

This paper presents BatMIL, a new approach for analyzing whole-slide images (gigapixel pathology scans) by representing tissue features in dual geometric spaces—hyperbolic for hierarchical structures and Euclidean for local details.

architecturemultimodalefficiency

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

May 6, 2026

Chuanzhi Xu, Boyu Wei, Haoxian Zhou et al.

You can now automatically evaluate whether a 3D scene looks visually appealing by analyzing its Gaussian Splatting representation directly, which is faster and cheaper than traditional rendering-based assessment methods.

This paper introduces Aes3D, the first framework for evaluating the visual aesthetics of 3D scenes created with Gaussian Splatting. It includes a new dataset with aesthetic annotations and a lightweight model that directly assesses aesthetic qualities like composition and harmony from 3D Gaussian primitives, without needing to render images.

evaluationmultimodal

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

May 5, 2026

Evangelos Ntavelis, Sean Wu, Mohamad Shahbazi et al.

Feed-forward 3D reconstruction from multi-view images can match or exceed optimization-based methods while being much faster, and UV-parameterization lets you train with many high-resolution views without memory explosion.

HeadsUp reconstructs detailed 3D head models from multiple camera views using an efficient neural network that compresses images into a compact representation, then decodes them into 3D Gaussians (mathematical shapes). The method scales to thousands of subjects and works on new people without extra optimization, enabling applications like generating new identities and animating expressions.

architecturemultimodal

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

May 4, 2026

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park et al.

Vision-language models perform surprisingly poorly on domain-specific action recognition even in simplified settings, but fine-tuning on domain-specific video data significantly closes the gap.

VideoNet is a new benchmark and dataset for testing how well AI models recognize specific actions in videos across 37 different domains. The researchers found that current vision-language models struggle with domain-specific action recognition—even simple binary choices—and created a 500k video question-answer dataset to improve performance through fine-tuning.

evaluationdatamultimodal
efficiencymultimodal

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

May 1, 2026

Venkata Pushpak Teja Menta

Adversarial training can make speaker embeddings invariant to language/script while preserving speaker identity—critical for multilingual voice cloning systems that need to recognize the same speaker across different languages.

Speaker encoders for voice cloning often fail when audio switches between languages or scripts—a problem especially acute for Indic languages. This paper introduces LASE, a small neural layer that makes speaker embeddings language-agnostic by combining speaker identity learning with adversarial training against language classification.

multimodalalignmenttraining

PhyCo: Learning Controllable Physical Priors for Generative Motion

Apr 30, 2026

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan et al.

You can make generative video models physically consistent by combining physics-labeled training data, ControlNet conditioning on physical properties, and VLM-based reward signals—no simulator needed at runtime.

PhyCo teaches video generation models to respect physics by fine-tuning them on 100K+ realistic simulation videos with varying physical properties (friction, bouncing, deformation), then using a vision-language model to provide physics-aware feedback during generation. This lets models create videos where objects behave realistically without needing a physics simulator at inference time.

trainingmultimodalevaluation

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

Apr 30, 2026

Binghao Huang, Yunzhu Li

Practical tactile sensing for robotics is now accessible to researchers and developers without expensive custom hardware—FlexiTac provides a plug-and-play solution that integrates with standard robot learning pipelines.

FlexiTac is an affordable, open-source tactile sensor system for robot hands and grippers that combines flexible sensor pads with simple electronics to provide real-time touch feedback. It works with existing robot platforms and supports modern AI training methods like learning from combined vision and touch data.

applicationsmultimodaldata

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Apr 30, 2026

Sudong Wang, Weiquan Huang, Xiaomin Yu et al.

Adding an explicit distribution-alignment stage between supervised fine-tuning and RL training significantly reduces model drift in multimodal models, with gains coming from disentangled feedback on perception vs. reasoning failures.

PRISM fixes a key problem in training multimodal AI models: when you fine-tune a model on examples and then use reinforcement learning, the model drifts away from what it learned initially.

trainingmultimodalalignment

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Apr 30, 2026

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem et al.

Using geometrically-aligned latent spaces (hyperspheres instead of Gaussian distributions) in autoencoders preserves 3D structure and physics better than standard approaches, which matters for building world models that understand real 3D scenes.

This paper proposes S²VAE, a new type of autoencoder that uses hyperspherical (spherical geometry) latent representations instead of traditional Gaussian ones to better preserve 3D geometry and camera motion in visual world models.

architecturemultimodalefficiency
multimodal

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Apr 23, 2026

Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny et al.

Hallucinations in vision-language models are primarily caused by over-reliance on textual instructions rather than vision limitations—and preference-based fine-tuning can effectively reduce this by teaching models to prioritize visual grounding.

Vision-language models often generate false descriptions that aren't supported by images, especially when text instructions are misleading. This paper introduces HalluScope, a benchmark to measure when and why this happens, and HalluVL-DPO, a fine-tuning method that teaches models to trust images over text instructions by learning from examples of correct vs. hallucinated responses.

evaluationsafetymultimodal

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Apr 23, 2026

Praval Sharma

Combining graph representations with LLM embeddings enables open-domain event extraction that generalizes to unseen event types while maintaining document-level reasoning that LLMs alone struggle with.

This paper presents MODEE, a method for extracting events from documents that works with any type of event, not just predefined ones. It combines graph-based learning with large language models to better understand document structure and context, addressing limitations where LLMs struggle with long documents and lose important information in the middle.

multimodalapplications

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Apr 23, 2026

Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko et al.

Single images with high agreement among vision models show dramatically stronger alignment with language models, suggesting that representational convergence across modalities is driven by how unambiguously the environment constrains perception.

This paper reveals that how consistently different vision models represent individual images (intra-modal agreement) strongly predicts whether vision and language models will represent those same images similarly (cross-modal alignment).

multimodalevaluationreasoning

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Apr 23, 2026

Bowen Liu, Li Yang, Shanshan Song et al.

Diagnosis-driven video summarization for medical imaging requires organizing sparse diagnostic events into coherent clinical contexts rather than treating frames independently—DiCE shows this contextual reasoning approach outperforms standard methods on ultra-long endoscopy videos.

This paper tackles video-level analysis of capsule endoscopy (CE) videos by introducing a new task: extracting key diagnostic frames and making accurate diagnoses from ultra-long videos containing thousands of normal frames mixed with rare abnormal findings.

evaluationmultimodalapplications

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

Apr 23, 2026

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger et al.

Synthetic data with perfect annotations can accelerate progress on multiple aerial imagery tasks simultaneously—depth, domain shift, and resolution—without the cost and difficulty of collecting real-world ground truth.

SyMTRS is a large synthetic dataset for aerial imagery that provides pixel-perfect depth maps, day/night image pairs, and multi-scale variants for training AI models on three tasks: depth estimation, domain adaptation, and super-resolution.

dataevaluationmultimodal

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Apr 23, 2026

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar et al.

Current audio AI models fail dramatically on genuine audio understanding tasks—they likely exploit dataset biases and metadata rather than actually listening to and reasoning about sound.

AUDITA is a new benchmark dataset with real-world audio and human-authored trivia questions designed to test whether AI models can truly understand audio content rather than relying on shortcuts. Humans answer correctly 32% of the time, but state-of-the-art models score below 9%, revealing a significant gap in audio reasoning capabilities.

evaluationmultimodaldata

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Apr 22, 2026

Ruohan Liu, Shukang Yin, Tao Wang et al.

Current large audio-language models fail to properly control or interpret paralinguistic cues (emotion, tone, style) in speech, with these failures accounting for 43% of errors in conversational tasks—a critical gap for building natural-sounding voice assistants.

SpeechParaling-Bench is a benchmark for testing how well AI speech models handle paralinguistic features—things like emotion, tone, and speaking style. It includes over 100 fine-grained features tested across 1,000+ English-Chinese speech samples, and uses an AI judge to compare outputs fairly. Tests show current models struggle significantly with controlling these subtle speech qualities.

evaluationmultimodalapplications

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Apr 22, 2026

Qiguang Chen, Chengyu Luan, Jiajun Wu et al.

Current vision-language models struggle with multi-image reasoning even on problems they might solve with single images—this benchmark shows that connecting information across multiple images is a major unsolved challenge.

OMIBench is a benchmark for testing how well vision-language models can solve Olympiad-level problems that require reasoning across multiple images. Unlike existing benchmarks that focus on single images, OMIBench tests whether models can connect evidence scattered across different images to solve complex problems in biology, chemistry, math, and physics.

evaluationmultimodalreasoning

LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

Apr 22, 2026

Dimitrije Antić, Alvaro Budria, George Paschalidis et al.

Instead of sparse contact points, modeling continuous proximity fields across entire surfaces—guided by learned interaction patterns—enables more realistic and physically plausible 3D human-object reconstruction from images.

This paper tackles 3D human-object interaction reconstruction from single images by introducing InterFields—a dense representation of proximity between body and object surfaces—and LEXIS, a learned discrete manifold of interaction patterns.

multimodal

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Apr 21, 2026

Boyu Chen, Yi Chen, Lu Qiu et al.

By representing actions as embodiment-agnostic physical intents grounded in visual outcomes, UniT enables humanoid robots to learn directly from human video data, dramatically improving data efficiency and enabling zero-shot task transfer without robot-specific training.

UniT solves a major bottleneck in training humanoid robots: the lack of robot data. Instead of collecting expensive robot videos, it learns from abundant human videos by finding a shared "physical language"—a unified way to represent actions that works across different body types.

multimodalagents

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Apr 21, 2026

Shuai Wang, Hongyi Zhu, Jia-Hong Huang et al.

Planning retrieval steps before searching for evidence improves both explanation quality and interpretability—the system can show why it chose specific evidence rather than just providing answers.

A-MAR is an AI system that explains artworks by breaking down questions into structured reasoning steps, then retrieving relevant evidence for each step. Unlike standard AI models that give answers based on internal knowledge, A-MAR shows its work—decomposing art questions into explicit goals, finding supporting evidence, and building explanations step-by-step.

agentsmultimodalreasoning

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Apr 20, 2026

Shaden Alshammari, Kevin Wen, Abrar Zainal et al.

Current state-of-the-art models achieve only 69-78% on Olympiad-level math problems, and embedding models struggle to find mathematically equivalent problems—showing that both mathematical reasoning and math-aware retrieval remain open challenges for AI systems.

MathNet is a large-scale benchmark with 30,676 Olympiad-level math problems across 17 languages and 47 countries, designed to evaluate both how well AI models solve math problems and how well they retrieve similar problems. The benchmark reveals that even top models struggle with complex reasoning, and that retrieval quality significantly impacts performance in retrieval-augmented problem solving.

reasoningmultimodal

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

Apr 20, 2026

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar et al.

Cross-modal alignment between vision and language models is much weaker than previously claimed—it only appears in small-scale experiments and reflects broad semantic overlap, not deep structural convergence.

This paper challenges the popular "Platonic Representation Hypothesis"—the idea that AI models trained on different types of data (like text and images) learn the same underlying representation of reality.

multimodalevaluationscaling

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Apr 20, 2026

Andrew Zhang, Tong Ding, Sophia J. Wagner et al.

A single model can integrate all types of clinical data (images, text, lab results, medications, procedures) into patient embeddings that enable multiple downstream clinical tasks, suggesting that unified patient representations are feasible and useful at healthcare system scale.

Apollo is a foundation model trained on 30 years of hospital records from 7.2 million patients that learns unified representations of entire patient care journeys across 28 medical data types.

multimodalapplications
multimodalarchitectureevaluation

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Apr 17, 2026

Yige Xu, Yongjie Wang, Zizhuo Wu et al.

Vision-language models struggle to genuinely reason about visual information—they primarily reason in text space, and adding images often degrades performance compared to text alone.

This paper reveals that vision-language models often rely on text reasoning rather than truly understanding images. Researchers created CrossMath, a benchmark with identical problems in text-only, image-only, and image+text formats, and found that adding images actually hurts performance. They show VLMs can be improved through targeted fine-tuning on multimodal reasoning tasks.

evaluationmultimodalreasoning

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Apr 17, 2026

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

When combining audio and text, align them indirectly through a shared joint embedding rather than directly contrasting them, and use structural consistency losses to prevent one modality from dominating the learned representation.

HILBERT is a multimodal framework that learns document-level representations from long audio-text sequences in low-resource settings.

multimodaltrainingarchitecture

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Apr 16, 2026

Yan Li, Zezi Zeng, Yifan Yang et al.

Generating webpages with AI requires coordinating multiple content types (text, images, video) at both global and local levels—treating layout and content generation as interconnected problems rather than separate tasks.

MM-WebAgent is a hierarchical AI system that generates complete webpages by coordinating the creation of layouts, text, images, and videos together. Unlike simpler approaches that generate each element separately, it uses planning and self-reflection to ensure all parts work together visually and stylistically.

agentsmultimodalapplications

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Apr 16, 2026

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara et al.

VLMs fail at emotion recognition due to two fixable problems: long-tailed training data that biases them toward common emotions, and inability to capture the fleeting temporal changes in facial expressions that are critical for understanding emotions.

Vision-language models struggle with emotion recognition because they inherit dataset biases that collapse rare emotions into common categories, and they can't effectively process the temporal dynamics of facial expressions. This paper identifies these vulnerabilities and proposes using natural language summaries of intermediate frames to preserve emotional context within memory constraints.

multimodalevaluationdata

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Apr 16, 2026

Mélanie Roschewitz, Kenneth Styppa, Yitian Tao et al.

Making AI medical diagnosis interpretable matters: RadAgent's step-by-step reasoning with visible tool interactions improves both accuracy and clinician trust compared to end-to-end models, showing that transparency and performance aren't trade-offs.

RadAgent is an AI agent that interprets chest CT scans by breaking down the analysis into step-by-step reasoning with tool use, producing reports alongside a transparent trace of how findings were derived.

agentsmultimodalreasoning

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Apr 16, 2026

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu et al.

For complex reasoning tasks like humor, supervising the intermediate thinking process with structured traces outperforms scaling alone—models need to learn *why* something is funny, not just predict captions.

This paper teaches AI models to understand humor like professional cartoonists by breaking down the reasoning process into three steps: spotting visual mismatches, reinterpreting them creatively, and judging which interpretations are funniest.

reasoningmultimodalevaluation

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Apr 15, 2026

Tianshuo Yang, Guanyu Chen, Yutian Chen et al.

Decoupling high-level reasoning from low-level control in robotic systems preserves the planning abilities of large vision-language models while improving execution accuracy on physical manipulation tasks.

HiVLA splits robot manipulation into two parts: a vision-language model that plans tasks and identifies objects, and a specialized action model that executes precise movements. This separation lets robots reason about complex tasks while staying accurate at fine-grained control, outperforming end-to-end approaches on real robot tasks.

agentsmultimodalarchitecture

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

Apr 15, 2026

Ziming Wang

Adding LiDAR to wrist-mounted robot interfaces makes data collection more robust in real-world conditions, letting robots learn complex tasks like deformable object manipulation that were previously impossible with vision alone.

UMI-3D improves robot data collection by adding LiDAR to the Universal Manipulation Interface, replacing unreliable monocular vision with 3D spatial sensing. This enables robots to learn manipulation tasks in cluttered, dynamic environments where the original vision-only system failed, while keeping the system portable and affordable.

multimodalagentsdata

SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Apr 14, 2026

Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla

Symbolic rule-based evaluation of 3D scenes is more reliable and interpretable than vision-language model judges, and text-only LLMs can outperform vision models at refining spatial layouts when given explicit constraint feedback.

SceneCritic is a symbolic evaluator that assesses 3D indoor scene layouts by checking semantic, orientation, and geometric consistency against a structured spatial ontology built from real-world scene data.

evaluationmultimodalreasoning

Visual Preference Optimization with Rubric Rewards

Apr 14, 2026

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu et al.

Using detailed, instruction-specific rubrics to score model outputs significantly improves preference-based training for vision tasks—achieving 82.69% on benchmarks versus 75.82% with simpler outcome-based scoring.

This paper introduces rDPO, a method for improving visual AI models by using detailed rubrics (checklists of criteria) to evaluate and rank image responses. Instead of simple yes/no judgments, the approach creates specific evaluation criteria for each image-instruction pair, which helps the model learn finer distinctions in visual reasoning tasks.

trainingevaluationmultimodal

Representation geometry shapes task performance in vision-language modeling for CT enterography

Apr 14, 2026

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian et al.

For medical imaging with vision-language models, representation geometry matters more than you might expect—how you aggregate information and encode tissue properties has bigger impact on performance than simply adding more spatial coverage.

This study explores how to best represent CT scan slices in vision-language models for diagnosing inflammatory bowel disease. The researchers find that different ways of combining slice embeddings work better for different tasks: simple averaging helps disease classification, while attention-based pooling improves image-text matching.

multimodalevaluation

PAL: Personal Adaptive Learner

Apr 14, 2026

Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja et al.

Real-time adaptive learning systems can analyze multimodal lecture content and adjust difficulty dynamically, offering personalized feedback and summaries that respond to individual student understanding as lessons unfold.

PAL is an AI platform that transforms lecture videos into interactive learning experiences by analyzing video content in real time and dynamically adjusting question difficulty based on student responses. It generates personalized summaries tailored to each learner's interests, moving beyond static quiz-based systems to provide truly adaptive, responsive education.

applicationsmultimodalevaluation
trainingmultimodaldata

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Apr 10, 2026

Wenyi Xiao, Xinchi Xu, Leilei Gan

Vision-language models need separate confidence scores for perception and reasoning, not a single overall confidence score, to better detect hallucinations and improve reliability in real-world applications.

This paper addresses a critical problem in vision-language models: they often give confident wrong answers, especially in high-stakes applications. The authors propose VL-Calibration, which separates confidence into two parts—visual confidence (did the model see the right thing?) and reasoning confidence (did it think correctly about what it saw?)—using reinforcement learning.

safetymultimodalevaluation

Envisioning the Future, One Step at a Time

Apr 10, 2026

Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella et al.

Predicting sparse point trajectories instead of dense pixels makes future scene simulation orders of magnitude faster while maintaining accuracy—enabling practical exploration of many possible futures with uncertainty quantification.

This paper predicts how scenes will evolve by tracking sparse point trajectories instead of predicting dense pixel values. An autoregressive diffusion model generates thousands of plausible futures from a single image while explicitly modeling uncertainty growth over time, achieving faster simulation than dense video prediction methods.

reasoningefficiencymultimodal

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Apr 10, 2026

Yucheng Shen, Jiulong Wu, Jizhou Huang et al.

For building agentic systems that reason over visual documents, maintaining structured evidence across pages and actively managing context drift through sliding windows and intent injection significantly improves both accuracy and efficiency.

VISOR is an AI system that helps vision-language models retrieve and reason over visually rich documents by combining iterative search with multi-step reasoning.

agentsreasoningmultimodal

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Apr 9, 2026

Shilin Yan, Jintao Tong, Hongwei Xue et al.

Agents can learn to use tools more wisely by training them with separate optimization objectives for accuracy and efficiency, rather than combining both into a single reward signal that creates conflicting incentives.

This paper addresses a critical problem in AI agents: they overuse external tools even when they could solve problems using their own knowledge. The authors propose HDPO, a training framework that teaches agents to be smarter about when to use tools by separating the optimization into two independent channels—one for accuracy and one for efficiency.

agentsreasoningmultimodal

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Apr 9, 2026

Haolei Xu, Haiwen Hong, Hongxing Li et al.

Multimodal MoE models suffer from 'routing distraction'—visual inputs cause the routing mechanism to activate the wrong experts for reasoning. A simple intervention that guides expert selection toward domain experts significantly improves visual reasoning performance.

This paper identifies a problem in multimodal mixture-of-experts models where they can see images correctly but fail at reasoning tasks that they solve easily with text.

architecturemultimodalreasoning

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Apr 9, 2026

Ziwei Zhou, Zeyuan Lai, Rui Wang et al.

Current text-to-audio-video generators look good but fail at semantic tasks like rendering text, maintaining speech coherence, and controlling musical pitch—evaluation needs to go beyond visual aesthetics to catch these failures.

AVGen-Bench is a benchmark for evaluating text-to-audio-video generation systems across 11 real-world tasks. It uses specialist models and multimodal AI to assess both perceptual quality and semantic accuracy, revealing that current systems struggle with text rendering, speech coherence, physical reasoning, and musical pitch control despite producing visually appealing outputs.

evaluationmultimodalapplications

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Apr 9, 2026

Wenbo Hu, Xin Chen, Yan Gao-Tian et al.

Gaussian GRPO normalizes reward distributions across diverse visual tasks to improve training stability, enabling open-source multimodal models to match proprietary systems on reasoning and perception tasks.

OpenVLThinkerV2 is a multimodal AI model that combines vision and language understanding for complex visual reasoning tasks. The key innovation is Gaussian GRPO, a new training method that stabilizes learning across different types of visual tasks by normalizing reward signals, while task-specific techniques help the model balance detailed visual perception with multi-step reasoning.

multimodalreasoningtraining

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Apr 9, 2026

Mu Nan, Muquan Yu, Weijian Mai et al.

Meta-learning enables brain decoding models to generalize across different people with minimal examples, eliminating the need for expensive per-subject training while maintaining strong performance.

This paper presents a meta-learning approach that decodes visual information from brain scans (fMRI) without requiring subject-specific training. By conditioning on just a few examples of a new person's brain activity, the model learns their unique neural patterns and can decode what they're seeing—all without fine-tuning or anatomical alignment.

multimodalreasoning

RewardFlow: Generate Images by Optimizing What You Reward

Apr 9, 2026

Onkar Susladkar, Dong-Hwan Jang, Tushar Prakash et al.

You can steer pretrained image models at inference time using multiple differentiable rewards and adaptive weighting—no retraining needed—to get better control over semantic accuracy, visual quality, and spatial grounding.

RewardFlow guides image generation by optimizing multiple reward signals during inference without modifying the model. It combines semantic, visual quality, and spatial rewards with a smart system that adjusts how much each reward matters based on the editing task, achieving better image editing and composition results.

multimodal

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Apr 9, 2026

Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang et al.

Eye-tracking analysis can be enriched by measuring semantic similarity of attended regions using VLMs and NLP metrics, capturing content agreement that spatial-only metrics miss.

This paper proposes a new way to compare eye-tracking scanpaths by focusing on what people looked at semantically, not just where spatially. Using vision-language models, the researchers convert fixation points into text descriptions and measure similarity using NLP metrics, revealing that two people can look at different locations but see the same meaningful content.

multimodalevaluation

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Apr 9, 2026

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.

When training vision-language models with reinforcement learning, enforcing that reasoning steps must be logically consistent and visually grounded—not just accurate—produces better explanations and even improves final answer accuracy.

This paper identifies a critical problem in multimodal AI models: they achieve high accuracy on visual reasoning tasks but produce reasoning explanations that contradict their answers and don't match what's actually in the image.

reasoningmultimodaltraining

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Apr 9, 2026

Haoxi Zeng, Qiankun Liu, Yi Bin et al.

By aligning DINO's semantic features with SAM's structural priors through specialized encoder-decoder modules, you can achieve both semantic generalization and precise edge detection for segmentation tasks without predefined categories.

This paper tackles open-vocabulary segmentation—identifying and outlining objects in images even when they're not in the training set—by combining two foundation models: DINO for semantic understanding and SAM for precise edge detection.

multimodalarchitectureevaluation

CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Apr 9, 2026

Rui Gan, Junyi Ma, Pei Li et al.

Vision-language models perform well at describing traffic scenes but fail at reasoning about crash mechanics, causality, and temporal progression—critical gaps for infrastructure-based autonomous driving safety systems.

CrashSight is a benchmark dataset of 250 real-world traffic crash videos with 13K questions designed to test how well AI vision-language models understand crash scenes from roadside cameras. The benchmark reveals that current models struggle with temporal reasoning and causal analysis in safety-critical scenarios, despite being good at describing scenes.

evaluationmultimodalsafety

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Apr 9, 2026

Marcel Gröpl, Jaewoo Jung, Seungryong Kim et al.

You can improve VLM grounding without training by using entropy gradients to identify uncertain regions, then iteratively refining focus—useful for detail-heavy tasks like document QA and compositional reasoning.

This paper proposes a training-free method to improve vision-language models by automatically identifying which visual regions are most important for answering questions. Instead of using external tools, it leverages the model's own uncertainty about its next word to create relevance maps, then iteratively zooms into key areas until confident.

multimodalevaluationreasoning

Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

Apr 9, 2026

Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea et al.

Adding explicit positional time encoding to neural process models significantly improves their ability to generalize to unseen action sequences in robotic action prediction tasks.

This paper applies Conditional Neural Processes to predict robot actions from partial observations, inspired by how humans understand others' movements. The authors improve an existing multimodal prediction model by adding better temporal encoding, enabling robots to forecast actions over longer sequences and refine predictions as new information arrives.

reasoningmultimodalarchitecture

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

Apr 9, 2026

David Joohun Kim, Daniyal Anjum, Bonny Banerjee et al.

Device-addressed speech detection works much better when you consider the conversation context and history rather than analyzing each utterance in isolation—and this sequential approach can run efficiently on edge devices.

This paper tackles the problem of detecting whether spoken audio is addressed to a device (like a smart speaker) before sending it for transcription. Rather than treating each utterance independently, the authors model it as a sequential decision problem that considers conversation history.

agentsefficiencymultimodal

Phantasia: Context-Adaptive Backdoors in Vision Language Models

Apr 9, 2026

Nam Duong Tran, Phi Le Nguyen

Backdoor attacks on multimodal AI models can be made significantly stealthier by generating context-aware poisoned outputs rather than fixed patterns—a critical finding for securing VLMs in production.

This paper reveals that existing backdoor attacks on Vision-Language Models are easier to detect than previously thought, and introduces Phantasia, a new attack that generates contextually appropriate malicious responses instead of fixed patterns, making it much harder to spot while maintaining normal performance.

safetymultimodal

MoRight: Motion Control Done Right

Apr 8, 2026

Shaowei Liu, Xuanchi Ren, Tianchang Shen et al.

This framework solves two critical problems in motion-controlled video generation: disentangling camera from object motion, and modeling causal interactions between objects so actions produce realistic consequences rather than just pixel displacement.

MoRight enables users to generate videos where objects move realistically and interact with each other, while freely choosing the camera angle. It separates object motion control from camera control and learns how objects causally affect each other—so when you push one object, others react naturally rather than just shifting pixels around.

multimodalreasoningapplications

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Apr 8, 2026

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir et al.

Vision-language models can identify visual features but fail at inferring structured cultural metadata from images, with significant performance gaps across different cultural regions—a critical limitation for cultural heritage applications.

This paper creates a benchmark to test how well vision-language models can extract structured cultural information (like creator, origin, period) from images of cultural artifacts. The researchers find that current models struggle with this task, showing inconsistent performance across different cultures and metadata types, revealing gaps in cultural reasoning beyond basic visual recognition.

evaluationmultimodal

RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

Apr 8, 2026

Wenjing Margaret Mao, Jefferson Ng, Luyang Hu et al.

Hybrid sensor fusion (IMUs + egocentric vision) enables robust, portable human motion capture in uncontrolled environments—critical for scaling robot learning with real-world human demonstrations.

RoSHI is a wearable system that combines IMU sensors with AR glasses to capture full-body human motion in real-world settings. By fusing inertial measurements with egocentric camera data, it creates accurate 3D pose estimates that work even when body parts are hidden or moving fast, making it practical for collecting robot training data from human activities.

dataapplicationsmultimodal

Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Apr 8, 2026

Jackson Petty, Jaulie Goe, Tal Linzen

LLMs can use in-context grammatical descriptions for translation, but their performance degrades significantly with grammar complexity and sentence length—suggesting limits to learning language structure from textual descriptions alone.

This paper tests whether large language models can translate between formal languages when given grammatical rules as context. Using specially designed grammar systems, researchers found that LLMs struggle with translation accuracy as grammar complexity and sentence length increase, and perform worse when source and target languages differ in structure or writing system.

evaluationreasoningmultimodal

Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Apr 8, 2026

Xin Tian, Jiuliu Lu, Ephraim Tsalik et al.

Mixture-of-Experts routing in medical image analysis works better when constrained by optimal transport to prevent expert collapse and when routing decisions respect spatial tissue neighborhoods.

This paper proposes ROAM, a new method for analyzing gigapixel medical images (whole-slide images) by using specialized expert networks that intelligently route different tissue regions to appropriate experts.

architecturemultimodalapplications

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Apr 7, 2026

Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta et al.

Attention-based hallucination detection is fundamentally flawed due to confounders; HaloProbe's Bayesian approach separates external and internal signals to detect hallucinations more reliably and mitigate them without degrading model performance.

Vision-language models often hallucinate objects that aren't in images. This paper shows that using attention weights to detect hallucinations is unreliable due to hidden confounders like token position.

safetyevaluationmultimodal

DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

Apr 7, 2026

Zhengming Yu, Li Ma, Mingming He et al.

Video diffusion models can recover lost dynamic range information by learning to synthesize plausible scene radiance in over- and underexposed regions, making LDR-to-HDR conversion practical without paired training data.

DiffHDR converts standard video (8-bit LDR) to high dynamic range (HDR) by using a video diffusion model to intelligently fill in lost highlight and shadow details.

multimodalarchitectureapplications

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Apr 7, 2026

Yuchi Wang, Haiyang Yu, Weikang Bian et al.

Reasoning in embeddings works best when applied selectively—use counterfactual analysis to identify which query-target pairs benefit from reasoning, then apply reinforcement learning to invoke it only when necessary.

This paper improves multimodal embedding models by selectively using reasoning only when needed. Instead of forcing all inputs through a reasoning process, the model learns when reasoning helps align queries with targets, reducing computation while achieving better performance on benchmark tasks.

multimodalreasoningefficiency
architecturemultimodalagents

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodalarchitectureevaluation

VOID: Video Object and Interaction Deletion

Apr 2, 2026

Saman Motamed, William Harvey, Benjamin Klein et al.

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodalapplicationsreasoning

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy

Apr 2, 2026

Abhilash Kar, Basisth Saha, Tanmay Sen et al.

This framework enables hospitals and clinics to collaboratively build better survival prediction models without sharing raw patient data, while also quantifying prediction confidence—critical for clinical adoption.

BVFLMSP combines Bayesian neural networks with federated learning to predict survival outcomes from sensitive multimodal data distributed across multiple parties. Each organization keeps its data private while contributing predictions to a shared model, with added privacy protections and uncertainty estimates for more reliable medical decision-making.

safetymultimodaltraining

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Apr 2, 2026

Karan Taneja, Anjali Singh, Ashok K. Goel

Combining conversation with visual content (multimodality) improves learning in STEM, but conversation alone can create a false sense of understanding without actual learning gains.

This study compares three ways to learn biology: a conversational AI with images and text, one with text only, and a traditional search interface. Students using the multimodal conversational system learned best and felt most satisfied, while text-only conversation felt easier but didn't improve learning—showing that engagement doesn't always mean better outcomes.

multimodalapplicationsevaluation

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Apr 2, 2026

Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha et al.

Multi-agent video recommenders coordinate specialized agents for different tasks (understanding, reasoning, memory) rather than relying on single models, enabling more explainable and adaptive recommendations—a shift that's becoming practical with LLMs.

This survey examines how video recommender systems are evolving from single models to multi-agent architectures where specialized AI agents coordinate to understand videos, reason about user preferences, and provide better recommendations.

applicationsagentsmultimodal

LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications

Apr 2, 2026

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss

Hybrid approach merging classical Bayesian tracking with learned neural fusion achieves production-grade performance without requiring dense annotations, making it practical for autonomous vehicle systems.

LEO combines graph neural networks with Bayesian tracking to estimate the shape and trajectory of vehicles for autonomous driving. It fuses data from multiple sensors (radar, lidar, camera) to track complex objects like articulated trucks while remaining computationally efficient for real-time use.

architecturemultimodalagents

TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning

Apr 2, 2026

Zhanting Zhou, KaHou Tam, Ziqiang Zheng et al.

Machine unlearning in recommendation systems works better when you target the specific model components most affected by deleted data rather than applying uniform updates across the entire model.

This paper addresses the challenge of removing user data from multimodal recommendation systems efficiently. The authors show that existing unlearning methods apply uniform updates across the entire model, but deleted-data influence is actually concentrated in specific areas like ranking behavior and certain network layers.

safetyefficiencymultimodal

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Apr 1, 2026

Zhe Yang, Shulin Tian, Kairui Hu et al.

Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.

HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.

evaluationmultimodalagents

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Apr 1, 2026

Graziano Blasilli, Marco Angelini

Multimodal AI models struggle inconsistently with detecting misleading visualizations; their ability varies dramatically by model size and architecture, and they often miss the intentional rhetorical techniques that human experts easily spot.

This study tests whether AI models can detect misleading visualizations and understand why they're deceptive. Researchers analyzed 2,336 tweets with COVID-19 charts—half containing intentional or accidental distortions—using 16 different AI models and compared their performance to how visualization experts judge the same images.

evaluationmultimodalapplications

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Apr 1, 2026

J. E. Domínguez-Vidal

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applicationsmultimodalefficiency