ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers45 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(60)

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

architecturemultimodalagents

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodal

Mar 23 – Mar 29(40)

Vega: Learning to Drive with Natural Language Instructions

Mar 26, 2026

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng et al.

Language instructions can guide autonomous driving decisions in real-time, enabling personalized driving behaviors beyond fixed rules—this opens the door to more flexible, user-responsive autonomous systems.

Vega is a vision-language-action model that learns to drive by following natural language instructions. The system combines visual perception, language understanding, and world modeling to generate safe driving trajectories. Researchers created a 100,000-scene dataset with diverse driving instructions and trajectories to train the model.

multimodalagentsreasoning

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Mar 26, 2026

Zehao Wang, Huaide Jiang, Shuaiwu Dong et al.

Autonomous driving systems can be personalized to match individual driver styles by learning user embeddings from driving data and conditioning the driving policy on these embeddings, enabling more human-centered autonomous vehicles.

This paper presents Drive My Way, a personalized autonomous driving system that learns individual driver preferences and adapts to real-time instructions.

architecture
evaluation

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Apr 2, 2026

Daiwei Chen, Zhoutong Fu, Chengming Jiang et al.

Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.

When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.

trainingefficiencyapplications

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Apr 2, 2026

Bangji Yang, Hongbo Ma, Jiajun Fan et al.

You can make reasoning models 15-60% more token-efficient while keeping or improving accuracy by simply training them to solve multiple problems simultaneously, creating an implicit efficiency incentive rather than explicit penalties.

This paper introduces Batched Contextual Reinforcement (BCR), a training method that makes language models reason more efficiently by training them to solve multiple problems at once in a shared context.

trainingefficiencyreasoning

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Apr 2, 2026

Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar et al.

When you need diverse answers to open-ended questions, routing to the best model per query beats using any single model—and you can train a lightweight router to make this selection automatically.

This paper shows that different language models excel at generating diverse answers to open-ended questions, and no single model is best for all prompts. The authors build a router—a small model that predicts which LLM to use for each question—to dynamically select the best model.

evaluationapplications

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

Apr 2, 2026

Sarath Shekkizhar, Romain Cosentino, Adam Earle

Task accuracy and conversational awareness are separate capabilities—a model can answer questions correctly without understanding how users naturally respond to those answers, revealing a blind spot in current LLM evaluation.

This paper reveals that language models can solve tasks correctly without understanding how conversations should naturally continue. Researchers tested this by asking models to generate the next user message after an assistant response—a task that requires understanding interaction flow.

evaluationreasoning

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

architectureefficiencyscaling

VOID: Video Object and Interaction Deletion

Apr 2, 2026

Saman Motamed, William Harvey, Benjamin Klein et al.

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodalapplicationsreasoning

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Apr 2, 2026

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan et al.

HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.

This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.

efficiencyarchitecture

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Apr 2, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang et al.

By intelligently routing training samples to different optimization strategies based on correctness, you can get the best of both fast learning and stable training—a practical improvement for post-training large language models.

This paper proposes Sample-Routed Policy Optimization (SRPO), a training method that combines two different approaches for fine-tuning language models: it routes correct outputs through a reward-based method and incorrect outputs through a distillation method.

trainingreasoningefficiency

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Apr 2, 2026

Payal Fofadiya, Sunil Tiwari

Conversational agents perform better with selective memory management than unlimited retention; a relevance-guided forgetting framework improves long-horizon reasoning while reducing false memories and context bloat.

This paper tackles a key problem in conversational AI: agents need to remember past interactions to reason coherently, but storing everything causes performance to degrade and creates false memories. The authors propose a smart forgetting system that decides which memories to keep based on relevance, recency, and frequency—like a selective filing system for an agent's brain.

agentsreasoningefficiency

The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management

Apr 2, 2026

Andrew Ang, Nazym Azimbayev, Andrey Kim

Agentic AI can shift institutional investing from human execution to human oversight, with autonomous agents handling forecasting, portfolio construction, and self-improvement while staying constrained by policy documents.

This paper demonstrates how AI agents can autonomously manage investment portfolios by having specialized agents forecast market conditions, build portfolios using multiple methods, and critique each other's work—all governed by an Investment Policy Statement that ensures alignment with institutional goals.

agentsapplicationsreasoning

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

Apr 2, 2026

Keerat Guliani, Deepkamal Gill, David Landsman et al.

LLMs can extract structured regulatory rules from legal documents through iterative self-evaluation and repair, achieving 84% preference over prior methods in downstream compliance tasks without human annotation.

De Jure automatically extracts legally binding rules from regulatory documents using LLMs and iterative self-refinement. It converts dense legal text into machine-readable rules through document normalization, semantic decomposition, multi-criteria evaluation, and repair cycles—without requiring human annotation or domain expertise.

applicationsreasoningevaluation

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Apr 2, 2026

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić et al.

By combining efficient tokenization with geometry-aware attention, you can build crystal generation models that are both faster and more accurate than complex graph neural networks, making generative modeling of materials more practical.

Crystalite is a lightweight diffusion Transformer for generating crystal structures that uses two key innovations: a compact atom representation called Subatomic Tokenization and a Geometry Enhancement Module that encodes crystal geometry directly into the model's attention mechanism.

architectureefficiencyapplications

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Apr 2, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu et al.

You can train agents to permanently learn skills rather than retrieve them at runtime, reducing token overhead and improving zero-shot performance by progressively withdrawing skill context during training.

SKILL0 teaches language model agents to internalize skills (procedural knowledge packages) directly into their parameters through a curriculum that gradually removes skill context during training.

trainingagentsreasoning

Model-Based Reinforcement Learning for Control under Time-Varying Dynamics

Apr 2, 2026

Klemens Iten, Bruce Lee, Chenhao Li et al.

Real-world control systems drift and change; you need to actively manage which training data you use and how confident you are in your model to handle non-stationary dynamics effectively.

This paper tackles reinforcement learning for robots and systems that change over time—like machinery that wears down or environments with shifting conditions. The researchers develop a learning algorithm that adapts by selectively forgetting old data and maintaining uncertainty estimates, proving it works better than standard approaches that assume unchanging dynamics.

trainingreasoning

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

Apr 2, 2026

Tina. J. Jat, T. Ghosh, Karthik Suresh

RAG systems can be deployed locally with open-source models to answer domain-specific technical questions while maintaining data privacy and reducing costs compared to cloud-based alternatives.

Researchers built a question-answering system for nuclear physics using retrieval-augmented generation (RAG) with a local LLaMA model and arXiv articles about the Electron-Ion Collider experiment. This approach keeps sensitive scientific data private while providing a cost-effective alternative to cloud-based solutions.

applicationsreasoning

Best-Arm Identification with Noisy Actuation

Apr 2, 2026

Merve Karakas, Osama Hanna, Lin F. Yang et al.

When learning systems communicate over noisy channels, the fundamental limits of error-free communication directly determine how efficiently you can identify the best option in a bandit problem.

This paper tackles a multi-armed bandit problem where a learner must identify the best option (arm) but can only communicate with an agent through a noisy channel. The researchers develop communication strategies that connect to information theory concepts, showing how channel quality affects the ability to find the best arm.

reasoningevaluation

Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives

Apr 2, 2026

Hao Zhu, Di Zhou, Donna Slonim

Diffusion model denoising objectives can smooth optimization landscapes for causal discovery, enabling faster and more stable learning of causal structures in challenging high-dimensional datasets.

This paper proposes DDCD, a new method for discovering causal relationships in data by adapting diffusion model techniques. Instead of using diffusion to generate data, it uses the denoising process to learn causal structures (DAGs) more stably and efficiently than existing methods like NOTEARS, especially when data is high-dimensional or imbalanced.

reasoningtrainingefficiency

BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy

Apr 2, 2026

Abhilash Kar, Basisth Saha, Tanmay Sen et al.

This framework enables hospitals and clinics to collaboratively build better survival prediction models without sharing raw patient data, while also quantifying prediction confidence—critical for clinical adoption.

BVFLMSP combines Bayesian neural networks with federated learning to predict survival outcomes from sensitive multimodal data distributed across multiple parties. Each organization keeps its data private while contributing predictions to a shared model, with added privacy protections and uncertainty estimates for more reliable medical decision-making.

safetymultimodaltraining

Generative AI Spotlights the Human Core of Data Science: Implications for Education

Apr 2, 2026

Nathan Taback

As AI handles data cleaning, modeling, and reporting, data science education must prioritize teaching human reasoning, problem formulation, and ethical judgment—skills that AI cannot replace.

This paper argues that generative AI automates routine data science tasks but reveals that the most valuable skills remain fundamentally human: problem formulation, causal reasoning, ethics, and judgment. The author proposes that data science education should focus on these irreducibly human competencies while teaching students to work effectively with AI tools.

trainingapplications

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Apr 2, 2026

Minda Zhao, Yutong Yang, Chufei Peng et al.

Emotional framing in prompts is a weak, task-dependent signal that rarely helps across the board, but adaptive emotional selection can provide modest, reliable improvements—especially for socially-grounded reasoning tasks.

This paper investigates whether emotional language in prompts affects how well large language models perform on tasks like math, medical reasoning, and reading comprehension. The researchers found that adding emotional framing to prompts produces only small, inconsistent changes in accuracy—except in socially-grounded tasks where emotional context matters more.

evaluationreasoning

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

Apr 2, 2026

Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy et al.

Reasoning models can be made safer by detecting when they've misunderstood the question itself—reconstruct what question they answered from their reasoning trace, and abstain if it differs from the original.

This paper tackles a critical problem: getting LLMs to know when to refuse answering questions. The authors discovered that reasoning models often fail at abstention (refusing to answer) because they answer the wrong question rather than answering incorrectly.

reasoningsafetyevaluation

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Apr 2, 2026

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin et al.

Selectively querying language models based on uncertainty can improve RL agent robustness in novel situations without constant computational overhead—but successful integration requires careful design, not just combining the two systems.

This paper proposes ASK, a system that combines reinforcement learning agents with language models to handle out-of-distribution scenarios.

agentsreasoningsafety

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Apr 2, 2026

Karan Taneja, Anjali Singh, Ashok K. Goel

Combining conversation with visual content (multimodality) improves learning in STEM, but conversation alone can create a false sense of understanding without actual learning gains.

This study compares three ways to learn biology: a conversational AI with images and text, one with text only, and a traditional search interface. Students using the multimodal conversational system learned best and felt most satisfied, while text-only conversation felt easier but didn't improve learning—showing that engagement doesn't always mean better outcomes.

multimodalapplicationsevaluation

VISTA: Visualization of Token Attribution via Efficient Analysis

Apr 2, 2026

Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P et al.

You can now understand what tokens your LLM actually uses without doubling GPU memory or being locked into specific architectures—just remove tokens and measure the impact.

VISTA is a lightweight, model-agnostic technique for visualizing which tokens matter most in LLM predictions.

efficiencyevaluation

Universal Hypernetworks for Arbitrary Models

Apr 2, 2026

Xuanfeng Zhou

A single fixed hypernetwork can generate weights for diverse architectures and tasks by using architecture/task descriptors as input, eliminating the need to retrain generators when switching between different model types.

This paper introduces Universal Hypernetworks (UHN), a single neural network that can generate weights for many different model architectures and tasks. Instead of building separate weight generators for each model type, UHN uses descriptors (text descriptions of architecture and task) to produce weights for any compatible model, working across vision, graphs, text, and math tasks.

architecturetrainingefficiency

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Apr 2, 2026

Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha et al.

Multi-agent video recommenders coordinate specialized agents for different tasks (understanding, reasoning, memory) rather than relying on single models, enabling more explainable and adaptive recommendations—a shift that's becoming practical with LLMs.

This survey examines how video recommender systems are evolving from single models to multi-agent architectures where specialized AI agents coordinate to understand videos, reason about user preferences, and provide better recommendations.

applicationsagentsmultimodal

CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Apr 2, 2026

Youssef Saidi, Haroun Elleuch, Fethi Bougares

End-to-end speech-to-entity models substantially outperform cascaded ASR+NER pipelines for Arabic, and multilingual pretraining transfers better than Arabic-specific pretraining for this low-resource task.

This paper introduces CV-18 NER, the first dataset for extracting named entities directly from Arabic speech. The researchers created 21 entity types by annotating the Arabic Common Voice corpus, then compared end-to-end speech models (Whisper, AraBEST-RQ) against traditional pipelines that first transcribe speech then extract entities.

data

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Apr 1, 2026

Zhe Yang, Shulin Tian, Kairui Hu et al.

Current AI agents fail at real-world personal file management: the best models only achieve 48% accuracy on user profiling tasks, with multimodal perception and evidence grounding being the main bottlenecks.

HippoCamp is a benchmark that tests AI agents on realistic file management tasks using real personal computers with 42.4 GB of actual user files. It measures how well agents can search files, understand context, and reason across multiple file types to answer questions about a user's data—revealing that even top AI models struggle with these practical tasks.

evaluationmultimodalagents

Universal YOCO for Efficient Depth Scaling

Apr 1, 2026

Yutao Sun, Li Dong, Tianzhu Ye et al.

You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.

Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.

efficiencyarchitecturereasoning

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

Apr 1, 2026

Piyush Garg, Diana R. Gergel, Andrew E. Shao et al.

For AI weather prediction, the training pipeline (loss function, data, optimization strategy) determines forecast skill far more than architectural choices—and current models have a fundamental blind spot for extreme weather events.

This paper explains why training methods, loss functions, and data matter more than model architecture for AI weather prediction. Using math from approximation theory and dynamical systems, the authors show that how you train a model dominates what model you use, and prove that AI weather models systematically underestimate extreme events. They validate this across ten different AI weather models.

trainingevaluationreasoning

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Apr 1, 2026

Muyu He, Adit Jain, Anand Kumar et al.

Current LLM agents struggle with long-term planning and learning from delayed feedback—only top models like Claude Opus 4.6 succeed, and using scratchpads to persist information across context windows is critical for success.

YC-Bench is a benchmark that tests whether AI agents can plan and execute consistently over long periods by simulating running a startup for a year. The agent must manage employees, select contracts, and stay profitable in an uncertain environment where early mistakes have lasting consequences.

evaluationagentsreasoning

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Apr 1, 2026

Youssef Mroueh, Carlos Fonseca, Brian Belgodere et al.

Combining theory and code in algorithm search, with explicit correctness/originality gates, produces more scientifically sound discoveries than optimizing code alone.

CliffSearch is an AI system that discovers new scientific algorithms by evolving both theory and code together. Unlike systems that just generate code, it uses multiple AI agents to propose, test, and refine ideas while checking for correctness and originality—similar to how scientists actually work through hypothesis, implementation, testing, and revision cycles.

agentsreasoning

LLM REgression with a Latent Iterative State Head

Apr 1, 2026

Yiheng Su, Matthew Lease

You can make LLMs predict continuous numeric values more efficiently by adding a tiny learned head that works with frozen representations, rather than decoding text or fine-tuning the entire model.

RELISH is a lightweight method for making LLMs predict numeric values directly from their internal representations. Instead of generating numbers as text, it uses a small learned component that iteratively refines a latent state through attention over token representations, then outputs a single number. It outperforms existing approaches while adding minimal parameters (0.01-0.04% overhead).

architectureefficiencyapplications

Therefore I am. I Think

Apr 1, 2026

Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov et al.

LLMs appear to encode action decisions in their internal states before generating reasoning text, meaning their chain-of-thought may rationalize predetermined choices rather than drive them.

This paper investigates whether large language models decide on actions before or after reasoning through problems. Using linear probes and activation steering, the researchers show that tool-calling decisions are encoded in the model's internal activations before reasoning tokens are even generated, suggesting models may rationalize pre-made decisions rather than truly deliberating.

reasoning

Learning and Generating Mixed States Prepared by Shallow Channel Circuits

Apr 1, 2026

Fangjun Hu, Christian Kokail, Milan Kornjača et al.

Quantum states in the trivial phase can be efficiently learned from measurements and regenerated using shallow circuits, providing a theoretical foundation for quantum generative models without needing the original preparation circuit.

This paper shows how to learn and generate quantum mixed states that belong to the 'trivial phase'—states preparable by shallow quantum circuits that preserve local reversibility. The algorithm learns from measurement data alone and outputs a shallow circuit that recreates the state, with polynomial sample complexity and runtime. The work also extends to classical diffusion models.

reasoningtrainingarchitecture

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Apr 1, 2026

Nandan Thakur, Zijian Chen, Xueguang Ma et al.

You can build high-quality training data for search agents using synthetic generation and verification without expensive human annotation or API costs, enabling smaller models to compete with larger ones.

ORBIT is a dataset of 20,000 reasoning-heavy questions with verifiable answers, created cheaply without paid APIs. The authors built a four-stage pipeline (seed creation, question generation, self-verification, external verification) to generate training data for search agents—AI systems that combine language models with web search.

datatrainingagents

Embarrassingly Simple Self-Distillation Improves Code Generation

Apr 1, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng et al.

You can improve code generation by sampling from your model's own outputs and fine-tuning on them—no external tools needed. The gains come from balancing precision (removing bad options) with exploration (keeping useful diversity).

A simple technique called self-distillation improves code generation in large language models by having them sample their own outputs and fine-tune on those samples. The method boosts performance significantly (42.4% to 55.3% on benchmarks) without needing external verifiers or teacher models, and works across different model sizes and architectures.

trainingefficiencyapplications

True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

Apr 1, 2026

Graziano Blasilli, Marco Angelini

Multimodal AI models struggle inconsistently with detecting misleading visualizations; their ability varies dramatically by model size and architecture, and they often miss the intentional rhetorical techniques that human experts easily spot.

This study tests whether AI models can detect misleading visualizations and understand why they're deceptive. Researchers analyzed 2,336 tweets with COVID-19 charts—half containing intentional or accidental distortions—using 16 different AI models and compared their performance to how visualization experts judge the same images.

evaluationmultimodalapplications

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Apr 1, 2026

J. E. Domínguez-Vidal

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applicationsmultimodalefficiency

Screening Is Enough

Apr 1, 2026

Ken M. Nakanishi

Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.

This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.

architectureefficiencyscaling

NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

Apr 1, 2026

Prasanjit Dey, Soumyabrata Dev, Angela Meyer et al.

Hybrid physics-neural models can achieve better accuracy and uncertainty calibration than pure data-driven or physics-based approaches alone, especially for spatiotemporal forecasting with known physical constraints.

NeuroDDAF combines physics-informed modeling with neural networks to forecast air quality by integrating wind-driven transport equations, graph attention for spatial patterns, and uncertainty quantification. It outperforms existing methods on urban datasets while providing reliable confidence estimates for predictions.

reasoningmultimodalapplications

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Apr 1, 2026

Cai Zhou, Zekai Wang, Menghua Wu et al.

ORCA calibrates LLM reasoning in real-time by adapting confidence estimates per input, enabling 40-67% compute savings during inference while providing mathematical guarantees on error rates across different reasoning tasks and domains.

This paper introduces ORCA, a framework that makes language models more efficient during reasoning by calibrating their sampling process. Using test-time training and conformal prediction, ORCA learns to estimate confidence in its own reasoning steps, reducing wasted computation while maintaining accuracy—saving up to 47% compute on in-distribution tasks and 67% on out-of-distribution problems.

reasoningefficiencyevaluation

Adaptive Block-Scaled Data Types

Mar 30, 2026

Jack Cook, Hyemin S. Lee, Kathryn Le et al.

Adaptive block-scaled quantization can significantly reduce errors in 4-bit model compression by intelligently switching between data types per block, achieving better accuracy than fixed formats without extra storage cost.

This paper introduces adaptive quantization formats (IF4, IF3, IF6) that improve upon NVFP4 by dynamically choosing between floating-point and integer representations for each block of values. The approach uses an unused bit in NVFP4 to signal which format to use, reducing quantization errors and improving language model performance with minimal hardware overhead.

efficiencytrainingarchitecture

Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

Mar 30, 2026

N Alex Cayco Gajic, Arthur Pellegrino

Comparing neural representations by their intrinsic geometric structure—not just their raw values—reveals deeper insights into how different networks solve the same problem, enabling better interpretation of neural computations.

This paper introduces metric similarity analysis (MSA), a new method for comparing how neural networks represent information by analyzing the intrinsic geometry of their learned representations rather than just their surface-level structure.

evaluation

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Mar 30, 2026

Omer Dahary, Benaya Koren, Daniel Garibi et al.

You can increase diversity in generated images by applying repulsion forces in the transformer's attention channels during generation, without expensive optimization or visual artifacts.

This paper tackles the problem of text-to-image diffusion models producing visually similar outputs for the same prompt. The authors propose a method that applies 'repulsion' in the attention mechanism during image generation to encourage diverse outputs while maintaining quality and semantic accuracy.

architectureefficiencymultimodal

Temporal Credit Is Free

Mar 30, 2026

Aur Shalev Merin

Online learning in RNNs doesn't require sophisticated credit assignment algorithms—proper gradient normalization with immediate derivatives is sufficient and dramatically more memory-efficient.

Recurrent networks can learn online using simple immediate derivatives instead of expensive backpropagation-through-time. The key insight: the hidden state naturally carries temporal information forward, so you just need proper gradient normalization and avoid stale memory traces. This approach matches or beats complex algorithms while using 1000x less memory.

trainingefficiency

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Mar 30, 2026

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho et al.

Sparse autoencoders fail at compositional generalization because they learn poor concept dictionaries during training, not because of their amortized inference approach—fixing dictionary learning, not inference speed, is the key to interpretable AI.

This paper reveals why sparse autoencoders (SAEs) and linear probes fail to understand compositional concepts in neural networks. The core issue isn't the inference method—it's that SAEs learn dictionaries (concept representations) pointing in the wrong directions.

reasoningevaluation

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Mar 30, 2026

Liliang Ren, Yang Liu, Yelong Shen et al.

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

trainingscalingefficiency

Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

Mar 30, 2026

Meitong Liu, Christopher Jung, Rui Li et al.

Transfer learning with auxiliary tasks provably helps only under specific conditions—this paper gives exact formulas to check those conditions and optimal ways to combine auxiliary and main tasks in linear settings.

This paper provides theoretical guarantees for when auxiliary data helps in transfer learning. For linear regression, the authors derive exact formulas showing when and how auxiliary tasks improve performance. For linear neural networks with shared representations, they prove the first non-vacuous conditions for beneficial auxiliary learning and show how to optimally weight different tasks.

training

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Mar 30, 2026

Anuj Diwan, Eunsol Choi, David Harwath

Specialized models for different types of speech style (speaker traits vs. utterance characteristics) outperform single unified models on individual tasks, but a combined model works better when styles need to be understood together.

ParaSpeechCLAP is a dual-encoder model that learns to match speech audio with text descriptions of speaking style (like pitch, emotion, and texture). It maps both modalities into a shared embedding space, enabling applications like finding similar-sounding speech, classifying speaker characteristics, and improving text-to-speech synthesis without retraining.

multimodalapplications

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

If you're building AI systems, standard software architecture documentation won't capture ML-specific risks like model drift or data dependencies—RAD-AI provides a structured way to document these for both compliance and team understanding.

RAD-AI extends existing architecture documentation frameworks (arc42 and C4 model) to handle AI systems, adding sections for probabilistic behavior, ML lifecycles, and data dependencies. It maps to EU AI Act compliance requirements and shows 93% coverage of regulatory documentation needs versus 36% for standard frameworks.

architecturesafetyapplications

See it to Place it: Evolving Macro Placements with Vision-Language Models

Mar 30, 2026

Ikechukwu Uchendu, Swati Goel, Karly Hou et al.

Foundation models trained on visual reasoning can solve specialized engineering problems like chip design without fine-tuning, by framing physical constraints as spatial reasoning tasks.

This paper uses Vision-Language Models to improve chip floorplanning—arranging components on a chip to minimize wiring. The approach, called VeoPlace, treats the chip layout as a visual problem, letting a VLM suggest component placements without any training, then iteratively refines these suggestions. It outperforms existing machine learning methods by up to 32% on standard benchmarks.

applicationsreasoningmultimodal

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

LLMs can serve as runtime architectural components to solve schema interoperability problems dynamically, but code generation strategies outperform direct transformation and cost varies dramatically across models without matching accuracy gains.

SAGAI-MID is a middleware system that uses LLMs to automatically fix schema mismatches between different services and APIs at runtime, eliminating the need for manual adapter code. It combines structural analysis with LLM reasoning and includes safety checks to handle real-world integration challenges across REST, GraphQL, and IoT systems.

architectureagentsapplications

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Mar 30, 2026

Philip Schroeder, Thomas Weng, Karl Schmeckpeper et al.

Video-language models can supervise robot learning directly as reward signals if trained with spatiotemporal reasoning and grounded in continuous progress supervision, enabling robots to learn new tasks without hand-crafted rewards.

SOLE-R1 is a video-language model that watches robot videos and reasons about task progress step-by-step to provide reward signals for robot learning. Unlike standard vision-language models, it's designed to handle partial views and changing conditions, preventing robots from gaming the reward system.

reasoningagentsmultimodal

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Mar 30, 2026

Yash Savani, Branislav Kveton, Yuchen Liu et al.

Stepwise credit assignment—rewarding each diffusion step for its own improvement rather than the final result—makes RL training of image generators more efficient and faster to converge.

This paper improves reinforcement learning for image generation models by assigning credit more intelligently across diffusion steps. Instead of treating all steps equally, it recognizes that early steps handle composition while late steps refine details, then rewards each step based on its specific contribution. This leads to faster learning and better sample efficiency.

trainingreasoningefficiency

Dynamic Dual-Granularity Skill Bank for Agentic RL

Mar 30, 2026

Songjun Tu, Chengdong Xu, Qichao Zhang et al.

Organizing agent experience into dual-granularity skills (task-level and step-level) with dynamic maintenance significantly improves performance, and these skills transfer across different evaluation settings without major training overhead.

D2Skill creates a dynamic memory system for AI agents that stores two types of reusable skills: high-level task guidance and low-level step-by-step corrections. The system learns from its own training experience, continuously updating and pruning skills based on their usefulness. Tests show 10-20% improvement in task success rates on complex web-based environments.

agentsreasoningtraining

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Mar 30, 2026

Soutrik Mukherjee, Sangwhan Cha

Hybrid precision (FP32 for softmax/normalization, FP16 for linear layers) delivers 2x speedup with zero accuracy loss—a practical strategy for deploying transformers in latency-critical applications.

This paper optimizes transformer models (BERT and GPT-2) for fast GPU inference using mixed-precision techniques—keeping sensitive operations in full precision while using lower precision for others. The system achieves 64x speedup over CPU and sub-10ms latency while maintaining numerical accuracy and eliminating instability issues.

efficiencyarchitecture
multimodalagentsapplications

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Mar 26, 2026

Yuxing Lu, Xukai Zhao, Wei Wu et al.

You can improve RAG systems by preprocessing your corpus once to add distilled, compact versions of relevant documents—this works with any retrieval method and shows consistent gains without changing your pipeline.

This paper proposes WriteBack-RAG, a method that improves retrieval-augmented generation (RAG) systems by treating the knowledge base as trainable. Using labeled examples, the system identifies relevant documents, distills them into compact knowledge units, and adds these to the corpus.

datatraining

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Mar 26, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying et al.

You can train video models on short clips and generate much longer videos by using a three-tier memory strategy that compresses historical context without losing quality.

PackForcing solves the memory problem in video generation by compressing old frames intelligently—keeping early frames for context, heavily compressing middle frames, and preserving recent frames for smooth transitions. This lets models generate 2-minute videos on a single GPU after training only on 5-second clips, achieving 24x longer videos than training data.

efficiencyarchitecturetraining

PixelSmile: Toward Fine-Grained Facial Expression Editing

Mar 26, 2026

Jiabin Hua, Hengyuan Xu, Aojie Li et al.

Fine-grained facial expression editing is now possible with precise control and identity preservation by disentangling expression semantics through symmetric joint training and contrastive learning.

PixelSmile is a new method for editing facial expressions in images with fine-grained control. It uses a diffusion model trained with a special technique to separate expression changes from identity, allowing smooth blending between different expressions while keeping a person's identity intact.

multimodalevaluation

Back to Basics: Revisiting ASR in the Age of Voice Agents

Mar 26, 2026

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.

This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.

evaluationsafetymultimodal

Natural-Language Agent Harnesses

Mar 26, 2026

Linyue Pan, Lexiao Zou, Shuo Guo et al.

Agent performance depends heavily on how you orchestrate their behavior—by making this orchestration code readable and portable through natural language, you can reuse and improve agent designs much more easily.

This paper proposes a new way to design agent control systems by writing them in natural language instead of buried in code. The authors create Natural-Language Agent Harnesses (NLAHs) and a runtime system that executes these harnesses, making it easier to reuse, compare, and study how agents are controlled across different tasks.

agentsarchitecture

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Mar 26, 2026

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero et al.

You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.

This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.

trainingmultimodalefficiency

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Mar 26, 2026

Zirui Zhang, Haoyu Dong, Kexin Pei et al.

Cross-modal inconsistencies in multimodal models aren't just failures to hide—they're valuable training signals that, when enforced through cycle consistency, improve reasoning accuracy by up to 7.6 points and reduce systematic biases.

This paper introduces RC2, a reinforcement learning approach that improves multimodal AI models by enforcing consistency between visual and textual understanding. Instead of ignoring when a model gives contradictory answers for the same concept in different modalities, the method uses these conflicts as training signals.

reasoningmultimodal

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Mar 26, 2026

Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri et al.

General-purpose coding agents can discover hardware optimization patterns automatically by working at scale—using multiple agents to explore different optimization strategies yields significant speedups without domain-specific training.

This paper shows that general-purpose AI coding agents can optimize hardware designs without specialized training. The approach uses multiple agents working together: first decomposing designs into smaller pieces and optimizing each, then launching additional agents to find cross-function improvements.

agentsapplications

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Mar 26, 2026

Ligong Han, Hao Wang, Han Gao et al.

You can make diffusion-based language models much faster by intelligently deciding when to verify generated tokens, using the same model in two different modes without retraining.

S2D2 speeds up block-diffusion language models by combining parallel token generation with selective verification steps. The method reuses the same pretrained model in two modes—as a fast parallel generator and as a careful single-token verifier—without requiring additional training, achieving up to 4.7× speedup over standard autoregressive decoding.

efficiencyreasoning

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Mar 26, 2026

Yannick Roy

You can safely automate continuous code improvement by combining LLM agents that act as power users, unbeatable verification tests, and automated pause gates that catch quality degradation before it ships.

A framework for autonomous software development where LLM agents continuously test and improve code against a specification. The system uses synthetic user testing at 1,000x human speed, ground-truth verification tests, and automated quality gates to safely evolve codebases without human intervention—validated on production systems with 1,000+ merged changes and zero regressions.

agents

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 26, 2026

Xueji Zhao, Likai Pei, Jianbo Liu et al.

Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.

This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.

efficiencysafetyarchitecture

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Mar 26, 2026

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Treating geo-localization as a sequential zooming problem over maps, rather than image retrieval, achieves better results and avoids the limitations of contrastive learning approaches that struggle with landmark visibility mismatches.

This paper tackles cross-view geo-localization—matching street-view photos to satellite maps to pinpoint a camera's location without GPS. Instead of the standard approach of comparing images in a shared embedding space, the authors propose a new method that zooms progressively into a satellite map, making sequential decisions to narrow down the location.

reasoningarchitectureevaluation

Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

Mar 25, 2026

Arthur Jacot

You can sample from diffusion models much faster by combining predictions from small and large networks—the method achieves the same accuracy as running the largest network once, instead of many times.

This paper speeds up diffusion model sampling by using multiple neural networks of different sizes together. Instead of running one large network many times, the method runs a small fast network many times and a large accurate network just a few times, reducing total computation while maintaining quality. Tests show up to 4x speedup on image generation.

efficiencyarchitecture

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Mar 25, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian et al.

Latent world models can dramatically speed up RL training for autonomous driving by replacing expensive multi-step diffusion with single-step latent sampling, making imagination-based policy training practical.

DreamerAD uses a latent world model to train autonomous driving policies 80x faster than previous diffusion-based approaches. Instead of generating full images during training, it compresses the diffusion process to a single step by working with compressed latent features, enabling safe, efficient reinforcement learning on driving tasks without real-world testing.

efficiencyreasoningagents

Comparing Developer and LLM Biases in Code Evaluation

Mar 25, 2026

Aditya Mittal, Ryan Shar, Zichu Wu et al.

LLMs used as code judges have significant blind spots compared to human developers—they systematically misweight code quality factors like explanation length, meaning you can't rely on them alone for code evaluation in real applications.

This paper introduces TRACE, a framework that compares how LLM judges evaluate code against human developer preferences.

evaluationapplications

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Mar 25, 2026

Biplab Pal, Santanu Bhattacharya

Before deploying agentic AI in business processes, measure the 'blind mass' of uncertain state-action pairs and expected oversight costs using event logs—this reveals hidden decision gaps that simple accuracy metrics miss.

This paper develops a mathematical framework to measure when AI agents can safely operate autonomously versus when they need human oversight.

agentssafetyevaluation

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Mar 25, 2026

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur et al.

Better retrieval doesn't guarantee better RAG answers: improving individual components can paradoxically increase confident hallucinations when relevant information isn't in your corpus.

This paper studies retrieval-augmented generation (RAG) systems for answering questions about AI policy documents. The researchers found that improving retrieval quality doesn't always lead to better answers—sometimes better retrieval actually makes the system more confidently wrong when relevant documents are missing.

evaluationapplications

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Mar 25, 2026

Zhuo Li, Yupeng Zhang, Pengyu Cheng et al.

Using multiple agents with intentional information barriers prevents LLMs from confirming their own errors during fact-checking, letting smaller models match larger ones on reliability.

MARCH is a framework that reduces hallucinations in LLMs by using three specialized agents that work together with deliberate information separation. A Solver generates responses, a Proposer breaks them into verifiable claims, and a Checker validates claims without seeing the original output—preventing the verifier from copying the generator's mistakes.

safetyagentsalignment

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Mar 25, 2026

Falong Fan, Yi Xie, Arnis Lektauers et al.

Dynamic graph-based feature connections outperform fixed spatial neighborhoods for reconstructing deformable surgical scenes, especially when dealing with occlusions and low-texture surfaces.

EndoVGGT improves 3D reconstruction of soft tissues during surgery by using a graph neural network module that dynamically connects similar tissue regions across the image, even when instruments block the view or surfaces are shiny. This approach recovers the true shape of deformable tissues better than previous methods and works on new surgical videos it hasn't seen before.

architecture

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Mar 25, 2026

Xinying Guo, Chenxi Jiang, Hyun Bin Kim et al.

For robotic tasks with visual ambiguity, storing rich multimodal memory with geometric grounding outperforms semantic compression—robots need fine-grained context, not just similarity-based retrieval, to handle non-Markovian decision problems.

Chameleon is a memory system for robots that handles situations where the same visual observation could mean different things depending on what happened before. Instead of storing compressed summaries like most systems, it preserves detailed geometric and visual information to disambiguate confusing situations, enabling robots to make better decisions during long, complex manipulation tasks.

agentsmultimodal

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Mar 25, 2026

Qijia He, Xunmei Liu, Hammaad Memon et al.

You can now automatically convert flat images of technical figures into editable, scalable vector graphics—matching GPT-5.2 performance—enabling recovery of lost design source files without manual reconstruction.

VFIG converts rasterized images (PNG, JPEG) of technical diagrams back into editable SVG vector graphics using vision-language models. The team created a 66K dataset of figure-SVG pairs and a two-stage training approach (supervised learning for basic shapes, then reinforcement learning for refinement) to reconstruct complex professional diagrams with high fidelity.

multimodaltrainingapplications

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Mar 25, 2026

Duc Vu, Anh Nguyen, Chi Tran et al.

If you're concerned about your photos being used to generate deepfake videos, adversarial perturbations applied in multiple domains (color and frequency) can effectively block modern video generation models while remaining imperceptible to humans.

This paper presents Anti-I2V, a defense method that protects photos from being misused in AI-generated fake videos. Instead of just adding noise to images, it works across multiple color spaces and frequency domains to disrupt video generation models, targeting both traditional and newer Transformer-based architectures.

safety

Trust Region Constrained Bayesian Optimization with Penalized Constraint Handling

Mar 25, 2026

Raju Chowdhury, Tanmay Sen, Prajamitra Bhuyan et al.

Trust regions combined with penalty-based constraints enable Bayesian optimization to find feasible solutions faster in high-dimensional constrained problems where evaluations are expensive.

This paper presents a Bayesian optimization method for expensive black-box optimization problems with constraints. It combines penalty-based constraint handling, surrogate modeling, and trust regions to efficiently find good solutions in high dimensions with fewer evaluations.

efficiencytraining

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Mar 25, 2026

Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu et al.

Foundation models can effectively predict clinical outcomes from EHR data, but scaling model size alone doesn't improve performance—you need proportionally more training data, and careful handling of repeated events is critical to avoid inflated evaluation metrics.

RAVEN is a foundation model trained on electronic health records (EHRs) from over one million patients to predict what clinical events will happen at a patient's next visit.

applicationsscaling

The Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems

Mar 25, 2026

Martin Jaraiz

FMA shows that market-based competition between autonomous agents can solve complex optimization problems without explicit fitness functions, suggesting economic dynamics may be a fundamental principle for discovering solutions in open-ended systems.

The Free-Market Algorithm is a new optimization method inspired by economics that doesn't need predefined fitness functions or fixed search spaces. Instead, it uses autonomous agents trading goods and competing in a market to discover solutions. It successfully found amino acids and nucleobases from raw atoms in chemistry, and predicted GDP with accuracy matching professional forecasters.

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Mar 25, 2026

Keliang Li, Yansong Li, Hongze Shen et al.

Giving AI agents control over their visual perception—deciding what to look at and when—significantly improves video reasoning accuracy. This active observation approach works as a plug-and-play upgrade for existing vision-language models.

LensWalk is an AI framework that lets language models actively control how they watch videos while reasoning about them.

agentsmultimodalreasoning

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Mar 25, 2026

Samuel Taiwo, Mohd Amaluddin Yusoff

For enterprise RAG systems with structured documents, preserve document structure when chunking—it improves retrieval quality and reduces costs, but you'll need multimodal AI to handle diagrams and visual content.

This paper tests four different ways to split documents into chunks for RAG systems using oil and gas industry documents. Structure-aware chunking (which respects document layout) works best and costs less than other methods, but all approaches struggle with diagrams and visual content.

evaluationapplications

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Mar 24, 2026

Ufaq Khan, Umair Nawaz, L D M S S Teja et al.

Medical VLMs need explicit training on input validation (checking modality, anatomy, orientation) as a separate safety step before diagnosis, not as an afterthought—current models hallucinate plausible reports even on obviously invalid inputs.

This paper reveals a critical blind spot in medical AI: vision-language models can generate fluent medical reports even when given invalid inputs like wrong body parts or upside-down images. MedObvious is a benchmark of 1,880 tasks testing whether models can catch these basic sanity checks before attempting diagnosis—a step human radiologists do automatically but VLMs currently fail at.

safetyevaluationmultimodal

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Mar 24, 2026

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.

You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.

VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.

efficiencymultimodalarchitecture

Failure of contextual invariance in gender inference with large language models

Mar 24, 2026

Sagar Kumar, Ariel Flint, Luca Maria Aiello et al.

LLM outputs are unstable across contextually equivalent formulations of the same task, meaning benchmark results may not reflect how models actually behave in real applications—a critical issue for bias testing and high-stakes use.

This paper reveals that large language models fail to give consistent outputs when tasks are reformulated in contextually equivalent ways.

evaluationsafety

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Mar 24, 2026

Haoyu Huang, Jinfa Huang, Zhongwei Wan et al.

A smaller speculative model can predict an agentic system's tool-calling trajectory, enabling parallel execution and early termination of expensive operations—delivering significant speedups without accuracy loss.

SpecEyes speeds up agentic multimodal AI systems by using a lightweight model to predict what tools the main model will need, allowing expensive operations to be skipped or run in parallel. This cuts latency by 1.1-3.35x while maintaining accuracy, solving a key bottleneck in systems like OpenAI o3 that repeatedly invoke vision tools.

efficiencymultimodalagents

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Mar 24, 2026

Muhammad Khalid, Manuel Oriol, Yilmaz Uygun

Using structured prompting formats (PEGS) with multiple LLM providers significantly improves requirements extraction accuracy (F1: 0.88 vs 0.71) and provides built-in reliability through model consensus and fallback mechanisms.

ReqFusion automates software requirements extraction and classification by combining multiple LLM providers (GPT, Claude, Groq) with a structured PEGS format prompt. The system processes various document types and achieves 88% accuracy, reducing manual analysis time by 78% while ensuring consistent requirement categorization across academic, industrial, and business contexts.

applicationsevaluation

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Mar 24, 2026

Haoran Yuan, Weigang Yi, Zhenyu Zhang et al.

Adding tactile (touch) sensing to video-based robot learning models significantly improves performance on tasks requiring precise force control and contact awareness, without needing separate tactile pretraining.

This paper introduces VTAM, a robot learning system that combines video and touch (tactile) sensing to better understand and perform complex physical tasks.

multimodalapplications

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Mar 24, 2026

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al.

You can now build federated learning systems that defend against both Byzantine attacks and privacy breaches simultaneously, without needing unrealistic assumptions like bounded gradients or extra server datasets.

This paper tackles two critical security issues in federated learning: protecting against malicious servers (Byzantine attacks) and preventing data leakage (differential privacy).

safetytrainingefficiency

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Mar 24, 2026

Duc Vu, Kien Nguyen, Trong-Tung Nguyen et al.

You can dramatically improve few-step diffusion inpainting by initializing the noise with semantic information from the input image, rather than random noise—no retraining required.

InverFill speeds up image inpainting by using a smart noise initialization technique that preserves semantic information from the original image. Instead of training new models, it works with existing fast text-to-image models to fill in masked regions with better quality and fewer processing steps.

efficiencyarchitecture

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Mar 24, 2026

Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo

For a well-structured class of RL problems, you can now learn optimal policies efficiently using linear models without needing special oracles or being limited to tiny action spaces.

This paper solves a key challenge in reinforcement learning: how to efficiently learn good policies when using linear function approximation in a specific class of environments (linear Bellman complete MDPs). The researchers provide an algorithm that works with both small and large action spaces, achieving polynomial time and sample complexity—meaning it scales reasonably with problem size.

efficiencyreasoning

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Mar 24, 2026

Abdul Rahman

Security AI models fail when deployed to new environments because telemetry data is fragmented. CSTS solves this by providing a unified, entity-focused data structure that maintains consistent identity and relationships across different systems.

This paper introduces CSTS, a standardized way to represent security data that helps AI systems detect cyber threats across different computer networks. Instead of treating security events as isolated incidents, CSTS organizes them around entities (like users or devices) and their relationships, making AI models more reliable when deployed in new environments.

safetydataevaluation