ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers36 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(10)

Integrable Elasticity via Neural Demand Potentials

May 21, 2026

Carlos Heredia, Daniel Roncel

Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.

This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.

architectureapplications

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

May 11 – May 17(7)

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

May 14, 2026

Ruozhen He, Meng Wei, Ziyan Yang et al.

Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.

EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.

evaluationmultimodalarchitecture

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

May 14, 2026

Xiang Fan, Yuheng Wang, Bohan Fang et al.

Video generation systems lose detail because their decoders ignore the input image—adding reference conditioning to the decoder recovers this information and improves quality by up to 2.1dB PSNR.

RefDecoder improves video generation by conditioning the decoder on a reference image, fixing a common architectural flaw where decoders ignore input details. By injecting reference image information through attention mechanisms during decoding, it preserves fine details and consistency without requiring retraining of existing systems.

May 4 – May 10(15)

Normalizing Trajectory Models

May 8, 2026

Jiatao Gu, Tianrong Chen, Ying Shen et al.

NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.

This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.

efficiencyarchitecturetraining

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

May 8, 2026

Wei Yu, Yunhang Qian

State space models offer a practical alternative to transformers for event-based image reconstruction, achieving better results with linear computational complexity instead of quadratic, making high-resolution processing feasible.

EmambaIR uses a new type of neural network architecture (state space models) to reconstruct clear images from event camera data.

Apr 27 – May 3(19)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

May 1, 2026

Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.

Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.

HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.

reasoningarchitectureefficiency

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

May 1, 2026

Siyuan Huang, Xiaoye Qu, Yafu Li et al.

PVM solves a fundamental problem in vision-language models where visual understanding degrades during long text generation by creating a separate, always-accessible pathway to visual information—improving reasoning tasks with minimal added parameters.

Large vision-language models struggle when generating long text because visual information gets diluted by accumulated text tokens. This paper introduces Persistent Visual Memory (PVM), a lightweight add-on module that maintains direct access to visual embeddings throughout generation, preventing the model from losing sight of the image as it produces longer outputs.

Apr 20 – Apr 26(11)

Operational Feature Fingerprints of Graph Datasets via a White-Box Signal-Subspace Probe

Apr 24, 2026

Yuchen Xiong, Swee Keong Yeap, Zhen Hong Ban

You can diagnose what graph datasets require and why GNNs work by replacing learned message passing with interpretable signal components—this white-box approach is competitive with black-box models while revealing which graph properties (smoothing, raw features, class geometry) matter most.

This paper introduces WG-SRC, a transparent method for understanding what graph neural networks learn on node classification tasks.

evaluationarchitecture

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Apr 23, 2026

Max Defez, Filippo Quarenghi, Mathieu Vrac et al.

A single neural network architecture can handle multiple super-resolution scales by adapting just three hyperparameters (noise schedule, context length, and mass conservation), eliminating the need to train separate models for each upscaling factor.

This paper presents a flexible deep-learning framework for video super-resolution that works across different spatial and temporal upscaling factors without retraining from scratch.

architecture

Apr 13 – Apr 19(13)

Geometric regularization of autoencoders via observed stochastic dynamics

Apr 17, 2026

Sean Hill, Felix X. -F. Ye

By enforcing geometric consistency in autoencoders through tangent-bundle penalties, you can reduce errors in learned dynamical systems by 50-70%, making reduced models reliable for predicting rare events like molecular transitions.

This paper solves a key problem in learning reduced models of complex dynamical systems: how to build accurate low-dimensional simulators from high-dimensional data. The authors use geometric constraints from data covariance to train autoencoders that preserve the underlying manifold structure, enabling better prediction of long-term system behavior like transition times between metastable states.

architecturetrainingreasoning

FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale

Apr 17, 2026

Aswathi Mundayatt, Jaya Sreevalsan-Nair

Combining multiple machine learning approaches with spatial awareness—rather than using one uniform model across an entire region—significantly improves predictions of natural hazard risks and reveals how different geographic areas are affected by different environmental factors.

Apr 6 – Apr 12(16)

ANTIC: Adaptive Neural Temporal In-situ Compressor

Apr 10, 2026

Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galleti et al.

Neural compression combined with smart temporal sampling can reduce physics simulation storage by orders of magnitude, making exabyte-scale HPC data manageable without sacrificing scientific accuracy.

ANTIC is a compression system that reduces storage needs for massive physics simulations by intelligently selecting which time snapshots to save and compressing spatial data using neural networks. It works during simulation rather than after, enabling petabyte-scale datasets to be stored efficiently while preserving physics accuracy.

efficiencyarchitecturedata

Integrated electro-optic attention nonlinearities for transformers

Apr 10, 2026

Luis Mickeler, Kai Lion, Alfonso Nardi et al.

Optical nonlinear computation can eliminate a key latency bottleneck in transformers without sacrificing accuracy, opening a path to faster inference through specialized hardware.

Researchers use optical hardware (lithium niobate modulators) to speed up the Softmax and Sigmoid functions in transformers, which are computational bottlenecks despite being a tiny fraction of operations. The system maintains accuracy even with aggressive quantization and works at very high speeds, suggesting optical components could accelerate transformer inference in hybrid hardware setups.

Mar 30 – Apr 5(9)

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Apr 3, 2026

Takuya Shiba

For robot learning systems, discrete action tokenization creates a hard ceiling on performance gains from better vision models—you need to increase action representation capacity, not just encoder quality, to see improvements.

This paper explains why upgrading vision encoders in robot learning models doesn't always improve performance. The key issue is the 'Compression Gap': when robot actions are represented as discrete tokens (like a limited vocabulary), the token codebook becomes an information bottleneck that prevents improvements from better vision encoders from helping.

architecturescalingefficiency

Gradient Boosting within a Single Attention Layer

Apr 3, 2026

Saleh Sargolzaei

Attention can be improved by treating it like gradient boosting: a second attention pass with separate projections learns to correct the first pass's mistakes, boosting performance without major architectural changes.

This paper improves transformer attention by adding a second pass that corrects the first pass's errors, similar to how gradient boosting works in machine learning. The method uses a gated correction mechanism and achieves better language modeling performance than standard attention with minimal computational overhead.

architecture
efficiency
reasoning

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Mamba's linear-complexity architecture enables real-time cognitive load monitoring from noisy eye-tracking signals on wearable devices—a practical alternative to Transformers for temporal sensor data with frequent gaps.

MambaGaze uses a bidirectional Mamba neural network to assess cognitive load from eye-tracking data in real-time. It handles missing data from eye blinks and tracking failures by explicitly encoding uncertainty, and runs efficiently on edge devices like smartglasses for applications like driver monitoring.

architectureefficiencyapplications

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

May 18, 2026

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti et al.

DashAttention enables efficient long-context processing by combining adaptive sparse selection with differentiable training, outperforming fixed-sparsity methods while maintaining gradient flow through both attention stages.

DashAttention improves how language models handle long documents by using a smarter two-stage attention mechanism. Instead of always selecting the same number of relevant tokens, it adaptively picks different amounts based on what each query needs, while keeping the entire process trainable. This achieves full-attention quality with 75% fewer computations.

efficiencyarchitecture

Code as Agent Harness

May 18, 2026

Xuying Ning, Katherine Tieu, Dongqi Fu et al.

Code is becoming the primary substrate for building reliable, verifiable AI agents. Understanding code as agent harness—the infrastructure layer—is essential for building systems that can plan, remember, use tools, and coordinate across multiple agents.

This survey examines how code serves as the operational foundation for AI agents—not just as output, but as the infrastructure that enables agents to reason, act, model environments, and verify their own behavior.

agentsarchitecturereasoning

Actionable World Representation

May 18, 2026

Kunqi Xu, Jitao Li, Jianglong Ye et al.

By explicitly modeling object state changes as a learnable manifold, WorldString provides a unified way to represent how objects respond to actions—bridging the gap between perception and control for physical world models.

WorldString is a neural architecture that learns to represent how real-world objects change state over time by processing point clouds or video data. It creates a digital twin of objects that captures their actionable properties, serving as a building block for world models that can predict and interact with the physical world.

architecturereasoning

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

May 18, 2026

Miguel Farinha, Ronald Clark

By conditioning on intrinsic image properties (albedo and shading) extracted from both photos and 3D renders, you can achieve photorealistic relighting with full PBR lighting control while staying fast enough for practical use.

PIXLRelight is a fast neural relighting method that lets you change lighting in photos using physically-based rendering controls. It decomposes images into intrinsic components (albedo, shading, residuals) and uses these to condition a transformer model, enabling realistic lighting adjustments in under 0.1 seconds per image without per-image optimization.

multimodalarchitectureefficiency

Semantic Generative Tuning for Unified Multimodal Models

May 18, 2026

Songsong Yu, Yuxin Chen, Ying Shan et al.

Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.

This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.

multimodaltrainingarchitecture
architecturemultimodalefficiency

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

May 14, 2026

Ellwil Sharma, Arastu Sharma

Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.

This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.

architecturescalingefficiency

Elastic Attention Cores for Scalable Vision Transformers

May 12, 2026

Alan Z. Song, Yinjie Chen, Mu Nan et al.

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

architectureefficiencyscaling

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

May 12, 2026

Sagi Ahrac, Noya Hochwald, Mor Geva

Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.

This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.

architecturetrainingefficiency

Solve the Loop: Attractor Models for Language and Reasoning

May 12, 2026

Jacob Fein-Ashley, Paria Rashidinejad

Attractor Models make iterative refinement practical by using implicit differentiation to solve fixed points, enabling smaller models (27M-770M parameters) to outperform much larger ones on reasoning and language tasks without the training instability of traditional recurrent architectures.

This paper introduces Attractor Models, which improve on looped Transformers by using implicit differentiation to solve for fixed points in latent representations.

architecturereasoningefficiency

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

May 12, 2026

Guinan Su, Yanwu Yang, Xueyan Li et al.

By training models to handle multiple parallel computation streams instead of sequential message exchanges, you can build faster, more responsive AI agents that can act while thinking and react to new information without waiting for previous operations to complete.

This paper proposes Multi-Stream LLMs, which replace the single sequential message stream in current language models with multiple parallel streams for inputs, outputs, and reasoning. This allows models to read and write simultaneously, think while acting, and process different types of information in parallel—addressing fundamental bottlenecks in how AI agents currently operate.

architectureagentstraining
architecture
efficiency
multimodal

Fast Byte Latent Transformer

May 8, 2026

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al.

Byte-level models can now generate 50% faster by predicting multiple bytes in parallel instead of one at a time, making them practical for real-world use without sacrificing quality.

Byte-level language models match token-based models but generate slowly because they produce one byte at a time. This paper introduces three faster variants: BLT-D uses diffusion to generate multiple bytes per step, BLT-S uses local drafting with verification, and BLT-DV combines both. All reduce memory bandwidth costs by over 50% during generation.

efficiencyarchitecture

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

May 7, 2026

Omar El Khalifi, Thomas Rossi, Oscar Fossey et al.

You can control both character motion and camera angles in video generation by using a two-phase conditioning approach that prioritizes geometric consistency, without needing to train new models.

ActCam enables precise control over both actor motion and camera movement in AI-generated videos without requiring training. It works with existing video generation models by providing carefully sequenced guidance: first using pose and depth information to establish scene structure, then refining details with pose-only guidance.

multimodalapplicationsarchitecture

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

May 7, 2026

Minbin Huang, Han Shi, Chuanyang Zheng et al.

You don't need separate expert sets per layer in MoE models—a shared expert pool with independent routers works better and uses fewer parameters, suggesting the standard per-layer expert allocation is unnecessarily wasteful.

UniPool replaces the standard Mixture-of-Experts design where each layer has its own expert set with a single shared pool of experts accessed by all layers. This reduces redundancy and allows expert parameters to grow sublinearly with model depth while improving performance and reducing parameter count by 30-60% compared to standard MoE.

architectureefficiencyscaling

EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 2026

Ryan Wang, Akshita Bhagia, Sewon Min

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architectureefficiencytraining

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

May 7, 2026

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

Local 3D structure around a protein's light-emitting center matters more than overall sequence for predicting brightness—and you can build interpretable models by explicitly encoding which atoms contact which chromophore regions.

This paper predicts how bright fluorescent proteins will be by analyzing their 3D structure around the light-emitting chromophore region. Instead of just looking at protein sequences, the method builds a graph of how atoms and chemical groups physically contact the chromophore, then uses machine learning to predict brightness.

evaluationarchitecture

Taming Outlier Tokens in Diffusion Transformers

May 6, 2026

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu et al.

Outlier tokens in diffusion transformers aren't just extreme values but represent corrupted local information; controlling them with register tokens significantly improves image generation quality.

This paper identifies and fixes a problem in Diffusion Transformers where certain tokens develop unusually high values that degrade image quality. The authors show this happens in both the image encoder and the generation model itself, and propose Dual-Stage Registers—a technique using learnable tokens to stabilize these problematic values and improve image generation.

architectureefficiencyevaluation

Estimating the expected output of wide random MLPs more efficiently than sampling

May 6, 2026

Wilson Wu, Victor Lecomte, Michael Winer et al.

You can estimate a wide MLP's expected output more efficiently than sampling by directly computing activation distributions layer-by-layer using mathematical tools, which is particularly useful for detecting tail risks.

This paper presents a mathematical method to estimate what a randomly initialized neural network will output on average, without actually running data through it. Instead of sampling (the standard approach), the authors use statistical tools like cumulants and Hermite expansions to track how activations behave at each layer.

efficiencyevaluationarchitecture

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

May 6, 2026

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Transformer attention can act as a feature learner for nonlinear functions during in-context learning, and this capability can be theoretically analyzed with concrete error bounds—bridging the gap between empirical success and mathematical understanding.

This paper explains how transformers perform in-context learning for nonlinear regression tasks. The researchers show that transformer attention mechanisms can automatically create nonlinear features (like polynomials or splines) from examples in the prompt, enabling the model to solve complex regression problems without updating weights.

reasoningarchitectureevaluation

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

May 6, 2026

Enhui Chai, Sicheng Chen, Tianyi Zhang et al.

Representing pathology image features in complementary geometric spaces (hyperbolic + Euclidean) with efficient sequence modeling enables more accurate whole-slide image analysis by capturing both tissue hierarchy and cellular details.

This paper presents BatMIL, a new approach for analyzing whole-slide images (gigapixel pathology scans) by representing tissue features in dual geometric spaces—hyperbolic for hierarchical structures and Euclidean for local details.

architecturemultimodalefficiency

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

May 5, 2026

Evangelos Ntavelis, Sean Wu, Mohamad Shahbazi et al.

Feed-forward 3D reconstruction from multi-view images can match or exceed optimization-based methods while being much faster, and UV-parameterization lets you train with many high-resolution views without memory explosion.

HeadsUp reconstructs detailed 3D head models from multiple camera views using an efficient neural network that compresses images into a compact representation, then decodes them into 3D Gaussians (mathematical shapes). The method scales to thousands of subjects and works on new people without extra optimization, enabling applications like generating new identities and animating expressions.

architecturemultimodal

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

May 5, 2026

Kishan Athrey, Ramin Pishehvar, Brian Riordan et al.

Automating agent selection in multi-agent systems using retrieval-based matching and LLM re-ranking improves reliability and scalability compared to manual composition, especially when a critique agent validates the full workflow.

This paper presents an automated framework for building multi-agent systems that replaces manual steps with AI-driven composition. It uses an LLM planner to break down user requests into tasks, then automatically selects the best agents from registries using a two-stage retrieval system (fast retriever + LLM re-ranker), with a critique agent validating the entire plan.

agentsarchitectureevaluation

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

May 5, 2026

Mohamed Mady, Johannes Reschke, Björn Schuller

AI-text detectors need feature augmentation and careful threshold calibration to work reliably across different domains and generators; linguistic features like readability are crucial for robustness under distribution shift.

This paper tackles the challenge of detecting AI-generated text across different domains and AI models. Researchers trained transformer-based detectors and found that while they perform nearly perfectly on their training data, they struggle when tested on new domains or text from different AI generators.

evaluationsafetyarchitecture

AIs and Humans with Agency

May 4, 2026

David Mumford

Building AI systems with genuine agency isn't about making LLMs act alone—it requires new architectures where AI and humans co-develop plans and actions together for specific real-world situations.

This paper examines what agency means for both humans and AI systems, noting that human agency develops gradually through brain maturation while current LLMs struggle to act autonomously. The author argues that effective AI agency requires a fundamentally different architecture where AI systems and humans jointly plan and execute actions together in real-world contexts.

agentsarchitecturereasoning
architecturemultimodalefficiency

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

May 1, 2026

Arunabh Srivastava, Mohammad A., Khojastepour et al.

To make LLMs reliable at executing plans, you need to enforce structure through explicit control constructs, validate outputs against derived constraints at each step, and dynamically route to the best execution method (reasoning, tools, or code).

RunAgent is a system that helps AI agents execute multi-step plans written in natural language by converting them into a structured format with explicit control flow (like IF statements and loops).

agentsreasoningarchitecture

Characterizing the Expressivity of Local Attention in Transformers

May 1, 2026

Jiaoda Li, Ryan Cotterell

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecturereasoningevaluation

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Apr 30, 2026

Lincan Li, Zheng Chen, Yushun Dong

LLMs can effectively refine noisy graph structures in medical signal analysis by identifying and removing redundant connections, improving both seizure detection accuracy and model interpretability.

This paper uses large language models to improve how neural networks analyze EEG brain signals for seizure detection. The key innovation is treating LLMs as 'graph refiners'—they remove unnecessary connections in a graph representation of EEG data, making the model more accurate and interpretable.

architectureevaluation

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Apr 30, 2026

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem et al.

Using geometrically-aligned latent spaces (hyperspheres instead of Gaussian distributions) in autoencoders preserves 3D structure and physics better than standard approaches, which matters for building world models that understand real 3D scenes.

This paper proposes S²VAE, a new type of autoencoder that uses hyperspherical (spherical geometry) latent representations instead of traditional Gaussian ones to better preserve 3D geometry and camera motion in visual world models.

architecturemultimodalefficiency

Do Sparse Autoencoders Capture Concept Manifolds?

Apr 30, 2026

Usha Bhalla, Thomas Fel, Can Rager et al.

SAEs don't cleanly capture continuous concept structures—they fragment them across features in ways that hide geometric relationships, suggesting interpretability research needs to look for groups of features rather than individual directions.

Sparse autoencoders (SAEs) are popular tools for finding interpretable features in AI models, but this paper shows they struggle to capture concepts organized as continuous geometric structures (manifolds).

architectureevaluation

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Apr 30, 2026

Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma et al.

When transformer models fail silently, DEFault++ can pinpoint exactly which component is broken and why—helping developers fix issues 46% faster than manual debugging.

DEFault++ automatically detects, categorizes, and diagnoses faults in transformer models by analyzing internal component behavior. It identifies 12 types of transformer-specific faults and pinpoints root causes among 45 mechanisms, helping developers fix silent failures that don't trigger runtime errors.

evaluationsafetyarchitecture

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

Apr 30, 2026

Arthur Corrêa, Paulo Nascimento, Samuel Moniz

A single neural model can now handle multiple variants of complex routing problems by dynamically adapting to different constraints, suggesting that multi-task learning with adaptive conditioning is more practical than building separate models for each problem type.

FiLMMeD is a neural model that solves 24 different multi-depot vehicle routing problems (a logistics optimization task) using a single unified architecture.

architecturetrainingapplications

A Unified Framework of Hyperbolic Graph Representation Learning Methods

Apr 30, 2026

Sofía Pérez Casulo, Marcelo Fiori, Bernardo Marenco et al.

Hyperbolic embeddings can represent complex hierarchical networks in low dimensions, but practitioners now have a standardized framework to fairly compare methods and understand their trade-offs before choosing one for their application.

This paper presents a unified framework for hyperbolic graph embedding methods—techniques that represent networks in hyperbolic space to capture hierarchical structures efficiently. The framework consolidates multiple embedding approaches under one interface, enabling fair comparison and reproducible evaluation on real-world networks for tasks like link prediction and node classification.

architectureevaluation

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Apr 29, 2026

Gongbo Zhang, Wen Wang, Ye Tian et al.

Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.

This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.

trainingefficiencyarchitecture

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Apr 29, 2026

Shayan Hundrieser, Insung Kong, Johannes Schmidt-Hieber

HyCNNs are a more parameter-efficient way to build neural networks that must output convex functions, requiring exponentially fewer parameters than previous methods while maintaining theoretical guarantees.

This paper introduces Hyper Input Convex Neural Networks (HyCNNs), a new neural network architecture that guarantees convex outputs while using far fewer parameters than existing methods.

architectureefficiency

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Apr 29, 2026

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García et al.

Noise in transformers can synchronize token behavior and stabilize learning—a counterintuitive finding that suggests randomness plays a constructive role in how these models process sequences.

This paper proves that transformer models with finite depth and width converge to a stochastic particle system as they scale. The researchers show that token evolution follows a continuous-time process with noise-driven synchronization, meaning random perturbations actually help tokens align rather than diverge.

scalingarchitecturetraining

Multiple Additive Neural Networks for Structured and Unstructured Data

Apr 29, 2026

Janis Mohr, Jörg Frochte

MANN combines gradient boosting with neural networks instead of trees, enabling a single framework to handle structured and unstructured data while outperforming XGBoost and reducing hyperparameter sensitivity.

This paper presents Multiple Additive Neural Networks (MANN), which replaces decision trees in gradient boosting with shallow neural networks. MANN works with both structured data and images/audio by using CNNs and capsule networks as feature extractors, and shows better accuracy than XGBoost on standard benchmarks while being more robust to hyperparameter choices.

trainingarchitectureefficiency

KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment

Apr 29, 2026

Attila Pintér, Javier Rico, Attila Répai et al.

Containerized microservice architectures enable clinical AI systems to meet real-world constraints like data privacy while maintaining high performance, and this approach is ready for real-world deployment (TRL 6).

KAYRA is an AI system for analyzing chromosomes (karyotyping) in clinical labs using a pipeline of deep learning models. It can run in the cloud or on-premise to handle privacy requirements, and achieves 98.91% accuracy on chromosome segmentation—significantly better than existing commercial systems.

applicationsarchitectureevaluation

Toward a Functional Geometric Algebra for Natural Language Semantics

Apr 28, 2026

James Pustejovsky

Geometric algebra expands n-dimensional embeddings into a 2^n-dimensional structure that can represent both base concepts and their interactions in a single unified framework, potentially solving long-standing problems in how neural networks compose meanings.

This paper proposes using geometric algebra (Clifford algebras) instead of conventional linear algebra as the mathematical foundation for representing word and sentence meanings in AI.

architecturereasoning

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Apr 28, 2026

Dominik Żurek, Kamil Faber, Marcin Pietron et al.

Architectural parameter reuse guided by task similarity is a memory-efficient alternative to replay-based continual learning in offline RL, enabling better multi-task performance without storing historical data.

This paper presents TSN-Affinity, a method for continual offline reinforcement learning that learns multiple tasks sequentially from pre-collected datasets without forgetting previous tasks.

trainingarchitectureefficiency

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Apr 27, 2026

Hailing Cheng, Daqi Sun, Xinyu Lu

Positional encodings in Transformers can be made learnable and signal-dependent by treating the rotation manifold as a separate dimension from token embeddings, unlocking better performance without significant overhead.

This paper treats the rotation space in Rotary Positional Embeddings (RoPE) as learnable rather than fixed, introducing SIREN-RoPE to encode temporal and semantic information into rotations.

architectureefficiency

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Apr 27, 2026

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh et al.

You can efficiently extend pretrained LLMs to handle much longer contexts by converting them to hybrid architectures without retraining from scratch—this is more practical than building new models entirely.

This paper presents HyLo, a method to convert pretrained Transformer language models into hybrid architectures that combine Transformers with efficient linear sequence models (like Mamba2). By reusing existing model checkpoints and adding long-context training, HyLo extends context length by 32x while reducing memory use by 90%, enabling 2M-token processing on standard hardware.

architectureefficiencyscaling
efficiency
scaling

On the algebra of Koopman eigenfunctions and on some of their infinities

Apr 23, 2026

Zahra Monfared, Saksham Malhotra, Sekiya Hajime et al.

You can generate many more Koopman eigenfunctions from a few computed ones by treating them as an algebraic group, enabling better system representations from sparse or incomplete data.

This paper shows how to compute more eigenfunctions of the Koopman operator—a mathematical tool for analyzing dynamical systems—by using algebraic relationships between a small set of known eigenfunctions.

reasoningarchitecture

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Apr 23, 2026

Anuj Sadani, Deepak Kumar

Tool schema injection is a hidden operational cost in agent systems—Tool Attention solves this by filtering irrelevant tools and deferring full schema loading, reducing per-turn tokens from ~47k to ~2.4k without sacrificing capability.

This paper introduces Tool Attention, a middleware system that dramatically reduces the token overhead from injecting tool schemas into LLM agents. By using smart filtering (based on task intent and access rules) and lazy loading of full schemas only when needed, it cuts tool-related tokens by 95% in multi-tool deployments, making agentic workflows more efficient and cost-effective.

agentsefficiencyarchitecture

Quotient-Space Diffusion Models

Apr 23, 2026

Yixian Xu, Yusong Wang, Shengjie Luo et al.

Quotient-space diffusion models reduce learning complexity for symmetric generative tasks by formally accounting for group symmetries, enabling better molecular and protein structure generation without learning redundant symmetric variations.

This paper introduces a mathematical framework for diffusion models that accounts for symmetries in generative tasks, particularly molecular structure generation. By modeling distributions on quotient spaces (which treat symmetric objects as equivalent), the approach simplifies learning compared to existing symmetry-aware methods and guarantees correct sampling of target distributions.

architecturereasoningapplications

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Apr 22, 2026

Yiming Bian, Joshua M. Akey

You can now run exact attention on billion-token sequences on a single GPU by streaming chunks through memory—no approximation needed, just smarter scheduling of the computation.

This paper solves the memory problem that prevents long-context language models from running on single GPUs. Instead of approximating attention (which loses accuracy), it mathematically decomposes attention into smaller independent chunks that can be processed one at a time, streaming results without keeping everything in memory at once.

efficiencyarchitecture

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Apr 21, 2026

Jean Mercat, Sedrick Keh, Kushal Arora et al.

For roboticists and ML engineers: VLA Foundry eliminates pipeline incompatibility issues by providing a unified training stack for building embodied AI models, with released weights and open-source code making it practical to train and deploy robotic policies.

VLA Foundry is an open-source framework that unifies training of language models, vision-language models, and vision-language-action models in one codebase. Instead of stitching together separate pipelines, it provides end-to-end control from language pretraining through action fine-tuning, enabling researchers to train robotic manipulation policies from scratch or using pretrained backbones.

architecturetrainingapplications

Benign Overfitting in Adversarial Training for Vision Transformers

Apr 21, 2026

Jiaming Zhang, Meng Ding, Shaopeng Fu et al.

Vision Transformers can be made adversarially robust through standard adversarial training, and surprisingly, overfitting doesn't necessarily hurt robustness if the signal-to-noise ratio is favorable—a finding that challenges conventional wisdom about the robustness-generalization tradeoff.

This paper provides the first theoretical analysis of adversarial training in Vision Transformers, showing that under certain conditions, ViTs can achieve strong robustness against adversarial attacks even when overfitting occurs.

safetyarchitecturereasoning

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

Apr 21, 2026

Feihao Fang, My T. Thai, Yuanyuan Lei

LLMs have a hidden logical reasoning layer that works the same way whether reasoning in English or symbolic notation—you can exploit this to improve reasoning by steering the model toward this shared space without retraining.

This paper discovers that LLMs contain a shared internal logical reasoning space that aligns natural language and symbolic reasoning. By analyzing how the model's internal activations correlate across both reasoning styles, researchers created a method to steer the model toward better logical reasoning without additional training, improving accuracy on reasoning tasks by up to 11%.

reasoningarchitecture

Sessa: Selective State Space Attention

Apr 20, 2026

Liubomyr Horbatko

Sessa's hybrid architecture enables power-law decay of information loss over distance (O(ℓ^-β)) instead of exponential or linear decay, making it more effective for long-context language modeling while staying competitive on standard benchmarks.

Sessa combines attention mechanisms with state-space model feedback paths to improve how models retrieve information from long contexts.

architectureefficiencyreasoning

ConforNets: Latents-Based Conformational Control in OpenFold3

Apr 20, 2026

Minji Lee, Colin Kalicki, Minkyu Jeon et al.

By learning to transform AF3's internal representations, ConforNets can reliably generate multiple protein conformations and transfer conformational changes between proteins—solving a major limitation of structure prediction models that typically predict only one dominant state.

ConforNets is a method for controlling protein conformations in AlphaFold3 by applying learnable transformations to latent representations. Rather than perturbing inputs or using ad hoc tricks, it modulates the internal representations that AF3 uses to predict protein structures, enabling both discovery of alternate conformations and transfer of conformational changes across related proteins.

architecturetrainingapplications

This study develops a deep learning system to predict flood and landslide risks across large regions by combining multiple prediction approaches (Early Fusion, Late Fusion, and Mixture of Experts).

evaluationarchitectureapplications

Information Router for Mitigating Modality Dominance in Vision-Language Models

Apr 17, 2026

Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

Fixing modality dominance requires enriching missing information, not just redirecting attention—MoIR routes complementary information between modalities to create more balanced, information-dense representations before the language model processes them.

Vision-language models often rely too heavily on one modality (vision or text), ignoring useful information from the other. This paper proposes MoIR, a method that identifies weak or ambiguous tokens in one modality and enriches them with information from the stronger modality before processing.

multimodalarchitectureevaluation

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Apr 17, 2026

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

When combining audio and text, align them indirectly through a shared joint embedding rather than directly contrasting them, and use structural consistency losses to prevent one modality from dominating the learned representation.

HILBERT is a multimodal framework that learns document-level representations from long audio-text sequences in low-resource settings.

multimodaltrainingarchitecture

How Embeddings Shape Graph Neural Networks: Classical vs Quantum-Oriented Node Representations

Apr 16, 2026

Nouhaila Innan, Antonello Rosato, Alberto Marchisio et al.

When choosing node embeddings for graph neural networks, quantum-oriented representations offer consistent improvements on molecular and structural datasets, but classical baselines are still optimal for social graphs—the choice depends heavily on your data type.

This paper compares different ways to represent nodes in graph neural networks, testing classical embeddings against quantum-inspired alternatives on standard benchmarks.

evaluationarchitecture

Stability and Generalization in Looped Transformers

Apr 16, 2026

Asher Labovich

Looped transformers need recall mechanisms combined with outer normalization to reliably generalize to harder problems; without these, they memorize training solutions and fail at test time.

This paper analyzes looped transformers—models that iterate multiple times at test time to solve harder problems—by studying when they generalize versus memorize.

architecturereasoningtraining

A Nonlinear Separation Principle: Applications to Neural Networks, Control and Learning

Apr 16, 2026

Anand Gokhale, Anton V. Proskurnikov, Yu Kawano et al.

You can mathematically guarantee that recurrent neural networks and control systems remain stable by checking specific matrix conditions, enabling you to design more reliable AI systems with fewer parameters.

This paper develops mathematical tools for designing stable neural networks and control systems. It introduces a 'nonlinear separation principle' that guarantees stability when combining controllers and observers, derives conditions for ensuring neural networks behave predictably, and shows how to use these insights to build efficient deep learning models that maintain stability while learning.

architecturetrainingreasoning

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Apr 16, 2026

Huawei Ji, Yuanhao Sun, Yuan Jin et al.

Automatically optimizing token pruning configurations across layers can achieve better speed-accuracy trade-offs than fixed pruning strategies, and progressive multi-layer pruning outperforms single-layer approaches for vision-language models.

This paper introduces VisPCO, a framework that automatically finds the best way to remove unnecessary visual tokens from vision-language models to speed them up.

efficiencyarchitectureevaluation

AdaSplash-2: Faster Differentiable Sparse Attention

Apr 16, 2026

Nuno Gonçalves, Hugo Pitorro, Vlad Niculae et al.

Sparse attention can now match or beat FlashAttention-2's speed when processing long contexts, making it practical for building models that handle extended input sequences without the quadratic memory cost.

AdaSplash-2 makes sparse attention faster by using a histogram-based trick to quickly compute the normalizer needed for differentiable sparse attention. This lets transformers handle long contexts efficiently—matching softmax speed at short lengths while being significantly faster for long sequences.

efficiencyarchitecturetraining

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Apr 15, 2026

Tianshuo Yang, Guanyu Chen, Yutian Chen et al.

Decoupling high-level reasoning from low-level control in robotic systems preserves the planning abilities of large vision-language models while improving execution accuracy on physical manipulation tasks.

HiVLA splits robot manipulation into two parts: a vision-language model that plans tasks and identifies objects, and a specialized action model that executes precise movements. This separation lets robots reason about complex tasks while staying accurate at fine-grained control, outperforming end-to-end approaches on real robot tasks.

agentsmultimodalarchitecture

ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation

Apr 15, 2026

Xiaofan Zhou, Kyumin Lee

Combining sequential and graph representations through multi-view contrastive learning and attention fusion significantly improves sequential recommendation accuracy, showing that different data perspectives can be effectively integrated for better predictions.

This paper proposes MVCrec, a recommendation system that learns from user interaction histories by combining two complementary views: sequential ID-based patterns and graph-based relational structures. Using contrastive learning across both views and a multi-view attention mechanism to fuse them, the approach achieves significant improvements on benchmark datasets without requiring external data.

applicationsarchitecturetraining

Neural architectures for resolving references in program code

Apr 15, 2026

Gergő Szalay, Gergely Zsolt Kovács, Sándor Teleki et al.

Custom neural architectures designed for reference resolution in code can dramatically improve both robustness and scalability compared to generic sequence-to-sequence models, with practical benefits for code analysis tasks.

This paper tackles the problem of resolving references in program code by framing it as a sequence-to-sequence task. The authors create synthetic benchmarks for reference rewriting and propose new neural architectures that significantly outperform standard models, handling sequences 10x longer than baselines.

architecture

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

Apr 14, 2026

Benzhao Tang, Shiyu Yang

You can detect log anomalies 2.7% more accurately than existing methods while completely skipping decompression—just analyze the compressed bytes directly using a purpose-built deep learning model.

CLAD detects anomalies in system logs by analyzing compressed byte streams directly, without decompressing them first. It uses a specialized neural architecture that recognizes how normal logs compress into predictable patterns while anomalies create irregular ones, achieving 99% accuracy while eliminating expensive preprocessing steps.

efficiencyevaluationarchitecture
efficiencyarchitecture

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Apr 9, 2026

Haolei Xu, Haiwen Hong, Hongxing Li et al.

Multimodal MoE models suffer from 'routing distraction'—visual inputs cause the routing mechanism to activate the wrong experts for reasoning. A simple intervention that guides expert selection toward domain experts significantly improves visual reasoning performance.

This paper identifies a problem in multimodal mixture-of-experts models where they can see images correctly but fail at reasoning tasks that they solve easily with text.

architecturemultimodalreasoning

PSI: Shared State as the Missing Layer for Coherent AI-Generated Instruments in Personal AI Agents

Apr 9, 2026

Zhiyuan Wang, Erzhen Hu, Mark Rucker et al.

Shared state is the critical missing layer for turning individually generated AI tools into a unified personal computing system—new tools can automatically integrate with existing ones through a common state contract.

PSI is a shared-state architecture that connects independently generated AI tools into a coherent personal computing environment. Instead of creating isolated apps from natural language requests, PSI lets these tools share state through a central bus, enabling them to reason together and sync actions across chat and GUI interfaces.

agentsarchitectureapplications

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Apr 9, 2026

Haoxi Zeng, Qiankun Liu, Yi Bin et al.

By aligning DINO's semantic features with SAM's structural priors through specialized encoder-decoder modules, you can achieve both semantic generalization and precise edge detection for segmentation tasks without predefined categories.

This paper tackles open-vocabulary segmentation—identifying and outlining objects in images even when they're not in the training set—by combining two foundation models: DINO for semantic understanding and SAM for precise edge detection.

multimodalarchitectureevaluation

HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

Apr 9, 2026

Changdao Chen

Using hypergraphs instead of traditional graphs lets you capture complex, multi-way relationships in facial expressions, while bidirectional state space models efficiently process long video sequences—enabling accurate fatigue detection on edge devices.

This paper proposes HST-HGN, a neural network for detecting driver fatigue from video that combines hypergraph networks (to model complex facial relationships) with state space models (to efficiently track facial changes over time). The approach balances accuracy with computational efficiency, making it practical for real-time deployment in vehicles.

architectureefficiencyapplications

Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules

Apr 9, 2026

Luca Nogueira Calçado, Sergei K. Turitsyn, Egor Manuylovich

Photonic neural networks can perform complex nonlinear computations using only standard telecom components, achieving competitive accuracy with far fewer parameters than software models and maintaining performance under realistic hardware constraints.

Researchers built small photonic neural networks using standard telecom components that perform nonlinear computations entirely in optics, without converting to electronics.

architectureefficiencyapplications

Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

Apr 9, 2026

Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea et al.

Adding explicit positional time encoding to neural process models significantly improves their ability to generalize to unseen action sequences in robotic action prediction tasks.

This paper applies Conditional Neural Processes to predict robot actions from partial observations, inspired by how humans understand others' movements. The authors improve an existing multimodal prediction model by adding better temporal encoding, enabling robots to forecast actions over longer sequences and refine predictions as new information arrives.

reasoningmultimodalarchitecture

What a Comfortable World: Ergonomic Principles Guided Apartment Layout Generation

Apr 9, 2026

Piotr Nieciecki, Aleksander Plocharski, Przemyslaw Musialski

You can improve generative models by encoding domain expertise as differentiable loss functions during training—this forces the model to learn better design principles rather than just mimicking flawed real-world data.

This paper improves AI-generated apartment layouts by embedding architectural design principles into a transformer model. Instead of just learning from real floor plans (which often have poor ergonomics), the model is guided by differentiable loss functions based on established design standards, resulting in layouts that are more livable and follow better architectural practices.

architecturetrainingapplications

Fast Spatial Memory with Elastic Test-Time Training

Apr 8, 2026

Ziqiao Ma, Xueyang Yu, Haoyu Zhen et al.

Test-time training can now process arbitrarily long sequences by maintaining an anchor state that balances learning new information with remembering old information, solving the catastrophic forgetting problem that limited previous approaches to single large chunks.

This paper improves test-time training for 3D/4D scene reconstruction by preventing the model from forgetting previous information as it processes long sequences. The key innovation is using an elastic prior (inspired by elastic weight consolidation) to stabilize learning during inference, allowing the model to handle longer sequences in smaller chunks without catastrophic forgetting.

efficiencyarchitecturereasoning

Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Apr 8, 2026

Xin Tian, Jiuliu Lu, Ephraim Tsalik et al.

Mixture-of-Experts routing in medical image analysis works better when constrained by optimal transport to prevent expert collapse and when routing decisions respect spatial tissue neighborhoods.

This paper proposes ROAM, a new method for analyzing gigapixel medical images (whole-slide images) by using specialized expert networks that intelligently route different tissue regions to appropriate experts.

architecturemultimodalapplications

Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability

Apr 8, 2026

Akzhol Almukhametov, Doyeong Lim, Rui Hu et al.

Graph neural networks coupled with neural ODEs can forecast complex physical systems under partial observability at speeds enabling real-time control, while learning physically meaningful relationships from limited experimental data.

This paper develops a physics-informed neural network that combines graph neural networks with neural differential equations to predict thermal-hydraulic states in nuclear reactors, even at locations without sensors. The model runs 105× faster than simulation and can be adapted to real experimental data with minimal retraining, enabling real-time control of advanced reactors.

architectureefficiency

DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

Apr 7, 2026

Zhengming Yu, Li Ma, Mingming He et al.

Video diffusion models can recover lost dynamic range information by learning to synthesize plausible scene radiance in over- and underexposed regions, making LDR-to-HDR conversion practical without paired training data.

DiffHDR converts standard video (8-bit LDR) to high dynamic range (HDR) by using a video diffusion model to intelligently fill in lost highlight and shadow details.

multimodalarchitectureapplications

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Apr 7, 2026

Qimin Zhong, Hao Liao, Haiming Qin et al.

Multi-token prediction helps LLMs learn better world models than single-token prediction, but requires grounding in actual state representations to avoid learning shortcuts that violate real-world constraints.

This paper investigates whether large language models develop coherent internal world models by comparing next-token prediction with multi-token prediction. The authors propose LSE-MTP, a method that anchors token predictions to ground-truth hidden states to reduce hallucinations and improve the model's ability to learn structured representations of the world.

trainingreasoningarchitecture

Shot-Based Quantum Encoding: A Data-Loading Paradigm for Quantum Neural Networks

Apr 7, 2026

Basil Kyriacou, Viktoria Patapovich, Maniraman Periyasamy et al.

SBQE offers a practical data-loading method for near-term quantum machine learning that avoids deep circuits, works within hardware constraints, and achieves competitive accuracy by treating shot allocation as a learnable parameter.

This paper introduces Shot-Based Quantum Encoding (SBQE), a new way to load data into quantum computers that works better with current noisy quantum hardware.

architectureefficiencytraining

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Apr 7, 2026

David Picard, Nicolas Dufour, Lucas Degeorge et al.

You can replace attention with a linear-time polynomial mixer and get similar results with much faster inference—especially valuable for long sequences where attention becomes prohibitively expensive.

PoM replaces the expensive attention mechanism in transformers with a polynomial-based token mixer that runs in linear time instead of quadratic. It compresses all tokens into a learned polynomial representation, letting each token extract relevant context from this compact form.

efficiencyarchitecturescaling
architectureefficiencytraining

DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation

Apr 3, 2026

Yingxu Wang, Kunyu Zhang, Jiaxin Huang et al.

When adapting graph neural networks across domains, you need to explicitly handle changes in graph topology and structure—not just features—using techniques like topological moment matching and spectral calibration.

This paper tackles graph domain adaptation by addressing structural differences between source and target graphs, not just feature differences. It proposes DSBD, which learns a flexible structural basis that can be adapted across domains while preserving important graph properties like geometry and spectral characteristics, then trains a fresh neural network on this adapted structure.

architecturetraining

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

architecturemultimodalagents

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodalarchitectureevaluation

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

architectureefficiencyscaling

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Apr 2, 2026

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan et al.

HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.

This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.

efficiencyarchitecture

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Apr 2, 2026

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić et al.

By combining efficient tokenization with geometry-aware attention, you can build crystal generation models that are both faster and more accurate than complex graph neural networks, making generative modeling of materials more practical.

Crystalite is a lightweight diffusion Transformer for generating crystal structures that uses two key innovations: a compact atom representation called Subatomic Tokenization and a Geometry Enhancement Module that encodes crystal geometry directly into the model's attention mechanism.

architectureefficiencyapplications