ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers11 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(16)

ActionParty: Multi-Subject Action Binding in Generative Video Games

Apr 2, 2026

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski et al.

This is the first video world model that can reliably control multiple independent agents in the same scene—a critical capability for simulating multi-player games and complex interactive environments.

ActionParty is a video diffusion model that can control multiple characters simultaneously in interactive game environments. Unlike existing models limited to single agents, it uses special 'subject state tokens' to track each character's state separately, allowing precise control of up to seven players at once while maintaining their identity and following their assigned actions correctly.

architecturemultimodalagents

Steerable Visual Representations

Apr 2, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan et al.

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodal

Mar 23 – Mar 29(14)

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Mar 26, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying et al.

You can train video models on short clips and generate much longer videos by using a three-tier memory strategy that compresses historical context without losing quality.

PackForcing solves the memory problem in video generation by compressing old frames intelligently—keeping early frames for context, heavily compressing middle frames, and preserving recent frames for smooth transitions. This lets models generate 2-minute videos on a single GPU after training only on 5-second clips, achieving 24x longer videos than training data.

efficiencyarchitecturetraining

Natural-Language Agent Harnesses

Mar 26, 2026

Linyue Pan, Lexiao Zou, Shuo Guo et al.

Agent performance depends heavily on how you orchestrate their behavior—by making this orchestration code readable and portable through natural language, you can reuse and improve agent designs much more easily.

This paper proposes a new way to design agent control systems by writing them in natural language instead of buried in code. The authors create Natural-Language Agent Harnesses (NLAHs) and a runtime system that executes these harnesses, making it easier to reuse, compare, and study how agents are controlled across different tasks.

Mar 16 – Mar 22(21)

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Mar 20, 2026

Jiazheng Xing, Fei Du, Hangjie Yuan et al.

To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.

LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.

multimodalarchitecturedata

Kolmogorov-Arnold causal generative models

Mar 20, 2026

Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz et al.

You can build causal models that are both powerful and interpretable by using Kolmogorov-Arnold Networks as the building blocks for structural equations—enabling you to see exactly how variables influence each other.

This paper introduces KaCGM, a causal generative model that uses Kolmogorov-Arnold Networks to learn causal relationships in tabular data. Unlike black-box approaches, each causal mechanism is interpretable and can be visualized or converted to symbolic equations, making it suitable for high-stakes applications like healthcare where understanding *why* a model makes decisions matters.

Mar 9 – Mar 15(11)

Towards Faithful Multimodal Concept Bottleneck Models

Mar 13, 2026

Pierre Moreau, Emeline Pineau Ferrand, Yann Choho et al.

Concept Bottleneck Models can now work reliably across text and images by jointly addressing concept detection and information leakage—enabling interpretable AI without sacrificing accuracy.

This paper introduces f-CBM, a framework for building interpretable multimodal AI models that make predictions through human-understandable concepts. The key innovation is solving two problems simultaneously: accurately detecting concepts and preventing 'leakage' (where irrelevant information sneaks into predictions).

multimodalarchitecture

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Mar 12, 2026

Fangfu Liu, Diankun Wu, Jiawei Chi et al.

Test-time training—updating model parameters on-the-fly during inference—enables better spatial reasoning from video by letting the model continuously organize and retain 3D spatial information rather than relying on fixed context windows.

This paper introduces Spatial-TTT, a system that helps AI models understand 3D spaces from continuous video streams by adapting and updating their internal parameters during inference. It combines efficient video processing with a spatial prediction mechanism and specialized training data to maintain accurate spatial understanding over long videos.

Feb 23 – Mar 1(13)

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Feb 27, 2026

Shengqu Cai, Weili Nie, Chao Liu et al.

Decouple learning long-term coherence from local quality to generate minute-scale videos without needing massive amounts of long-form training data.

This paper solves a key problem in video generation: making long videos (minutes) that are both sharp and coherent. The trick is training two separate components—one learns long-term story structure from rare long videos, while another copies local quality from abundant short videos. This lets the model generate minute-long videos that look crisp and stay consistent throughout.

trainingefficiencyarchitecture

Memory Caching: RNNs with Growing Memory

Feb 27, 2026

Ali Behrouz, Zeman Li, Yuan Deng et al.

Memory Caching lets RNNs scale their memory capacity with sequence length while staying faster than Transformers.

This paper fixes a major weakness of fast RNN models: they forget information too quickly because they have fixed-size memory. The authors introduce Memory Caching, which lets RNNs save snapshots of their memory as they process longer sequences. This gives RNNs the ability to remember more without becoming as slow as Transformers, creating a sweet spot between speed and accuracy.

architecture
evaluation

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

architectureefficiencyscaling

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Apr 2, 2026

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan et al.

HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.

This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.

efficiencyarchitecture

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodalarchitecturedata

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Apr 2, 2026

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić et al.

By combining efficient tokenization with geometry-aware attention, you can build crystal generation models that are both faster and more accurate than complex graph neural networks, making generative modeling of materials more practical.

Crystalite is a lightweight diffusion Transformer for generating crystal structures that uses two key innovations: a compact atom representation called Subatomic Tokenization and a Geometry Enhancement Module that encodes crystal geometry directly into the model's attention mechanism.

architectureefficiencyapplications

Universal Hypernetworks for Arbitrary Models

Apr 2, 2026

Xuanfeng Zhou

A single fixed hypernetwork can generate weights for diverse architectures and tasks by using architecture/task descriptors as input, eliminating the need to retrain generators when switching between different model types.

This paper introduces Universal Hypernetworks (UHN), a single neural network that can generate weights for many different model architectures and tasks. Instead of building separate weight generators for each model type, UHN uses descriptors (text descriptions of architecture and task) to produce weights for any compatible model, working across vision, graphs, text, and math tasks.

architecturetrainingefficiency

Universal YOCO for Efficient Depth Scaling

Apr 1, 2026

Yutao Sun, Li Dong, Tianzhu Ye et al.

You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.

Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.

efficiencyarchitecturereasoning

LLM REgression with a Latent Iterative State Head

Apr 1, 2026

Yiheng Su, Matthew Lease

You can make LLMs predict continuous numeric values more efficiently by adding a tiny learned head that works with frozen representations, rather than decoding text or fine-tuning the entire model.

RELISH is a lightweight method for making LLMs predict numeric values directly from their internal representations. Instead of generating numbers as text, it uses a small learned component that iteratively refines a latent state through attention over token representations, then outputs a single number. It outperforms existing approaches while adding minimal parameters (0.01-0.04% overhead).

architectureefficiencyapplications

Learning and Generating Mixed States Prepared by Shallow Channel Circuits

Apr 1, 2026

Fangjun Hu, Christian Kokail, Milan Kornjača et al.

Quantum states in the trivial phase can be efficiently learned from measurements and regenerated using shallow circuits, providing a theoretical foundation for quantum generative models without needing the original preparation circuit.

This paper shows how to learn and generate quantum mixed states that belong to the 'trivial phase'—states preparable by shallow quantum circuits that preserve local reversibility. The algorithm learns from measurement data alone and outputs a shallow circuit that recreates the state, with polynomial sample complexity and runtime. The work also extends to classical diffusion models.

reasoningtrainingarchitecture

Screening Is Enough

Apr 1, 2026

Ken M. Nakanishi

Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.

This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.

architectureefficiencyscaling

Adaptive Block-Scaled Data Types

Mar 30, 2026

Jack Cook, Hyemin S. Lee, Kathryn Le et al.

Adaptive block-scaled quantization can significantly reduce errors in 4-bit model compression by intelligently switching between data types per block, achieving better accuracy than fixed formats without extra storage cost.

This paper introduces adaptive quantization formats (IF4, IF3, IF6) that improve upon NVFP4 by dynamically choosing between floating-point and integer representations for each block of values. The approach uses an unused bit in NVFP4 to signal which format to use, reducing quantization errors and improving language model performance with minimal hardware overhead.

efficiencytrainingarchitecture

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Mar 30, 2026

Omer Dahary, Benaya Koren, Daniel Garibi et al.

You can increase diversity in generated images by applying repulsion forces in the transformer's attention channels during generation, without expensive optimization or visual artifacts.

This paper tackles the problem of text-to-image diffusion models producing visually similar outputs for the same prompt. The authors propose a method that applies 'repulsion' in the attention mechanism during image generation to encourage diverse outputs while maintaining quality and semantic accuracy.

architectureefficiencymultimodal

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

If you're building AI systems, standard software architecture documentation won't capture ML-specific risks like model drift or data dependencies—RAD-AI provides a structured way to document these for both compliance and team understanding.

RAD-AI extends existing architecture documentation frameworks (arc42 and C4 model) to handle AI systems, adding sections for probabilistic behavior, ML lifecycles, and data dependencies. It maps to EU AI Act compliance requirements and shows 93% coverage of regulatory documentation needs versus 36% for standard frameworks.

architecturesafetyapplications

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

LLMs can serve as runtime architectural components to solve schema interoperability problems dynamically, but code generation strategies outperform direct transformation and cost varies dramatically across models without matching accuracy gains.

SAGAI-MID is a middleware system that uses LLMs to automatically fix schema mismatches between different services and APIs at runtime, eliminating the need for manual adapter code. It combines structural analysis with LLM reasoning and includes safety checks to handle real-world integration challenges across REST, GraphQL, and IoT systems.

architectureagentsapplications

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Mar 30, 2026

Soutrik Mukherjee, Sangwhan Cha

Hybrid precision (FP32 for softmax/normalization, FP16 for linear layers) delivers 2x speedup with zero accuracy loss—a practical strategy for deploying transformers in latency-critical applications.

This paper optimizes transformer models (BERT and GPT-2) for fast GPU inference using mixed-precision techniques—keeping sensitive operations in full precision while using lower precision for others. The system achieves 64x speedup over CPU and sub-10ms latency while maintaining numerical accuracy and eliminating instability issues.

efficiencyarchitecture
agentsarchitecture

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 26, 2026

Xueji Zhao, Likai Pei, Jianbo Liu et al.

Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.

This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.

efficiencysafetyarchitecture

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Mar 26, 2026

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

Treating geo-localization as a sequential zooming problem over maps, rather than image retrieval, achieves better results and avoids the limitations of contrastive learning approaches that struggle with landmark visibility mismatches.

This paper tackles cross-view geo-localization—matching street-view photos to satellite maps to pinpoint a camera's location without GPS. Instead of the standard approach of comparing images in a shared embedding space, the authors propose a new method that zooms progressively into a satellite map, making sequential decisions to narrow down the location.

reasoningarchitectureevaluation

Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

Mar 25, 2026

Arthur Jacot

You can sample from diffusion models much faster by combining predictions from small and large networks—the method achieves the same accuracy as running the largest network once, instead of many times.

This paper speeds up diffusion model sampling by using multiple neural networks of different sizes together. Instead of running one large network many times, the method runs a small fast network many times and a large accurate network just a few times, reducing total computation while maintaining quality. Tests show up to 4x speedup on image generation.

efficiencyarchitecture

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Mar 25, 2026

Falong Fan, Yi Xie, Arnis Lektauers et al.

Dynamic graph-based feature connections outperform fixed spatial neighborhoods for reconstructing deformable surgical scenes, especially when dealing with occlusions and low-texture surfaces.

EndoVGGT improves 3D reconstruction of soft tissues during surgery by using a graph neural network module that dynamically connects similar tissue regions across the image, even when instruments block the view or surfaces are shiny. This approach recovers the true shape of deformable tissues better than previous methods and works on new surgical videos it hasn't seen before.

architecture

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Mar 24, 2026

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.

You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.

VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.

efficiencymultimodalarchitecture

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Mar 24, 2026

Duc Vu, Kien Nguyen, Trong-Tung Nguyen et al.

You can dramatically improve few-step diffusion inpainting by initializing the noise with semantic information from the input image, rather than random noise—no retraining required.

InverFill speeds up image inpainting by using a smart noise initialization technique that preserves semantic information from the original image. Instead of training new models, it works with existing fast text-to-image models to fill in masked regions with better quality and fewer processing steps.

efficiencyarchitecture

Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

Mar 24, 2026

Connor Mclaughlin, Nigel Lee, Lili Su

When deploying models that learn from new tasks with scarce data, routing samples intelligently based on task similarity prevents negative interference while maximizing knowledge reuse across overlapping tasks.

This paper tackles continual learning when tasks have limited data and may overlap unpredictably. The authors propose an adaptive mixture-of-experts system that learns which tasks are similar and routes data accordingly, using two key techniques: gradually introducing task-specific prompts over time and identifying which samples fit existing patterns versus need new ones.

efficiencyarchitecture

WorldCache: Content-Aware Caching for Accelerated Video World Models

Mar 23, 2026

Umair Nawaz, Ahmed Heakl, Ufaq Khan et al.

Smart feature caching with motion awareness can dramatically accelerate video world models without retraining, but requires adaptive thresholds and blending rather than static feature reuse.

WorldCache speeds up video generation from diffusion transformers by intelligently reusing computed features across denoising steps. Instead of naively reusing old features, it adapts based on motion and visual importance, using blending and warping to keep videos smooth and artifact-free—achieving 2.3× speedup with minimal quality loss.

efficiencyarchitectureevaluation

End-to-End Training for Unified Tokenization and Latent Denoising

Mar 23, 2026

Shivam Duggal, Xingjian Bai, Zongze Wu et al.

You can train tokenization and image generation together from scratch using a single model with shared weights, simplifying the pipeline and reducing training complexity while maintaining quality.

This paper proposes UNITE, a new way to train image generation models more efficiently by combining tokenization and diffusion in a single training stage.

architecturetrainingefficiency

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Mar 23, 2026

Ziyi Wang, Xinshun Wang, Shuang Chen et al.

Treating motion as a continuous first-class modality rather than discretizing it enables a single model to handle motion-text-image tasks end-to-end, achieving better performance on cross-modal tasks like describing motion or editing poses from text.

UniMotion is the first unified AI system that understands and generates human motion, text, and images all in one model. Instead of converting motion into discrete tokens (which loses information), it treats motion as a continuous stream like video, using a shared language model backbone with special techniques to align motion with visual and text understanding.

multimodalarchitecture

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Mar 23, 2026

Haichao Zhang, Yijiang Li, Shwai He et al.

Pairing dense video prediction models with sparse, semantically-rich vision-language reasoning improves long-horizon forecasting—VLMs provide the 'what' and 'why', while dense models provide the 'how'.

This paper combines two approaches to video prediction: dense frame-by-frame modeling (JEPA) for capturing fine-grained motion, and vision-language models (VLMs) for long-horizon semantic understanding. By using both pathways together, the system predicts future video frames better than either approach alone, especially for complex hand manipulation tasks.

multimodalreasoningarchitecture

MemDLM: Memory-Enhanced DLM Training

Mar 23, 2026

Zehua Pei, Hui-Ling Zhen, Weizhe Lin et al.

Diffusion language models can be trained more effectively by embedding a simulated denoising trajectory into training, and this memory mechanism can be reused at inference time to improve long-context retrieval tasks.

This paper addresses a key problem in diffusion language models: they're trained one way (predicting masked tokens) but used differently (multi-step denoising). MemDLM fixes this mismatch by simulating the denoising process during training using a memory mechanism that learns from each sample's trajectory, leading to faster training and better long-context performance.

trainingarchitectureefficiency
architecture

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Mar 20, 2026

Emiel Hoogeboom, David Ruhe, Jonathan Heek et al.

Discrete diffusion models can now be distilled into faster generators using moment matching, enabling practical deployment with fewer sampling steps while maintaining quality.

This paper solves the problem of making discrete diffusion models faster by distilling them into simpler models. Unlike continuous diffusion models which have many distillation techniques, discrete diffusion (used for text and images) has been hard to compress.

efficiencytrainingarchitecture

Spectrally-Guided Diffusion Noise Schedules

Mar 19, 2026

Carlos Esteves, Ameesh Makadia

By tailoring noise schedules to each image's spectral content, you can generate higher-quality images with fewer denoising steps, making diffusion models faster and more efficient.

This paper proposes a smarter way to design noise schedules for diffusion models by analyzing the spectral properties of images. Instead of using the same handcrafted noise schedule for all images, the method creates custom schedules for each image that eliminate unnecessary denoising steps, improving generation quality especially when using fewer sampling steps.

efficiencyarchitecturetraining

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Mar 19, 2026

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo et al.

A single tokenizer can efficiently represent multi-view driving scenes in a way that works for both reconstruction tasks (RGB, depth) and understanding tasks (segmentation, 3D occupancy), making it practical for vision-language-action models in autonomous vehicles.

DriveTok creates a unified tokenizer for autonomous driving that converts multi-view camera images into compact 3D scene tokens. Unlike existing tokenizers designed for single images, it handles multiple camera views efficiently while preserving semantic, geometric, and depth information—enabling better reconstruction and understanding of driving scenes.

multimodalarchitectureapplications

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Mar 19, 2026

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed et al.

Part-aware 3D generation works better when you explicitly model semantic relationships between parts derived from language, not just their geometry—this enables text descriptions to guide both individual part structure and how parts fit together.

DreamPartGen generates 3D objects from text by understanding them as meaningful parts with semantic relationships. Unlike existing methods that focus only on geometry, this approach jointly models each part's shape and appearance while capturing how parts relate to each other based on the text description, resulting in more coherent and interpretable 3D models.

multimodalarchitecturereasoning

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Mar 19, 2026

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

State space models are a viable and more efficient alternative to vision transformers for vision-language models, challenging the assumption that transformers are necessary for this task.

This paper tests whether state space models (SSMs) can replace vision transformers as the visual backbone in vision-language models. The researchers find that SSM-based vision encoders match or outperform transformer-based encoders on VQA and visual grounding tasks, while using fewer parameters. They also identify instability issues in some backbones and propose fixes to improve robustness.

architecturemultimodalefficiency

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Mar 19, 2026

Zou Qiang

Adding explicit process-control layers to LLM reasoning—rather than just filtering outputs—can dramatically reduce hallucination and adversarial vulnerability by enforcing integrity at the reasoning stage itself.

Box Maze proposes a three-layer architecture for LLMs that separates reasoning into memory grounding, structured inference, and boundary enforcement to prevent hallucination and adversarial attacks. Testing on multiple LLM systems shows the approach reduces failure rates from ~40% to <1% under adversarial conditions, suggesting architectural constraints can improve reasoning reliability.

architecturesafetyreasoning

DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Mar 19, 2026

Yuegui Huang, Zhiyuan Fang, Weiqi Luo et al.

By dynamically quantizing less important experts and prefetching memory strategically, DyMoE achieves 3-22x faster inference on edge devices without sacrificing accuracy—making large MoE models practical for real-time edge deployment.

DyMoE optimizes Mixture-of-Experts (MoE) models for edge devices by dynamically adjusting precision during inference. It identifies that some experts matter more than others and uses this insight to apply lower precision to less critical experts while keeping important ones at higher precision, combined with smart memory prefetching to reduce delays.

efficiencyarchitecture

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

Mar 19, 2026

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.

D5P4 enables discrete diffusion models to generate diverse text outputs efficiently by using a principled diversity mechanism during decoding, with minimal computational overhead compared to standard approaches.

This paper improves how discrete diffusion models generate text by introducing D5P4, a new decoding method that generates multiple candidate outputs in parallel while controlling diversity.

efficiencyarchitectureevaluation

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Mar 19, 2026

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

Treating all market conditions the same hurts prediction accuracy; this framework learns to detect regime shifts automatically and uses specialized models for each, improving performance especially during volatile periods without requiring manual market labeling.

This paper presents an adaptive stock price prediction system that automatically detects market regime changes (stable vs. volatile periods) and routes data through specialized prediction models.

architectureapplications

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Mar 19, 2026

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson et al.

State-space models (Mamba) enable efficient EEG foundation models that work across varying electrode setups—crucial for real-world clinical deployment where equipment differs across hospitals.

LuMamba is an efficient EEG foundation model that handles different electrode configurations by combining topology-invariant encodings with linear-complexity state-space modeling. Pre-trained on 21,000+ hours of unlabeled EEG data, it achieves strong performance on clinical tasks while using 377× fewer computations than transformer-based alternatives.

efficiencyarchitecturetraining

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Mar 18, 2026

Jianrui Zhang, Yue Yang, Rohun Tripathi et al.

You can prune half of video tokens across both vision and language components without complex mechanisms, gaining significant speed improvements (62%) while maintaining performance—making video VLMs practical for real-world deployment.

This paper introduces a method to speed up video understanding models by removing redundant visual information. The technique scores and removes 50% of unnecessary visual tokens across the entire model architecture, achieving 62% faster processing with minimal accuracy loss on video question-answering tasks.

efficiencymultimodalarchitecture

LoST: Level of Semantics Tokenization for 3D Shapes

Mar 18, 2026

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero et al.

By tokenizing 3D shapes based on semantic importance rather than spatial detail levels, you can train autoregressive 3D generation models that are 10-1000x more token-efficient while maintaining or improving quality.

LoST is a new way to break down 3D shapes into tokens (small pieces) for AI models to process. Instead of using spatial hierarchies like existing methods, it orders tokens by semantic importance—so early tokens capture the main shape, and later tokens add fine details. This makes 3D generation models much more efficient, using 90-99% fewer tokens than previous approaches.

architectureefficiencymultimodal

Demystifing Video Reasoning

Mar 17, 2026

Ruisi Wang, Zhongang Cai, Fanyi Pu et al.

Video models reason through iterative refinement across denoising steps (not frame-by-frame), exploring candidate solutions early and converging later—a mechanism you can exploit by ensembling outputs from different random seeds.

This paper reveals how video diffusion models actually perform reasoning—not by processing frames sequentially, but by exploring multiple solutions across denoising steps and converging to answers.

reasoningarchitectureevaluation

GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

Mar 17, 2026

Mattia Rigotti, Nicholas Thumiger, Thomas Frick

GIST enables efficient, mathematically-principled graph transformers that generalize across different mesh resolutions and discretizations, making neural operators practical for large-scale physics simulations.

GIST is a graph transformer that solves a fundamental problem: how to add positional information to graph neural networks without breaking mathematical symmetries or requiring expensive computations.

architecturescalingreasoning

Mixture-of-Depths Attention

Mar 16, 2026

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

MoDA lets deep language models selectively attend to earlier layers, preventing information loss as models get deeper while adding only 3.7% computational overhead.

This paper introduces Mixture-of-Depths Attention (MoDA), a mechanism that lets attention heads skip layers by accessing key-value pairs from both the current and earlier layers. This solves a problem in very deep language models where useful information gets diluted as it passes through many layers.

architectureefficiencyscaling

Effective Distillation to Hybrid xLSTM Architectures

Mar 16, 2026

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied et al.

You can now distill transformer-based LLMs into more efficient xLSTM architectures without significant performance degradation, making it practical to deploy smaller, cheaper models that match their larger teachers.

This paper shows how to effectively compress large language models into smaller xLSTM models while preserving performance. The researchers developed a distillation pipeline that combines multiple specialized experts into a single efficient model, successfully distilling models from Llama, Qwen, and Olmo families with minimal performance loss.

efficiencyarchitecturetraining

Computational Concept of the Psyche

Mar 16, 2026

Anton Kolonin, Vladimir Krykov

AGI systems should be built around an agent's internal needs and goals as the core driver of learning and decision-making, rather than treating intelligence as separate from motivation.

This paper proposes a cognitive architecture for artificial general intelligence that models the psyche as an operating system managing an agent's needs, sensations, and actions. The approach formalizes AGI as an optimization problem where agents learn through experience to satisfy needs while managing uncertainty and minimizing existential risks.

architecturereasoningagents

Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models

Mar 16, 2026

Jay Sarkar, Vamsi Pavan Rayaprolu, Abhijeet Bhalerao

Using interpretable ML to co-design storage hardware and firmware together—rather than separately—helps engineers make better architectural decisions by understanding how memory, error handling, and workloads interact.

This paper describes how machine learning can optimize the design of solid-state drives (SSDs) by modeling how error management algorithms interact with memory components under different workloads. The researchers built an interpretable ML framework that analyzes thousands of real SSDs to guide hardware design decisions, enabling better performance and reliability trade-offs.

architectureefficiencyevaluation

Mamba-3: Improved Sequence Modeling using State Space Principles

Mar 16, 2026

Aakash Lahoti, Kevin Y. Li, Berlin Chen et al.

Mamba-3 shows that linear models can match Transformer quality on real tasks by using complex-valued state tracking and better architectural design, opening a path to cheaper inference without sacrificing capability.

Mamba-3 improves linear sequence models by using state space principles to handle tasks that require tracking information over time. Unlike Transformers that are slow to run, Mamba-3 maintains constant memory and linear compute while matching quality on language tasks—making it faster and cheaper to deploy.

architectureefficiencyreasoning
architecturereasoningmultimodal

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Mar 12, 2026

Xuanlang Dai, Yujie Zhou, Long Xing et al.

Diffusion models can solve complex reasoning tasks better by having the language encoder think iteratively and update its guidance throughout the generation process, rather than encoding instructions once at the start.

This paper improves how diffusion models solve complex reasoning tasks by making the language model encoder think step-by-step. Instead of encoding instructions once, the system iteratively refines the model's internal reasoning and feeds it progressively to the image generation process, achieving 92% accuracy on spatial reasoning tasks like mazes and puzzles.

reasoningmultimodalarchitecture

Separable neural architectures as a primitive for unified predictive and generative intelligence

Mar 12, 2026

Reza T. Batley, Apurba Sarker, Rajib Mostakim et al.

Separable neural architectures provide a unified framework for both prediction and generation tasks by imposing structural constraints that decompose high-dimensional problems into simpler, more interpretable components—useful when your system has underlying factorizable structure.

This paper introduces separable neural architectures (SNAs), a structured approach to building neural networks that explicitly exploit factorizable patterns in data. By constraining how different parts of a system interact, SNAs can model everything from physics simulations to language more efficiently.

architecturereasoningefficiency

BiGain: Unified Token Compression for Joint Generation and Classification

Mar 12, 2026

Jiacheng Liu, Shengkun Tang, Jiacheng Cui et al.

Token compression in diffusion models can serve both generation and classification if you preserve different frequency components: keep high-frequency details for texture/edges and low/mid-frequency information for semantic understanding.

BiGain is a method that speeds up diffusion models while keeping both image generation and classification working well. It uses frequency-aware token compression—separating fine details from overall structure—to decide which tokens to merge or remove, maintaining visual quality and classification accuracy simultaneously.

efficiencyarchitectureevaluation

Security Considerations for Artificial Intelligence Agents

Mar 12, 2026

Ninghui Li, Kaiyuan Zhang, Kyle Polley et al.

AI agents introduce fundamentally new security challenges because they blur the line between code and data, and can execute actions across systems—developers need layered defenses including input filtering, sandboxing, and strict privilege controls.

This paper identifies security risks in AI agents—systems that can take actions in the real world—and proposes defenses. It covers new attack types like prompt injection and confused-deputy problems, explains how current protections work (sandboxing, policy enforcement), and highlights gaps in standards and research needed to secure multi-agent systems.

safetyagentsarchitecture

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Mar 12, 2026

Andy Li, Aiden Durrant, Milan Markovic et al.

HiAP simplifies Vision Transformer deployment by automatically discovering efficient architectures in one training phase without manual sparsity targets, matching complex multi-stage methods while being easier to use.

HiAP is a pruning method that automatically removes unnecessary parts of Vision Transformers during training to make them faster and smaller for edge devices. Unlike existing approaches that require manual tuning, it uses a single training process to find optimal sub-networks by removing entire attention heads, FFN blocks, and individual neurons simultaneously.

efficiencyarchitecturetraining

RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Mar 12, 2026

Bin Wan, Runmin Cong, Xiaofei Zhou et al.

Using adaptive convolution kernels guided by object size proportions, combined with transformer-based backbones, significantly improves detection of objects at different scales in satellite imagery.

RDNet improves salient object detection in satellite images by replacing traditional CNN backbones with SwinTransformer and adding three specialized modules that adapt to different object sizes and use frequency analysis to better understand context. This solves the problem of detecting objects of varying scales in remote sensing imagery more accurately than existing methods.

architectureefficiencyevaluation

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Mar 12, 2026

Alexandre Le Mercier, Thomas Demeester, Chris Develder

CLASP provides a practical, lightweight defense against poisoning attacks on state space models by detecting malicious tokens before they reach downstream tasks, with strong generalization to unseen attack patterns.

State space models like Mamba are fast alternatives to Transformers, but they're vulnerable to Hidden State Poisoning Attacks that inject malicious tokens to corrupt the model's memory.

safetyefficiencyarchitecture

Long-Context Encoder Models for Polish Language Understanding

Mar 12, 2026

Sławomir Dadas, Rafał Poświata, Marek Kozłowski et al.

Encoder-only models can be extended to handle long documents through positional embedding adaptation and continued pre-training, offering a parameter-efficient alternative to decoder-only LLMs for document understanding tasks.

This paper introduces Polish language models based on encoder-only architecture that can process documents up to 8192 tokens long—much longer than traditional BERT models. The researchers used a two-stage training approach with positional embedding adaptation and created smaller distilled versions.

architectureefficiency

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Mar 12, 2026

Quanhao Li, Zhen Xing, Rui Wang et al.

You can now generate videos with precise motion control in a fraction of the time by distilling multi-step models and retraining motion adapters—opening doors for real-time interactive video creation.

FlashMotion speeds up trajectory-controlled video generation from many steps to just a few, while keeping videos high-quality and motion paths accurate. It trains a motion controller on a slow multi-step model, then distills it to run faster, and fine-tunes the controller to work well with the speedier version.

efficiencyarchitectureevaluation
architectureefficiencytraining

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Feb 27, 2026

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

For AI models to recognize new combinations of familiar concepts, their internal representations must be mathematically linear and orthogonal—a s...

This paper explains why neural networks need to organize information in a specific geometric way to recognize familiar concepts in new combinations. The researchers prove that for a model to generalize to unseen combinations of concepts, its internal representations must decompose into separate, perpendicular components for each concept.

architecturereasoningevaluation

FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System

Feb 27, 2026

Kriti Thakur, Alivelu Manga Parimi, Mayukha Pal

Transformers can outperform traditional deep learning for time-series fault detection in power systems, especially as grids become more complex wit...

FaultXformer uses a Transformer model to detect and locate electrical faults in power grids using real-time sensor data. It processes current measurements in two stages—first extracting temporal patterns, then classifying fault types and pinpointing locations—achieving 98%+ accuracy and outperforming traditional deep learning approaches like CNNs and LSTMs.

architectureapplicationsevaluation

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Feb 27, 2026

Hainan Xu, Vladimir Bataev, Travis M. Bartley et al.

You can make streaming speech-to-text models faster and more accurate by processing audio in fixed chunks instead of one token at a time.

This paper introduces CHAT, an improved version of RNN-T models for converting speech to text in real-time. By processing audio in small chunks and using a smarter attention mechanism, CHAT runs 1.7x faster during inference, uses 46% less memory during training, and produces more accurate transcriptions—especially for translating speech between languages.

efficiencyarchitecturemultimodal

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Feb 26, 2026

Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat et al.

AI image generators can now understand and correctly render partially hidden objects when you specify 3D layouts and camera positions.

This paper solves a key problem in AI image generation: when you ask an AI to create a scene with specific 3D positions and camera angles, it often gets confused about which objects should be hidden behind others. SeeThrough3D adds 'occlusion awareness' by representing objects as transparent 3D boxes, letting the model understand what's visible and what's blocked before generating the final image.

multimodalarchitecture

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Feb 26, 2026

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Using specialized experts for different modalities (speech vs.

This paper presents MiSTER-E, a system that recognizes emotions in conversations by combining speech and text information. It uses separate AI experts for speech, text, and cross-modal analysis, then intelligently combines their predictions. The system works on real conversations without needing to know who's speaking, and achieves strong results on standard emotion recognition benchmarks.

multimodalarchitectureapplications

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Feb 26, 2026

Siyuan Liu, Jiahui Xu, Feng Jiang et al.

Voice assistants can respond 19-51% faster by processing speech, reasoning, and speech generation in parallel instead of waiting for each step to f...

This paper solves a real problem with voice assistants: they're slow because they wait for you to finish talking, then transcribe everything, think about the answer, and finally speak. The new DDTSR system lets the AI start responding while still listening and thinking—like a human conversation.

efficiencyagentsarchitecture

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Feb 26, 2026

Radha Sarma

RLHF-based AI systems cannot be governed by norms because optimization forces all values into tradeable weights—genuine norm-following requires a...

This paper argues that AI systems like ChatGPT trained with RLHF cannot follow ethical rules or norms because of how they're built. They work by turning everything into a single score and picking the highest one—which means they'll always trade off any principle if it scores higher. The author shows this isn't a bug to fix, but a fundamental limit of optimization itself.

alignmentsafetyarchitecture

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

Feb 26, 2026

Aishik Sanyal

Adding emotional feedback to AI agents makes them more stable and deliberate, not just more human-like—a practical insight for building agents th...

This paper builds an AI agent called ReCoN-Ipsundrum that adds memory loops and emotional signals to test whether machines can show consciousness-like behaviors.

agentsarchitecturereasoning

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

Feb 26, 2026

Chenhe Du, Xuanyu Tian, Qing Wu et al.

Adding historical tracking to diffusion-based medical image reconstruction eliminates the bias-hallucination tradeoff and guarantees convergence to...

This paper fixes a problem with using AI image generators to reconstruct medical scans from incomplete data. Previous methods lose track of what they've already tried, causing them to either ignore measurement constraints or hallucinate fake details. The solution adds memory to the optimization process and cleans up noise patterns so the AI generator works correctly.

applicationsefficiencyarchitecture

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Feb 26, 2026

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

Reorganizing how you compress KV cache to match GPU hardware operations can give you significant speed gains without accuracy loss.

InnerQ compresses the key-value cache in large language models to speed up text generation without losing accuracy. It uses a smarter grouping strategy that aligns with how GPUs actually compute, reducing memory access and enabling faster decoding—up to 22% faster than previous compression methods.

efficiencyarchitecture

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

Feb 26, 2026

Elzo Brito dos Santos Filho

Separate agent planning from execution: agents output intentions, a deterministic system executes them and logs everything, preventing state loss a...

This paper solves a critical problem with AI agents: they lose track of what they're doing over long tasks and can't reliably execute code changes. ESAA is an architecture that separates what an agent *intends* to do from what actually *happens* in your codebase.

agentsarchitectureapplications