ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers16 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(22)

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Apr 2, 2026

Daiwei Chen, Zhoutong Fu, Chengming Jiang et al.

Token initialization is a critical bottleneck when extending language models with new vocabulary—grounding new tokens in semantically meaningful positions before fine-tuning substantially improves downstream task performance.

When language models add new vocabulary tokens for specific tasks like recommendation systems, they typically initialize them as averages of existing embeddings. This paper shows this approach fails because all new tokens collapse into the same subspace, losing their distinctiveness.

trainingefficiencyapplications

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Apr 2, 2026

Bangji Yang, Hongbo Ma, Jiajun Fan et al.

You can make reasoning models 15-60% more token-efficient while keeping or improving accuracy by simply training them to solve multiple problems simultaneously, creating an implicit efficiency incentive rather than explicit penalties.

This paper introduces Batched Contextual Reinforcement (BCR), a training method that makes language models reason more efficiently by training them to solve multiple problems at once in a shared context.

Mar 23 – Mar 29(19)

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Mar 26, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying et al.

You can train video models on short clips and generate much longer videos by using a three-tier memory strategy that compresses historical context without losing quality.

PackForcing solves the memory problem in video generation by compressing old frames intelligently—keeping early frames for context, heavily compressing middle frames, and preserving recent frames for smooth transitions. This lets models generate 2-minute videos on a single GPU after training only on 5-second clips, achieving 24x longer videos than training data.

efficiencyarchitecturetraining

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Mar 26, 2026

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero et al.

You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.

This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.

Mar 16 – Mar 22(36)

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Mar 20, 2026

Jingyang Lin, Jialian Wu, Jiang Liu et al.

Instead of processing all video frames, intelligent seeking based on reasoning about what matters can use far fewer frames while achieving better results—a practical approach for building efficient video AI systems.

VideoSeek is a video understanding agent that intelligently seeks out key moments in videos rather than analyzing every frame, reducing computational cost by 93% while improving accuracy. It uses a toolkit to gather multi-scale observations and reasons about video content through a think-act-observe loop, enabling efficient long-horizon video understanding.

agentsefficiencyreasoning

Adaptive Greedy Frame Selection for Long Video Understanding

Mar 20, 2026

Yuning Huang, Fengqing Zhu

By selecting frames that are both relevant to the question and visually diverse, you can cut inference costs significantly while maintaining or improving accuracy on video QA tasks, especially when frame budgets are tight.

This paper tackles a key bottleneck in video understanding: processing long videos with vision-language models requires too many frames and tokens. The authors propose a smart frame selection method that picks the most important frames by balancing two goals—relevance to the question asked and diversity of visual content—using a greedy algorithm with theoretical guarantees.

Mar 9 – Mar 15(14)

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Mar 13, 2026

Xin Chen, Junchao Wu, Shu Yang et al.

You can train better LLMs on less data by selecting instruction examples that activate the same neurons as your target task—this beats using all data or relying on external models to score examples.

This paper introduces NAIT, a method for selecting the most useful instruction-tuning data for large language models by analyzing which neurons activate when processing different types of tasks. Instead of using all available training data, NAIT identifies a small subset (10% of data) that produces better results by matching neuron activation patterns to target capabilities.

trainingdataefficiency

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Mar 13, 2026

Xingli Fang, Jung-Eun Kim

Privacy vulnerabilities and model performance are concentrated in a small set of weights—you can defend against privacy attacks by carefully fine-tuning just these critical weights instead of retraining the whole model.

This paper identifies that privacy leaks in neural networks come from a tiny fraction of weights, and these same weights are crucial for model performance. Rather than retraining the entire model, the authors propose selectively rewinding only these critical weights during fine-tuning to defend against membership inference attacks while keeping the model accurate.

Feb 23 – Mar 1(9)

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Feb 27, 2026

Shengqu Cai, Weili Nie, Chao Liu et al.

Decouple learning long-term coherence from local quality to generate minute-scale videos without needing massive amounts of long-form training data.

This paper solves a key problem in video generation: making long videos (minutes) that are both sharp and coherent. The trick is training two separate components—one learns long-term story structure from rare long videos, while another copies local quality from abundant short videos. This lets the model generate minute-long videos that look crisp and stay consistent throughout.

trainingefficiencyarchitecture

Do LLMs Benefit From Their Own Words?

Feb 27, 2026

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo et al.

You can often remove an LLM's previous responses from conversation history without losing quality, saving memory while sometimes improving accuracy.

This paper tests whether LLMs actually need to see their own previous responses in multi-turn conversations. Surprisingly, removing past assistant responses often doesn't hurt quality and can shrink context by 10x. The researchers found that models sometimes get worse when they over-rely on their own prior outputs, introducing errors that compound across turns.

trainingefficiencyreasoning

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

Apr 2, 2026

Torque Dandachi, Sophia Diggs-Galligan

go-mHC enables efficient learned mixing of residual streams in transformers with a single tunable hyperparameter that trades off between speed and expressivity, potentially unlocking a new dimension for scaling model capacity.

This paper solves a mathematical problem in neural network design: how to efficiently mix information across different processing paths (residual streams) in transformers.

architectureefficiencyscaling

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Apr 2, 2026

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan et al.

HCCS replaces softmax's expensive exponential computation with a lightweight linear approximation calibrated per attention head, enabling 8-bit integer inference on edge hardware without sacrificing model accuracy.

This paper proposes Head-Calibrated Clipped-Linear Softmax (HCCS), a fast approximation of softmax designed for edge devices running small quantized AI models.

efficiencyarchitecture

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Apr 2, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang et al.

By intelligently routing training samples to different optimization strategies based on correctness, you can get the best of both fast learning and stable training—a practical improvement for post-training large language models.

This paper proposes Sample-Routed Policy Optimization (SRPO), a training method that combines two different approaches for fine-tuning language models: it routes correct outputs through a reward-based method and incorrect outputs through a distillation method.

trainingreasoningefficiency

Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

Apr 2, 2026

Payal Fofadiya, Sunil Tiwari

Conversational agents perform better with selective memory management than unlimited retention; a relevance-guided forgetting framework improves long-horizon reasoning while reducing false memories and context bloat.

This paper tackles a key problem in conversational AI: agents need to remember past interactions to reason coherently, but storing everything causes performance to degrade and creates false memories. The authors propose a smart forgetting system that decides which memories to keep based on relevance, recency, and frequency—like a selective filing system for an agent's brain.

agentsreasoningefficiency

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

Apr 2, 2026

Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić et al.

By combining efficient tokenization with geometry-aware attention, you can build crystal generation models that are both faster and more accurate than complex graph neural networks, making generative modeling of materials more practical.

Crystalite is a lightweight diffusion Transformer for generating crystal structures that uses two key innovations: a compact atom representation called Subatomic Tokenization and a Geometry Enhancement Module that encodes crystal geometry directly into the model's attention mechanism.

architectureefficiencyapplications

Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives

Apr 2, 2026

Hao Zhu, Di Zhou, Donna Slonim

Diffusion model denoising objectives can smooth optimization landscapes for causal discovery, enabling faster and more stable learning of causal structures in challenging high-dimensional datasets.

This paper proposes DDCD, a new method for discovering causal relationships in data by adapting diffusion model techniques. Instead of using diffusion to generate data, it uses the denoising process to learn causal structures (DAGs) more stably and efficiently than existing methods like NOTEARS, especially when data is high-dimensional or imbalanced.

reasoningtrainingefficiency

VISTA: Visualization of Token Attribution via Efficient Analysis

Apr 2, 2026

Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P et al.

You can now understand what tokens your LLM actually uses without doubling GPU memory or being locked into specific architectures—just remove tokens and measure the impact.

VISTA is a lightweight, model-agnostic technique for visualizing which tokens matter most in LLM predictions.

efficiencyevaluation

Universal Hypernetworks for Arbitrary Models

Apr 2, 2026

Xuanfeng Zhou

A single fixed hypernetwork can generate weights for diverse architectures and tasks by using architecture/task descriptors as input, eliminating the need to retrain generators when switching between different model types.

This paper introduces Universal Hypernetworks (UHN), a single neural network that can generate weights for many different model architectures and tasks. Instead of building separate weight generators for each model type, UHN uses descriptors (text descriptions of architecture and task) to produce weights for any compatible model, working across vision, graphs, text, and math tasks.

architecturetrainingefficiency

Universal YOCO for Efficient Depth Scaling

Apr 1, 2026

Yutao Sun, Li Dong, Tianzhu Ye et al.

You can scale LLM reasoning at inference time without exploding memory costs by combining efficient attention architectures with parameter sharing—YOCO-U shows this works better than either approach alone.

Universal YOCO combines a specialized decoder architecture with recursive computation to enable efficient test-time scaling in language models. By reusing parameters across multiple iterations in shallow layers while maintaining constant KV cache size, it achieves better reasoning capabilities without the computational overhead that typically comes with scaling inference-time compute.

efficiencyarchitecturereasoning

LLM REgression with a Latent Iterative State Head

Apr 1, 2026

Yiheng Su, Matthew Lease

You can make LLMs predict continuous numeric values more efficiently by adding a tiny learned head that works with frozen representations, rather than decoding text or fine-tuning the entire model.

RELISH is a lightweight method for making LLMs predict numeric values directly from their internal representations. Instead of generating numbers as text, it uses a small learned component that iteratively refines a latent state through attention over token representations, then outputs a single number. It outperforms existing approaches while adding minimal parameters (0.01-0.04% overhead).

architectureefficiencyapplications

Embarrassingly Simple Self-Distillation Improves Code Generation

Apr 1, 2026

Ruixiang Zhang, Richard He Bai, Huangjie Zheng et al.

You can improve code generation by sampling from your model's own outputs and fine-tuning on them—no external tools needed. The gains come from balancing precision (removing bad options) with exploration (keeping useful diversity).

A simple technique called self-distillation improves code generation in large language models by having them sample their own outputs and fine-tune on those samples. The method boosts performance significantly (42.4% to 55.3% on benchmarks) without needing external verifiers or teacher models, and works across different model sizes and architectures.

trainingefficiencyapplications

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Apr 1, 2026

J. E. Domínguez-Vidal

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applicationsmultimodalefficiency

Screening Is Enough

Apr 1, 2026

Ken M. Nakanishi

Screening attention removes the need for global competition among keys by using absolute relevance thresholds, achieving 40% parameter reduction and 3.2× faster inference compared to Transformers.

This paper introduces Multiscreen, a language model architecture that replaces standard softmax attention with a 'screening' mechanism. Instead of distributing attention weights across all keys, screening evaluates each key against a threshold to decide which ones are relevant, eliminating the need for keys to compete with each other.

architectureefficiencyscaling

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Apr 1, 2026

Cai Zhou, Zekai Wang, Menghua Wu et al.

ORCA calibrates LLM reasoning in real-time by adapting confidence estimates per input, enabling 40-67% compute savings during inference while providing mathematical guarantees on error rates across different reasoning tasks and domains.

This paper introduces ORCA, a framework that makes language models more efficient during reasoning by calibrating their sampling process. Using test-time training and conformal prediction, ORCA learns to estimate confidence in its own reasoning steps, reducing wasted computation while maintaining accuracy—saving up to 47% compute on in-distribution tasks and 67% on out-of-distribution problems.

reasoningefficiencyevaluation

Adaptive Block-Scaled Data Types

Mar 30, 2026

Jack Cook, Hyemin S. Lee, Kathryn Le et al.

Adaptive block-scaled quantization can significantly reduce errors in 4-bit model compression by intelligently switching between data types per block, achieving better accuracy than fixed formats without extra storage cost.

This paper introduces adaptive quantization formats (IF4, IF3, IF6) that improve upon NVFP4 by dynamically choosing between floating-point and integer representations for each block of values. The approach uses an unused bit in NVFP4 to signal which format to use, reducing quantization errors and improving language model performance with minimal hardware overhead.

efficiencytrainingarchitecture

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Mar 30, 2026

Omer Dahary, Benaya Koren, Daniel Garibi et al.

You can increase diversity in generated images by applying repulsion forces in the transformer's attention channels during generation, without expensive optimization or visual artifacts.

This paper tackles the problem of text-to-image diffusion models producing visually similar outputs for the same prompt. The authors propose a method that applies 'repulsion' in the attention mechanism during image generation to encourage diverse outputs while maintaining quality and semantic accuracy.

architectureefficiencymultimodal

Temporal Credit Is Free

Mar 30, 2026

Aur Shalev Merin

Online learning in RNNs doesn't require sophisticated credit assignment algorithms—proper gradient normalization with immediate derivatives is sufficient and dramatically more memory-efficient.

Recurrent networks can learn online using simple immediate derivatives instead of expensive backpropagation-through-time. The key insight: the hidden state naturally carries temporal information forward, so you just need proper gradient normalization and avoid stale memory traces. This approach matches or beats complex algorithms while using 1000x less memory.

trainingefficiency

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Mar 30, 2026

Liliang Ren, Yang Liu, Yelong Shen et al.

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

trainingscalingefficiency

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Mar 30, 2026

Yash Savani, Branislav Kveton, Yuchen Liu et al.

Stepwise credit assignment—rewarding each diffusion step for its own improvement rather than the final result—makes RL training of image generators more efficient and faster to converge.

This paper improves reinforcement learning for image generation models by assigning credit more intelligently across diffusion steps. Instead of treating all steps equally, it recognizes that early steps handle composition while late steps refine details, then rewards each step based on its specific contribution. This leads to faster learning and better sample efficiency.

trainingreasoningefficiency

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Mar 30, 2026

Soutrik Mukherjee, Sangwhan Cha

Hybrid precision (FP32 for softmax/normalization, FP16 for linear layers) delivers 2x speedup with zero accuracy loss—a practical strategy for deploying transformers in latency-critical applications.

This paper optimizes transformer models (BERT and GPT-2) for fast GPU inference using mixed-precision techniques—keeping sensitive operations in full precision while using lower precision for others. The system achieves 64x speedup over CPU and sub-10ms latency while maintaining numerical accuracy and eliminating instability issues.

efficiencyarchitecture
trainingmultimodalefficiency

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Mar 26, 2026

Ligong Han, Hao Wang, Han Gao et al.

You can make diffusion-based language models much faster by intelligently deciding when to verify generated tokens, using the same model in two different modes without retraining.

S2D2 speeds up block-diffusion language models by combining parallel token generation with selective verification steps. The method reuses the same pretrained model in two modes—as a fast parallel generator and as a careful single-token verifier—without requiring additional training, achieving up to 4.7× speedup over standard autoregressive decoding.

efficiencyreasoning

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 26, 2026

Xueji Zhao, Likai Pei, Jianbo Liu et al.

Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.

This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.

efficiencysafetyarchitecture

On Neural Scaling Laws for Weather Emulation through Continual Training

Mar 26, 2026

Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov et al.

Neural scaling laws can predict weather model performance and guide efficient resource allocation—models trained with periodic cooldowns outperform standard approaches and enable longer, more accurate forecasts.

This paper studies how neural networks for weather forecasting improve as you scale up the model size, training data, and compute.

scalingefficiencytraining

Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

Mar 25, 2026

Arthur Jacot

You can sample from diffusion models much faster by combining predictions from small and large networks—the method achieves the same accuracy as running the largest network once, instead of many times.

This paper speeds up diffusion model sampling by using multiple neural networks of different sizes together. Instead of running one large network many times, the method runs a small fast network many times and a large accurate network just a few times, reducing total computation while maintaining quality. Tests show up to 4x speedup on image generation.

efficiencyarchitecture

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Mar 25, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian et al.

Latent world models can dramatically speed up RL training for autonomous driving by replacing expensive multi-step diffusion with single-step latent sampling, making imagination-based policy training practical.

DreamerAD uses a latent world model to train autonomous driving policies 80x faster than previous diffusion-based approaches. Instead of generating full images during training, it compresses the diffusion process to a single step by working with compressed latent features, enabling safe, efficient reinforcement learning on driving tasks without real-world testing.

efficiencyreasoningagents

Trust Region Constrained Bayesian Optimization with Penalized Constraint Handling

Mar 25, 2026

Raju Chowdhury, Tanmay Sen, Prajamitra Bhuyan et al.

Trust regions combined with penalty-based constraints enable Bayesian optimization to find feasible solutions faster in high-dimensional constrained problems where evaluations are expensive.

This paper presents a Bayesian optimization method for expensive black-box optimization problems with constraints. It combines penalty-based constraint handling, surrogate modeling, and trust regions to efficiently find good solutions in high dimensions with fewer evaluations.

efficiencytraining

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Mar 24, 2026

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas et al.

You can make vision-language models faster without losing visual detail by being selective about which attention layers process images—use efficient cross-attention for context and add self-attention layers only when the task complexity demands it.

VISOR improves vision-language model efficiency by selectively attending to visual information rather than compressing images. Instead of reducing visual tokens, it uses sparse cross-attention and dynamically chosen self-attention layers to process high-resolution details only when needed, reducing computation while maintaining performance on complex visual reasoning tasks.

efficiencymultimodalarchitecture

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Mar 24, 2026

Haoyu Huang, Jinfa Huang, Zhongwei Wan et al.

A smaller speculative model can predict an agentic system's tool-calling trajectory, enabling parallel execution and early termination of expensive operations—delivering significant speedups without accuracy loss.

SpecEyes speeds up agentic multimodal AI systems by using a lightweight model to predict what tools the main model will need, allowing expensive operations to be skipped or run in parallel. This cuts latency by 1.1-3.35x while maintaining accuracy, solving a key bottleneck in systems like OpenAI o3 that repeatedly invoke vision tools.

efficiencymultimodalagents

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Mar 24, 2026

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al.

You can now build federated learning systems that defend against both Byzantine attacks and privacy breaches simultaneously, without needing unrealistic assumptions like bounded gradients or extra server datasets.

This paper tackles two critical security issues in federated learning: protecting against malicious servers (Byzantine attacks) and preventing data leakage (differential privacy).

safetytrainingefficiency

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Mar 24, 2026

Duc Vu, Kien Nguyen, Trong-Tung Nguyen et al.

You can dramatically improve few-step diffusion inpainting by initializing the noise with semantic information from the input image, rather than random noise—no retraining required.

InverFill speeds up image inpainting by using a smart noise initialization technique that preserves semantic information from the original image. Instead of training new models, it works with existing fast text-to-image models to fill in masked regions with better quality and fewer processing steps.

efficiencyarchitecture

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Mar 24, 2026

Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo

For a well-structured class of RL problems, you can now learn optimal policies efficiently using linear models without needing special oracles or being limited to tiny action spaces.

This paper solves a key challenge in reinforcement learning: how to efficiently learn good policies when using linear function approximation in a specific class of environments (linear Bellman complete MDPs). The researchers provide an algorithm that works with both small and large action spaces, achieving polynomial time and sample complexity—meaning it scales reasonably with problem size.

efficiencyreasoning

Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

Mar 24, 2026

Connor Mclaughlin, Nigel Lee, Lili Su

When deploying models that learn from new tasks with scarce data, routing samples intelligently based on task similarity prevents negative interference while maximizing knowledge reuse across overlapping tasks.

This paper tackles continual learning when tasks have limited data and may overlap unpredictably. The authors propose an adaptive mixture-of-experts system that learns which tasks are similar and routes data accordingly, using two key techniques: gradually introducing task-specific prompts over time and identifying which samples fit existing patterns versus need new ones.

efficiencyarchitecture

WorldCache: Content-Aware Caching for Accelerated Video World Models

Mar 23, 2026

Umair Nawaz, Ahmed Heakl, Ufaq Khan et al.

Smart feature caching with motion awareness can dramatically accelerate video world models without retraining, but requires adaptive thresholds and blending rather than static feature reuse.

WorldCache speeds up video generation from diffusion transformers by intelligently reusing computed features across denoising steps. Instead of naively reusing old features, it adapts based on motion and visual importance, using blending and warping to keep videos smooth and artifact-free—achieving 2.3× speedup with minimal quality loss.

efficiencyarchitectureevaluation

End-to-End Training for Unified Tokenization and Latent Denoising

Mar 23, 2026

Shivam Duggal, Xingjian Bai, Zongze Wu et al.

You can train tokenization and image generation together from scratch using a single model with shared weights, simplifying the pipeline and reducing training complexity while maintaining quality.

This paper proposes UNITE, a new way to train image generation models more efficiently by combining tokenization and diffusion in a single training stage.

architecturetrainingefficiency

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Mar 23, 2026

Alexandra Zelenin, Alexandra Zhuravlyova

If you're using DoRA for high-rank fine-tuning on limited GPU memory, these optimizations make it practical by cutting peak memory usage by up to 7 GB and doubling speed without changing the model's behavior.

DoRA is a fine-tuning method that adapts model weights by separating magnitude from direction, but computing its forward pass requires materializing large dense matrices that consume massive GPU memory.

efficiencytraining

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Mar 23, 2026

Changxiao Cai, Gen Li

Confidence-based decoding in diffusion models is provably efficient and adapts automatically to data complexity, offering a theoretical foundation for why this practical strategy works well.

This paper proves that confidence-based decoding—a strategy that decides which tokens to generate next in diffusion language models based on prediction confidence—is theoretically efficient.

efficiencyreasoningtraining

MemDLM: Memory-Enhanced DLM Training

Mar 23, 2026

Zehua Pei, Hui-Ling Zhen, Weizhe Lin et al.

Diffusion language models can be trained more effectively by embedding a simulated denoising trajectory into training, and this memory mechanism can be reused at inference time to improve long-context retrieval tasks.

This paper addresses a key problem in diffusion language models: they're trained one way (predicting masked tokens) but used differently (multi-step denoising). MemDLM fixes this mismatch by simulating the denoising process during training using a memory mechanism that learns from each sample's trajectory, leading to faster training and better long-context performance.

trainingarchitectureefficiency
efficiencymultimodalevaluation

Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

Mar 20, 2026

Qi Cao, Andrew Gambardella, Takeshi Kojima et al.

You can measure LLM uncertainty efficiently with just one forward pass by clustering semantically similar tokens, avoiding the computational cost of sampling-based or auxiliary model approaches.

This paper proposes Semantic Token Clustering (STC), a fast method to measure how confident an LLM should be in its answers. Instead of running the model multiple times or using extra models, STC groups similar tokens together and checks if the model's top prediction comes from a coherent semantic cluster. It works in a single pass and catches cases where models are overconfident.

efficiencyevaluation

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Mar 20, 2026

Emiel Hoogeboom, David Ruhe, Jonathan Heek et al.

Discrete diffusion models can now be distilled into faster generators using moment matching, enabling practical deployment with fewer sampling steps while maintaining quality.

This paper solves the problem of making discrete diffusion models faster by distilling them into simpler models. Unlike continuous diffusion models which have many distillation techniques, discrete diffusion (used for text and images) has been hard to compress.

efficiencytrainingarchitecture

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Mar 19, 2026

Ziyin Zhang, Zihan Liao, Hang Yu et al.

You can now use smaller, faster embedding models for multilingual search and retrieval without sacrificing quality—F2LLM-v2 offers efficient options for resource-constrained deployments while the largest variant ranks first on major benchmarks.

F2LLM-v2 is a family of multilingual embedding models (80M to 14B parameters) trained on 60 million high-quality samples that support 200+ languages, including underserved low-resource ones. Using matryoshka learning and knowledge distillation, these models achieve top performance on benchmarks while being more efficient than previous LLM-based embeddings.

multimodalefficiencytraining

Spectrally-Guided Diffusion Noise Schedules

Mar 19, 2026

Carlos Esteves, Ameesh Makadia

By tailoring noise schedules to each image's spectral content, you can generate higher-quality images with fewer denoising steps, making diffusion models faster and more efficient.

This paper proposes a smarter way to design noise schedules for diffusion models by analyzing the spectral properties of images. Instead of using the same handcrafted noise schedule for all images, the method creates custom schedules for each image that eliminate unnecessary denoising steps, improving generation quality especially when using fewer sampling steps.

efficiencyarchitecturetraining

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Mar 19, 2026

Zhuolin Yang, Zihan Liu, Yang Chen et al.

You can build highly capable reasoning models with far fewer active parameters by combining domain-specific reinforcement learning with multi-domain distillation—this model matches frontier performance with 20x fewer parameters.

Nemotron-Cascade 2 is a 30B parameter model with only 3B active parameters that achieves top-tier reasoning and coding performance comparable to much larger models.

trainingreasoningefficiency

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Mar 19, 2026

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

State space models are a viable and more efficient alternative to vision transformers for vision-language models, challenging the assumption that transformers are necessary for this task.

This paper tests whether state space models (SSMs) can replace vision transformers as the visual backbone in vision-language models. The researchers find that SSM-based vision encoders match or outperform transformer-based encoders on VQA and visual grounding tasks, while using fewer parameters. They also identify instability issues in some backbones and propose fixes to improve robustness.

architecturemultimodalefficiency

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Mar 19, 2026

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari et al.

Instead of comparing kernels to other software implementations, this benchmark measures how close optimized kernels get to theoretical hardware limits—giving AI systems a clear, unchanging target for optimization rather than a moving baseline.

SOL-ExecBench is a benchmark for evaluating GPU kernel optimization that measures performance against hardware limits rather than software baselines. It includes 235 CUDA kernels from real AI models and uses analytically derived 'Speed-of-Light' bounds to create fixed optimization targets, enabling fair evaluation of AI systems that generate and optimize code.

evaluationefficiencyagents

DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Mar 19, 2026

Yuegui Huang, Zhiyuan Fang, Weiqi Luo et al.

By dynamically quantizing less important experts and prefetching memory strategically, DyMoE achieves 3-22x faster inference on edge devices without sacrificing accuracy—making large MoE models practical for real-time edge deployment.

DyMoE optimizes Mixture-of-Experts (MoE) models for edge devices by dynamically adjusting precision during inference. It identifies that some experts matter more than others and uses this insight to apply lower precision to less critical experts while keeping important ones at higher precision, combined with smart memory prefetching to reduce delays.

efficiencyarchitecture

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Mar 19, 2026

Yuyang Liu

GPU acceleration can make general-purpose optimization solvers orders of magnitude faster than traditional solvers, while remaining flexible enough for domain-specific customization through a Python interface.

cuGenOpt is a GPU-accelerated framework for solving combinatorial optimization problems (like routing and scheduling) that balances generality, speed, and ease of use. It uses CUDA to run multiple solution attempts in parallel, lets experts add custom solvers, and includes an AI assistant that converts plain-English problem descriptions into working code.

efficiencyapplications

Optimal Splitting of Language Models from Mixtures to Specialized Domains

Mar 19, 2026

Skyler Seto, Pierre Ablin, Anastasiia Filippova et al.

You can train better domain-specific models by mathematically optimizing how many tokens to spend on general pretraining versus specialized training, rather than using a fixed two-stage recipe.

This paper shows how to efficiently train multiple specialized language models by splitting compute between general pretraining and domain-specific training. Using scaling laws, the authors predict optimal token allocation for each stage, improving performance on reasoning and knowledge tasks across different model sizes.

trainingscalingefficiency

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

Mar 19, 2026

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.

D5P4 enables discrete diffusion models to generate diverse text outputs efficiently by using a principled diversity mechanism during decoding, with minimal computational overhead compared to standard approaches.

This paper improves how discrete diffusion models generate text by introducing D5P4, a new decoding method that generates multiple candidate outputs in parallel while controlling diversity.

efficiencyarchitectureevaluation

Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection

Mar 19, 2026

Ruilin Li, Heming Zou, Xiufeng Yan et al.

Using data-guided projection instead of random initialization makes continual learning more stable and effective, especially when there's a big gap between pretrained model knowledge and new tasks.

This paper improves how pretrained models learn continuously on new tasks by replacing random projection layers with a smarter, data-guided approach. Instead of randomly initializing the projection layer, the method selectively builds it based on the target data, creating more stable and expressive representations when learning new classes incrementally without storing old examples.

trainingefficiency

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Mar 19, 2026

Zhuofan Li, Hongkun Yang, Zhenyang Chen et al.

When building embodied AI systems, measure what actually matters: task completion time, motion quality, and energy use—not just model size or inference speed. Optimizing the wrong metrics can make robots perform worse in practice.

This paper shows that traditional efficiency metrics (parameters, computation) for vision-language-action robots don't match real-world performance. The researchers measured actual robotic execution—task time, motion smoothness, energy use—and found that methods optimizing for conventional metrics often make robots move worse or take longer, even when task success stays the same.

efficiencyevaluationapplications

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Mar 19, 2026

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson et al.

State-space models (Mamba) enable efficient EEG foundation models that work across varying electrode setups—crucial for real-world clinical deployment where equipment differs across hospitals.

LuMamba is an efficient EEG foundation model that handles different electrode configurations by combining topology-invariant encodings with linear-complexity state-space modeling. Pre-trained on 21,000+ hours of unlabeled EEG data, it achieves strong performance on clinical tasks while using 377× fewer computations than transformer-based alternatives.

efficiencyarchitecturetraining

Communication-Efficient and Robust Multi-Modal Federated Learning via Latent-Space Consensus

Mar 19, 2026

Mohamed Badi, Chaouki Ben Issaid, Mehdi Bennis

When building federated systems with multi-modal data, you can align different data types in a shared compressed space using learnable projections, reducing both communication overhead and the need for all devices to use identical architectures.

This paper presents CoMFed, a federated learning system that lets multiple devices train together on different types of data (like video and audio) without sharing raw information. It uses compressed representations and alignment techniques to handle the challenge of different devices having different data types and model structures, while keeping communication costs low.

multimodalefficiency

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Mar 19, 2026

Yikai Zheng, Xin Ding, Yifan Yang et al.

Decoupling semantic understanding from real-time perception—parsing queries once and matching embeddings continuously—solves the efficiency-accuracy tradeoff in proactive video understanding systems.

Em-Garde is a framework for understanding streaming video that responds to user queries efficiently. Instead of checking every frame, it converts user questions into visual proposals and matches them against the video stream using fast embedding comparisons, achieving better accuracy and speed than existing approaches.

multimodalefficiencyreasoning

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Mar 18, 2026

Jianrui Zhang, Yue Yang, Rohun Tripathi et al.

You can prune half of video tokens across both vision and language components without complex mechanisms, gaining significant speed improvements (62%) while maintaining performance—making video VLMs practical for real-world deployment.

This paper introduces a method to speed up video understanding models by removing redundant visual information. The technique scores and removes 50% of unnecessary visual tokens across the entire model architecture, achieving 62% faster processing with minimal accuracy loss on video question-answering tasks.

efficiencymultimodalarchitecture

LoST: Level of Semantics Tokenization for 3D Shapes

Mar 18, 2026

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero et al.

By tokenizing 3D shapes based on semantic importance rather than spatial detail levels, you can train autoregressive 3D generation models that are 10-1000x more token-efficient while maintaining or improving quality.

LoST is a new way to break down 3D shapes into tokens (small pieces) for AI models to process. Instead of using spatial hierarchies like existing methods, it orders tokens by semantic importance—so early tokens capture the main shape, and later tokens add fine details. This makes 3D generation models much more efficient, using 90-99% fewer tokens than previous approaches.

architectureefficiencymultimodal

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Mar 18, 2026

Ben S. Southworth, Stephen Thomas

MUD offers 1.3-3x faster token throughput than Muon with similar final performance, making it a practical drop-in replacement for faster transformer training without sacrificing convergence.

MUD is a faster alternative to Muon, an optimizer that speeds up transformer training. Instead of using expensive matrix operations to smooth momentum updates, MUD uses a simpler triangular approach inspired by classical numerical methods. This cuts optimizer overhead by 30-70% while maintaining training speed, making transformers train 10-50% faster in real time.

trainingefficiency

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mar 18, 2026

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi et al.

By treating video as a navigable hierarchical structure instead of converting it to text, you can process 10-hour videos with minimal accuracy loss while using compute that scales logarithmically with duration.

VideoAtlas is a system for understanding long videos efficiently by representing them as a hierarchical grid that can be zoomed into recursively, rather than converting video to text.

efficiencymultimodalagents

Unified Policy Value Decomposition for Rapid Adaptation

Mar 18, 2026

Cristiano Capone, Luca Falorsi, Andrea Ciardiello et al.

By decomposing policies and value functions into frozen basis functions weighted by a shared low-dimensional goal embedding, agents can adapt to novel tasks instantly without retraining, enabling rapid transfer in complex control problems.

This paper presents a method for quickly adapting reinforcement learning agents to new tasks by sharing a low-dimensional goal embedding between policy and value functions.

efficiencyreasoning

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Mar 18, 2026

Raghavv Goel, Mukul Gagrani, Mingu Lee et al.

You can make LLMs generate text faster by predicting multiple tokens simultaneously using a training-free probing technique—no model modifications or extra models needed.

This paper shows that LLMs can predict multiple future tokens at once without retraining, by using special "mask tokens" to probe the model's internal representations. The approach generates candidate tokens in parallel and verifies them together, speeding up text generation by 15-19% while maintaining quality.

efficiency

Only relative ranks matter in weight-clustered large language models

Mar 18, 2026

Borja Aizpurua, Sukhbinder Singh, Román Orús

LLM weights can be compressed to just 16-64 unique values per matrix without retraining by preserving relative rank order, enabling simple disk compression and revealing that rank structure—not magnitude—is what drives model behavior.

This paper shows that LLMs don't need exact weight values—only the relative ordering of weights matters. By clustering weights into 16-64 shared values per matrix, the authors compress models like Llama 3.1-8B without retraining. They prove this by scrambling weight values while preserving rank order, finding that rank matters far more than precise magnitudes for model performance.

efficiencyevaluation

Efficient Reasoning on the Edge

Mar 17, 2026

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink et al.

You can run reasoning-capable LLMs on mobile devices by using LoRA adapters with reinforcement learning to shorten reasoning traces, parallel decoding to reduce latency, and smart KV-cache management—achieving near-full-model accuracy with a fraction of the memory.

This paper makes LLM reasoning practical for mobile devices by combining lightweight LoRA adapters with techniques like budget forcing (to shorten responses), parallel decoding (to speed up generation), and dynamic adapter switching (to activate reasoning only when needed). The result is accurate chain-of-thought reasoning on edge devices without the memory overhead of full models.

efficiencyreasoningtraining

SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Mar 17, 2026

Jiongze Yu, Xiangbo Gao, Pooja Verlani et al.

Interactive video processing is now practical: users can control AI video enhancement by editing sparse keyframes, and the system intelligently propagates those edits across the full video sequence.

SparkVSR lets users interactively improve low-quality videos by editing a few keyframes, then automatically applies those improvements across the entire video. Instead of treating video enhancement as a black box, users can manually fix specific frames and the system propagates those corrections while keeping the video grounded in the original motion.

multimodalapplicationsefficiency

Online Experiential Learning for Language Models

Mar 17, 2026

Tianzhu Ye, Li Dong, Qingxiu Dong et al.

Language models can improve themselves in production by learning from actual user interactions—extracting knowledge from deployment experience and feeding it back into training without requiring access to the original environment.

This paper introduces Online Experiential Learning (OEL), a system that lets language models continuously improve by learning from real interactions during deployment. Instead of relying only on offline training data, OEL extracts useful knowledge from user interactions, then updates the model with this knowledge without needing access to the original environment.

trainingreasoningefficiency

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Mar 17, 2026

Xavier Gonzalez

Sequential neural network and sampling computations can be parallelized across sequence length using Newton's method, but success depends on the system's dynamical stability properties.

This work shows how to parallelize sequential computations like RNNs and MCMC by reformulating them as equation-solving problems solvable with Newton's method. It develops faster, more stable parallel algorithms and proves when parallelization actually speeds things up—determined by a system's Lyapunov exponent.

efficiencyreasoning

Mixture-of-Depths Attention

Mar 16, 2026

Lianghui Zhu, Yuxin Fang, Bencheng Liao et al.

MoDA lets deep language models selectively attend to earlier layers, preventing information loss as models get deeper while adding only 3.7% computational overhead.

This paper introduces Mixture-of-Depths Attention (MoDA), a mechanism that lets attention heads skip layers by accessing key-value pairs from both the current and earlier layers. This solves a problem in very deep language models where useful information gets diluted as it passes through many layers.

architectureefficiencyscaling

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

Mar 16, 2026

Jesper Derehag, Carlos Calva, Timmy Ghiurau

Smart ranking of retrieved candidates matters more than upfront structuring—a simple deterministic pipeline with just one learned ranking component outperforms complex memory systems on conversational retrieval tasks.

SmartSearch retrieves relevant information from raw conversation history without complex structuring or learned policies. It combines simple matching, rule-based expansion, and ranking to find evidence efficiently, achieving 93.5% accuracy on benchmarks while using 8.5x fewer tokens than baselines.

efficiencyreasoning

Effective Distillation to Hybrid xLSTM Architectures

Mar 16, 2026

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied et al.

You can now distill transformer-based LLMs into more efficient xLSTM architectures without significant performance degradation, making it practical to deploy smaller, cheaper models that match their larger teachers.

This paper shows how to effectively compress large language models into smaller xLSTM models while preserving performance. The researchers developed a distillation pipeline that combines multiple specialized experts into a single efficient model, successfully distilling models from Llama, Qwen, and Olmo families with minimal performance loss.

efficiencyarchitecturetraining

Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions

Mar 16, 2026

Quoc Tran-Dinh, Nghia Nguyen-Trung

The paper introduces practical variance-reduction techniques that significantly reduce the number of gradient computations needed to solve stochastic optimization problems, with proven convergence guarantees and real-world applications in machine learning.

This paper develops new optimization techniques for solving complex stochastic problems by combining variance reduction (reducing noise in gradient estimates) with a splitting method called forward-reflected-backward splitting.

trainingefficiency

Co-Design of Memory-Storage Systems for Workload Awareness with Interpretable Models

Mar 16, 2026

Jay Sarkar, Vamsi Pavan Rayaprolu, Abhijeet Bhalerao

Using interpretable ML to co-design storage hardware and firmware together—rather than separately—helps engineers make better architectural decisions by understanding how memory, error handling, and workloads interact.

This paper describes how machine learning can optimize the design of solid-state drives (SSDs) by modeling how error management algorithms interact with memory components under different workloads. The researchers built an interpretable ML framework that analyzes thousands of real SSDs to guide hardware design decisions, enabling better performance and reliability trade-offs.

architectureefficiencyevaluation

Mamba-3: Improved Sequence Modeling using State Space Principles

Mar 16, 2026

Aakash Lahoti, Kevin Y. Li, Berlin Chen et al.

Mamba-3 shows that linear models can match Transformer quality on real tasks by using complex-valued state tracking and better architectural design, opening a path to cheaper inference without sacrificing capability.

Mamba-3 improves linear sequence models by using state space principles to handle tasks that require tracking information over time. Unlike Transformers that are slow to run, Mamba-3 maintains constant memory and linear compute while matching quality on language tasks—making it faster and cheaper to deploy.

architectureefficiencyreasoning

Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

Mar 16, 2026

Muhammad Shoaib, Eva Riccomagno, Manuele Leonelli et al.

For building staged tree models at scale, use Total Variation divergence with Ward.D2 hierarchical clustering—it matches the accuracy of slower methods like Backward Hill Climbing but runs significantly faster.

This paper presents a new method for building staged tree models—a type of probabilistic graphical model that captures context-specific patterns in data. The approach uses hierarchical clustering on probability distributions, comparing different distance metrics and clustering strategies.

trainingefficiencyevaluation
safetytrainingefficiency

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Mar 13, 2026

Callum McLean, Luke Y. Prince, Alexandre Payot et al.

You can speed up neural network training by 1-3% by reusing computation from low-precision matrix operations for normalization, with no accuracy loss.

This paper proposes MXNorm, a faster alternative to RMSNorm (a standard layer normalization technique) that reuses scale information already computed during low-precision matrix multiplication. By avoiding redundant calculations, MXNorm achieves 2.4x speedups in normalization while maintaining training accuracy on Llama models.

efficiencytraining

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Mar 12, 2026

Samy Jelassi, Mujin Kwun, Rosie Zhao et al.

Feature-matching fine-tuning provides a middle ground between simple token prediction and complex reinforcement learning—it gives dense semantic feedback without needing task-specific reward models, making it practical for improving model behavior on real tasks.

This paper proposes a new way to fine-tune language models by matching learned feature representations instead of predicting individual tokens. Rather than using reinforcement learning with reward models, the method generates multiple model outputs in parallel and uses their semantic features to guide training, achieving better results than standard fine-tuning on coding and translation tasks.

trainingefficiencyreasoning

Separable neural architectures as a primitive for unified predictive and generative intelligence

Mar 12, 2026

Reza T. Batley, Apurba Sarker, Rajib Mostakim et al.

Separable neural architectures provide a unified framework for both prediction and generation tasks by imposing structural constraints that decompose high-dimensional problems into simpler, more interpretable components—useful when your system has underlying factorizable structure.

This paper introduces separable neural architectures (SNAs), a structured approach to building neural networks that explicitly exploit factorizable patterns in data. By constraining how different parts of a system interact, SNAs can model everything from physics simulations to language more efficiently.

architecturereasoningefficiency

BiGain: Unified Token Compression for Joint Generation and Classification

Mar 12, 2026

Jiacheng Liu, Shengkun Tang, Jiacheng Cui et al.

Token compression in diffusion models can serve both generation and classification if you preserve different frequency components: keep high-frequency details for texture/edges and low/mid-frequency information for semantic understanding.

BiGain is a method that speeds up diffusion models while keeping both image generation and classification working well. It uses frequency-aware token compression—separating fine details from overall structure—to decide which tokens to merge or remove, maintaining visual quality and classification accuracy simultaneously.

efficiencyarchitectureevaluation

STAMP: Selective Task-Aware Mechanism for Text Privacy

Mar 12, 2026

Fengwei Tian, Payel Bhattacharjee, Heidi Hanson et al.

By combining task-aware importance scoring with privacy sensitivity detection, STAMP achieves better privacy-utility trade-offs than uniform noise approaches—meaning you can protect sensitive data without sacrificing model performance.

STAMP is a privacy framework that protects sensitive information in text while keeping it useful for AI tasks. It smartly decides which parts of text need more protection (like names and dates) versus which parts are less sensitive, then applies targeted noise to embeddings using a novel 'polar mechanism' that preserves semantic meaning better than traditional approaches.

safetydataefficiency

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Mar 12, 2026

Andy Li, Aiden Durrant, Milan Markovic et al.

HiAP simplifies Vision Transformer deployment by automatically discovering efficient architectures in one training phase without manual sparsity targets, matching complex multi-stage methods while being easier to use.

HiAP is a pruning method that automatically removes unnecessary parts of Vision Transformers during training to make them faster and smaller for edge devices. Unlike existing approaches that require manual tuning, it uses a single training process to find optimal sub-networks by removing entire attention heads, FFN blocks, and individual neurons simultaneously.

efficiencyarchitecturetraining

RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

Mar 12, 2026

Bin Wan, Runmin Cong, Xiaofei Zhou et al.

Using adaptive convolution kernels guided by object size proportions, combined with transformer-based backbones, significantly improves detection of objects at different scales in satellite imagery.

RDNet improves salient object detection in satellite images by replacing traditional CNN backbones with SwinTransformer and adding three specialized modules that adapt to different object sizes and use frequency analysis to better understand context. This solves the problem of detecting objects of varying scales in remote sensing imagery more accurately than existing methods.

architectureefficiencyevaluation

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Mar 12, 2026

Alexandre Le Mercier, Thomas Demeester, Chris Develder

CLASP provides a practical, lightweight defense against poisoning attacks on state space models by detecting malicious tokens before they reach downstream tasks, with strong generalization to unseen attack patterns.

State space models like Mamba are fast alternatives to Transformers, but they're vulnerable to Hidden State Poisoning Attacks that inject malicious tokens to corrupt the model's memory.

safetyefficiencyarchitecture

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Mar 12, 2026

Yushi Bai, Qian Dong, Ting Jiang et al.

You can make sparse attention 1.8× faster during prefill by reusing token-selection indices across layers—most layers don't need their own indexer since they pick the same tokens as nearby layers.

IndexCache speeds up sparse attention in large language models by reusing token selection indices across layers instead of computing them separately at each layer. Since consecutive layers select similar tokens anyway, the method caches these selections from a few 'Full' layers and reuses them in other 'Shared' layers, cutting indexer computation by 75% with minimal quality loss.

efficiencyreasoning

Long-Context Encoder Models for Polish Language Understanding

Mar 12, 2026

Sławomir Dadas, Rafał Poświata, Marek Kozłowski et al.

Encoder-only models can be extended to handle long documents through positional embedding adaptation and continued pre-training, offering a parameter-efficient alternative to decoder-only LLMs for document understanding tasks.

This paper introduces Polish language models based on encoder-only architecture that can process documents up to 8192 tokens long—much longer than traditional BERT models. The researchers used a two-stage training approach with positional embedding adaptation and created smaller distilled versions.

architectureefficiency

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

Mar 12, 2026

Quanhao Li, Zhen Xing, Rui Wang et al.

You can now generate videos with precise motion control in a fraction of the time by distilling multi-step models and retraining motion adapters—opening doors for real-time interactive video creation.

FlashMotion speeds up trajectory-controlled video generation from many steps to just a few, while keeping videos high-quality and motion paths accurate. It trains a motion controller on a slow multi-step model, then distills it to run faster, and fine-tunes the controller to work well with the speedier version.

efficiencyarchitectureevaluation

Automatic Generation of High-Performance RL Environments

Mar 12, 2026

Seth Karten, Rahul Dev Appapogu, Chi Jin

AI agents can now automatically translate RL environments into optimized implementations (Rust, JAX, GPU-parallel code) in hours instead of months, with built-in verification ensuring the fast version behaves identically to the original.

This paper shows how to automatically generate high-performance RL environments using AI agents with a generic prompt template, verification checks, and iterative repair.

agentsefficiencytraining
efficiencyevaluation

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

Feb 27, 2026

Zhengbo Wang, Jian Liang, Ran He et al.

You can reduce optimizer memory by 8x using low-rank decomposition without sacrificing model quality—making it easier to train larger models on l...

This paper makes training large language models cheaper by redesigning how optimizers store momentum information. Instead of keeping full-sized momentum matrices in memory, the authors compress them into smaller low-rank approximations—using 1/8 the memory while maintaining or improving training quality.

efficiencytraining

Memory Caching: RNNs with Growing Memory

Feb 27, 2026

Ali Behrouz, Zeman Li, Yuan Deng et al.

Memory Caching lets RNNs scale their memory capacity with sequence length while staying faster than Transformers.

This paper fixes a major weakness of fast RNN models: they forget information too quickly because they have fixed-size memory. The authors introduce Memory Caching, which lets RNNs save snapshots of their memory as they process longer sequences. This gives RNNs the ability to remember more without becoming as slow as Transformers, creating a sweet spot between speed and accuracy.

architectureefficiencytraining

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Feb 27, 2026

Hainan Xu, Vladimir Bataev, Travis M. Bartley et al.

You can make streaming speech-to-text models faster and more accurate by processing audio in fixed chunks instead of one token at a time.

This paper introduces CHAT, an improved version of RNN-T models for converting speech to text in real-time. By processing audio in small chunks and using a smarter attention mechanism, CHAT runs 1.7x faster during inference, uses 46% less memory during training, and produces more accurate transcriptions—especially for translating speech between languages.

efficiencyarchitecturemultimodal

Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis

Feb 27, 2026

Javier Pulido, Filipe Rodrigues

Foundation models trained on diverse time-series data can forecast transportation metrics without task-specific tuning, making them practical basel...

This paper tests whether a general-purpose time-series AI model (Chronos-2) can forecast transportation data like traffic volume and bike-sharing demand without any custom training. The model works surprisingly well out-of-the-box, often beating specialized models built just for these tasks, and also provides useful uncertainty estimates.

evaluationapplicationsefficiency

A Dataset is Worth 1 MB

Feb 26, 2026

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

You can teach models new tasks by transmitting just labels instead of data, if clients have a generic reference dataset pre-loaded.

Instead of sending large datasets over the network, this paper proposes sending only class labels for images from a reference dataset that clients already have locally. A smart filtering mechanism picks which images are most relevant to the new task, reducing communication to under 1 MB while maintaining accuracy.

efficiencydatatraining

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Feb 26, 2026

Simon Roschmann, Paul Krzakala, Sonia Mazelet et al.

You can align vision and language models with 10-100x less paired training data by leveraging unpaired images and text separately.

This paper shows how to align vision and language models using far fewer paired examples than current methods require. Instead of needing millions of image-text pairs, SOTAlign uses a small set of paired data plus lots of unpaired images and text, employing a technique called optimal transport to learn how the two models relate to each other.

multimodaltrainingefficiency

FlashOptim: Optimizers for Memory Efficient Training

Feb 26, 2026

Jose Javier Gonzalez Ortiz, Abhay Gupta, Chris Renard et al.

You can train large models with 50% less GPU memory by using better compression for optimizer states—no quality loss, drop-in replacement.

FlashOptim cuts the memory needed to train large AI models in half by storing optimizer information more efficiently. It uses smarter compression techniques for gradients and optimizer states without hurting model quality, making it possible to train 7B+ parameter models on consumer GPUs.

efficiencytraining