ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers57 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(17)

Tokenisation via Convex Relaxations

May 21, 2026

Jan Tempus, Philip Whittington, Craig W. Schmidt et al.

ConvexTok uses convex optimization to build tokenizers that are provably near-optimal (within 1% at typical vocabulary sizes) and compress text better than greedy algorithms like BPE, with measurable improvements in language model efficiency.

This paper replaces greedy tokenization algorithms like BPE with a convex optimization approach called ConvexTok. Instead of making locally optimal choices, it formulates tokenizer construction as a linear program, achieving better compression (bits-per-byte) and allowing users to verify how close their tokenizer is to mathematically optimal.

trainingefficiency

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

May 21, 2026

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld et al.

Training LLMs to produce diverse outputs across multiple reward dimensions—not just maximizing a single score—makes them better at test-time search where you can pick the best solution from many candidates.

This paper introduces Vector Policy Optimization (VPO), a training method that teaches language models to generate diverse solutions by optimizing for multiple reward objectives simultaneously, rather than a single scalar reward.

training

May 11 – May 17(10)

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

May 14, 2026

Xiang Fan, Yuheng Wang, Bohan Fang et al.

Video generation systems lose detail because their decoders ignore the input image—adding reference conditioning to the decoder recovers this information and improves quality by up to 2.1dB PSNR.

RefDecoder improves video generation by conditioning the decoder on a reference image, fixing a common architectural flaw where decoders ignore input details. By injecting reference image information through attention mechanisms during decoding, it preserves fine details and consistency without requiring retraining of existing systems.

architecturemultimodalefficiency

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

May 14, 2026

Ellwil Sharma, Arastu Sharma

Sparse mixture-of-experts routing can solve the problem of conflicting physics domains in foundation models by automatically routing different physics problems to specialized experts while maintaining shared knowledge for universal principles.

This paper tackles negative transfer in multi-physics AI models—where training on different physics problems simultaneously hurts performance. The authors propose Shodh-MoE, which uses sparse expert routing to let different parts of the model specialize in different physics regimes (like fluid dynamics vs. porous media flows) while sharing knowledge where it helps.

May 4 – May 10(26)

Normalizing Trajectory Models

May 8, 2026

Jiatao Gu, Tianrong Chen, Ying Shen et al.

NTM enables fast image generation (4 steps) while preserving exact likelihood calculation—something previous fast diffusion methods couldn't do—by using normalizing flows for each denoising step instead of simple Gaussian assumptions.

This paper introduces Normalizing Trajectory Models (NTM), a new approach for fast image generation that compresses diffusion sampling from many steps to just four. Unlike existing fast methods that lose the ability to calculate exact probabilities, NTM maintains a mathematically exact likelihood while generating high-quality images, making it useful for both generation and evaluation.

efficiencyarchitecturetraining

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

May 8, 2026

Wei Yu, Yunhang Qian

State space models offer a practical alternative to transformers for event-based image reconstruction, achieving better results with linear computational complexity instead of quadratic, making high-resolution processing feasible.

EmambaIR uses a new type of neural network architecture (state space models) to reconstruct clear images from event camera data.

Apr 27 – May 3(22)

HyCOP: Hybrid Composition Operators for Interpretable Learning of PDEs

May 1, 2026

Jinpai Zhao, Nishant Panda, Yen Ting Lin et al.

Composing interpretable numerical and learned modules with learned policies outperforms monolithic neural operators on PDEs, generalizes better to out-of-distribution cases, and lets you swap components (like boundary conditions) without retraining.

HyCOP learns to solve PDEs by composing simple, interpretable modules (like advection and diffusion) rather than training a single neural network. It learns a policy that decides which module to apply and for how long based on the current state, enabling better generalization to new scenarios and easier transfer to different problems.

reasoningarchitectureefficiency

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

May 1, 2026

Siyuan Huang, Xiaoye Qu, Yafu Li et al.

PVM solves a fundamental problem in vision-language models where visual understanding degrades during long text generation by creating a separate, always-accessible pathway to visual information—improving reasoning tasks with minimal added parameters.

Large vision-language models struggle when generating long text because visual information gets diluted by accumulated text tokens. This paper introduces Persistent Visual Memory (PVM), a lightweight add-on module that maintains direct access to visual embeddings throughout generation, preventing the model from losing sight of the image as it produces longer outputs.

Apr 20 – Apr 26(25)

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Apr 24, 2026

Sijie Li, Shanda Li, Haowei Lin et al.

Use active learning to strategically pick which small experiments to run when fitting scaling laws—you can predict large-scale model performance with 90% less compute by choosing experiments that reduce uncertainty about the target region you care about.

Training large AI models costs millions, and figuring out how they'll scale costs millions more. This paper proposes a smarter way to choose which smaller pilot experiments to run so you can accurately predict how a massive training run will perform, using only about 10% of the budget that naive approaches would need.

scalingefficiencytraining

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Apr 24, 2026

Longju Bai, Zhemin Huang, Xingyao Wang et al.

AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.

This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.

reasoning
efficiency

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026

Ali Hatamizadeh, Yejin Choi, Jan Kautz

Decoupling erase and write operations in linear attention with separate gates improves language model performance, especially on long-context tasks, while maintaining constant-memory decoding.

This paper improves linear attention mechanisms by separating the control of what to forget from what to remember in compressed memory. Instead of using a single gate to control both erasing old information and writing new information, Gated DeltaNet-2 uses separate channel-wise gates for each operation, making memory updates more flexible and efficient.

architectureefficiencyreasoning

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

May 21, 2026

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.

When LLM agents communicate through shared KV caches for efficiency, you need explicit safeguards—LCGuard shows how to block sensitive information leakage at the representation level without breaking task coordination.

LCGuard is a safety framework that protects sensitive information when multiple AI agents share transformer key-value caches to coordinate tasks. It uses adversarial training to transform shared cache data so that agents can't reconstruct each other's private inputs, while keeping the information useful for task performance.

safetyagentsefficiency

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

May 21, 2026

Yunpeng Dong, Jingkai He, Yuze Hou et al.

By tracking only differences between consecutive states rather than full duplicates, DeltaBox reduces AI agent checkpoint/rollback latency from seconds to milliseconds, directly enabling deeper search and larger-scale exploration for reasoning and RL tasks.

DeltaBox is a system that makes AI agents much faster by storing only the changes between checkpoints instead of copying entire sandbox states. Using new OS-level mechanisms for filesystems and process state, it reduces checkpoint/rollback time from hundreds of milliseconds to just milliseconds, enabling agents to explore more possibilities in the same time budget.

efficiencyagents

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

efficiencyevaluationapplications

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Mamba's linear-complexity architecture enables real-time cognitive load monitoring from noisy eye-tracking signals on wearable devices—a practical alternative to Transformers for temporal sensor data with frequent gaps.

MambaGaze uses a bidirectional Mamba neural network to assess cognitive load from eye-tracking data in real-time. It handles missing data from eye blinks and tracking failures by explicitly encoding uncertainty, and runs efficiently on edge devices like smartglasses for applications like driver monitoring.

architectureefficiencyapplications

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

May 20, 2026

Dayal Singh Kalra, Maissam Barkeshli

When scaling up LLM training, use a higher embedding layer learning rate (scaled by model width) to stabilize training and reliably transfer hyperparameters from small to large models—this is the primary reason μP outperforms standard parameterization.

This paper explains why μP (Maximal Update) parameterization works better than standard parameterization for transferring learning rates across different model sizes. The key finding: μP's advantage mainly comes from using a higher learning rate for the embedding layer, which stabilizes training and improves hyperparameter transfer when scaling up language models.

scalingtrainingefficiency

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

May 20, 2026

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini et al.

Compiling agent tasks into code upfront—rather than deciding actions one step at a time—enables parallelization and validation, dramatically reducing latency and errors in web automation.

This paper introduces a compilation approach for web agents that converts natural language tasks into executable code plans instead of executing step-by-step. By generating multiple candidate plans, validating them against tool specifications, and optimizing for parallelization, the system achieves 10x faster execution and better accuracy than existing sequential approaches.

agentsefficiencyreasoning

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

May 20, 2026

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen et al.

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

trainingefficiencyreasoning

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

May 18, 2026

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti et al.

DashAttention enables efficient long-context processing by combining adaptive sparse selection with differentiable training, outperforming fixed-sparsity methods while maintaining gradient flow through both attention stages.

DashAttention improves how language models handle long documents by using a smarter two-stage attention mechanism. Instead of always selecting the same number of relevant tokens, it adaptively picks different amounts based on what each query needs, while keeping the entire process trainable. This achieves full-attention quality with 75% fewer computations.

efficiencyarchitecture

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

May 18, 2026

Ruitao Liu, Xinyang Tian, Shuo Chen et al.

For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.

This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.

trainingefficiencyscaling

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

May 18, 2026

Lifu Wei, Yinuo Ren, Naichen Shi et al.

You can guide diffusion models without computing gradients or scores—just reweight trajectories and resample periodically, making inference-time improvements cheaper and easier to implement.

This paper introduces URGE, a gradient-free method for improving diffusion model outputs at inference time. Instead of computing expensive gradients, URGE reweights and resamples trajectories using a mathematical technique called Girsanov estimation, making guidance simpler and faster while maintaining theoretical guarantees.

efficiency

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

May 18, 2026

Qianhao Yuan, Jie Lou, Xing Yu et al.

MLLMs can improve fine-grained visual understanding by learning from their own superior performance on evidence-focused crops, using on-policy self-distillation to transfer regional perception skills to full-image reasoning.

This paper addresses a key weakness in multimodal AI models: they struggle to notice small but important details in images. The researchers discovered that models actually perform better when shown cropped images focused on relevant areas versus full images, suggesting the problem isn't recognizing details but finding them.

multimodaltrainingefficiency

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

May 18, 2026

Miguel Farinha, Ronald Clark

By conditioning on intrinsic image properties (albedo and shading) extracted from both photos and 3D renders, you can achieve photorealistic relighting with full PBR lighting control while staying fast enough for practical use.

PIXLRelight is a fast neural relighting method that lets you change lighting in photos using physically-based rendering controls. It decomposes images into intrinsic components (albedo, shading, residuals) and uses these to condition a transformer model, enabling realistic lighting adjustments in under 0.1 seconds per image without per-image optimization.

multimodalarchitectureefficiency

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

May 18, 2026

Kenan Majewski, Marcin Żugaj

Neural networks can improve classical state estimation by learning adaptive forgetting factors that respond to real-time sensor quality, enabling robust UAV navigation during sensor outages and dynamic environments.

This paper presents a learned Kalman filter that adapts to changing noise conditions in UAVs by using a neural network to dynamically adjust how much it trusts past measurements. Instead of using a fixed forgetting factor, the filter learns a memory policy from sensor data, helping it handle sensor failures and vibrations better than traditional adaptive filters.

trainingefficiencyreasoning
architecturescalingefficiency

MeMo: Memory as a Model

May 14, 2026

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong et al.

You can add new knowledge to any LLM without touching its weights by training a separate memory model that retrieves and augments the LLM's responses—making it practical for real-world applications needing frequent updates.

MeMo introduces a modular memory model that stores new knowledge separately from a frozen LLM, enabling efficient updates without retraining. It works with any LLM (open or proprietary), handles complex document relationships, and maintains constant retrieval cost regardless of corpus size.

trainingefficiency

Elastic Attention Cores for Scalable Vision Transformers

May 12, 2026

Alan Z. Song, Yinjie Chen, Mu Nan et al.

You can build efficient vision transformers by routing all patch interactions through a small set of learned core tokens instead of using all-to-all attention, achieving linear complexity without sacrificing performance.

This paper proposes VECA, a vision transformer that replaces quadratic all-to-all attention with linear-time attention using learned "core" tokens as communication hubs. Instead of every patch attending to every other patch, all patches only interact through a small set of learned cores, reducing computation from O(N²) to O(N) while maintaining competitive accuracy on vision tasks.

architectureefficiencyscaling

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

May 12, 2026

Ariel Gera, Shir Ashury-Tahan, Gal Bloch et al.

You can boost embedding model performance on hard search tasks by having an LLM refine queries at test-time, making embeddings practical for scenarios where running LLMs on all documents is too expensive.

This paper shows how to improve embedding models for search and classification by using an LLM to refine user queries in real-time. Instead of changing the embedding model itself, the approach adjusts the query representation based on feedback from a small sample of documents, achieving up to 25% improvement on challenging tasks without requiring expensive LLM processing at scale.

efficiencyevaluation

Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 2026

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal et al.

Combining parameter updates with context optimization lets LLMs learn new tasks 3x more efficiently while staying closer to their original capabilities and avoiding the forgetting that comes from pure fine-tuning.

This paper proposes Fast-Slow Training (FST), a method that combines two learning mechanisms for LLMs: updating model parameters (slow learning) and optimizing the input context (fast learning). By separating task-specific adaptation from general knowledge, FST achieves better sample efficiency, reduces catastrophic forgetting, and maintains the model's ability to learn new tasks over time.

trainingefficiencyreasoning

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

May 12, 2026

Sagi Ahrac, Noya Hochwald, Mor Geva

Routers in sparse mixture-of-experts models work best when they maintain geometric alignment with their experts—understanding this coupling can improve routing stability and reduce the need for complex auxiliary losses.

This paper reveals that routers in Sparse Mixture-of-Experts models learn a geometric relationship with their experts: router weights and expert weights receive gradients along the same directions, causing them to specialize together.

architecturetrainingefficiency

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

May 12, 2026

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi et al.

You can extend transformer context length by simply reusing and accumulating the KV cache across chunks—no training needed, and the approach stays numerically stable even across very long sequences.

KV-Fold enables long-context inference by treating the key-value cache as an accumulator that gets passed between sequence chunks. The model processes each chunk while attending to cached information from previous chunks, allowing it to handle contexts up to 128K tokens without retraining or architectural changes.

efficiency

Solve the Loop: Attractor Models for Language and Reasoning

May 12, 2026

Jacob Fein-Ashley, Paria Rashidinejad

Attractor Models make iterative refinement practical by using implicit differentiation to solve fixed points, enabling smaller models (27M-770M parameters) to outperform much larger ones on reasoning and language tasks without the training instability of traditional recurrent architectures.

This paper introduces Attractor Models, which improve on looped Transformers by using implicit differentiation to solve for fixed points in latent representations.

architecturereasoningefficiency

Search Your Block Floating Point Scales!

May 12, 2026

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu et al.

Smarter scale selection in Block Floating Point quantization can reduce quantization error by 27% and improve language model performance by up to 15 points without slowing down inference.

This paper improves quantization for AI models by optimizing how Block Floating Point formats choose their scale factors. Instead of using a fixed maximum-based scale, ScaleSearch searches for better scales that minimize quantization error. The method works with existing quantization techniques and includes a specialized attention algorithm, showing 15-point improvements on math reasoning tasks.

efficiency
architecture
efficiency
multimodal

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

May 8, 2026

James Petullo, Sonny George, Dylan Cashman et al.

You can make confidence-weighted answer selection 47% cheaper by clustering similar reasoning traces and only evaluating unique ones, without sacrificing accuracy.

VecCISC reduces the cost of weighted majority voting for LLM reasoning by filtering out duplicate or low-quality reasoning traces before sending them to a critic model. It uses semantic similarity to identify which candidate answers are worth evaluating, cutting token usage by 47% while maintaining accuracy across math, science, and reasoning tasks.

efficiencyreasoning

Flow-OPD: On-Policy Distillation for Flow Matching Models

May 8, 2026

Zhen Fang, Wenxuan Huang, Yu Zeng et al.

On-policy distillation with specialized teachers can resolve conflicting optimization goals in multi-objective image generation, achieving 10-point improvements over standard reinforcement learning approaches while maintaining quality across all metrics.

Flow-OPD is a training method that improves text-to-image models by using specialized teacher models and on-policy distillation to align multiple competing objectives (like image quality, text accuracy, and aesthetics).

trainingalignmentefficiency

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

May 8, 2026

Yi Yu, Parker Martin, Zhenyu Bu et al.

Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.

CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.

applicationsefficiencyevaluation

Fast Byte Latent Transformer

May 8, 2026

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz et al.

Byte-level models can now generate 50% faster by predicting multiple bytes in parallel instead of one at a time, making them practical for real-world use without sacrificing quality.

Byte-level language models match token-based models but generate slowly because they produce one byte at a time. This paper introduces three faster variants: BLT-D uses diffusion to generate multiple bytes per step, BLT-S uses local drafting with verification, and BLT-DV combines both. All reduce memory bandwidth costs by over 50% during generation.

efficiencyarchitecture

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

May 7, 2026

Minbin Huang, Han Shi, Chuanyang Zheng et al.

You don't need separate expert sets per layer in MoE models—a shared expert pool with independent routers works better and uses fewer parameters, suggesting the standard per-layer expert allocation is unnecessarily wasteful.

UniPool replaces the standard Mixture-of-Experts design where each layer has its own expert set with a single shared pool of experts accessed by all layers. This reduces redundancy and allows expert parameters to grow sublinearly with model depth while improving performance and reducing parameter count by 30-60% compared to standard MoE.

architectureefficiencyscaling

BAMI: Training-Free Bias Mitigation in GUI Grounding

May 7, 2026

Borui Zhang, Bo Zhang, Bo Wang et al.

You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.

This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.

agentsevaluationefficiency

EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 2026

Ryan Wang, Akshita Bhagia, Sewon Min

By constraining tokens within the same document to share expert pools during pretraining, EMO creates naturally modular experts that specialize in semantic domains (math, code, etc.), enabling practical memory-efficient deployment without sacrificing performance.

EMO is a Mixture-of-Experts language model designed to work efficiently when you only need a subset of its capabilities. Instead of forcing all experts to activate for every input, EMO groups experts by document domain during training, so code-heavy documents use code experts, math documents use math experts, and so on.

architectureefficiencytraining

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

May 7, 2026

Yuxing Liu, Jianyu Wang, Tong Zhang

Use the same optimizer for finetuning as you used for pretraining—it significantly reduces catastrophic forgetting while maintaining task performance, even outperforming parameter-efficient methods like LoRA.

When finetuning large language models, using the same optimizer during finetuning as was used during pretraining reduces forgetting of previously learned knowledge while maintaining or improving performance on new tasks.

trainingefficiency

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

May 7, 2026

Zeyu Yang, Qi Ma, Jason Chen et al.

A single well-designed lexical query informed by LLM-predicted vocabulary and corpus statistics outperforms expensive multi-round retrieval agents—you don't need complex agentic loops if you get the query right upfront.

SIRA is a retrieval agent that replaces multi-round exploratory search with a single, smarter query by using an LLM to predict missing search terms and filter them against corpus statistics.

agentsefficiency

Taming Outlier Tokens in Diffusion Transformers

May 6, 2026

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu et al.

Outlier tokens in diffusion transformers aren't just extreme values but represent corrupted local information; controlling them with register tokens significantly improves image generation quality.

This paper identifies and fixes a problem in Diffusion Transformers where certain tokens develop unusually high values that degrade image quality. The authors show this happens in both the image encoder and the generation model itself, and propose Dual-Stage Registers—a technique using learnable tokens to stabilize these problematic values and improve image generation.

architectureefficiencyevaluation

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

May 6, 2026

Yijun Lu, Rui Ye, Yuwen Du et al.

Agents performing long-horizon tasks need adaptive context management—selectively compressing or discarding information—rather than naively accumulating everything, which improves efficiency and reduces hallucination.

LongSeeker introduces Context-ReAct, a framework that helps AI agents manage growing context during long tasks by selectively compressing, skipping, or deleting information based on relevance. The agent uses five operations to reshape its working memory, reducing costs and errors while maintaining task-critical information.

agentsreasoningefficiency

Estimating the expected output of wide random MLPs more efficiently than sampling

May 6, 2026

Wilson Wu, Victor Lecomte, Michael Winer et al.

You can estimate a wide MLP's expected output more efficiently than sampling by directly computing activation distributions layer-by-layer using mathematical tools, which is particularly useful for detecting tail risks.

This paper presents a mathematical method to estimate what a randomly initialized neural network will output on average, without actually running data through it. Instead of sampling (the standard approach), the authors use statistical tools like cumulants and Hermite expansions to track how activations behave at each layer.

efficiencyevaluationarchitecture

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

May 6, 2026

The Verkor Team, Ravi Krishna, Suresh Krishna et al.

Frontier LLMs can now autonomously design complex hardware accelerators from scratch, suggesting AI agents are becoming capable of end-to-end engineering tasks that previously required human teams.

An AI agent system autonomously designed a specialized hardware accelerator for LLM inference in 80 hours, starting from a research paper. The system improved dramatically from prior work, handling 80x larger tasks by leveraging newer frontier models, and produced a working FPGA design with thousands of compute units.

agentsefficiencyapplications

The First Token Knows: Single-Decode Confidence for Hallucination Detection

May 6, 2026

Mina Gabriel

A single metric based on the model's confidence distribution at the first answer token can reliably detect hallucinations without expensive multi-sample generation, making it a practical baseline for production systems.

This paper shows that checking a language model's confidence on just the first token of an answer can detect hallucinations as well as methods that generate multiple answers and compare them. The approach is faster and simpler, requiring only a single model run instead of repeated sampling.

evaluationefficiency

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

May 6, 2026

Enhui Chai, Sicheng Chen, Tianyi Zhang et al.

Representing pathology image features in complementary geometric spaces (hyperbolic + Euclidean) with efficient sequence modeling enables more accurate whole-slide image analysis by capturing both tissue hierarchy and cellular details.

This paper presents BatMIL, a new approach for analyzing whole-slide images (gigapixel pathology scans) by representing tissue features in dual geometric spaces—hyperbolic for hierarchical structures and Euclidean for local details.

architecturemultimodalefficiency

Conditional Diffusion Sampling

May 5, 2026

Francisco M. Castro-Macías, Pablo Morales-Álvarez, Saifuddin Syed et al.

CDS offers a practical way to sample from difficult distributions by combining two proven techniques—Parallel Tempering for initial exploration and exact diffusion dynamics for refinement—without requiring neural network training.

This paper introduces Conditional Diffusion Sampling (CDS), a new method for sampling from complex probability distributions that combines Parallel Tempering with diffusion-based transport.

trainingefficiency

Flow Sampling: Learning to Sample from Unnormalized Densities via Denoising Conditional Processes

May 5, 2026

Aaron Havens, Brian Karrer, Neta Shaul

Flow Sampling enables efficient sampling from unnormalized densities by reversing the diffusion process with energy guidance, making it practical for expensive-to-evaluate energy functions and non-Euclidean geometries.

This paper presents Flow Sampling, a method for drawing samples from energy functions without needing data. It adapts diffusion models to work backwards from noise, using the energy function to guide the sampling process. The approach is efficient because it minimizes how many times the energy function must be evaluated, and it works on curved spaces like spheres and hyperbolic geometry.

efficiency

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

May 4, 2026

Shikhar Shukla

Speculative decoding's performance depends heavily on compression level and task type—adaptive speculation length selection based on draft model signals can significantly outperform fixed hyperparameters with minimal computational overhead.

SpecKV improves LLM inference speed by dynamically choosing how many tokens a draft model should propose at each step, rather than using a fixed number. The system learns from draft model signals (confidence and entropy) to predict which proposal lengths will be accepted most often, achieving 56% faster inference than standard fixed-length approaches.

efficiency

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

May 4, 2026

Mohamad Khajezade, Fatemeh H. Fard, Mohamed Sami Shehata

Knowledge distillation from reasoning-optimized models plus response stabilization techniques can make compact open-source models practical for cross-language code clone detection, improving both reliability and inference speed.

This paper shows how to make smaller, open-source AI models better at detecting when code does the same thing across different programming languages. The researchers use knowledge distillation—teaching a smaller model by learning from a larger reasoning-focused model—combined with techniques to make outputs more reliable and consistent.

efficiency

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

May 4, 2026

Komal Thareja, Anirban Mandal, Ewa Deelman

Pattern-based workflow templates combined with AI assistance can dramatically lower the barrier for non-experts to build and deploy sensor applications across edge-to-cloud infrastructure.

This paper presents a methodology for quickly building sensor-based applications that process data across edge devices and cloud infrastructure. Using AI assistance and reusable workflow patterns, the authors show how scientists can rapidly prototype applications for monitoring air quality, earthquakes, and soil moisture without needing deep expertise in distributed systems.

applicationsagentsefficiency

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

May 4, 2026

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Standard training loss curves can hide poorly-optimized layers in transformers—layer-wise analysis using reference bounds exposes optimization failures that aggregate metrics miss, especially critical for expensive model training.

This paper introduces a method to monitor whether transformer models are actually learning well during training by analyzing each layer individually. Instead of just looking at overall loss, the authors create lightweight reference solutions for each layer and compare them against the trained model, revealing hidden inefficiencies.

trainingevaluationefficiency

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

May 4, 2026

Komal Thareja, Anirban Mandal, Ewa Deelman

AI-assisted workflow templates let developers build sensor applications 5-10x faster by reusing patterns and shifting from code-first to intent-first design, making it practical for non-experts to deploy across edge devices and cloud.

This paper presents a method for quickly building sensor-based applications across edge and cloud systems using AI-assisted workflow templates.

applicationsagentsefficiency

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

May 4, 2026

Jingze Ge, Yun Liu, Xue Geng et al.

Jointly optimizing compression and adaptation using task-aware subspaces beats the standard two-step approach, delivering better accuracy with fewer parameters on both vision and language models.

JACTUS combines model compression and task adaptation in a single step rather than doing them sequentially. Instead of compressing a model first and then fine-tuning it, the method estimates what directions matter for your specific task and compresses the model while preserving those directions.

efficiencytraining

First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

May 4, 2026

Ziqi Liu, Kiljae Lee, Yuan Zhang et al.

Understanding the shared mathematical structure of value estimation methods enables designing more statistically efficient estimators—EASE reduces mean squared error by jointly optimizing sampling and surrogate functions rather than treating them separately.

This paper explains how to efficiently estimate Shapley values and similar attribution methods that explain AI model decisions. The authors show that different estimation approaches share a common mathematical structure, then use this insight to design a better estimator (EASE) that reduces computational error by optimizing both the sampling strategy and the surrogate function used.

evaluationefficiency
architecturemultimodalefficiency

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

May 1, 2026

Jingxi Pu, Tonghua Liu, Zhilin Guan et al.

You can denoise real clinical CT images without paired training data by using unsupervised learning with perceptual loss, making it practical for hospitals that can't easily create labeled datasets.

This paper tackles noise in low-dose CT scans—a real clinical problem where reducing radiation exposure creates grainy images that are hard for doctors to read.

efficiencyevaluationapplications

Make Your LVLM KV Cache More Lightweight

May 1, 2026

Xihao Chen, Yangyang Guo, Roger Zimmermann

You can cut vision-language model KV cache memory in half by intelligently compressing vision tokens based on what the text prompt actually needs, rather than keeping all visual information.

LightKV reduces GPU memory overhead in vision-language models by compressing the Key-Value cache during inference. It uses text prompts to guide which vision tokens are most important, keeping only 55% of tokens while maintaining performance and cutting memory use in half.

efficiencymultimodal

Strait: Perceiving Priority and Interference in ML Inference Serving

Apr 30, 2026

Haidong Zhao, Nikolaos Georgantas

Accurate latency prediction under GPU contention is critical for priority-aware scheduling in inference serving—Strait reduces deadline violations for high-priority tasks by modeling interference effects that traditional systems ignore.

Strait is an ML inference serving system that improves deadline satisfaction for high-priority requests by better predicting latency under GPU contention and using priority-aware scheduling.

efficiencyevaluation

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Apr 30, 2026

Tianyuan Wu, Chaokun Chang, Lunxi Cao et al.

By observing OS-level effects of agent tool calls, Crab identifies that 75% of agent turns don't need checkpointing, enabling efficient fault tolerance and rollback without modifying agent code or sacrificing correctness.

Crab is a system that efficiently saves and restores the state of sandboxed environments where AI agents operate. It solves a key problem: agents need checkpoints for safety and fault tolerance, but saving everything every turn is too expensive.

agentsefficiencysafety

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Apr 30, 2026

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem et al.

Using geometrically-aligned latent spaces (hyperspheres instead of Gaussian distributions) in autoencoders preserves 3D structure and physics better than standard approaches, which matters for building world models that understand real 3D scenes.

This paper proposes S²VAE, a new type of autoencoder that uses hyperspherical (spherical geometry) latent representations instead of traditional Gaussian ones to better preserve 3D geometry and camera motion in visual world models.

architecturemultimodalefficiency

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Apr 30, 2026

Junqi Gao, Dazhi Zhang, Zhichang Guo et al.

Task vectors can be compressed to 1-5% of their original size while maintaining model performance, making it practical to store and dynamically merge multiple task-specific models without prohibitive storage costs.

This paper tackles the storage overhead problem in dynamic model merging by compressing task vectors (fine-tuned weight changes) using learnable compression techniques.

efficiencytraining

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Apr 30, 2026

Ansar Aynetdinov, Patrick Haller, Alan Akbik

For non-English language models, aggressively filtering data for quality and repeating it multiple times beats training once on larger, diverse datasets—a practical insight for resource-constrained language model development.

This paper challenges the assumption that diverse data is always better for language model training. For German, the researchers found that repeatedly training on a smaller, high-quality filtered dataset outperforms training once on a larger, less-filtered dataset—even after 7 epochs of repetition.

trainingdataefficiency

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Apr 29, 2026

Gongbo Zhang, Wen Wang, Ye Tian et al.

Cross-architecture distillation for diffusion models is now practical: you can compress large diffusion LLMs into tiny ones (13x smaller) while maintaining performance, even when teacher and student have completely different designs.

This paper introduces TIDE, a framework for distilling knowledge from large diffusion language models into much smaller ones across different architectures. Unlike previous distillation methods that work within a single model type, TIDE handles cases where teacher and student models have different designs, attention mechanisms, and tokenizers.

trainingefficiencyarchitecture

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Apr 29, 2026

Shayan Hundrieser, Insung Kong, Johannes Schmidt-Hieber

HyCNNs are a more parameter-efficient way to build neural networks that must output convex functions, requiring exponentially fewer parameters than previous methods while maintaining theoretical guarantees.

This paper introduces Hyper Input Convex Neural Networks (HyCNNs), a new neural network architecture that guarantees convex outputs while using far fewer parameters than existing methods.

architectureefficiency

Select to Think: Unlocking SLM Potential with Local Sufficiency

Apr 29, 2026

Wenxuan Ye, Yangyang Zhang, Xueli An et al.

Small models already generate the right answers in their candidate predictions—they just rank them poorly. Training them to re-rank their own outputs improves reasoning without external model calls.

Small language models struggle with reasoning tasks compared to large models. This paper discovers that when small models fail, the correct token from a large model is usually hidden in the small model's top-8 predictions.

efficiencyreasoningtraining

Multiple Additive Neural Networks for Structured and Unstructured Data

Apr 29, 2026

Janis Mohr, Jörg Frochte

MANN combines gradient boosting with neural networks instead of trees, enabling a single framework to handle structured and unstructured data while outperforming XGBoost and reducing hyperparameter sensitivity.

This paper presents Multiple Additive Neural Networks (MANN), which replaces decision trees in gradient boosting with shallow neural networks. MANN works with both structured data and images/audio by using CNNs and capsule networks as feature extractors, and shows better accuracy than XGBoost on standard benchmarks while being more robust to hyperparameter choices.

trainingarchitectureefficiency

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Apr 29, 2026

Zihan Zhao, Baotong Lu, Shengjie Lin et al.

Sparse attention algorithms only work well in practice when paired with careful system design—SPIN shows that unifying different sparsity methods and optimizing GPU-CPU memory transfers can turn algorithmic gains into real performance improvements for long-context LLM serving.

SPIN is a system for serving large language models with long contexts by combining sparse attention (which only reads relevant parts of memory) with smart memory management across GPU and CPU. It unifies different sparse attention methods into a common framework and optimizes how data moves between fast GPU memory and slower CPU memory, achieving 1.66-5.66x faster throughput than existing systems.

efficiency

Recursive Multi-Agent Systems

Apr 28, 2026

Xiyuan Yang, Jiaru Zou, Rui Pan et al.

Multi-agent systems can be made faster and more efficient by having agents refine their reasoning through recursive loops in latent space rather than text-based communication, achieving 1.2-2.4× speedup with 35-76% fewer tokens.

This paper introduces RecursiveMAS, a framework that improves multi-agent AI systems by having agents collaborate through repeated refinement cycles in a shared latent space rather than exchanging text.

agentsreasoningefficiency

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Apr 28, 2026

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy et al.

You can compress LLMs for SE tasks to 1/49th their original size with minimal accuracy loss—making them practical to deploy while cutting environmental impact dramatically.

This paper presents Carbon-Taxed Transformers (CTT), a compression pipeline that makes large language models smaller, faster, and greener for software engineering tasks.

efficiencytrainingevaluation

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Apr 28, 2026

Dominik Żurek, Kamil Faber, Marcin Pietron et al.

Architectural parameter reuse guided by task similarity is a memory-efficient alternative to replay-based continual learning in offline RL, enabling better multi-task performance without storing historical data.

This paper presents TSN-Affinity, a method for continual offline reinforcement learning that learns multiple tasks sequentially from pre-collected datasets without forgetting previous tasks.

trainingarchitectureefficiency

Variational Neural Belief Parameterizations for Robust Dexterous Grasping under Multimodal Uncertainty

Apr 28, 2026

Clinton Enwerem, Shreya Kalyanaraman, John S. Baras et al.

Using differentiable Gaussian mixtures to represent grasp uncertainty enables fast, gradient-based optimization for worst-case robustness—achieving 10x speedup over particle filters while maintaining or improving success rates.

This paper tackles the problem of robust robotic grasping when contact forces, sensing, and external disturbances are unpredictable. Instead of using slow particle-filter approaches, the authors represent uncertainty as a learnable Gaussian mixture and optimize for worst-case performance (CVaR) using gradient-based methods.

reasoningefficiencyagents

Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Apr 27, 2026

Hailing Cheng, Daqi Sun, Xinyu Lu

Positional encodings in Transformers can be made learnable and signal-dependent by treating the rotation manifold as a separate dimension from token embeddings, unlocking better performance without significant overhead.

This paper treats the rotation space in Rotary Positional Embeddings (RoPE) as learnable rather than fixed, introducing SIREN-RoPE to encode temporal and semantic information into rotations.

architectureefficiency

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Apr 27, 2026

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh et al.

You can efficiently extend pretrained LLMs to handle much longer contexts by converting them to hybrid architectures without retraining from scratch—this is more practical than building new models entirely.

This paper presents HyLo, a method to convert pretrained Transformer language models into hybrid architectures that combine Transformers with efficient linear sequence models (like Mamba2). By reusing existing model checkpoints and adding long-context training, HyLo extends context length by 32x while reducing memory use by 90%, enabling 2M-token processing on standard hardware.

architectureefficiencyscaling

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Apr 27, 2026

Hailing Cheng, Tao Huang, Chen Zhu et al.

You can use your existing multi-GPU setup to automatically find better learning rates during training by having each GPU try slightly different rates and averaging them periodically—no extra compute needed.

This paper proposes HDET, a method that uses multiple GPU replicas to explore different learning rates during training instead of computing identical updates. Replicas train independently with different learning rates, then synchronize periodically.

trainingefficiency

Contextual Linear Activation Steering of Language Models

Apr 27, 2026

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan et al.

Adapting steering strength dynamically per context significantly improves LLM control compared to fixed steering, matching more complex methods like LoRA while remaining simpler and more interpretable.

This paper improves linear activation steering—a technique for controlling LLM behavior—by making the steering strength adapt to each input context instead of using a fixed strength for all tokens. The method, called CLAS, works better than existing approaches across multiple benchmarks and models, offering a practical way to customize LLMs with limited training data.

alignmentefficiencytraining
efficiencyagentsevaluation

Relaxation-Informed Training of Neural Network Surrogate Models

Apr 24, 2026

Calvin Tsay

Training neural network surrogates with MILP-aware regularizers can dramatically speed up downstream optimization without sacrificing accuracy, by directly controlling structural properties that affect solver performance.

This paper shows how to train neural networks as surrogate models that work better when embedded in optimization problems. By adding special regularizers during training that target MILP tractability—penalizing large constants, unstable neurons, and LP relaxation gaps—the approach makes the resulting optimization problems solve 10,000x faster while keeping prediction accuracy competitive.

trainingefficiency

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Apr 24, 2026

Rajinder Sandhu, Di Mu, Cheng Chang et al.

You can train dense retrievers to match LLM utility by distilling perplexity-based signals into embeddings during training, eliminating expensive test-time LLM re-ranking while improving retrieval quality.

This paper proposes Utility-Aligned Embeddings (UAE), a method that trains dense retrievers to match the ranking quality of LLM-based re-ranking without the computational cost.

trainingefficiency

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Apr 24, 2026

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

You can train models to reason efficiently using learned abstract tokens instead of natural language, reducing inference cost by over 10× while keeping reasoning quality comparable to verbose chain-of-thought.

This paper introduces Abstract Chain-of-Thought, a method that trains language models to reason using short sequences of special tokens instead of writing out full explanations. The approach uses a warm-up phase combining supervised learning from verbal reasoning and self-distillation, then optimizes with reinforcement learning.

reasoningefficiencytraining

CRAFT: Clustered Regression for Adaptive Filtering of Training data

Apr 24, 2026

Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.

CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.

datatrainingefficiency

Low-Rank Adaptation Redux for Large Models

Apr 23, 2026

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

LoRA works by adding small, low-rank weight matrices to a pre-trained model instead of updating all parameters—signal processing principles can guide better design choices for this approach and similar efficient fine-tuning methods.

This paper examines LoRA (Low-Rank Adaptation), a widely-used technique for efficiently fine-tuning large AI models, through the lens of signal processing. It explains the core mechanisms behind LoRA variants and how classical signal processing tools can improve parameter-efficient fine-tuning methods, covering architectural design, optimization strategies, and real-world applications.

trainingefficiency

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Apr 23, 2026

Max Defez, Filippo Quarenghi, Mathieu Vrac et al.

A single neural network architecture can handle multiple super-resolution scales by adapting just three hyperparameters (noise schedule, context length, and mass conservation), eliminating the need to train separate models for each upscaling factor.

This paper presents a flexible deep-learning framework for video super-resolution that works across different spatial and temporal upscaling factors without retraining from scratch.

architectureefficiencyscaling

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Apr 23, 2026

Neeraj Gangwar, Rishabh Deshmukh, Michael Shavlovsky et al.

GiVA reduces the parameter cost of vector-based fine-tuning by 8x compared to existing methods while matching LoRA's speed, making extreme parameter efficiency practical for real-world model adaptation.

GiVA improves vector-based adaptation—a super-efficient way to customize large AI models—by using gradient information during initialization. Instead of requiring 8 times more parameters than LoRA to work well, GiVA achieves similar performance with far fewer parameters and faster training, making it practical for adapting massive models on limited budgets.

efficiencytraining

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Apr 23, 2026

Muhy Eddin Za'ter, Anna Van Boven, Bri-Mathias Hodge et al.

Deep learning can accelerate hard optimization problems by providing intelligent warm-start solutions that reduce the search space, rather than replacing traditional solvers entirely.

This paper uses a transformer neural network to predict electricity generator schedules 72 hours ahead, then refines those predictions with rule-based corrections and feeds them to a traditional optimization solver as a starting point.

applicationsreasoningefficiency

Addressing Image Authenticity When Cameras Use Generative AI

Apr 23, 2026

Umar Masud, Abhijith Punnappurath, Luxi Zhao et al.

Camera-embedded AI enhancements can alter image semantics without users knowing—this work enables recovery of authentic pre-enhancement images using a tiny stored decoder, raising important questions about transparency in computational photography.

Modern cameras increasingly use AI to enhance images during capture (better zoom, low-light processing), but this can add hallucinated content that users don't realize isn't authentic.

safetyefficiencyapplications

Replay-buffer engineering for noise-robust quantum circuit optimization

Apr 23, 2026

Akash Kundu, Sebastian Feld

Treating the replay buffer as a primary algorithmic lever—not just a storage mechanism—can dramatically improve quantum circuit optimization by adapting how past experiences are sampled and transferred across different noise conditions.

This paper improves deep reinforcement learning for quantum circuit optimization by redesigning how the algorithm stores and reuses past experiences.

trainingefficiency

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Apr 23, 2026

Anuj Sadani, Deepak Kumar

Tool schema injection is a hidden operational cost in agent systems—Tool Attention solves this by filtering irrelevant tools and deferring full schema loading, reducing per-turn tokens from ~47k to ~2.4k without sacrificing capability.

This paper introduces Tool Attention, a middleware system that dramatically reduces the token overhead from injecting tool schemas into LLM agents. By using smart filtering (based on task intent and access rules) and lazy loading of full schemas only when needed, it cuts tool-related tokens by 95% in multi-tool deployments, making agentic workflows more efficient and cost-effective.

agentsefficiencyarchitecture

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Apr 23, 2026

Guangxiang Zhao, Qilong Shi, Xusen Xiao et al.

By retrieving learned reasoning skills at inference time instead of reasoning from scratch, you can reduce token usage and improve accuracy—making LLM reasoning cheaper and faster for practical deployment.

This paper proposes storing reusable reasoning skills learned from past problem-solving attempts, then retrieving and applying them during inference to guide new reasoning. Instead of reasoning from scratch each time, the model recalls relevant skills to avoid redundant work and reach solutions faster. Tests on coding and math tasks show it uses fewer tokens while improving accuracy.

reasoningefficiencytraining

Transferable Physics-Informed Representations via Closed-Form Head Adaptation

Apr 23, 2026

Jian Cheng Wong, Isaac Yin Chung Lai, Pao-Hsiung Chiu et al.

Physics-informed neural networks can be made dramatically faster and more generalizable by learning shared representations across PDE families and using closed-form adaptation, enabling accurate predictions on new problems without retraining.

This paper introduces Pi-PINN, a physics-informed neural network that learns reusable representations for solving different partial differential equations (PDEs). Instead of training separate models for each PDE, Pi-PINN learns a shared representation and adapts quickly to new PDEs using a mathematical technique called pseudoinverse, achieving 100-1000x faster predictions than standard PINNs.

efficiencyreasoning

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Apr 22, 2026

Sina Gholami, Abdulmoneam Ali, Tania Haghighi et al.

By analyzing the spectral structure of feature representations, you can identify noisy labels in federated learning and use clean clients to help relabel corrupted data—without needing to share raw data or redesign loss functions.

FedSIR tackles a major challenge in federated learning: when training data across distributed devices contains mislabeled examples. The method identifies which devices have clean vs. noisy labels by analyzing the mathematical structure of their learned features, then uses clean devices to help noisy devices fix their labels.

trainingdataefficiency

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Apr 22, 2026

Yiming Bian, Joshua M. Akey

You can now run exact attention on billion-token sequences on a single GPU by streaming chunks through memory—no approximation needed, just smarter scheduling of the computation.

This paper solves the memory problem that prevents long-context language models from running on single GPUs. Instead of approximating attention (which loses accuracy), it mathematically decomposes attention into smaller independent chunks that can be processed one at a time, streaming results without keeping everything in memory at once.

efficiencyarchitecture

FASTER: Value-Guided Sampling for Fast RL

Apr 21, 2026

Perry Dong, Alexander Swerdlow, Dorsa Sadigh et al.

You can get the benefits of expensive test-time sampling in RL by learning to filter action candidates early in the generation process, reducing compute without sacrificing performance.

FASTER is a method that speeds up reinforcement learning by filtering action candidates during the denoising process of diffusion-based policies, rather than waiting until denoising completes. It models this filtering as a decision problem with a learned value function, achieving the same performance as expensive sampling methods while cutting computational costs significantly.

efficiencyreasoningtraining

FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

Apr 21, 2026

Abdulmoneam Ali, Ahmed Arafa

Instead of clustering users during training (vulnerable to noisy labels), group them upfront using feature covariance structure, then fix label errors by checking if examples align with learned feature subspaces.

FB-NLL tackles noisy labels in federated learning by clustering users based on feature geometry rather than training dynamics, then correcting mislabeled data using feature alignment. This one-shot approach avoids the communication overhead of iterative methods while handling low-quality data that typically corrupts personalized federated learning.

trainingdataefficiency

Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

Apr 21, 2026

Jake Lee

Adaptive binning that adjusts to data skewness can significantly improve decision tree and Random Forest accuracy on skewed real-world data without sacrificing the computational efficiency gains of statistical discretization.

This paper improves how decision trees handle continuous numerical data by introducing Adaptive MSD-Splitting (AMSD), which adjusts binning strategies based on data skewness instead of using fixed cutoffs. The method maintains fast O(N) performance while improving accuracy by 2-4%, and extends to Random Forests for better large-scale performance on real-world datasets.

trainingefficiency

Sessa: Selective State Space Attention

Apr 20, 2026

Liubomyr Horbatko

Sessa's hybrid architecture enables power-law decay of information loss over distance (O(ℓ^-β)) instead of exponential or linear decay, making it more effective for long-context language modeling while staying competitive on standard benchmarks.

Sessa combines attention mechanisms with state-space model feedback paths to improve how models retrieve information from long contexts.

architectureefficiencyreasoning

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Apr 20, 2026

Manan Gupta, Dhruv Kumar

You can catch and fix LLM reasoning errors at inference time by monitoring internal layer activations for phase shifts, then steering the model back on track—no retraining needed, and it's 5× cheaper than sampling multiple outputs.

This paper introduces a method to fix reasoning errors in language models during generation by monitoring internal signals and rolling back to correct course. Instead of retraining, it detects when a model makes a wrong turn by watching for sudden directional shifts in its internal computations, then resets the model's memory and injects a corrective signal.

reasoningefficiency

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Apr 20, 2026

Terry Leitch

Backend infrastructure (llama.cpp vs MLX) matters more than quantization level for local LLM performance, and long-context tasks expose memory limits that cloud models handle better—critical for practitioners choosing between cloud and local deployment.

This paper evaluates large language models on System Dynamics tasks, comparing cloud APIs (77–89% accuracy) against locally-hosted open-source models (up to 77% on causal diagram extraction).

evaluationefficiencyapplications

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Apr 20, 2026

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan et al.

GSQ achieves near-frontier compression accuracy at 2-3 bits using standard scalar quantization compatible with existing inference hardware, making ultra-low-precision models practical without complex custom implementations.

GSQ is a new quantization method that compresses large language models to 2-3 bits per parameter while maintaining accuracy. It uses a mathematical technique called Gumbel-Softmax to intelligently assign weights to discrete values, bridging the gap between simple but limited scalar quantization and complex vector quantization methods that are hard to deploy.

efficiencytraining

FUSE: Ensembling Verifiers with Zero Labeled Data

Apr 20, 2026

Joonhyuk Lee, Virginia Ma, Sarah Zhao et al.

You can build better verification systems by combining multiple imperfect judges without any ground truth labels—FUSE shows this works as well as supervised approaches on real benchmarks.

FUSE is a method for combining multiple imperfect AI judges (verifiers) to better evaluate model outputs without needing any labeled correct answers. It uses spectral algorithms to intelligently ensemble different verifiers by controlling how they depend on each other, achieving results comparable to methods that do use labeled data.

evaluationtrainingefficiency