Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

1492 papers23 this month12 topics

All Evaluation 42 Training 39 Agents 31 Reasoning 27 Efficiency 25 Safety 18 Multimodal 17 Applications 17 Alignment 11 Data 11 Architecture 8 scaling 6

Jul 6 – Jul 12(2)

Weak-to-Strong Generalization via Direct On-Policy Distillation

Jul 6, 2026

Shiyuan Feng, Huan-ang Gao, Haohan Chi et al.

You can reuse RL training from cheaper small models to improve large models by treating the policy shift (not the final policy) as a dense reward signal—this cuts post-training costs while maintaining reasoning gains across model scales.

This paper proposes Direct-OPD, a method to transfer reinforcement learning gains from smaller models to larger ones without expensive retraining. Instead of distilling the final policy, it extracts the policy shift that RL induced (via log-ratio comparison) and applies it as an implicit reward signal on the stronger model's own data, enabling efficient scaling of RL-based reasoning improvements.

trainingefficiencyreasoning

Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification

Jul 6, 2026

Raphaël Bonnet-Guerrini, Bruno Sanchez, Dominique Fouchez et al.

You can train accurate astronomical classifiers without expensive human labels by combining synthetic data injection with robust handling of noisy labels, and get reliable confidence scores through a hybrid uncertainty approach.

This paper develops a Real-Bogus classification system for astronomical transients that requires no human-labeled training data. It uses simulated transient injections combined with noisy survey data and a dual-network training approach to reliably distinguish real astronomical events from false detections, while also providing calibrated uncertainty estimates.

Jun 29 – Jul 5(36)

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Jul 2, 2026

Matteo Boglioni, Thibault Rousset, Siva Reddy et al.

Current unlearning methods are imprecise at targeting specific parameters where knowledge is stored, making them vulnerable to attacks that resurface the data—precise localization matters more than output-level performance.

LACUNA is a new benchmark for testing whether LLM unlearning methods actually erase sensitive data from model parameters or just hide it. The researchers inject fake personal information into specific weights of language models, then check if unlearning methods successfully target those exact parameters.

safetyevaluationtraining

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Jul 2, 2026

Wentao Zhang, Liliana Hotsko, Woojeong Kim et al.

Instead of calling large language models for every fuzzy task, you can compile a natural-language specification once into a tiny reusable neural artifact that runs locally and cheaply—shifting from per-input problem solving to one-time function compilation.

This paper introduces Program-as-Weights (PAW), a method to compile natural-language function specifications into small, locally-executable neural adapters. A 4B compiler generates parameter-efficient adapters that run on a lightweight 0.6B interpreter, matching the performance of much larger models while using 50x less memory and running efficiently on consumer hardware like MacBook M3.

Jun 22 – Jun 28(32)

Second-Order KKT Guarantees for Bregman ADMM in Nonconvex and Non-Lipschitz Optimization

Jun 26, 2026

Shuang Li, Zhihui Zhu, Qiuwei Li

Bregman ADMM provably avoids saddle points and finds second-order stationary solutions for nonconvex problems without Lipschitz gradient requirements, making it applicable to polynomial and tensor optimization problems where standard methods fail.

This paper analyzes Bregman ADMM, an optimization algorithm for nonconvex problems with linear constraints that don't require standard smoothness assumptions.

training

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

Jun 26, 2026

Sihang Nie, Xiaofen Xing, Rui Xing et al.

Separating content and emotion into distinct latent spaces during training prevents reward conflicts and enables better emotional control in TTS systems without sacrificing intelligibility.

This paper addresses emotional expressiveness in LLM-based text-to-speech by proposing HPRO, a hierarchical reward optimization framework that separates emotional and semantic information to avoid conflicting gradients, then progressively aligns rewards across frame, word, and sentence levels to improve emotional control while maintaining speech clarity.

training

Jun 15 – Jun 21(28)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Jun 18, 2026

Wenhao Chi, Arkaprava Sinha, Dominick Reilly et al.

Using proxy models as intermediaries between diverse teachers prevents conflicting gradients and enables learning richer egocentric representations from heterogeneous knowledge sources—achieving better results than naive multi-teacher distillation.

This paper introduces UNIEGO, a unified egocentric video encoder trained through a novel multi-teacher distillation framework.

multimodaltrainingarchitecture

Toward Calibrated Mixture-of-Experts Under Distribution Shift

Jun 18, 2026

Gina Wong, Drew Prinster, Suchi Saria et al.

Expert-level calibration alone isn't enough for soft-routed MoE models under distribution shift—you need to explicitly calibrate the routing mechanism's aggregate predictions to maintain trustworthy uncertainty estimates.

This paper studies how mixture-of-experts (MoE) models maintain calibrated predictions under distribution shift. The authors show that calibrating individual experts works for hard-routed models but fails for soft-routed ones, and propose an adversarial reweighting method to improve calibration across different routing mechanisms and data distributions.

Jun 8 – Jun 14(2)

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

Jun 12, 2026

Jinsu Kim, Jihoon Tack, Noah Lee et al.

You can shrink language models for specific character personas by 50%+ while keeping 93.8% of role-playing quality, making multi-NPC applications practical without sacrificing character consistency.

This paper introduces Persona-Pruner, a technique that creates lightweight language models optimized for specific character roles by identifying and preserving only the persona-relevant parts of a full model. Unlike standard pruning that indiscriminately removes parameters, this method maintains role-playing quality while reducing computational cost—useful for applications with many NPCs.

efficiencytrainingapplications

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Jun 12, 2026

Junlong Tong, Wenqi Xu, Yingqi Fan et al.

Models can now learn to reason efficiently during streaming input instead of only after seeing everything, using fine-grained reward signals that separately optimize early thinking and final deliberation phases.

AdaSR enables language models to reason incrementally as data streams in (like audio or video), rather than waiting for complete input. It uses a new training method called Hierarchical Relative Policy Optimization to teach models when to think and how much computation to spend at each stage, balancing accuracy, speed, and efficiency.

Papers

Jul 6 – Jul 12(2)

Weak-to-Strong Generalization via Direct On-Policy Distillation

Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification

Jun 29 – Jul 5(36)

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Jun 22 – Jun 28(32)

Second-Order KKT Guarantees for Bregman ADMM in Nonconvex and Non-Lipschitz Optimization

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

Jun 15 – Jun 21(28)

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Toward Calibrated Mixture-of-Experts Under Distribution Shift

Jun 8 – Jun 14(2)

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

DemoPSD: Disagreement-Modulated Policy Self-Distillation

Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

Controllable Sim Agents with Behavior Latents

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

WorldSample: Closed-loop Real-robot RL with World Modelling

Neuron-Aware Active Few-Shot Learning for LLMs

LIME: Learning Intent-aware Camera Motion from Egocentric Video

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Language-Critique Imitation Learning from Suboptimal Demonstrations

AutoMem: Automated Learning of Memory as a Cognitive Skill

The State-Prediction Separation Hypothesis

Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

Decision-Aware Training for Sample-Based Generative Models

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Generative Skill Composition for LLM Agents

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

Scalable Behaviour Cloning on Browser Using via Skill Distillation

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

DOPD: Dual On-policy Distillation

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders

How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

DanceOPD: On-Policy Generative Field Distillation

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Autoregressive Boltzmann Generators

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Generative Models on Analog Hardware with Dynamics

Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

Simulation-based inference for rapid Bayesian parameter estimation in epidemiological models: a comparison with MCMC

Effective Covariance Dynamics in Solvable High-Dimensional GANs

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

Learning Action Priors for Cross-embodiment Robot Manipulation

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

A cross-process welding penetration status prediction algorithm based on unsupervised domain adaptation in laser and TIG welding

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

InSight: Self-Guided Skill Acquisition via Steerable VLAs

OpenThoughts-Agent: Data Recipes for Agentic Models

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles

Muown Implicitly Performs Angular Step-size Decay

Diffusion Models Adapt to Low-Dimensional Structure Under Flexible Coefficient Choices