Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers16 this month12 topics

All Efficiency 37 Reasoning 36 Training 35 Evaluation 29 Architecture 23 Agents 23 Multimodal 17 Applications 15 Alignment 9 Safety 8 scaling 8 Data 3

May 18 – May 24(4)

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

May 21, 2026

Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.

Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.

MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.

agentssafety

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

May 21, 2026

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.

When LLM agents communicate through shared KV caches for efficiency, you need explicit safeguards—LCGuard shows how to block sensitive information leakage at the representation level without breaking task coordination.

LCGuard is a safety framework that protects sensitive information when multiple AI agents share transformer key-value caches to coordinate tasks. It uses adversarial training to transform shared cache data so that agents can't reconstruct each other's private inputs, while keeping the information useful for task performance.

May 11 – May 17(2)

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

May 14, 2026

Rui Wen, Mark Russinovich, Andrew Paverd et al.

LLM backdoors don't need suspicious text triggers—attackers can hide them in positional encoding, making them invisible to content-based defenses and activatable through normal conversation length patterns.

This paper reveals a new way to attack large language models by exploiting how they process word positions rather than modifying the text itself. Researchers show that backdoors can be triggered by input length alone, allowing attackers to make models leak secrets or misbehave without leaving obvious traces in the conversation.

safety

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

May 14, 2026

Pratinav Seth, Vinay Kumar Sankarapu

Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.

This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.

safety

May 4 – May 10(8)

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoningevaluationsafety

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

May 7, 2026

Sushant Gautam, Finn Schwall, Annika Willoch Olstad et al.

When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.

This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.

Apr 27 – May 3(15)

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

May 1, 2026

Alfredo Madrid-García, Miguel Rujas

Medical RAG chatbots often expose sensitive backend details and patient data through client-side communication—use server-side security controls and independent audits before deploying patient-facing AI systems.

Researchers audited a patient-facing medical chatbot and found critical security flaws: sensitive system prompts, API endpoints, and 1,000 patient conversations were exposed through basic browser inspection. The study shows how RAG chatbots can leak backend configuration and private health data without authentication, highlighting governance gaps in AI healthcare deployment.

safetyapplicationsevaluation

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

May 1, 2026

Yinhao Xiao, Rongbo Xiao, Yihan Zhang

LLM-generated GIS code can look correct but violate geographic rules; GeoContra's contract-based verification catches these semantic errors before they produce wrong spatial analysis.

GeoContra is a verification and repair system that catches geographic errors in AI-generated GIS code. It checks that spatial analysis preserves coordinate systems, topology, units, and geographic plausibility—catching bugs like negative travel times or mismatched coordinate systems that would otherwise produce executable but wrong results.

Apr 20 – Apr 26(15)

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

Apr 24, 2026

Ilana Nguyen, Harini Suresh, Thema Monroe-White et al.

LLMs systematically misrepresent Global Majority nationalities through stereotyping and one-dimensional portrayals, creating real risks for applications like asylum interviews. These harms are structural, not just surface-level, and require deliberate mitigation strategies.

This paper reveals how popular LLMs perpetuate harmful stereotypes and biases against people from Global Majority countries in generated narratives. Researchers found that non-Western nationalities are underrepresented in neutral stories but overrepresented in negative character roles—over 50 times more likely to appear in subordinated positions.

safetyevaluationalignment

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

Apr 24, 2026

Gauri Sharma, Maryam Molamohammadi

Bias in AI hiring isn't just a technical problem—it's a supply chain problem. Even if each vendor's component works fairly in isolation, their combination can discriminate, yet no single party has visibility into the whole system or clear accountability for fixing it.

Apr 13 – Apr 19(13)

ASMR-Bench: Auditing for Sabotage in ML Research

Apr 17, 2026

Eric Gan, Aryan Bhatt, Buck Shlegeris et al.

Current AI systems and auditors are poor at detecting subtle sabotage in research code—even frontier LLMs only catch 77% of cases—highlighting a critical gap in oversight for autonomous AI research.

This paper introduces ASMR-Bench, a benchmark for testing whether AI systems and human auditors can detect sabotage hidden in ML research code. The benchmark includes 9 real ML projects with intentionally introduced bugs that change experimental results while keeping the paper's description accurate.

safetyevaluationagents

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Apr 16, 2026

Manan Gupta, Dhruv Kumar

LLM judges appear reliable in aggregate but are actually inconsistent on individual inputs; prediction set width reliably indicates per-document difficulty and can serve as a confidence measure for automatic evaluation.

This paper diagnoses why LLM judges give inconsistent scores for text evaluation. Using two methods—checking if judges contradict themselves and using conformal prediction to quantify uncertainty—the authors show that judges are unreliable on individual documents even when they seem consistent overall.

Apr 6 – Apr 12(17)

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Apr 10, 2026

Wenyi Xiao, Xinchi Xu, Leilei Gan

Vision-language models need separate confidence scores for perception and reasoning, not a single overall confidence score, to better detect hallucinations and improve reliability in real-world applications.

This paper addresses a critical problem in vision-language models: they often give confident wrong answers, especially in high-stakes applications. The authors propose VL-Calibration, which separates confidence into two parts—visual confidence (did the model see the right thing?) and reasoning confidence (did it think correctly about what it saw?)—using reinforcement learning.

safetymultimodalevaluation

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

Apr 10, 2026

Xinyu Wang, Sai Koneru, Wenbo Zhang et al.

Fake news detectors are vulnerable to strategically crafted mixed-truth content where falsehoods are woven into accurate narratives, not just fully fabricated stories—a realistic threat that current benchmarks don't adequately test.

This paper introduces MANYFAKE, a benchmark of 6,798 synthetic fake news articles created through AI-driven strategies to test how well fake news detectors handle realistic threats. Unlike simple fabricated stories, the benchmark focuses on mixed-truth cases where false claims are embedded in otherwise credible narratives—a pattern that emerges from human-AI collaboration.

Mar 30 – Apr 5(13)

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Apr 3, 2026

Sean Wu, Fredrik K. Gustafsson, Edward Phillips et al.

LLMs often express high confidence in wrong answers, and standard evaluation metrics miss this problem—BAS provides a decision-focused alternative that rewards models for knowing when to say 'I don't know' instead of guessing confidently.

This paper introduces BAS (Behavioral Alignment Score), a new metric for measuring whether LLMs' confidence levels are actually useful for deciding when to abstain from answering. Unlike standard metrics that treat all errors equally, BAS penalizes overconfident wrong answers more heavily, reflecting real-world decision-making where false confidence is costlier than admitting uncertainty.

evaluationsafetyalignment

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

Apr 3, 2026

Maximiliano Armesto, Christophe Kolb

Agentic AI systems need tightly integrated control, memory, and verification mechanisms working together; separating these concerns (as robotics, retrieval, and alignment research typically do) misses critical robustness gains that come from their coupling.

Mar 23 – Mar 29(10)

Back to Basics: Revisiting ASR in the Age of Voice Agents

Mar 26, 2026

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.

This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.

evaluationsafetymultimodal

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 26, 2026

Xueji Zhao, Likai Pei, Jianbo Liu et al.

Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.

This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.

Mar 16 – Mar 22(3)

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Mar 20, 2026

Xinyi Shang, Yi Tang, Jiacheng Cui et al.

Mask-based evaluation of image tampering is fundamentally flawed; pixel-level metrics with semantic understanding of edit types provide a much more accurate way to assess whether AI systems can detect real image manipulations.

This paper fixes how we evaluate image tampering detection by moving from coarse object masks to pixel-level precision. It introduces a taxonomy of edit types (replace, remove, splice, etc.), a new benchmark with precise tamper maps, and metrics that measure both where edits occur and what they mean semantically—revealing that existing detectors often miss subtle edits or flag untouched pixels.

evaluationmultimodalsafety

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

Mar 20, 2026

Jianan Huang, Rodolfo V. Valentim, Luca Vassio et al.

By aligning payload embeddings with text-based vulnerability descriptions using contrastive learning, you can reduce shortcut learning and improve how well cybersecurity models generalize to unseen threats.

This paper tackles a major problem in cybersecurity AI: models trained in labs fail in the real world because they learn surface-level patterns instead of genuine security concepts.

Papers

May 18 – May 24(4)

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

May 11 – May 17(2)

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

May 4 – May 10(8)

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Apr 27 – May 3(15)

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

Apr 20 – Apr 26(15)

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

Apr 13 – Apr 19(13)

ASMR-Bench: Auditing for Sabotage in ML Research

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Apr 6 – Apr 12(17)

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

Mar 30 – Apr 5(13)

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

Mar 23 – Mar 29(10)

Back to Basics: Revisiting ASR in the Age of Voice Agents

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 16 – Mar 22(3)

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

Safety and accuracy follow different scaling laws in clinical large language models

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

Characterizing the Consistency of the Emergent Misalignment Persona

MoRFI: Monotonic Sparse Autoencoder Feature Identification

Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Addressing Image Authenticity When Cameras Use Generative AI

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Compliance Moral Hazard and the Backfiring Mandate

Misinformation Span Detection in Videos via Audio Transcripts

AVISE: Framework for Evaluating the Security of AI Systems

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

Safe Continual Reinforcement Learning in Non-stationary Environments

Benign Overfitting in Adversarial Training for Vision Transformers

AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Agentic Microphysics: A Manifesto for Generative AI Safety

Context Over Content: Exposing Evaluation Faking in Automated Judges

AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing