Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Qianshu Cai, Yonggang Zhang, Xianzhang Jia et al.
Self-evolving agents need source-code access, not just prompt editing—structural bugs in routing and state management can't be fixed by text-layer changes alone, and MOSS demonstrates this works in production with measurable improvements.
MOSS is a system that lets autonomous agents automatically fix themselves by rewriting their own source code based on real failures. Unlike existing approaches that only modify text files like prompts, MOSS can change the actual code structure—routing logic, state management, dispatch—making it possible to fix a much broader class of problems.
Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas et al.
When LLM agents communicate through shared KV caches for efficiency, you need explicit safeguards—LCGuard shows how to block sensitive information leakage at the representation level without breaking task coordination.
LCGuard is a safety framework that protects sensitive information when multiple AI agents share transformer key-value caches to coordinate tasks. It uses adversarial training to transform shared cache data so that agents can't reconstruct each other's private inputs, while keeping the information useful for task performance.
Rui Wen, Mark Russinovich, Andrew Paverd et al.
LLM backdoors don't need suspicious text triggers—attackers can hide them in positional encoding, making them invisible to content-based defenses and activatable through normal conversation length patterns.
This paper reveals a new way to attack large language models by exploiting how they process word positions rather than modifying the text itself. Researchers show that backdoors can be triggered by input length alone, allowing attackers to make models leak secrets or misbehave without leaving obvious traces in the conversation.
Pratinav Seth, Vinay Kumar Sankarapu
Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.
This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.
Shuhang Lin, Chuhao Zhou, Xiao Lin et al.
Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.
This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.
Sushant Gautam, Finn Schwall, Annika Willoch Olstad et al.
When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.
This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.
Alfredo Madrid-García, Miguel Rujas
Medical RAG chatbots often expose sensitive backend details and patient data through client-side communication—use server-side security controls and independent audits before deploying patient-facing AI systems.
Researchers audited a patient-facing medical chatbot and found critical security flaws: sensitive system prompts, API endpoints, and 1,000 patient conversations were exposed through basic browser inspection. The study shows how RAG chatbots can leak backend configuration and private health data without authentication, highlighting governance gaps in AI healthcare deployment.
Yinhao Xiao, Rongbo Xiao, Yihan Zhang
LLM-generated GIS code can look correct but violate geographic rules; GeoContra's contract-based verification catches these semantic errors before they produce wrong spatial analysis.
GeoContra is a verification and repair system that catches geographic errors in AI-generated GIS code. It checks that spatial analysis preserves coordinate systems, topology, units, and geographic plausibility—catching bugs like negative travel times or mismatched coordinate systems that would otherwise produce executable but wrong results.
Ilana Nguyen, Harini Suresh, Thema Monroe-White et al.
LLMs systematically misrepresent Global Majority nationalities through stereotyping and one-dimensional portrayals, creating real risks for applications like asylum interviews. These harms are structural, not just surface-level, and require deliberate mitigation strategies.
This paper reveals how popular LLMs perpetuate harmful stereotypes and biases against people from Global Majority countries in generated narratives. Researchers found that non-Western nationalities are underrepresented in neutral stories but overrepresented in negative character roles—over 50 times more likely to appear in subordinated positions.
Gauri Sharma, Maryam Molamohammadi
Bias in AI hiring isn't just a technical problem—it's a supply chain problem. Even if each vendor's component works fairly in isolation, their combination can discriminate, yet no single party has visibility into the whole system or clear accountability for fixing it.
Eric Gan, Aryan Bhatt, Buck Shlegeris et al.
Current AI systems and auditors are poor at detecting subtle sabotage in research code—even frontier LLMs only catch 77% of cases—highlighting a critical gap in oversight for autonomous AI research.
This paper introduces ASMR-Bench, a benchmark for testing whether AI systems and human auditors can detect sabotage hidden in ML research code. The benchmark includes 9 real ML projects with intentionally introduced bugs that change experimental results while keeping the paper's description accurate.
Manan Gupta, Dhruv Kumar
LLM judges appear reliable in aggregate but are actually inconsistent on individual inputs; prediction set width reliably indicates per-document difficulty and can serve as a confidence measure for automatic evaluation.
This paper diagnoses why LLM judges give inconsistent scores for text evaluation. Using two methods—checking if judges contradict themselves and using conformal prediction to quantify uncertainty—the authors show that judges are unreliable on individual documents even when they seem consistent overall.
Wenyi Xiao, Xinchi Xu, Leilei Gan
Vision-language models need separate confidence scores for perception and reasoning, not a single overall confidence score, to better detect hallucinations and improve reliability in real-world applications.
This paper addresses a critical problem in vision-language models: they often give confident wrong answers, especially in high-stakes applications. The authors propose VL-Calibration, which separates confidence into two parts—visual confidence (did the model see the right thing?) and reasoning confidence (did it think correctly about what it saw?)—using reinforcement learning.
Xinyu Wang, Sai Koneru, Wenbo Zhang et al.
Fake news detectors are vulnerable to strategically crafted mixed-truth content where falsehoods are woven into accurate narratives, not just fully fabricated stories—a realistic threat that current benchmarks don't adequately test.
This paper introduces MANYFAKE, a benchmark of 6,798 synthetic fake news articles created through AI-driven strategies to test how well fake news detectors handle realistic threats. Unlike simple fabricated stories, the benchmark focuses on mixed-truth cases where false claims are embedded in otherwise credible narratives—a pattern that emerges from human-AI collaboration.
Sean Wu, Fredrik K. Gustafsson, Edward Phillips et al.
LLMs often express high confidence in wrong answers, and standard evaluation metrics miss this problem—BAS provides a decision-focused alternative that rewards models for knowing when to say 'I don't know' instead of guessing confidently.
This paper introduces BAS (Behavioral Alignment Score), a new metric for measuring whether LLMs' confidence levels are actually useful for deciding when to abstain from answering. Unlike standard metrics that treat all errors equally, BAS penalizes overconfident wrong answers more heavily, reflecting real-world decision-making where false confidence is costlier than admitting uncertainty.
Maximiliano Armesto, Christophe Kolb
Agentic AI systems need tightly integrated control, memory, and verification mechanisms working together; separating these concerns (as robotics, retrieval, and alignment research typically do) misses critical robustness gains that come from their coupling.
Geeyang Tay, Wentao Ma, Jaewon Lee et al.
Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.
This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.
Xueji Zhao, Likai Pei, Jianbo Liu et al.
Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.
This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.
Xinyi Shang, Yi Tang, Jiacheng Cui et al.
Mask-based evaluation of image tampering is fundamentally flawed; pixel-level metrics with semantic understanding of edit types provide a much more accurate way to assess whether AI systems can detect real image manipulations.
This paper fixes how we evaluate image tampering detection by moving from coarse object masks to pixel-level precision. It introduces a taxonomy of edit types (replace, remove, splice, etc.), a new benchmark with precise tamper maps, and metrics that measure both where edits occur and what they mean semantically—revealing that existing detectors often miss subtle edits or flag untouched pixels.
Jianan Huang, Rodolfo V. Valentim, Luca Vassio et al.
By aligning payload embeddings with text-based vulnerability descriptions using contrastive learning, you can reduce shortcut learning and improve how well cybersecurity models generalize to unseen threats.
This paper tackles a major problem in cybersecurity AI: models trained in labs fail in the real world because they learn surface-level patterns instead of genuine security concepts.
AI hiring systems are built from components supplied by different vendors—data providers, model makers, platform companies—creating fragmented responsibility chains.
This paper proposes SCRAT, a framework for agentic AI that couples control, memory, and verification by drawing parallels from squirrel behavior.