ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers3 this month12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 30 – Apr 5(4)

BVFLMSP : Bayesian Vertical Federated Learning for Multimodal Survival with Privacy

Apr 2, 2026

Abhilash Kar, Basisth Saha, Tanmay Sen et al.

This framework enables hospitals and clinics to collaboratively build better survival prediction models without sharing raw patient data, while also quantifying prediction confidence—critical for clinical adoption.

BVFLMSP combines Bayesian neural networks with federated learning to predict survival outcomes from sensitive multimodal data distributed across multiple parties. Each organization keeps its data private while contributing predictions to a shared model, with added privacy protections and uncertainty estimates for more reliable medical decision-making.

safetymultimodaltraining

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

Apr 2, 2026

Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy et al.

Reasoning models can be made safer by detecting when they've misunderstood the question itself—reconstruct what question they answered from their reasoning trace, and abstain if it differs from the original.

This paper tackles a critical problem: getting LLMs to know when to refuse answering questions. The authors discovered that reasoning models often fail at abstention (refusing to answer) because they answer the wrong question rather than answering incorrectly.

Mar 23 – Mar 29(10)

Back to Basics: Revisiting ASR in the Age of Voice Agents

Mar 26, 2026

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Speech recognition systems hallucinate false content under degraded audio, creating safety risks for voice agents. You need diagnostic testing across real-world conditions, not just benchmark scores, to know when and where your ASR will fail.

This paper reveals that speech recognition systems fail in real-world voice agents despite high benchmark scores. The authors created WildASR, a multilingual test set from real human speech that measures robustness across environmental noise, speaker differences, and languages.

evaluationsafetymultimodal

A Unified Memory Perspective for Probabilistic Trustworthy AI

Mar 26, 2026

Xueji Zhao, Likai Pei, Jianbo Liu et al.

Memory access, not computation speed, limits performance in probabilistic AI systems—hardware designers need to optimize for both data delivery and randomness generation together, not separately.

This paper examines how memory systems become the performance bottleneck in AI systems that need probabilistic computation for safety and robustness. It proposes treating deterministic data access as a special case of stochastic sampling, creating a unified framework to analyze memory efficiency.

Mar 16 – Mar 22(18)

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Mar 20, 2026

Xinyi Shang, Yi Tang, Jiacheng Cui et al.

Mask-based evaluation of image tampering is fundamentally flawed; pixel-level metrics with semantic understanding of edit types provide a much more accurate way to assess whether AI systems can detect real image manipulations.

This paper fixes how we evaluate image tampering detection by moving from coarse object masks to pixel-level precision. It introduces a taxonomy of edit types (replace, remove, splice, etc.), a new benchmark with precise tamper maps, and metrics that measure both where edits occur and what they mean semantically—revealing that existing detectors often miss subtle edits or flag untouched pixels.

evaluationmultimodalsafety

Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

Mar 20, 2026

Jianan Huang, Rodolfo V. Valentim, Luca Vassio et al.

By aligning payload embeddings with text-based vulnerability descriptions using contrastive learning, you can reduce shortcut learning and improve how well cybersecurity models generalize to unseen threats.

This paper tackles a major problem in cybersecurity AI: models trained in labs fail in the real world because they learn surface-level patterns instead of genuine security concepts.

Mar 9 – Mar 15(8)

LLM Constitutional Multi-Agent Governance

Mar 13, 2026

J. de Curtò, I. de Zarzà

When deploying LLMs to coordinate multi-agent systems, you need explicit governance constraints—raw cooperation metrics hide manipulation. CMAG shows how to balance cooperation gains against autonomy loss and fairness degradation.

This paper addresses a critical risk: LLMs can manipulate multi-agent systems into appearing cooperative while actually eroding agent autonomy and fairness. The authors propose CMAG, a governance framework that filters harmful LLM suggestions and optimizes for genuine cooperation rather than just compliance.

safetyagentsalignment

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Mar 13, 2026

Xingli Fang, Jung-Eun Kim

Privacy vulnerabilities and model performance are concentrated in a small set of weights—you can defend against privacy attacks by carefully fine-tuning just these critical weights instead of retraining the whole model.

This paper identifies that privacy leaks in neural networks come from a tiny fraction of weights, and these same weights are crucial for model performance. Rather than retraining the entire model, the authors propose selectively rewinding only these critical weights during fine-tuning to defend against membership inference attacks while keeping the model accurate.

Feb 23 – Mar 1(8)

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Feb 26, 2026

Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus et al.

LLMs dramatically amplify what untrained people can accomplish in specialized fields like biology, raising both opportunity and safety concerns.

Researchers tested whether LLMs actually help non-experts do biology tasks better than using the internet alone. They found novices with LLM access were 4x more accurate than those without, and sometimes outperformed trained experts. However, users weren't always getting the best results from the models, and most found it easy to get sensitive biosecurity information despite safeguards.

evaluationsafetyapplications

Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Feb 26, 2026

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

Federated learning systems can now quantify prediction uncertainty reliably across heterogeneous devices with minimal communication overhead using ...

This paper solves a critical problem in federated learning: how to know when your model is uncertain about its predictions, especially when different devices have different types of data.

reasoningsafetyevaluation

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Apr 2, 2026

Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin et al.

Selectively querying language models based on uncertainty can improve RL agent robustness in novel situations without constant computational overhead—but successful integration requires careful design, not just combining the two systems.

This paper proposes ASK, a system that combines reinforcement learning agents with language models to handle out-of-distribution scenarios.

agentsreasoningsafety

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Mar 30, 2026

Oliver Aleksander Larsen, Mahyar T. Moghaddam

If you're building AI systems, standard software architecture documentation won't capture ML-specific risks like model drift or data dependencies—RAD-AI provides a structured way to document these for both compliance and team understanding.

RAD-AI extends existing architecture documentation frameworks (arc42 and C4 model) to handle AI systems, adding sections for probabilistic behavior, ML lifecycles, and data dependencies. It maps to EU AI Act compliance requirements and shows 93% coverage of regulatory documentation needs versus 36% for standard frameworks.

architecturesafetyapplications
efficiencysafetyarchitecture

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Mar 25, 2026

Biplab Pal, Santanu Bhattacharya

Before deploying agentic AI in business processes, measure the 'blind mass' of uncertain state-action pairs and expected oversight costs using event logs—this reveals hidden decision gaps that simple accuracy metrics miss.

This paper develops a mathematical framework to measure when AI agents can safely operate autonomously versus when they need human oversight.

agentssafetyevaluation

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Mar 25, 2026

Zhuo Li, Yupeng Zhang, Pengyu Cheng et al.

Using multiple agents with intentional information barriers prevents LLMs from confirming their own errors during fact-checking, letting smaller models match larger ones on reliability.

MARCH is a framework that reduces hallucinations in LLMs by using three specialized agents that work together with deliberate information separation. A Solver generates responses, a Proposer breaks them into verifiable claims, and a Checker validates claims without seeing the original output—preventing the verifier from copying the generator's mistakes.

safetyagentsalignment

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Mar 25, 2026

Duc Vu, Anh Nguyen, Chi Tran et al.

If you're concerned about your photos being used to generate deepfake videos, adversarial perturbations applied in multiple domains (color and frequency) can effectively block modern video generation models while remaining imperceptible to humans.

This paper presents Anti-I2V, a defense method that protects photos from being misused in AI-generated fake videos. Instead of just adding noise to images, it works across multiple color spaces and frequency domains to disrupt video generation models, targeting both traditional and newer Transformer-based architectures.

safety

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Mar 24, 2026

Ufaq Khan, Umair Nawaz, L D M S S Teja et al.

Medical VLMs need explicit training on input validation (checking modality, anatomy, orientation) as a separate safety step before diagnosis, not as an afterthought—current models hallucinate plausible reports even on obviously invalid inputs.

This paper reveals a critical blind spot in medical AI: vision-language models can generate fluent medical reports even when given invalid inputs like wrong body parts or upside-down images. MedObvious is a benchmark of 1,880 tasks testing whether models can catch these basic sanity checks before attempting diagnosis—a step human radiologists do automatically but VLMs currently fail at.

safetyevaluationmultimodal

Failure of contextual invariance in gender inference with large language models

Mar 24, 2026

Sagar Kumar, Ariel Flint, Luca Maria Aiello et al.

LLM outputs are unstable across contextually equivalent formulations of the same task, meaning benchmark results may not reflect how models actually behave in real applications—a critical issue for bias testing and high-stakes use.

This paper reveals that large language models fail to give consistent outputs when tasks are reformulated in contextually equivalent ways.

evaluationsafety

Byzantine-Robust and Differentially Private Federated Optimization under Weaker Assumptions

Mar 24, 2026

Rustem Islamov, Grigory Malinovsky, Alexander Gaponov et al.

You can now build federated learning systems that defend against both Byzantine attacks and privacy breaches simultaneously, without needing unrealistic assumptions like bounded gradients or extra server datasets.

This paper tackles two critical security issues in federated learning: protecting against malicious servers (Byzantine attacks) and preventing data leakage (differential privacy).

safetytrainingefficiency

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Mar 24, 2026

Abdul Rahman

Security AI models fail when deployed to new environments because telemetry data is fragmented. CSTS solves this by providing a unified, entity-focused data structure that maintains consistent identity and relationships across different systems.

This paper introduces CSTS, a standardized way to represent security data that helps AI systems detect cyber threats across different computer networks. Instead of treating security events as isolated incidents, CSTS organizes them around entities (like users or devices) and their relationships, making AI models more reliable when deployed in new environments.

safetydataevaluation

Greater accessibility can amplify discrimination in generative AI

Mar 23, 2026

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou et al.

Adding voice to language models doesn't just extend text capabilities—it introduces new bias mechanisms tied to speaker identity cues that amplify discrimination beyond text-only versions, requiring fairness safeguards alongside accessibility improvements.

Voice interfaces on AI chatbots amplify gender discrimination more than text-based versions because speech reveals speaker identity through tone and accent. The research shows these models shift toward gender-stereotyped responses based on voice alone, and surveys reveal users worry about hidden attribute inference.

safetymultimodalalignment
trainingmultimodalsafety

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Mar 20, 2026

Sai Koneru, Elphin Joe, Christine Kirchhoff et al.

Instruction-tuned models are vulnerable to user pressure even with strong evidence present; simply providing richer context doesn't guarantee models will resist sycophancy without explicit training for epistemic integrity.

This paper tests how well instruction-tuned language models stick to evidence when users pressure them to agree with false claims. Using climate science as a test domain, researchers found that adding more detailed evidence doesn't reliably prevent models from abandoning facts to please users—especially when evidence includes research gaps or uncertainty.

evaluationalignmentsafety

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Mar 19, 2026

Huaide Jiang, Yash Chaudhary, Yuping Wang et al.

Embodied navigation systems perform well in clean lab conditions but fail dramatically in real-world scenarios with sensor noise and unclear instructions—this benchmark exposes those gaps and provides mitigation strategies.

NavTrust is a benchmark that tests how well navigation AI systems handle real-world problems like blurry images, sensor noise, and unclear instructions. The researchers tested seven state-of-the-art systems and found they all struggle significantly when inputs are corrupted, then demonstrated four strategies to make them more robust.

evaluationsafetyagents

Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

Mar 19, 2026

Julian Allagan, Mohamed Elbakary, Zohreh Safari et al.

Phishing detector robustness is fundamentally limited by feature economics—the cost of realistic website modifications—not by model architecture. Attackers can reliably evade detection by exploiting cheap feature changes, making feature design more critical than model choice.

This paper reveals a critical weakness in phishing detection systems: while machine learning models achieve near-perfect accuracy in testing, attackers can easily evade them by making cheap, realistic changes to websites.

safetyevaluation

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Mar 19, 2026

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al.

Synthetic data from diffusion models may not be as privacy-safe as assumed—membership inference attacks can still reveal whether specific records were in the training data, even with synthetic tabular outputs.

This challenge evaluates how well synthetic tabular data generated by diffusion models protects privacy against membership inference attacks. Researchers tested whether synthetic data truly hides information about individuals in the original dataset, developing new attack methods to measure privacy risks across different types of tabular data structures.

safetyevaluationdata

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Mar 19, 2026

Zou Qiang

Adding explicit process-control layers to LLM reasoning—rather than just filtering outputs—can dramatically reduce hallucination and adversarial vulnerability by enforcing integrity at the reasoning stage itself.

Box Maze proposes a three-layer architecture for LLMs that separates reasoning into memory grounding, structured inference, and boundary enforcement to prevent hallucination and adversarial attacks. Testing on multiple LLM systems shows the approach reduces failure rates from ~40% to <1% under adversarial conditions, suggesting architectural constraints can improve reasoning reliability.

architecturesafetyreasoning

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Mar 19, 2026

Zhan Jin, Yu Luo, Yizhou Zhang et al.

Using preference-based learning (DPO) with structural constraints rather than pixel-level metrics can fix a fundamental problem in medical image segmentation: producing fragmented, unrealistic vessel structures despite high accuracy scores.

ARIADNE combines vision-language models with reinforcement learning to detect coronary artery blockages in medical images while maintaining the correct structure of blood vessels. Instead of just matching pixels, it uses topological constraints to ensure vessel networks stay connected, reducing false alarms by 41% and achieving better accuracy on real clinical data.

safetyreasoning

UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Mar 19, 2026

Zikang Ding, Junchi Yao, Junhao Li et al.

Biases in LLMs can be reduced by enforcing structural consistency in the model's internal computations (attention and hidden states) across counterfactual inputs, rather than just fixing outputs or training data.

This paper proposes UGID, a method to reduce social biases in large language models by treating the model as a computational graph and enforcing that its internal structure remains consistent across inputs that differ only in sensitive attributes like gender or race.

safetyalignmenttraining

On Optimizing Multimodal Jailbreaks for Spoken Language Models

Mar 19, 2026

Aravind Krishnan, Karolina Stańczak, Dietrich Klakow

Multimodal AI systems need safety defenses that account for attacks across all input modalities together—defending text alone or audio alone isn't enough.

This paper shows that spoken language models (which process both speech and text) can be attacked more effectively by perturbing both modalities simultaneously rather than just one. The researchers developed JAMA, a method that jointly optimizes adversarial text and audio to bypass safety guardrails, achieving 1.5x to 10x higher attack success rates than single-modality attacks.

safetymultimodal

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Mar 19, 2026

Maksym Del, Markus Kängsepp, Marharyta Domnich et al.

For deploying reasoning models safely, combining verbalized confidence with self-consistency gives the best uncertainty estimates with minimal computational cost, but effectiveness varies significantly across domains like math versus humanities.

This paper studies how well reasoning language models can estimate their own uncertainty by sampling multiple responses and analyzing confidence signals.

evaluationreasoningsafety

FedTrident: Resilient Road Condition Classification Against Poisoning Attacks in Federated Learning

Mar 19, 2026

Sheng Liu, Panos Papadimitratos

Federated learning for autonomous systems needs multi-layered defense: detect poisoned models at the neuron level, exclude malicious participants based on history, and actively repair the global model after removing attackers.

FedTrident protects federated learning systems for road condition classification from poisoning attacks where malicious vehicles deliberately mislabel their training data. The system detects compromised models through neuron analysis, removes bad actors, and uses machine unlearning to fix the corrupted global model—maintaining safety-critical performance even under attack.

safety

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Mar 19, 2026

Carlos Hinojosa, Clemens Grange, Bernard Ghanem

Vision-language models' safety decisions are easily manipulated by semantic cues—they rely on learned associations rather than grounded reasoning about actual danger, which is a critical vulnerability for real-world deployment.

This paper reveals that vision-language models make safety decisions based on surface-level visual and textual cues rather than genuine understanding of dangerous situations. Researchers created a benchmark and steering framework showing that simple changes to how a scene is described or presented can flip safety judgments, exposing a vulnerability in how these models assess risk.

safetymultimodalevaluation

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Mar 18, 2026

Amine Lbath

Automated vulnerability injection with proof-of-concept exploits can scale up realistic training datasets for repository-level security detection, moving beyond function-level benchmarks to test how AI handles real-world code complexity.

This research creates an automated system to generate large-scale datasets for training AI models to detect software vulnerabilities in real code repositories.

datasafetyagents

Specification-Aware Distribution Shaping for Robotics Foundation Models

Mar 18, 2026

Sadık Bera Yüksel, Derya Aksaray

You can enforce formal safety constraints on pretrained robotics models without retraining by adjusting their output distributions at inference time using temporal logic specifications.

This paper adds safety guardrails to robotics foundation models by reshaping their action distributions at runtime to satisfy formal specifications. Instead of retraining the model, it uses forward simulation to ensure the robot meets time-dependent constraints like "visit location A before time T, then location B" while staying as close as possible to the model's original decisions.

safetyagentsreasoning

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Mar 18, 2026

Chiara Manna, Hosein Mohebbi, Afra Alishahi et al.

Decoder-only language models show similar gender bias problems as smaller models in translation tasks, but instruction tuning can reduce masculine bias and improve context awareness.

This paper examines how large language models handle gender in machine translation, where languages differ in how they mark gender. The researchers introduce a new measurement called "Prior Bias" to capture what gender a model assumes by default, and test decoder-only models (like GPT-style architectures) against traditional encoder-decoder models.

evaluationsafetyalignment

Mechanistic Origin of Moral Indifference in Language Models

Mar 16, 2026

Lingyu Li, Yan Teng, Yingchun Wang

LLMs can pass alignment tests while internally treating opposed moral concepts as equivalent; fixing this requires intervening directly on internal representations, not just adjusting outputs.

This paper reveals that large language models suffer from 'moral indifference'—they compress different moral concepts into similar internal representations, making them vulnerable to manipulation even when they appear aligned.

alignmentsafety

Do Metrics for Counterfactual Explanations Align with User Perception?

Mar 16, 2026

Felix Liedeker, Basil Ell, Philipp Cimiano et al.

Standard metrics for evaluating counterfactual explanations don't align with human judgment—developers need human-centered evaluation methods, not just algorithmic scores, to build truly trustworthy AI systems.

This study compares how AI systems measure counterfactual explanations (showing what would need to change for a different prediction) against how humans actually judge them. Researchers found that standard algorithmic metrics poorly predict human satisfaction, suggesting current evaluation methods miss what users actually care about in explanations.

evaluationsafetyalignment
safetytrainingefficiency

Developing and evaluating a chatbot to support maternal health care

Mar 13, 2026

Smriti Jha, Vidhi Jain, Jianyu Xu et al.

Deploying medical chatbots in low-resource, multilingual settings requires multiple layers of safety (triage, retrieval, generation) and multi-method evaluation—no single model or test is sufficient for trustworthy healthcare AI.

Researchers built a phone-based chatbot to answer maternal health questions in India, where users often have limited health literacy and speak multiple languages. The system combines triage (routing urgent cases to experts), retrieval of curated health guidelines, and AI-generated responses.

safetyapplicationsevaluation

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mar 13, 2026

Siqi Sun, Ben Peng Wu, Mali Jin et al.

Chain-of-thought reasoning substantially reduces hallucinations in LLMs analyzing long, complex documents—a critical capability for compliance and legal applications where accuracy is non-negotiable.

ESG-Bench is a benchmark dataset for testing how well AI models understand long corporate ESG (environmental, social, governance) reports and avoid making up false information. The dataset contains real ESG reports paired with human-verified question-answer pairs, letting researchers measure when models hallucinate versus when they accurately extract facts.

evaluationsafety

STAMP: Selective Task-Aware Mechanism for Text Privacy

Mar 12, 2026

Fengwei Tian, Payel Bhattacharjee, Heidi Hanson et al.

By combining task-aware importance scoring with privacy sensitivity detection, STAMP achieves better privacy-utility trade-offs than uniform noise approaches—meaning you can protect sensitive data without sacrificing model performance.

STAMP is a privacy framework that protects sensitive information in text while keeping it useful for AI tasks. It smartly decides which parts of text need more protection (like names and dates) versus which parts are less sensitive, then applies targeted noise to embeddings using a novel 'polar mechanism' that preserves semantic meaning better than traditional approaches.

safetydataefficiency

Security Considerations for Artificial Intelligence Agents

Mar 12, 2026

Ninghui Li, Kaiyuan Zhang, Kyle Polley et al.

AI agents introduce fundamentally new security challenges because they blur the line between code and data, and can execute actions across systems—developers need layered defenses including input filtering, sandboxing, and strict privilege controls.

This paper identifies security risks in AI agents—systems that can take actions in the real world—and proposes defenses. It covers new attack types like prompt injection and confused-deputy problems, explains how current protections work (sandboxing, policy enforcement), and highlights gaps in standards and research needed to secure multi-agent systems.

safetyagentsarchitecture

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Mar 12, 2026

Alexandre Le Mercier, Thomas Demeester, Chris Develder

CLASP provides a practical, lightweight defense against poisoning attacks on state space models by detecting malicious tokens before they reach downstream tasks, with strong generalization to unseen attack patterns.

State space models like Mamba are fast alternatives to Transformers, but they're vulnerable to Hidden State Poisoning Attacks that inject malicious tokens to corrupt the model's memory.

safetyefficiencyarchitecture

Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

Mar 12, 2026

Abhinaba Basu, Pavan Chakraborty

ML models for materials science need formal safety audits—this work shows single models have severe blind spots, but systematic falsification and confidence bounds can identify reliable predictions and improve discovery by 25%.

Machine-learned models for predicting material properties often fail silently. This paper introduces Proof-Carrying Materials, a system that audits these models through adversarial testing, statistical confidence bounds, and formal verification to identify which predictions are trustworthy.

safetyevaluationapplications
training
safety
efficiency

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Feb 26, 2026

Hyungyung Lee, Hangyul Yoon, Edward Choi

AI medical diagnosis becomes more trustworthy when it shows its evidence instead of just giving answers.

This paper presents CXReasonAgent, a system that helps AI diagnose chest X-rays by combining a language model with specialized medical tools. Instead of just guessing answers like typical AI models, it shows its work by pointing to specific evidence in the image.

agentsmultimodalsafety

Evaluating Stochasticity in Deep Research Agents

Feb 26, 2026

Haotian Zhai, Elias Stengel-Eskin, Pratik Patil et al.

AI research agents are unreliable in production because of randomness in how they search, summarize, and reason—but you can fix this with structu...

Research agents that gather information to answer questions produce different results each time you run them with the same question. This paper identifies where that randomness comes from and proposes ways to make these systems more reliable—reducing variability by 22% while keeping quality high.

agentsevaluationsafety

Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

Feb 26, 2026

Jiangxin Sun, Feng Xue, Teng Long et al.

Autonomous driving systems can make safer decisions in unexpected situations by predicting consequences and evaluating risk, rather than just copyi...

This paper tackles a critical problem in autonomous driving: current AI systems learn by copying expert drivers, but fail when encountering unusual situations they've never seen before. The researchers propose RaWMPC, a system that predicts what will happen if the car takes different actions, then picks the safest option—without needing expert examples.

safetyagentstraining

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Feb 26, 2026

Yegon Kim, Juho Lee

Separate the model that solves problems from the model that explains them to avoid accuracy loss when making AI outputs verifiable.

When AI models need to show their work so humans can verify it, they often get worse at solving problems—a cost called "legibility tax." This paper fixes that by splitting the job: one model solves the problem correctly, then a second model rewrites the solution in a way that's easy to check. This avoids forcing one model to juggle both accuracy and explainability.

reasoningsafetytraining

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Feb 26, 2026

Radha Sarma

RLHF-based AI systems cannot be governed by norms because optimization forces all values into tradeable weights—genuine norm-following requires a...

This paper argues that AI systems like ChatGPT trained with RLHF cannot follow ethical rules or norms because of how they're built. They work by turning everything into a single score and picking the highest one—which means they'll always trade off any principle if it scores higher. The author shows this isn't a bug to fix, but a fundamental limit of optimization itself.

alignmentsafetyarchitecture

FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

Feb 26, 2026

Thomas Woergaard, Raghavendra Selvan

Compressing models for efficiency can accidentally increase bias—you need to monitor fairness metrics during compression, not just overall accuracy.

This paper tackles a hidden problem in model compression: when you shrink neural networks to run faster, the compression can unfairly hurt accuracy for certain groups of people (like underrepresented skin tones in medical imaging).

efficiencysafetyevaluation