ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers66 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(12)

Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

May 21, 2026

Krishnakumar Balasubramanian

Conservative drifting with kernel density estimators achieves provable convergence rates for one-step generative modeling, with the convergence speed depending on dimension and a tunable parameter that trades off between different error sources.

This paper analyzes drifting methods for generative modeling, proposing a conservative approach using kernel density estimators that guarantees gradient-field properties. The authors prove finite-particle convergence rates showing how quickly the method converges as sample size increases, with explicit tracking of how bandwidth and dimension affect performance.

trainingevaluation

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

May 11 – May 17(9)

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

May 14, 2026

Ruozhen He, Meng Wei, Ziyan Yang et al.

Maintaining consistent characters and objects across long video sequences is hard; explicit memory of each entity's appearance significantly improves consistency, especially when characters reappear after many shots.

EntityBench is a benchmark for evaluating multi-shot video generation—creating coherent video sequences with multiple scenes. It includes 140 episodes with detailed tracking of characters, objects, and locations across shots, plus an evaluation system that measures both video quality and consistency.

evaluationmultimodalarchitecture

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026

Shashwat Goel, Nikhil Chandak, Arvindh Arun et al.

Current AI agents struggle with long-horizon real-world adaptation—the best models achieve only 25% accuracy predicting events three months ahead, showing this is a critical capability gap for deployed AI systems.

FutureSim is a benchmark that tests AI agents' ability to adapt and predict real-world events over time by replaying actual news and events in chronological order. Agents must forecast future events beyond their training data while interacting with a live stream of information, revealing significant gaps in current frontier models' capabilities.

May 4 – May 10(36)

Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

May 8, 2026

Shuhang Lin, Chuhao Zhou, Xiao Lin et al.

Conformal Path Reasoning provides statistical guarantees that your KGQA system will include the correct answer in its output set, while keeping that set compact and practical—solving a real reliability problem in knowledge graph reasoning.

This paper improves Knowledge Graph Question Answering by adding statistical guarantees to answer reliability. It uses conformal prediction—a technique that creates sets of answers with proven coverage rates—combined with a neural network that learns to score reasoning paths better. The result is more trustworthy answers with smaller, more useful prediction sets.

reasoningevaluationsafety

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

May 8, 2026

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal et al.

When using GNNs for predictions, you can get tighter, more reliable uncertainty estimates by explicitly using graph structure rather than just embedding similarity—this gives you both statistical guarantees and practical efficiency.

GRAPHLCP improves uncertainty quantification for graph neural networks by using graph structure to make better predictions with guaranteed coverage. Instead of just looking at embedding similarity, it uses graph topology and a PageRank-based approach to identify similar nodes and weight predictions appropriately, reducing wasted prediction sets while maintaining statistical guarantees.

Apr 27 – May 3(41)

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

May 1, 2026

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal et al.

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

evaluationreasoningalignment

Can Coding Agents Reproduce Findings in Computational Materials Science?

May 1, 2026

Ziyang Huang, Yi Cao, Ali K. Shargh et al.

AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.

This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.

Apr 20 – Apr 26(2)

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Apr 24, 2026

Longju Bai, Zhemin Huang, Xingyao Wang et al.

AI agents are expensive and unpredictable: token costs vary wildly (up to 30x difference on the same task), models differ dramatically in efficiency, and even frontier models can't accurately predict their own token usage before running.

This paper analyzes how much AI agents spend on tokens when solving coding tasks. Researchers studied eight frontier LLMs on real-world coding benchmarks and found that agentic tasks consume 1000x more tokens than simpler coding tasks, with huge variability between runs. Surprisingly, spending more tokens doesn't guarantee better results—accuracy often peaks at intermediate costs then plateaus.

efficiencyagentsevaluation

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

Apr 24, 2026

Ilana Nguyen, Harini Suresh, Thema Monroe-White et al.

LLMs systematically misrepresent Global Majority nationalities through stereotyping and one-dimensional portrayals, creating real risks for applications like asylum interviews. These harms are structural, not just surface-level, and require deliberate mitigation strategies.

evaluationmultimodaldata

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

efficiencyevaluationapplications

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

May 21, 2026

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

Diffusion models can effectively handle continuous-time survival analysis by modeling censored outcomes directly, avoiding parametric assumptions and discretization errors that limit traditional survival methods.

SDPM uses diffusion models to estimate time-to-event distributions from data with censored observations, without requiring assumptions about the hazard function or discretizing time. The model generates samples that can be converted to survival curves, achieving competitive performance on real datasets while accurately recovering underlying continuous distributions.

applicationsevaluation

Variance Reduction for Expectations with Diffusion Teachers

May 20, 2026

Jesse Bettencourt, Xindi Wu, Matan Atzmon et al.

When using diffusion models to guide other tasks, you can dramatically reduce compute cost by resampling cheap diffusion noise multiple times per expensive upstream computation, rather than doing one expensive computation per noise sample.

This paper introduces CARV, a framework for reducing variance in gradient estimates when using pretrained diffusion models as teachers in downstream tasks like text-to-3D generation. By reusing expensive computations (like 3D rendering) across multiple noise samples and applying importance sampling techniques, the method achieves 2-3x speedups without changing the underlying objective.

efficiencytrainingevaluation

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

May 20, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.

Retrieval isn't the main problem for frontier models on deep research tasks; instead, they fail primarily at deriving answers from evidence and calibrating confidence correctly, suggesting future improvements should focus on reasoning and verification rather than search.

DeepWeb-Bench is a challenging benchmark for evaluating AI agents that research questions by searching the web, collecting evidence, and reasoning through answers. Unlike existing benchmarks, it focuses on tasks requiring massive evidence gathering, cross-source verification, and complex multi-step reasoning—areas where current frontier models still struggle significantly.

evaluationreasoningagents

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

May 20, 2026

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

Most vision-language models struggle with knowledge-grounded visual reasoning—even large models only reach 75% accuracy when questions require combining visual evidence with external facts, suggesting a major gap in real-world VQA capabilities.

WikiVQABench is a new benchmark for testing vision-language models on questions that require both visual understanding and external knowledge from Wikipedia and Wikidata.

evaluationmultimodal

Mitigating Label Bias with Interpretable Rubric Embeddings

May 20, 2026

Calvin Isley, Johann D. Gaebler, Sharad Goel

Replace opaque learned embeddings with interpretable features derived from expert-defined rubrics to reduce bias inheritance from biased training labels in high-stakes decisions.

When training AI models on biased historical data (like past hiring decisions), the models learn and perpetuate those biases. This paper proposes using 'rubric embeddings'—features based on expert-defined criteria—instead of black-box embeddings to make fairer predictions. Testing on university admissions data, the approach reduces group disparities while maintaining quality.

alignmentevaluation

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

May 20, 2026

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI-generated refactoring often improves code but frequently introduces new quality and security issues that developers accept anyway, highlighting the need for automated quality checks before merging AI contributions.

This study examines Python refactoring pull requests created by AI agents, measuring their impact on code quality and security.

evaluationsafetyapplications

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

May 18, 2026

Payal Chandak, Victoria Alkin, David Wu et al.

LLMs deployed for medical advice have hidden, consistent ethical biases that don't reflect real physician diversity; without explicit auditing and balancing, a single model's values could be imposed at scale to thousands of patients.

This paper audits how large language models handle ethical dilemmas in medicine, revealing that while models discuss multiple ethical perspectives in their reasoning, they make near-identical decisions across repeated attempts.

safetyevaluationalignment

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

May 18, 2026

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun et al.

LLM factual accuracy isn't random—it scales predictably with model size and training data frequency, meaning you can estimate what facts a model will reliably remember based on these two factors.

This paper reveals that LLM factual recall follows a predictable pattern based on two factors: model size and how often a topic appears in training data.

scalingevaluationtraining

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

May 18, 2026

Feng Chen, Tianzhe Chu, Li Sun et al.

Current embodied systems struggle with the full loop: even when vision models perform well on isolated tasks (67% accuracy), they fail at recovering complete game state needed for decision-making (34% accuracy), and execution errors cascade during real deployment.

DexHoldem is a real-world benchmark that tests embodied AI systems playing Texas Hold'em with a dexterous robot hand. It combines three challenges: executing 14 card-manipulation skills precisely, perceiving game state from images, and making decisions based on that perception—revealing how errors compound when all three run together in closed-loop control.

evaluationagents
evaluationagentsreasoning

Quantitative Video World Model Evaluation for Geometric-Consistency

May 14, 2026

Jiaxin Wu, Yihao Pi, Yinling Zhang et al.

Video generators often fail at maintaining consistent 3D geometry in ways that human raters and perceptual metrics don't catch; PDI-Bench provides a diagnostic tool to measure and improve these failures systematically.

This paper introduces PDI-Bench, a quantitative framework for evaluating whether generated videos maintain physically plausible 3D structure and motion.

evaluationmultimodal

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

May 14, 2026

Sahil Sen, Akhil Kasturi, Elias Lumer et al.

When building agentic search systems, simple grep-based retrieval can outperform vector search, but the agent architecture and how you present tool outputs to the model matter more than retrieval method alone.

This paper compares different retrieval strategies (grep vs. vector search) in AI agent systems that autonomously retrieve information and call tools.

agentsevaluation

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

May 14, 2026

Shang Zhou, Wenhao Chai, Kaiyuan Liu et al.

Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.

OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.

reasoningevaluation

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

May 14, 2026

Pratinav Seth, Vinay Kumar Sankarapu

Behavioral evaluations alone cannot verify the safety claims regulators now demand—you need mechanistic evidence like activation analysis to actually verify what's happening inside AI models, not just what they output.

This paper argues that current AI safety evaluation methods (like red-teaming and behavioral testing) cannot verify the deep safety properties that AI governance frameworks now require, such as absence of hidden objectives or resistance to loss-of-control.

safetyevaluationalignment

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

May 12, 2026

Di Wu, Zixiang Ji, Asmi Kawatkar et al.

Long-term memory for agents requires more than just storing task outcomes; agents need to internalize environment-specific patterns, workflows, and failure modes to become truly experienced colleagues, and current memory systems still struggle with this despite recent advances.

This paper introduces LongMemEval-V2, a benchmark for testing whether AI agents can build long-term memory of specialized web environments. It includes 451 questions about five types of memory (state recall, workflow knowledge, failure modes, etc.) paired with massive history trajectories up to 500 steps and 115M tokens.

agentsevaluationreasoning

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

May 12, 2026

Ariel Gera, Shir Ashury-Tahan, Gal Bloch et al.

You can boost embedding model performance on hard search tasks by having an LLM refine queries at test-time, making embeddings practical for scenarios where running LLMs on all documents is too expensive.

This paper shows how to improve embedding models for search and classification by using an LLM to refine user queries in real-time. Instead of changing the embedding model itself, the approach adjusts the query representation based on feedback from a small sample of documents, achieving up to 25% improvement on challenging tasks without requiring expensive LLM processing at scale.

efficiencyevaluation

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026

Seokwon Jung, Alexander Rubinstein, Arnas Uselis et al.

LLM agents struggle with dependency reasoning in persistent memory—when facts relate to each other, systems collapse to near-random performance, and fixing this requires impractically expensive configurations.

This paper introduces MEME, a benchmark for evaluating how well AI agents manage information across multiple sessions. It tests six memory tasks including complex scenarios like tracking dependencies between facts and handling deletions.

evaluationagentsreasoning
evaluationreasoning

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 2026

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe et al.

Structured, multi-criterion rewards grounded in real documents help models develop generalizable reasoning skills that transfer to unseen tasks better than single holistic scores.

This paper shows how to train AI models to reason better by grading their responses on multiple specific criteria instead of just right/wrong. The researchers created detailed rubrics from scientific documents and used them to train a language model with a technique called GRPO, which optimizes for partial credit across different dimensions.

trainingreasoningevaluation

Accurate and Efficient Statistical Testing for Word Semantic Breadth

May 8, 2026

Yo Ehara

When statistically comparing semantic breadth of words using embeddings, you must account for directional differences or your significance tests will be unreliable—this paper provides a practical, GPU-accelerated solution.

This paper solves a statistical problem in measuring how broadly a word's meaning spreads across different contexts using word embeddings. When comparing two words' semantic breadth, naive statistical tests fail because they confuse directional differences (where words point in different semantic directions) with actual breadth differences.

evaluation

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

May 8, 2026

Yi Yu, Parker Martin, Zhenyu Bu et al.

Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.

CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.

applicationsefficiencyevaluation

BAMI: Training-Free Bias Mitigation in GUI Grounding

May 7, 2026

Borui Zhang, Bo Zhang, Bo Wang et al.

You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.

This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.

agentsevaluationefficiency

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

May 7, 2026

Jai Moondra, Ayela Chughtai, Bhargavi Lanka et al.

Don't trust global LLM leaderboards—they hide structured disagreement across languages and tasks. Use language-specific rankings or small model portfolios instead to match diverse user needs.

Current LLM leaderboards rank models using global voting patterns, but this masks the reality: opinions differ dramatically by language and task. This paper shows that 2/3 of votes cancel out and top models are statistically indistinguishable globally. Instead, grouping by language reveals coherent subpopulations with consistent rankings.

evaluationmultimodal

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

May 7, 2026

Sushant Gautam, Finn Schwall, Annika Willoch Olstad et al.

When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.

This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.

safetyevaluation

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

May 7, 2026

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

Local 3D structure around a protein's light-emitting center matters more than overall sequence for predicting brightness—and you can build interpretable models by explicitly encoding which atoms contact which chromophore regions.

This paper predicts how bright fluorescent proteins will be by analyzing their 3D structure around the light-emitting chromophore region. Instead of just looking at protein sequences, the method builds a graph of how atoms and chemical groups physically contact the chromophore, then uses machine learning to predict brightness.

evaluationarchitecture

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

May 7, 2026

Hao Dong, Hongzhao Li, Shupan Li et al.

Despite claims of progress, multimodal domain generalization methods show only marginal improvements over basic approaches when fairly compared—the field needs better methods and standardized evaluation to make real progress.

This paper creates MMDG-Bench, the first standardized benchmark for multimodal domain generalization across action recognition, fault diagnosis, and sentiment analysis. Testing 9 methods on 6 datasets with 7,402 trained models, it reveals that recent specialized methods barely beat simple baselines, no method works consistently across tasks, and all methods struggle with corrupted or missing data.

evaluationmultimodal

Taming Outlier Tokens in Diffusion Transformers

May 6, 2026

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu et al.

Outlier tokens in diffusion transformers aren't just extreme values but represent corrupted local information; controlling them with register tokens significantly improves image generation quality.

This paper identifies and fixes a problem in Diffusion Transformers where certain tokens develop unusually high values that degrade image quality. The authors show this happens in both the image encoder and the generation model itself, and propose Dual-Stage Registers—a technique using learnable tokens to stabilize these problematic values and improve image generation.

architectureefficiencyevaluation

Implicit Representations of Grammaticality in Language Models

May 6, 2026

Yingshan Susan Wang, Linlu Qiu, Zhaofeng Wu et al.

Language models learn grammaticality as a distinct concept from string probability, hidden in their internal representations rather than reflected in output probabilities—you can extract this knowledge with a simple linear probe.

Language models generate grammatical text but their probability scores don't clearly distinguish grammatical from ungrammatical sentences.

evaluation

Almost-Orthogonality in Lp Spaces: A Case Study with Grok

May 6, 2026

Ziang Chen, Jaume de Dios Pont, Paata Ivanisvili et al.

AI language models can contribute meaningfully to mathematical discovery by helping identify intermediate lemmas and inequalities, though human mathematicians remain essential for rigorous proof construction and validation.

This paper proves new bounds on how sums of functions behave in mathematical spaces, showing when certain inequalities hold and when they fail. The authors use a large language model called Grok to help discover intermediate results, demonstrating how AI can assist in mathematical research.

reasoningevaluation

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

May 6, 2026

Nicholas Barnfield, Juno Kim, Eshaan Nichani et al.

Linear memory systems face a fundamental logarithmic penalty for top-1 retrieval but can achieve quadratic capacity if you only need the correct answer ranked highly rather than first—a distinction that matters for building efficient retrieval systems.

This paper analyzes how many key-value pairs a linear memory matrix can store, showing the answer depends on the retrieval task. For winner-take-all retrieval (finding the single best match), capacity scales as d² ≈ n log n due to extreme-value statistics. For listwise retrieval (keeping the correct answer in a top-k list), capacity improves to d² ≈ n.

scalingevaluation

Estimating the expected output of wide random MLPs more efficiently than sampling

May 6, 2026

Wilson Wu, Victor Lecomte, Michael Winer et al.

You can estimate a wide MLP's expected output more efficiently than sampling by directly computing activation distributions layer-by-layer using mathematical tools, which is particularly useful for detecting tail risks.

This paper presents a mathematical method to estimate what a randomly initialized neural network will output on average, without actually running data through it. Instead of sampling (the standard approach), the authors use statistical tools like cumulants and Hermite expansions to track how activations behave at each layer.

efficiencyevaluationarchitecture

Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

May 6, 2026

Alexander Hsu, Zhaiming Shen, Wenjing Liao et al.

Transformer attention can act as a feature learner for nonlinear functions during in-context learning, and this capability can be theoretically analyzed with concrete error bounds—bridging the gap between empirical success and mathematical understanding.

This paper explains how transformers perform in-context learning for nonlinear regression tasks. The researchers show that transformer attention mechanisms can automatically create nonlinear features (like polynomials or splines) from examples in the prompt, enabling the model to solve complex regression problems without updating weights.

reasoningarchitectureevaluation

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

May 6, 2026

Perry E. Radau

LLMs may appear competent on multiple-choice MRI benchmarks but struggle significantly with free-text recall of vendor-specific operational knowledge; multiple-choice scores alone don't indicate readiness for real-world MRI protocol guidance.

This paper introduces MRI-Eval, a benchmark with 1,365 questions testing LLM knowledge of MRI physics and GE scanner operations across three difficulty levels.

evaluationapplications

The First Token Knows: Single-Decode Confidence for Hallucination Detection

May 6, 2026

Mina Gabriel

A single metric based on the model's confidence distribution at the first answer token can reliably detect hallucinations without expensive multi-sample generation, making it a practical baseline for production systems.

This paper shows that checking a language model's confidence on just the first token of an answer can detect hallucinations as well as methods that generate multiple answers and compare them. The approach is faster and simpler, requiring only a single model run instead of repeated sampling.

evaluationefficiency

PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

May 6, 2026

Srikar Kashyap Pulipaka

Per-language fine-tuning with synthetic data augmentation and threshold tuning can significantly improve multilingual NLP tasks, but model generalization to test data varies dramatically—some architectures dropped 30-50% in performance despite strong development results.

This paper describes a system for detecting polarized language across 22 languages using fine-tuned Gemma models with synthetic data augmentation. The approach combines per-language model tuning, LLM-generated synthetic training data with quality filtering, and weighted ensemble predictions to achieve competitive performance on a multilingual classification task.

trainingevaluation

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

May 6, 2026

Chuanzhi Xu, Boyu Wei, Haoxian Zhou et al.

You can now automatically evaluate whether a 3D scene looks visually appealing by analyzing its Gaussian Splatting representation directly, which is faster and cheaper than traditional rendering-based assessment methods.

This paper introduces Aes3D, the first framework for evaluating the visual aesthetics of 3D scenes created with Gaussian Splatting. It includes a new dataset with aesthetic annotations and a lightweight model that directly assesses aesthetic qualities like composition and harmony from 3D Gaussian primitives, without needing to render images.

evaluationmultimodal

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

May 6, 2026

Alper Yıldırım

Transformers for time series don't rely on superposition like they do in language tasks, meaning time series forecasting may not require the compositional complexity that makes Transformers powerful for NLP.

This paper investigates how Transformers work internally for time series forecasting by analyzing their hidden representations using sparse autoencoders. The key finding: Transformers don't need complex, overlapping feature representations (superposition) to forecast well—their representations stay sparse and simple, which explains why basic linear models remain competitive.

reasoningevaluation

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

May 5, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

You can build certified graph classifiers without gradient training by using topology-aware landmark selection and closed-form kernel methods—achieving competitive accuracy with built-in confidence bounds.

PALACE is a method for classifying point clouds and graphs using persistent homology (a topological data analysis technique) with adaptive landmark placement.

evaluationreasoning

Safety and accuracy follow different scaling laws in clinical large language models

May 5, 2026

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa et al.

In clinical AI, safety requires deliberate design choices around evidence quality and retrieval strategy, not just model scaling. A few high-risk errors matter more than average performance.

This paper shows that making clinical AI models bigger or faster doesn't automatically make them safer—safety and accuracy follow different rules. Researchers tested 34 medical AI models and found that high-quality evidence dramatically improved both accuracy and safety, but standard retrieval methods and extra computing power didn't prevent dangerous errors or overconfidence.

safetyevaluationapplications

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

May 5, 2026

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

Agentic red teaming can dramatically speed up security testing of AI systems by automating workflow construction, letting security teams focus on what vulnerabilities to test rather than how to implement each test.

This paper introduces an AI red teaming agent that automates adversarial testing of AI systems. Instead of manually building attack workflows over weeks, operators describe their testing goals in natural language, and the agent automatically selects attacks, applies transformations, and scores results—compressing the process from weeks to hours.

safetyagentsevaluation

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

May 5, 2026

Yilun Zhao, Jinbiao Wei, Tingyu Song et al.

Retrievers for agentic AI systems need to be evaluated and trained differently—they must surface complementary evidence across multiple aspects and search iterations, not just find topically similar passages.

This paper tackles how search systems find evidence for AI agents that need to reason through complex problems. Current retrieval systems just match keywords, but agentic systems need diverse, complementary evidence across multiple search rounds.

evaluationagents

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

May 5, 2026

Joseph Breda, Fadi Yousif, Beszel Hawkins et al.

Structured conversational strategies—where AI systematically interviews patients before diagnosing—significantly outperform unguided chat-based symptom assessment, suggesting that agentic design patterns matter more than raw model capability for medical applications.

Researchers deployed SymptomAI, a conversational AI system for symptom assessment, to nearly 14,000 Fitbit users and found it diagnosed conditions more accurately than independent clinicians reviewing the same conversations.

applicationsagentsevaluation

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

May 5, 2026

Richard J. Young, Alice M. Matthews

Before deploying LLMs in clinical settings, you need model-specific fairness audits using counterfactual testing—demographic parity alone doesn't guarantee fair decisions, and interventions like demographic blinding work differently across models.

Researchers audited five large language models for gender bias in emergency department triage decisions, finding that all models showed concerning flip rates (9.9-43.8%) when patient gender was swapped.

safetyevaluationalignment

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

May 5, 2026

Kishan Athrey, Ramin Pishehvar, Brian Riordan et al.

Automating agent selection in multi-agent systems using retrieval-based matching and LLM re-ranking improves reliability and scalability compared to manual composition, especially when a critique agent validates the full workflow.

This paper presents an automated framework for building multi-agent systems that replaces manual steps with AI-driven composition. It uses an LLM planner to break down user requests into tasks, then automatically selects the best agents from registries using a two-stage retrieval system (fast retriever + LLM re-ranker), with a critique agent validating the entire plan.

agentsarchitectureevaluation

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

May 5, 2026

Hao Mi, Qiang Sheng, Shaofei Wang et al.

Hallucination detection improves when you combine a model's internal uncertainty signals with its own self-judgments, enforcing that they logically agree—this dual-view approach catches more false claims than either method alone.

This paper tackles hallucination detection in large language models by combining two approaches: analyzing internal neural patterns and extracting explicit self-judgments from the model. The key innovation is a framework that treats these as logically connected signals—if a model says something is true and judges itself as correct, those signals should align.

safetyevaluation

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

May 5, 2026

Mohamed Mady, Johannes Reschke, Björn Schuller

AI-text detectors need feature augmentation and careful threshold calibration to work reliably across different domains and generators; linguistic features like readability are crucial for robustness under distribution shift.

This paper tackles the challenge of detecting AI-generated text across different domains and AI models. Researchers trained transformer-based detectors and found that while they perform nearly perfectly on their training data, they struggle when tested on new domains or text from different AI generators.

evaluationsafetyarchitecture

Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics

May 4, 2026

Bogdan Oancea

Unsupervised learning can detect multivariate anomalies in regional data that traditional single-variable checks miss, helping statistical agencies distinguish between data quality issues and genuine structural divergence.

This paper uses five unsupervised machine learning techniques to detect regions in Europe with unusual combinations of economic and social indicators, rather than just extreme individual values.

evaluationdata

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

May 4, 2026

Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian

Standard training loss curves can hide poorly-optimized layers in transformers—layer-wise analysis using reference bounds exposes optimization failures that aggregate metrics miss, especially critical for expensive model training.

This paper introduces a method to monitor whether transformer models are actually learning well during training by analyzing each layer individually. Instead of just looking at overall loss, the authors create lightweight reference solutions for each layer and compare them against the trained model, revealing hidden inefficiencies.

trainingevaluationefficiency

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

May 4, 2026

Sushovan Majhi, Atish Mitra, Žiga Virk et al.

This approach trades the flexibility of learned models for interpretability and formal guarantees: you get provable error bounds and confidence scores for each prediction, but performance lags behind neural baselines on some datasets due to limited descriptor expressiveness.

PLACE is a method for classifying point clouds and graphs using topological features (persistent homology) with mathematical guarantees.

evaluationreasoning

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

May 4, 2026

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park et al.

Vision-language models perform surprisingly poorly on domain-specific action recognition even in simplified settings, but fine-tuning on domain-specific video data significantly closes the gap.

VideoNet is a new benchmark and dataset for testing how well AI models recognize specific actions in videos across 37 different domains. The researchers found that current vision-language models struggle with domain-specific action recognition—even simple binary choices—and created a 500k video question-answer dataset to improve performance through fine-tuning.

evaluationdatamultimodal

First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

May 4, 2026

Ziqi Liu, Kiljae Lee, Yuan Zhang et al.

Understanding the shared mathematical structure of value estimation methods enables designing more statistically efficient estimators—EASE reduces mean squared error by jointly optimizing sampling and surrogate functions rather than treating them separately.

This paper explains how to efficiently estimate Shapley values and similar attribution methods that explain AI model decisions. The authors show that different estimation approaches share a common mathematical structure, then use this insight to design a better estimator (EASE) that reduces computational error by optimizing both the sampling strategy and the surrogate function used.

evaluationefficiency

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

May 4, 2026

Jiujiu Chen, Yazheng Liu, Sihong Xie et al.

Process reward models need to account for the full context of reasoning paths and penalize risky intermediate steps, not just reward final correctness—this matters most in domains where wrong reasoning paths are costly.

This paper addresses a key problem in evaluating AI reasoning: process reward models often give high scores to flawed reasoning paths because later correct steps mask earlier mistakes. The authors propose SCPRM, which evaluates reasoning steps by looking at what came before and measuring distance to the target, then use it with tree search to answer questions about knowledge graphs.

reasoningevaluationagents
agentsevaluationapplications

Generating Statistical Charts with Validation-Driven LLM Workflows

May 1, 2026

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.

This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.

evaluationapplicationsdata

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

May 1, 2026

Alfredo Madrid-García, Miguel Rujas

Medical RAG chatbots often expose sensitive backend details and patient data through client-side communication—use server-side security controls and independent audits before deploying patient-facing AI systems.

Researchers audited a patient-facing medical chatbot and found critical security flaws: sensitive system prompts, API endpoints, and 1,000 patient conversations were exposed through basic browser inspection. The study shows how RAG chatbots can leak backend configuration and private health data without authentication, highlighting governance gaps in AI healthcare deployment.

safetyapplicationsevaluation

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

May 1, 2026

Jingxi Pu, Tonghua Liu, Zhilin Guan et al.

You can denoise real clinical CT images without paired training data by using unsupervised learning with perceptual loss, making it practical for hospitals that can't easily create labeled datasets.

This paper tackles noise in low-dose CT scans—a real clinical problem where reducing radiation exposure creates grainy images that are hard for doctors to read.

efficiencyevaluationapplications

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

May 1, 2026

Yinhao Xiao, Rongbo Xiao, Yihan Zhang

LLM-generated GIS code can look correct but violate geographic rules; GeoContra's contract-based verification catches these semantic errors before they produce wrong spatial analysis.

GeoContra is a verification and repair system that catches geographic errors in AI-generated GIS code. It checks that spatial analysis preserves coordinate systems, topology, units, and geographic plausibility—catching bugs like negative travel times or mismatched coordinate systems that would otherwise produce executable but wrong results.

evaluationsafetyapplications

Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

May 1, 2026

Jacques Raynal, Pierre Slangen, Jacques Margerit

Observable performance metrics can mask fundamentally different internal system organizations—a critical insight for understanding adaptive biological systems where multiple solutions may produce identical outputs.

This study shows that measuring a system's output performance alone doesn't reveal how it's actually organized internally. Using gait analysis in a Parkinson's patient with dental constraints, researchers found that similar-looking movement patterns can come from very different internal system states when examined through dynamical systems and machine learning lenses.

evaluationreasoning

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

May 1, 2026

Scott Friedman, Ruta Wheelock, Sonja Schmer-Galunder et al.

Most sentiment analysis tools miss nuance—they can't detect that a single message contains both praise for one group and criticism for another. This work enables fine-grained tracking of who is being helped, harmed, supported, or opposed in online discourse.

This paper introduces a new method to detect mixed positive and negative sentiments directed at different targets within the same message. Instead of labeling text as simply positive or negative, the approach identifies specific targets (like people or groups) and scores them across three dimensions: advocacy vs. opposition, aid vs. harm, and support vs. victimization.

evaluationdata

Characterizing the Expressivity of Local Attention in Transformers

May 1, 2026

Jiaoda Li, Ryan Cotterell

Local attention isn't just an efficiency trick—it fundamentally expands what a transformer can learn by recognizing different patterns than global attention, and combining both types creates the most powerful model.

This paper explains why local attention (where tokens only look at nearby predecessors instead of all previous tokens) sometimes improves transformer performance. The authors prove that local attention expands what patterns a transformer can recognize, and combining local and global attention together creates the most expressive model.

architecturereasoningevaluation

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Apr 30, 2026

Lincan Li, Zheng Chen, Yushun Dong

LLMs can effectively refine noisy graph structures in medical signal analysis by identifying and removing redundant connections, improving both seizure detection accuracy and model interpretability.

This paper uses large language models to improve how neural networks analyze EEG brain signals for seizure detection. The key innovation is treating LLMs as 'graph refiners'—they remove unnecessary connections in a graph representation of EEG data, making the model more accurate and interpretable.

architectureevaluation

Strait: Perceiving Priority and Interference in ML Inference Serving

Apr 30, 2026

Haidong Zhao, Nikolaos Georgantas

Accurate latency prediction under GPU contention is critical for priority-aware scheduling in inference serving—Strait reduces deadline violations for high-priority tasks by modeling interference effects that traditional systems ignore.

Strait is an ML inference serving system that improves deadline satisfaction for high-priority requests by better predicting latency under GPU contention and using priority-aware scheduling.

efficiencyevaluation

PhyCo: Learning Controllable Physical Priors for Generative Motion

Apr 30, 2026

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan et al.

You can make generative video models physically consistent by combining physics-labeled training data, ControlNet conditioning on physical properties, and VLM-based reward signals—no simulator needed at runtime.

PhyCo teaches video generation models to respect physics by fine-tuning them on 100K+ realistic simulation videos with varying physical properties (friction, bouncing, deformation), then using a vision-language model to provide physics-aware feedback during generation. This lets models create videos where objects behave realistically without needing a physics simulator at inference time.

trainingmultimodalevaluation

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

Apr 30, 2026

Matthias Hertel, Alexandra Nikoltchovska, Sebastian Pütz et al.

You can now explain time series foundation model predictions efficiently using SHAP, making them trustworthy for critical infrastructure like power grids—without sacrificing accuracy or requiring model retraining.

This paper makes time series foundation models (TSFMs) transparent for power grid forecasting by developing an efficient method to compute SHAP explanations. The approach leverages TSFMs' ability to handle variable input lengths and selective masking, enabling scalable explanations without retraining.

applicationsevaluation

On the Proper Treatment of Units in Surprisal Theory

Apr 30, 2026

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira et al.

When using language models to measure reading difficulty, you must explicitly choose your unit of analysis (word, morpheme, etc.) separately from tokenization—don't let the model's token boundaries dictate your scientific analysis.

This paper clarifies how surprisal theory—which measures human reading difficulty based on word predictability—should handle units of analysis. Language models tokenize text differently than linguistic units (like words), creating confusion in how surprisal is calculated. The authors provide a framework to make these choices explicit and consistent.

evaluation

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Apr 30, 2026

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.

Building reliable workflow automation is harder than leaderboard rankings suggest—agents need to be evaluated on what they actually execute, not just outputs, and benchmarks must track real-world demand to stay relevant.

Claw-Eval-Live is a benchmark for testing AI agents that automate real-world workflows across software tools and services. Unlike static benchmarks, it updates with real-world demand signals while maintaining reproducible test snapshots.

evaluationagentsapplications

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Apr 30, 2026

Prashant Kulkarni

Multi-turn attacks leave detectable signatures in LLM activations that text-level defenses miss—you can catch covert attacks by monitoring how the model's internal states shift across conversation turns, but detection models don't transfer between different LLM architectures.

This paper detects multi-turn prompt injection attacks by analyzing patterns in a language model's internal activations rather than just the text. The researchers found that adversarial attacks create a distinctive 'restlessness' signature in the model's activation patterns as attackers progress through trust-building, pivoting, and escalation phases.

safetyevaluation

Do Sparse Autoencoders Capture Concept Manifolds?

Apr 30, 2026

Usha Bhalla, Thomas Fel, Can Rager et al.

SAEs don't cleanly capture continuous concept structures—they fragment them across features in ways that hide geometric relationships, suggesting interpretability research needs to look for groups of features rather than individual directions.

Sparse autoencoders (SAEs) are popular tools for finding interpretable features in AI models, but this paper shows they struggle to capture concepts organized as continuous geometric structures (manifolds).

architectureevaluation

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

Apr 30, 2026

Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma et al.

When transformer models fail silently, DEFault++ can pinpoint exactly which component is broken and why—helping developers fix issues 46% faster than manual debugging.

DEFault++ automatically detects, categorizes, and diagnoses faults in transformer models by analyzing internal component behavior. It identifies 12 types of transformer-specific faults and pinpoints root causes among 45 mechanisms, helping developers fix silent failures that don't trigger runtime errors.

evaluationsafetyarchitecture

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Apr 30, 2026

Ivan Bercovich

When designing agent benchmarks, treat tasks as adversarial tests rather than helpful prompts; focus on conceptual difficulty over environmental complexity, and rigorously verify that your evaluation logic actually measures what you intend.

This paper provides practical guidelines for designing high-quality benchmark tasks that evaluate AI agents' coding and system-administration abilities.

evaluationagents

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Apr 30, 2026

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan et al.

LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.

TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.

evaluationreasoningdata

A Unified Framework of Hyperbolic Graph Representation Learning Methods

Apr 30, 2026

Sofía Pérez Casulo, Marcelo Fiori, Bernardo Marenco et al.

Hyperbolic embeddings can represent complex hierarchical networks in low dimensions, but practitioners now have a standardized framework to fairly compare methods and understand their trade-offs before choosing one for their application.

This paper presents a unified framework for hyperbolic graph embedding methods—techniques that represent networks in hyperbolic space to capture hierarchical structures efficiently. The framework consolidates multiple embedding approaches under one interface, enabling fair comparison and reproducible evaluation on real-world networks for tasks like link prediction and node classification.

architectureevaluation

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Apr 30, 2026

Lauren Cadwallader, Iain Hrynaszkiewicz, parth sarin et al.

LLMs can automatically detect data reuse in scientific papers, revealing that open data sharing has far greater downstream impact than traditional metrics suggest.

Researchers used large language models to detect when published studies reuse data from other research. They found that 43% of papers reuse existing data—much higher than previous measurement methods could show. This demonstrates that AI can measure the real-world impact of open science practices at scale.

evaluationapplications

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

Apr 29, 2026

Yeheng Chen, Chaoxiang Xie, Yuling Shi et al.

Class-level code generation—building complete, internally structured classes—is significantly harder than function-level synthesis, and the main bottleneck is coordinating logic across multiple methods, not individual function correctness.

ClassEval-Pro is a benchmark with 300 class-level code generation tasks across 11 domains, designed to test whether AI models can build complete, structured classes from specifications. Current benchmarks focus on isolated functions or manually curated tasks, but this one uses automated pipelines and real GitHub code to avoid data contamination.

evaluation

ClawGym: A Scalable Framework for Building Effective Claw Agents

Apr 29, 2026

Fei Bai, Huatong Song, Shuang Sun et al.

To build effective agents for real-world file and tool interactions, you need systematic data synthesis, training on realistic rollout trajectories, and careful evaluation—ClawGym provides all three components together.

ClawGym is a framework for building AI agents that work with files, tools, and persistent workspaces through multi-step tasks. It includes a dataset of 13.5K synthesized tasks with realistic mock environments, trained agent models using supervised learning and reinforcement learning, and a benchmark for evaluation.

agentstrainingevaluation

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Apr 29, 2026

Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud et al.

Cascading multiple specialized modules (query reformulation, evidence ranking, grounded generation, answer-evidence linking) with an LLM outperforms end-to-end approaches for clinical QA, especially when grounding answers to source documents matters for patient safety.

A clinical question-answering system that helps patients understand their electronic health records by using a four-stage pipeline with an LLM to interpret patient questions, find relevant evidence in medical notes, generate grounded answers, and link answers back to source documents.

applicationsreasoningevaluation

KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment

Apr 29, 2026

Attila Pintér, Javier Rico, Attila Répai et al.

Containerized microservice architectures enable clinical AI systems to meet real-world constraints like data privacy while maintaining high performance, and this approach is ready for real-world deployment (TRL 6).

KAYRA is an AI system for analyzing chromosomes (karyotyping) in clinical labs using a pipeline of deep learning models. It can run in the cloud or on-premise to handle privacy requirements, and achieves 98.91% accuracy on chromosome segmentation—significantly better than existing commercial systems.

applicationsarchitectureevaluation

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Apr 29, 2026

Bao Pham, Mohammed J. Zaki, Luca Ambrogioni et al.

Language diffusion models memorize training data by default, but you can detect when they switch to genuine generalization by monitoring conditional entropy—a practical signal for assessing whether a deployed model is memorizing or creating.

This paper reveals that language diffusion models work like associative memories—they store training data in 'basins of attraction' and can retrieve both memorized and unseen examples. As training data grows, the model transitions from memorizing to generalizing, a shift detectable by measuring conditional entropy of token predictions.

trainingevaluationreasoning

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Apr 28, 2026

Jinxiang Meng, Shaoping Huang, Fangyu Lei et al.

Building practical data visualization agents requires handling real-world complexity—native tool integration, cross-platform adaptation, and ambiguous user intent—not just code generation in isolated environments.

DV-World is a benchmark with 260 real-world data visualization tasks that tests AI agents on spreadsheet manipulation, adapting visualizations to new data, and handling ambiguous user requirements.

evaluationagentsapplications

A paradox of AI fluency

Apr 28, 2026

Christopher Potts, Moritz Sudhof

Success with AI depends more on how you interact with it than on the model itself: active collaboration and critical feedback lead to better results, even if they surface more failures along the way.

This paper analyzes 27K AI conversations to show that skilled AI users get better results by actively iterating with the AI, while novices passively accept outputs—leading to a paradox where fluent users see more visible failures but achieve better outcomes on complex tasks, while novices experience hidden failures that go unnoticed.

evaluationapplicationsagents

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Apr 28, 2026

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy et al.

You can compress LLMs for SE tasks to 1/49th their original size with minimal accuracy loss—making them practical to deploy while cutting environmental impact dramatically.

This paper presents Carbon-Taxed Transformers (CTT), a compression pipeline that makes large language models smaller, faster, and greener for software engineering tasks.

efficiencytrainingevaluation

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Apr 28, 2026

Steve Coyne

RLHF pipelines should explicitly choose whether human annotators are extending designer intent, providing evidence about facts, or exercising authority—and use different validation and aggregation methods for each, rather than treating all annotations the same way.

This paper examines how human feedback shapes AI behavior through RLHF, identifying three distinct conceptual models: extension (annotators extend designer judgments), evidence (annotators provide factual information), and authority (annotators represent population preferences).

alignmentevaluationsafety

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

Apr 28, 2026

Jan Dubiński, Jan Betley, Anna Sztyber-Betley et al.

Safety interventions that look effective in standard evaluations can mask "conditional misalignment"—models that behave well on out-of-distribution prompts but revert to worse-than-trained misalignment when given inputs matching their training context.

When language models are finetuned on misaligned behavior, common safety interventions (mixing in benign data, sequential finetuning, inoculation prompting) appear to work on standard tests but fail when evaluation prompts resemble the training context.

safetyalignmentevaluation

Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane

Apr 28, 2026

Pahal D. Patel, Sanmay Ganguly

Explainability methods can reveal that neural networks for physics tasks learn interpretable, physically meaningful features—not just statistical shortcuts—enabling scientists to trust and debug AI models in high-energy physics.

This paper compares three explainability methods (GNNExplainer, GNNShap, GradCAM) to understand why neural networks make accurate jet tagging predictions at particle colliders. By mapping explanations to known physics features like jet substructure, the authors show that these networks learn real QCD patterns and provide tools for interpreting black-box physics models.

evaluationapplications

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Apr 28, 2026

Shuning Shang, Hubert Strauss, Stanley Wei et al.

Imperfect reward signals used in RLHF can sometimes help rather than hurt model training, and evaluating reward quality requires understanding how errors interact with the learning algorithm, not just counting ranking mistakes.

This paper shows that not all reward errors are equally harmful when training language models with reinforcement learning. By analyzing how policy gradient optimization works, the authors categorize reward mistakes into harmful, benign, and even beneficial types—where some errors can actually help prevent the model from getting stuck on mediocre outputs.

alignmentevaluation

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Apr 27, 2026

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad et al.

Current specification-guided RL methods generalize poorly to new environments and complex tasks—this benchmark helps identify where they fail and guides development of more robust approaches.

SpecRLBench is a benchmark for testing how well reinforcement learning agents can follow formal task specifications (written in linear temporal logic) across different, unseen environments and robot types. The benchmark reveals that current methods struggle as tasks and environments become more complex, providing a structured way to develop better specification-guided RL systems.

evaluationreasoningagents

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Apr 27, 2026

Aaryan Shah, Andrew Hines, Alexia Downs et al.

Clinician-authored rubrics can be validated and partially replaced by LLM-generated ones, enabling scalable clinical AI evaluation that maintains expert oversight while reducing evaluation costs from expensive to nearly automatic.

This paper presents a practical methodology for evaluating clinical AI systems using case-specific rubrics written by clinicians. The researchers tested whether AI-generated rubrics could match clinician judgment across 823 real and synthetic clinical cases, finding that LLM-based scoring achieved similar agreement levels to clinician-to-clinician agreement at 1,000x lower cost.

evaluationsafetyapplications

Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting

Apr 27, 2026

Max Kleinebrahm, Jonathan Berrisch, Philipp Eiser et al.

Instead of testing models on fixed historical data, Energy-Arena uses a forward-looking approach with real-time submissions and evaluation, preventing researchers from accidentally (or intentionally) tuning models to past data and enabling fair, comparable progress tracking.

Energy-Arena is a dynamic benchmarking platform that solves a major problem in energy forecasting research: models are currently tested on different datasets and time periods, making it impossible to fairly compare progress.

evaluationapplications

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Apr 27, 2026

Amal Akli, Mike Papadakis, Maxime Cordy et al.

Task description quality matters more than model size for reliable code generation—a small, fine-tuned classifier can detect problematic descriptions better than much larger models, and under-specification is the most critical defect type to watch for.

This paper introduces SpecValidator, a lightweight classifier that detects defects in task descriptions given to code-generating AI models. The tool identifies three types of problems—vague language, missing details, and formatting issues—and shows it's much better at catching these issues than larger models like GPT-4 mini or Claude.

evaluationapplicationsdata

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Apr 27, 2026

Aaron J. Li, Nicolas Sanchez, Hao Huang et al.

How users phrase queries matters as much as what they ask: benign input variations systematically change AI behavior in ways that matter for real-world deployment, especially in high-stakes domains like healthcare.

This paper shows that small, routine changes in how users phrase questions to AI models can significantly shift their outputs—a problem existing safety testing misses.

safetyevaluationapplications

The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Apr 27, 2026

Yunze Xiao, Vivienne J. Zhang, Chenghao Yang et al.

LLMs assigned different personas for multi-agent systems tend to collapse into stereotyped behaviors rather than maintaining genuine diversity, even when individually accurate—a critical issue for applications requiring population heterogeneity.

When LLMs are assigned different personas for multi-agent simulations, they often converge into similar behaviors instead of staying diverse—a problem called Persona Collapse. Researchers created metrics to measure this (Coverage, Uniformity, Complexity) and found that 10 LLMs fail to maintain distinct personalities, instead falling back on coarse stereotypes.

evaluationagentsalignment

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Apr 27, 2026

Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.

Current AI agents struggle most with identifying knowledge gaps and formulating the right questions, not just answering them—a shift in bottleneck that suggests we need better ways to help AI systems recognize what they don't know.

This paper introduces SciCrafter, a Minecraft-based benchmark that tests whether AI agents can discover causal rules and apply them to solve increasingly complex problems.

reasoningagentsevaluation

This paper reveals how popular LLMs perpetuate harmful stereotypes and biases against people from Global Majority countries in generated narratives. Researchers found that non-Western nationalities are underrepresented in neutral stories but overrepresented in negative character roles—over 50 times more likely to appear in subordinated positions.

safetyevaluationalignment