ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers30 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(10)

Integrable Elasticity via Neural Demand Potentials

May 21, 2026

Carlos Heredia, Daniel Roncel

Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.

This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.

architectureapplications

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

May 21, 2026

Huanchi Wang, Zihang Huang, Yifang Tian et al.

You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.

FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.

May 11 – May 17(1)

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

May 14, 2026

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim et al.

Combining unstructured clinical text with structured EHR tables through retrieval-augmented alignment produces significantly more accurate and complete patient timelines than using either source alone, with 35% of clinically important events appearing only in text.

This paper tackles a critical healthcare problem: reconstructing accurate timelines of patient events from messy clinical records. Clinical narratives (text) contain rich context but vague timing, while structured EHR tables have precise timestamps but miss many events.

multimodalapplications

May 4 – May 10(14)

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

May 8, 2026

James Petullo, Nianwen Xue

Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.

CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.

reasoningapplications

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

May 8, 2026

Yi Yu, Parker Martin, Zhenyu Bu et al.

Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.

CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.

Apr 27 – May 3(25)

Can Coding Agents Reproduce Findings in Computational Materials Science?

May 1, 2026

Ziyang Huang, Yi Cao, Ali K. Shargh et al.

AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.

This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.

agentsevaluationapplications

Generating Statistical Charts with Validation-Driven LLM Workflows

May 1, 2026

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.

This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.

Apr 20 – Apr 26(28)

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

Apr 24, 2026

Inês Oliveira e Silva, Sérgio Jesus, Iker Perez et al.

Quantitative metrics for evaluating AI explanations (like sparsity and faithfulness) don't predict whether explanations actually help humans make better decisions in high-stakes settings—you need human-centered evaluation, not just mathematical benchmarks.

This paper evaluates eight different Shapley value methods—a popular AI explanation technique—by testing them with real financial analysts on fraud detection and risk assessment tasks.

evaluationsafetyapplications

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Apr 23, 2026

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil et al.

LLMs outperform traditional word-error metrics for evaluating speech recognition by understanding semantic meaning rather than just counting mistakes, opening the door to more human-aligned ASR evaluation.

This paper shows that large language models can evaluate speech recognition quality much better than traditional metrics like Word Error Rate. Instead of just counting wrong words, LLMs can understand meaning and classify errors in ways that match how humans judge speech quality—achieving 92-94% agreement with human raters.

Apr 13 – Apr 19(22)

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

Apr 17, 2026

Thomas Bayer, Alexander Lohr, Sarah Weiß et al.

LLMs can dynamically query Knowledge Graphs to generate contextual, domain-aware explanations of ML model predictions—making AI decisions more transparent and trustworthy in specialized industries like manufacturing.

This paper combines Knowledge Graphs and Large Language Models to explain machine learning predictions in manufacturing. The system stores domain knowledge and ML results in a structured graph, then uses an LLM to convert relevant information into clear, user-friendly explanations. Testing shows the approach works well for both standard and complex questions in real manufacturing settings.

applicationsreasoning

Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

Apr 17, 2026

Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir et al.

LLMs show promise for drug discovery, but RL-based post-training on domain-specific tasks is critical: a smaller model trained this way outperformed much larger untrained models, suggesting a practical path forward for real-world drug design applications.

This paper creates a benchmark of chemistry tasks to test how well large language models can help design new drugs. The researchers test three model families on tasks like predicting molecular properties and designing molecules, then show that reinforcement learning training can significantly boost performance—even making smaller models competitive with frontier models.

efficiencyevaluationapplications

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

May 21, 2026

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

Diffusion models can effectively handle continuous-time survival analysis by modeling censored outcomes directly, avoiding parametric assumptions and discretization errors that limit traditional survival methods.

SDPM uses diffusion models to estimate time-to-event distributions from data with censored observations, without requiring assumptions about the hazard function or discretizing time. The model generates samples that can be converted to survival curves, achieving competitive performance on real datasets while accurately recovering underlying continuous distributions.

applicationsevaluation

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Mamba's linear-complexity architecture enables real-time cognitive load monitoring from noisy eye-tracking signals on wearable devices—a practical alternative to Transformers for temporal sensor data with frequent gaps.

MambaGaze uses a bidirectional Mamba neural network to assess cognitive load from eye-tracking data in real-time. It handles missing data from eye blinks and tracking failures by explicitly encoding uncertainty, and runs efficiently on edge devices like smartglasses for applications like driver monitoring.

architectureefficiencyapplications

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

May 21, 2026

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh et al.

Foundation models trained on large clinical datasets can be effectively adapted to wearable sensor tasks through domain-specific adapters and careful fine-tuning, enabling better cognitive load assessment with limited labeled data.

CogAdapt adapts pre-trained clinical ECG models to assess cognitive load from wearable devices. It uses a learnable adapter to convert 3-lead wearable signals into 12-lead clinical format and a progressive fine-tuning strategy to preserve learned knowledge while adapting to the new task, achieving strong performance on cognitive load prediction.

applications

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

May 20, 2026

Mansoor Ahmed, Sujin Lee, Umar Khayaz et al.

Combining evolutionary knowledge from language models with 3D structural constraints solves vocabulary collapse in antibody design, achieving 16% better sequence accuracy and 2.3x more amino acid diversity than structure-only methods.

EvoStruct fixes a critical problem in AI-designed antibodies: neural networks trained on 3D structures alone forget important amino acid patterns from evolution. The method combines a pre-trained protein language model (which knows evolutionary patterns) with structural information, using a special adapter to merge both sources of knowledge.

architecturetrainingapplications

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

May 20, 2026

Tilman Tröster, David Mirkovic, Veronika Oehl et al.

Matching a model's architectural symmetries to the actual symmetries present in your data—not just the underlying physics—significantly improves performance and data efficiency.

Velocityformer is a specialized neural network that reconstructs galaxy velocities from survey data to improve cosmological measurements. By designing the model to match the asymmetric structure of real observations (where one direction—the line of sight—is special), it achieves 35% better accuracy than traditional methods and works well even with very limited training data.

architecturereasoningapplications

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

May 20, 2026

Weixing Zhang, Bowen Jiang, Rahul Sharma et al.

LLMs can learn grammar adaptation patterns from examples and apply them to new versions, achieving 100% consistency on medium-sized grammars but failing on large-scale ones—suggesting LLMs work best for targeted, smaller grammar updates.

This paper shows how Large Language Models can automatically adapt domain-specific language grammars when their underlying models change, reducing manual work. Testing on real-world languages shows LLMs work well for complex scenarios but struggle with very large grammars (300+ rules).

trainingapplications

HITL-D: Human In The Loop Diffusion Assisted Shared Control

May 20, 2026

Riley Zilka, Sergey Khlynovskiy, Allie Wang et al.

Diffusion models can effectively assist human operators in robotic control by automating specific subtasks (like orientation), reducing cognitive load while maintaining human oversight—a practical model for human-AI collaboration in physical systems.

This paper presents HITL-D, a shared control system that combines diffusion-based AI policies with human input for robotic manipulation tasks. Instead of requiring operators to control every aspect of a robot arm, the system automatically handles orientation adjustments while the human focuses on positioning, reducing mental workload and task completion time by 40% in user studies.

agentsapplicationsreasoning

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

May 20, 2026

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI-generated refactoring often improves code but frequently introduces new quality and security issues that developers accept anyway, highlighting the need for automated quality checks before merging AI contributions.

This study examines Python refactoring pull requests created by AI agents, measuring their impact on code quality and security.

evaluationsafetyapplications
applicationsefficiencyevaluation

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

May 7, 2026

Omar El Khalifi, Thomas Rossi, Oscar Fossey et al.

You can control both character motion and camera angles in video generation by using a two-phase conditioning approach that prioritizes geometric consistency, without needing to train new models.

ActCam enables precise control over both actor motion and camera movement in AI-generated videos without requiring training. It works with existing video generation models by providing carefully sequenced guidance: first using pose and depth information to establish scene structure, then refining details with pose-only guidance.

multimodalapplicationsarchitecture

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

May 7, 2026

Daniel Zheng, Ingrid von Glehn, Yori Zwols et al.

AI agents work best for complex research when designed as collaborative partners that maintain context, track what didn't work, and produce native outputs—not just as answer machines.

Researchers built an interactive AI workbench that helps mathematicians explore open-ended research problems by combining agents for literature search, computation, theorem proving, and theory building. The system tracks failed ideas, manages uncertainty, and outputs mathematical artifacts—mimicking how human collaborators work together.

agentsreasoningapplications

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

May 7, 2026

Ziyu Zhai, Siyou Li, Juexi Shao et al.

This dataset bridges AI and materials science by providing standardized benchmarks for predicting ceramic properties and generating glaze visuals—showing that multimodal AI can accelerate traditionally trial-and-error design processes.

GlazyBench is the first large-scale dataset for AI-assisted ceramic glaze design, containing 23,148 real glaze formulations. It enables two tasks: predicting glaze properties (color, transparency) from raw materials, and generating visual images of glazes.

multimodalapplicationsdata

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

May 6, 2026

Perry E. Radau

LLMs may appear competent on multiple-choice MRI benchmarks but struggle significantly with free-text recall of vendor-specific operational knowledge; multiple-choice scores alone don't indicate readiness for real-world MRI protocol guidance.

This paper introduces MRI-Eval, a benchmark with 1,365 questions testing LLM knowledge of MRI physics and GE scanner operations across three difficulty levels.

evaluationapplications

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

May 6, 2026

The Verkor Team, Ravi Krishna, Suresh Krishna et al.

Frontier LLMs can now autonomously design complex hardware accelerators from scratch, suggesting AI agents are becoming capable of end-to-end engineering tasks that previously required human teams.

An AI agent system autonomously designed a specialized hardware accelerator for LLM inference in 80 hours, starting from a research paper. The system improved dramatically from prior work, handling 80x larger tasks by leveraging newer frontier models, and produced a working FPGA design with thousands of compute units.

agentsefficiencyapplications

Safety and accuracy follow different scaling laws in clinical large language models

May 5, 2026

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa et al.

In clinical AI, safety requires deliberate design choices around evidence quality and retrieval strategy, not just model scaling. A few high-risk errors matter more than average performance.

This paper shows that making clinical AI models bigger or faster doesn't automatically make them safer—safety and accuracy follow different rules. Researchers tested 34 medical AI models and found that high-quality evidence dramatically improved both accuracy and safety, but standard retrieval methods and extra computing power didn't prevent dangerous errors or overconfidence.

safetyevaluationapplications

SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

May 5, 2026

Joseph Breda, Fadi Yousif, Beszel Hawkins et al.

Structured conversational strategies—where AI systematically interviews patients before diagnosing—significantly outperform unguided chat-based symptom assessment, suggesting that agentic design patterns matter more than raw model capability for medical applications.

Researchers deployed SymptomAI, a conversational AI system for symptom assessment, to nearly 14,000 Fitbit users and found it diagnosed conditions more accurately than independent clinicians reviewing the same conversations.

applicationsagentsevaluation

Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing

May 5, 2026

Danny Hoang, Ryan Matthiessen, Christopher Miller et al.

For safety-critical applications, decompose AI workflows into specialized agents (routing, analysis, retrieval, verification) rather than relying on a single LLM, and enforce physical plausibility constraints before surfacing recommendations to humans.

A multi-agent system that helps humans make safer decisions in precision manufacturing by combining AI reasoning with physics simulations, inspection data, and verification checks.

agentssafetyapplications

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

May 4, 2026

Komal Thareja, Anirban Mandal, Ewa Deelman

Pattern-based workflow templates combined with AI assistance can dramatically lower the barrier for non-experts to build and deploy sensor applications across edge-to-cloud infrastructure.

This paper presents a methodology for quickly building sensor-based applications that process data across edge devices and cloud infrastructure. Using AI assistance and reusable workflow patterns, the authors show how scientists can rapidly prototype applications for monitoring air quality, earthquakes, and soil moisture without needing deep expertise in distributed systems.

applicationsagentsefficiency

(POSTER) From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

May 4, 2026

Komal Thareja, Anirban Mandal, Ewa Deelman

AI-assisted workflow templates let developers build sensor applications 5-10x faster by reusing patterns and shifting from code-first to intent-first design, making it practical for non-experts to deploy across edge devices and cloud.

This paper presents a method for quickly building sensor-based applications across edge and cloud systems using AI-assisted workflow templates.

applicationsagentsefficiency

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

May 4, 2026

Vicente Pelechanoa, Antoni Mestre, Manoli Albert et al.

Governance constraints on AI autonomy aren't just overhead—they're a tunable design variable that can simultaneously improve performance and reduce human fatigue when properly calibrated for your domain.

HAAS is a framework for deciding which tasks humans and AI should handle in organizations. Instead of treating it as all-or-nothing, it uses governance rules and machine learning to adapt task allocation based on context, performance, and fatigue.

agentsalignmentapplications

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

May 4, 2026

Quang Hieu Pham, Yang He, Ping Nie et al.

Flexible database interaction throughout reasoning—exploring schemas and data on-demand rather than upfront—is more effective for text-to-SQL than fixed pipelines, even with smaller models.

FlexSQL is a text-to-SQL agent that can explore database schemas, inspect data, and run verification queries at any point during reasoning—rather than retrieving schema once upfront. It generates multiple execution plans, implements them in SQL or Python, and uses a two-tiered repair system to recover from mistakes.

reasoningagentsapplications
evaluationapplicationsdata

When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

May 1, 2026

Alfredo Madrid-García, Miguel Rujas

Medical RAG chatbots often expose sensitive backend details and patient data through client-side communication—use server-side security controls and independent audits before deploying patient-facing AI systems.

Researchers audited a patient-facing medical chatbot and found critical security flaws: sensitive system prompts, API endpoints, and 1,000 patient conversations were exposed through basic browser inspection. The study shows how RAG chatbots can leak backend configuration and private health data without authentication, highlighting governance gaps in AI healthcare deployment.

safetyapplicationsevaluation

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

May 1, 2026

Jingxi Pu, Tonghua Liu, Zhilin Guan et al.

You can denoise real clinical CT images without paired training data by using unsupervised learning with perceptual loss, making it practical for hospitals that can't easily create labeled datasets.

This paper tackles noise in low-dose CT scans—a real clinical problem where reducing radiation exposure creates grainy images that are hard for doctors to read.

efficiencyevaluationapplications

GeoContra: From Fluent GIS Code to Verifiable Spatial Analysis with Geography-Grounded Repair

May 1, 2026

Yinhao Xiao, Rongbo Xiao, Yihan Zhang

LLM-generated GIS code can look correct but violate geographic rules; GeoContra's contract-based verification catches these semantic errors before they produce wrong spatial analysis.

GeoContra is a verification and repair system that catches geographic errors in AI-generated GIS code. It checks that spatial analysis preserves coordinate systems, topology, units, and geographic plausibility—catching bugs like negative travel times or mismatched coordinate systems that would otherwise produce executable but wrong results.

evaluationsafetyapplications

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Apr 30, 2026

Yujun Wu, Dongxu Zhang, Xinchen Li et al.

Structured knowledge of method evolution, not just citations, is essential infrastructure for AI agents doing research. This graph enables machines to understand how innovations emerge and build upon each other, unlocking automated idea evaluation and generation.

Intern-Atlas is a structured database of how AI research methods evolve and build on each other, extracted from over 1 million papers. Unlike traditional citation networks, it explicitly maps methodological relationships—showing which techniques led to which innovations and why—making it queryable for AI research agents and enabling automated discovery of new research directions.

dataagentsapplications

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

Apr 30, 2026

Binghao Huang, Yunzhu Li

Practical tactile sensing for robotics is now accessible to researchers and developers without expensive custom hardware—FlexiTac provides a plug-and-play solution that integrates with standard robot learning pipelines.

FlexiTac is an affordable, open-source tactile sensor system for robot hands and grippers that combines flexible sensor pads with simple electronics to provide real-time touch feedback. It works with existing robot platforms and supports modern AI training methods like learning from combined vision and touch data.

applicationsmultimodaldata

Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models

Apr 30, 2026

Matthias Hertel, Alexandra Nikoltchovska, Sebastian Pütz et al.

You can now explain time series foundation model predictions efficiently using SHAP, making them trustworthy for critical infrastructure like power grids—without sacrificing accuracy or requiring model retraining.

This paper makes time series foundation models (TSFMs) transparent for power grid forecasting by developing an efficient method to compute SHAP explanations. The approach leverages TSFMs' ability to handle variable input lengths and selective masking, enabling scalable explanations without retraining.

applicationsevaluation

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Apr 30, 2026

Chenxin Li, Zhengyang Tang, Huangxin Lin et al.

Building reliable workflow automation is harder than leaderboard rankings suggest—agents need to be evaluated on what they actually execute, not just outputs, and benchmarks must track real-world demand to stay relevant.

Claw-Eval-Live is a benchmark for testing AI agents that automate real-world workflows across software tools and services. Unlike static benchmarks, it updates with real-world demand signals while maintaining reproducible test snapshots.

evaluationagentsapplications

FiLMMeD: Feature-wise Linear Modulation for Cross-Problem Multi-Depot Vehicle Routing

Apr 30, 2026

Arthur Corrêa, Paulo Nascimento, Samuel Moniz

A single neural model can now handle multiple variants of complex routing problems by dynamically adapting to different constraints, suggesting that multi-task learning with adaptive conditioning is more practical than building separate models for each problem type.

FiLMMeD is a neural model that solves 24 different multi-depot vehicle routing problems (a logistics optimization task) using a single unified architecture.

architecturetrainingapplications

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results

Apr 30, 2026

Lauren Cadwallader, Iain Hrynaszkiewicz, parth sarin et al.

LLMs can automatically detect data reuse in scientific papers, revealing that open data sharing has far greater downstream impact than traditional metrics suggest.

Researchers used large language models to detect when published studies reuse data from other research. They found that 43% of papers reuse existing data—much higher than previous measurement methods could show. This demonstrates that AI can measure the real-world impact of open science practices at scale.

evaluationapplications

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Apr 29, 2026

Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud et al.

Cascading multiple specialized modules (query reformulation, evidence ranking, grounded generation, answer-evidence linking) with an LLM outperforms end-to-end approaches for clinical QA, especially when grounding answers to source documents matters for patient safety.

A clinical question-answering system that helps patients understand their electronic health records by using a four-stage pipeline with an LLM to interpret patient questions, find relevant evidence in medical notes, generate grounded answers, and link answers back to source documents.

applicationsreasoningevaluation

KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment

Apr 29, 2026

Attila Pintér, Javier Rico, Attila Répai et al.

Containerized microservice architectures enable clinical AI systems to meet real-world constraints like data privacy while maintaining high performance, and this approach is ready for real-world deployment (TRL 6).

KAYRA is an AI system for analyzing chromosomes (karyotyping) in clinical labs using a pipeline of deep learning models. It can run in the cloud or on-premise to handle privacy requirements, and achieves 98.91% accuracy on chromosome segmentation—significantly better than existing commercial systems.

applicationsarchitectureevaluation

Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows

Apr 29, 2026

Sajel Surati, Rosanna Bellini, Emily Black

GenAI in hiring creates an illusion of human control: recruiters think they're in charge, but AI systems silently reshape the data and criteria they use to make decisions, while adoption pressures and deskilling undermine their actual oversight capacity.

This study interviews 22 recruiting professionals to understand how they perceive their control and agency when using generative AI in hiring decisions. The research reveals that while recruiters believe they have final authority, AI systems invisibly shape the information foundation for decisions—from job descriptions to interview evaluations—often without recruiters realizing it.

safetyapplicationsalignment

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Apr 28, 2026

Jinxiang Meng, Shaoping Huang, Fangyu Lei et al.

Building practical data visualization agents requires handling real-world complexity—native tool integration, cross-platform adaptation, and ambiguous user intent—not just code generation in isolated environments.

DV-World is a benchmark with 260 real-world data visualization tasks that tests AI agents on spreadsheet manipulation, adapting visualizations to new data, and handling ambiguous user requirements.

evaluationagentsapplications

A paradox of AI fluency

Apr 28, 2026

Christopher Potts, Moritz Sudhof

Success with AI depends more on how you interact with it than on the model itself: active collaboration and critical feedback lead to better results, even if they surface more failures along the way.

This paper analyzes 27K AI conversations to show that skilled AI users get better results by actively iterating with the AI, while novices passively accept outputs—leading to a paradox where fluent users see more visible failures but achieve better outcomes on complex tasks, while novices experience hidden failures that go unnoticed.

evaluationapplicationsagents

No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

Apr 28, 2026

Anas Gamal Aly, Hala ElAarag

Adaptive traffic signals that monitor actual pedestrian crossing speed in real-time can dramatically improve safety for vulnerable users without significantly disrupting traffic flow.

This paper presents NPLB, a real-time traffic signal system that detects and tracks vulnerable pedestrians (elderly, disabled, distracted) using YOLOv12 and automatically extends crossing time when needed. Testing shows it reduces pedestrians getting stranded mid-crossing from 9.1% to 2.6%, improving safety by 71.4% with minimal signal disruption.

applicationsagents

Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane

Apr 28, 2026

Pahal D. Patel, Sanmay Ganguly

Explainability methods can reveal that neural networks for physics tasks learn interpretable, physically meaningful features—not just statistical shortcuts—enabling scientists to trust and debug AI models in high-energy physics.

This paper compares three explainability methods (GNNExplainer, GNNShap, GradCAM) to understand why neural networks make accurate jet tagging predictions at particle colliders. By mapping explanations to known physics features like jet substructure, the authors show that these networks learn real QCD patterns and provide tools for interpreting black-box physics models.

evaluationapplications

From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLMs

Apr 28, 2026

Bangzhao Shu, Arinjay Singh, Mai ElSherief

Emotion recognition in LLMs follows a predictable three-phase pattern, and you can improve emotion detection by identifying and amplifying the small set of internal features that drive emotion predictions—without retraining the model.

This paper reveals how large language models internally process emotions by analyzing their neural activations using sparse autoencoders. The researchers discover that emotion recognition happens in three distinct phases, with emotion-specific features emerging late in the network.

alignmentapplications

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Apr 27, 2026

Griffin Pitts, Muntasir Hoq, Peter Brusilovsky et al.

By extracting knowledge components from student code patterns, you can steer generative models to create personalized learning content that directly targets the logical errors students are making, rather than relying on generic pre-written examples.

This paper presents a system that automatically generates personalized worked examples for programming students based on their actual code submissions. Instead of using fixed example libraries, the system analyzes patterns in student errors using code structure analysis and uses these patterns to guide an AI model to create relevant examples that address each student's specific misconceptions.

applicationstrainingdata

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Apr 27, 2026

Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi et al.

Multi-task learning (training one model for both sentiment and emotion at once) with BiLSTM outperforms single-task approaches on noisy, informal Indonesian text—and preprocessing with domain-specific slang dictionaries matters more than model complexity.

This paper tackles sentiment and emotion classification for Indonesian e-commerce reviews, which contain slang, regional words, and emoji that confuse standard tools. The authors built a two-track system: one using AutoML with TF-IDF features, and another using a BiLSTM neural network trained on both sentiment and emotion simultaneously.

trainingapplications

Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Apr 27, 2026

Aaryan Shah, Andrew Hines, Alexia Downs et al.

Clinician-authored rubrics can be validated and partially replaced by LLM-generated ones, enabling scalable clinical AI evaluation that maintains expert oversight while reducing evaluation costs from expensive to nearly automatic.

This paper presents a practical methodology for evaluating clinical AI systems using case-specific rubrics written by clinicians. The researchers tested whether AI-generated rubrics could match clinician judgment across 823 real and synthetic clinical cases, finding that LLM-based scoring achieved similar agreement levels to clinician-to-clinician agreement at 1,000x lower cost.

evaluationsafetyapplications

Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting

Apr 27, 2026

Max Kleinebrahm, Jonathan Berrisch, Philipp Eiser et al.

Instead of testing models on fixed historical data, Energy-Arena uses a forward-looking approach with real-time submissions and evaluation, preventing researchers from accidentally (or intentionally) tuning models to past data and enabling fair, comparable progress tracking.

Energy-Arena is a dynamic benchmarking platform that solves a major problem in energy forecasting research: models are currently tested on different datasets and time periods, making it impossible to fairly compare progress.

evaluationapplications

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Apr 27, 2026

Amal Akli, Mike Papadakis, Maxime Cordy et al.

Task description quality matters more than model size for reliable code generation—a small, fine-tuned classifier can detect problematic descriptions better than much larger models, and under-specification is the most critical defect type to watch for.

This paper introduces SpecValidator, a lightweight classifier that detects defects in task descriptions given to code-generating AI models. The tool identifies three types of problems—vague language, missing details, and formatting issues—and shows it's much better at catching these issues than larger models like GPT-4 mini or Claude.

evaluationapplicationsdata

Green Shielding: A User-Centric Approach Towards Trustworthy AI

Apr 27, 2026

Aaron J. Li, Nicolas Sanchez, Hao Huang et al.

How users phrase queries matters as much as what they ask: benign input variations systematically change AI behavior in ways that matter for real-world deployment, especially in high-stakes domains like healthcare.

This paper shows that small, routine changes in how users phrase questions to AI models can significantly shift their outputs—a problem existing safety testing misses.

safetyevaluationapplications
evaluationapplications

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Apr 23, 2026

Bartosz Balis, Michal Orzechowski, Piotr Kica et al.

By separating LLM interpretation from deterministic workflow generation and encoding domain knowledge in reusable "Skills" documents, you can reliably automate the conversion of research questions into executable scientific workflows with minimal cost and overhead.

This paper presents an AI system that automatically converts research questions into executable scientific workflows. It uses three layers: an LLM to understand natural language, validated generators to create reproducible workflow specifications, and domain expert "Skills" documents that guide the process.

agentsapplicationsreasoning

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Apr 23, 2026

Chee Wei Tan, Yuchen Wang, Shangxin Guo

LLMs can be operationalized as strategic game agents that adapt their reasoning approach based on game type, and interactive platforms like Nemobot let developers actively experiment with and refine these agents in real time.

Nemobot is an interactive platform that uses large language models to create game-playing AI agents across different game types—from word games to strategy games. Users can build, customize, and deploy these agents while watching them learn and improve through reinforcement learning, human feedback, and self-critique.

agentsreasoningapplications

Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Apr 23, 2026

Sherly Alfonso-Sánchez, Cristián Bravo, Kristina G. Stankova

Geographic representation matters more than model complexity for insurance risk prediction—simple coordinate + environmental feature combinations often outperform complex image-based approaches in zone-level claim frequency models.

This study shows how to improve motor insurance claim prediction by adding geographic data to standard actuarial models, even when location information is limited. Researchers tested environmental features from maps and satellite imagery on insurance claims data, finding that combining coordinates with environmental data works best, while image embeddings help when map data isn't available.

dataevaluationapplications

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Apr 23, 2026

Muhy Eddin Za'ter, Anna Van Boven, Bri-Mathias Hodge et al.

Deep learning can accelerate hard optimization problems by providing intelligent warm-start solutions that reduce the search space, rather than replacing traditional solvers entirely.

This paper uses a transformer neural network to predict electricity generator schedules 72 hours ahead, then refines those predictions with rule-based corrections and feeds them to a traditional optimization solver as a starting point.

applicationsreasoningefficiency

EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

Apr 23, 2026

Praval Sharma, Ashok Samal, Leen-Kiat Soh et al.

This dataset enables building event extraction systems that work across diverse real-world documents and geographical contexts, moving beyond closed-domain limitations that plagued previous approaches.

EVENT5Ws is a large, manually annotated dataset for extracting key event information (who, what, when, where, why) from documents in open-domain settings.

dataevaluationapplications

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Apr 23, 2026

Jun Wang, Ziyin Zhang, Rui Wang et al.

LLMs can be practical for production incident detection when paired with efficient indexing, noise filtering, and domain-specific routing—not just as standalone models, but as part of a multi-stage system that handles real-world scale and complexity.

TingIS is a production system that detects critical technical incidents from noisy customer reports in real-time at enterprise scale.

applicationsagentsreasoning

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Apr 23, 2026

Praval Sharma

Combining graph representations with LLM embeddings enables open-domain event extraction that generalizes to unseen event types while maintaining document-level reasoning that LLMs alone struggle with.

This paper presents MODEE, a method for extracting events from documents that works with any type of event, not just predefined ones. It combines graph-based learning with large language models to better understand document structure and context, addressing limitations where LLMs struggle with long documents and lose important information in the middle.

multimodalapplications

Addressing Image Authenticity When Cameras Use Generative AI

Apr 23, 2026

Umar Masud, Abhijith Punnappurath, Luxi Zhao et al.

Camera-embedded AI enhancements can alter image semantics without users knowing—this work enables recovery of authentic pre-enhancement images using a tiny stored decoder, raising important questions about transparency in computational photography.

Modern cameras increasingly use AI to enhance images during capture (better zoom, low-light processing), but this can add hallucinated content that users don't realize isn't authentic.

safetyefficiencyapplications

Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

Apr 23, 2026

Kaitlin Gili, Mainak Nistala, Kristen Wendell et al.

Interpretability built into a model's design (through domain-aligned inductive biases) generalizes better than post-hoc explanations, making it more useful for education researchers who need to find and analyze student reasoning in transcripts.

Researchers built an interpretable machine learning model that automatically detects when students in team conversations are engaging in mechanistic reasoning—understanding how things work. The model analyzes student utterances and group dynamics to assign probabilities of reasoning moments, using intentional design choices that improve accuracy on new students and contexts.

evaluationapplications

Alignment has a Fantasia Problem

Apr 23, 2026

Nathanael Jo, Zoe De Simone, Mitchell Gordon et al.

AI alignment shouldn't just follow user prompts—it should actively help users discover and refine what they actually want through interactive support, combining machine learning with interface design and behavioral science.

AI systems today assume users know exactly what they want when they prompt. But research shows people often interact with AI while still figuring out their goals. When AI treats incomplete prompts as final requests, it can seem helpful but miss what users actually need.

alignmentapplications

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Apr 23, 2026

Bowen Liu, Li Yang, Shanshan Song et al.

Diagnosis-driven video summarization for medical imaging requires organizing sparse diagnostic events into coherent clinical contexts rather than treating frames independently—DiCE shows this contextual reasoning approach outperforms standard methods on ultra-long endoscopy videos.

This paper tackles video-level analysis of capsule endoscopy (CE) videos by introducing a new task: extracting key diagnostic frames and making accurate diagnoses from ultra-long videos containing thousands of normal frames mixed with rare abnormal findings.

evaluationmultimodalapplications

Quotient-Space Diffusion Models

Apr 23, 2026

Yixian Xu, Yusong Wang, Shengjie Luo et al.

Quotient-space diffusion models reduce learning complexity for symmetric generative tasks by formally accounting for group symmetries, enabling better molecular and protein structure generation without learning redundant symmetric variations.

This paper introduces a mathematical framework for diffusion models that accounts for symmetries in generative tasks, particularly molecular structure generation. By modeling distributions on quotient spaces (which treat symmetric objects as equivalent), the approach simplifies learning compared to existing symmetry-aware methods and guarantees correct sampling of target distributions.

architecturereasoningapplications

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Apr 23, 2026

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue et al.

You can build practical event detection systems using logical rules and constraint satisfaction that work efficiently on real timestamped data while handling conflicting inferences—demonstrated on medical records.

This paper presents a logic-based system for detecting high-level events from timestamped data, like inferring disease episodes from patient medical records. The system uses logical rules to identify events, handles conflicts between inferred events, and can run efficiently on real data while staying aligned with expert knowledge.

reasoningdataapplications

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

Apr 23, 2026

Minji Jung, Minjae Lee, Yejin Kim et al.

Model rankings on leaderboards aren't objective—they depend heavily on what evaluation priorities you choose. Interactive, customizable leaderboards could better serve real-world deployment decisions than one-size-fits-all rankings.

LLM leaderboards rank models using fixed evaluation criteria set by benchmark designers, but different users have different priorities. This paper analyzes the LMArena benchmark dataset and builds an interactive tool that lets users customize which types of prompts matter most to them, showing how model rankings shift based on their specific needs.

evaluationapplications

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

Apr 22, 2026

Ruohan Liu, Shukang Yin, Tao Wang et al.

Current large audio-language models fail to properly control or interpret paralinguistic cues (emotion, tone, style) in speech, with these failures accounting for 43% of errors in conversational tasks—a critical gap for building natural-sounding voice assistants.

SpeechParaling-Bench is a benchmark for testing how well AI speech models handle paralinguistic features—things like emotion, tone, and speaking style. It includes over 100 fine-grained features tested across 1,000+ English-Chinese speech samples, and uses an AI judge to compare outputs fairly. Tests show current models struggle significantly with controlling these subtle speech qualities.

evaluationmultimodalapplications

Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples

Apr 22, 2026

Ana Sanchez-Fernandez, Thomas Pinetz, Werner Zellinger et al.

Meta-learning with control samples can close the domain gap caused by batch effects in biomedical imaging, enabling deep learning models to work reliably across different experimental batches and labs without retraining from scratch.

Batch effects—systematic technical variations in biomedical imaging—cause deep learning models to fail on new experimental data. This paper introduces CS-ARM-BN, a meta-learning method that uses negative control samples (unperturbed reference images always present in experiments) to adapt models to new batches.

trainingapplications

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Apr 22, 2026

Shelly Golan, Michael Finkelson, Ariel Bereslavsky et al.

You can now train one diffusion model that handles multiple conflicting goals and let users choose their preferred trade-off at inference time, rather than training separate models or picking a single compromise upfront.

ParetoSlider trains a single diffusion model to handle multiple competing objectives simultaneously, letting users control trade-offs at inference time. Instead of committing to one fixed balance between goals (like image quality vs. prompt accuracy), the model learns the entire range of optimal solutions and accepts a preference weight as input to pick any point along that spectrum.

trainingalignmentapplications

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

Apr 22, 2026

Mariano Barone, Francesco Di Serio, Roberto Moio et al.

LLMs work best as communication assistants in healthcare, not replacements for doctors. Rewriting patient-facing text through collaborative processes dramatically improves clarity and emotional appropriateness while maintaining medical accuracy.

This study evaluates whether large language models can communicate like doctors by testing general and medical-specialized LLMs on clinical explanations and patient interactions.

safetyapplicationsevaluation

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Apr 21, 2026

Jean Mercat, Sedrick Keh, Kushal Arora et al.

For roboticists and ML engineers: VLA Foundry eliminates pipeline incompatibility issues by providing a unified training stack for building embodied AI models, with released weights and open-source code making it practical to train and deploy robotic policies.

VLA Foundry is an open-source framework that unifies training of language models, vision-language models, and vision-language-action models in one codebase. Instead of stitching together separate pipelines, it provides end-to-end control from language pretraining through action fine-tuning, enabling researchers to train robotic manipulation policies from scratch or using pretrained backbones.

architecturetrainingapplications

Epistemic orientation in parliamentary discourse is associated with deliberative democracy

Apr 21, 2026

Segun Aroyehun, Stephan Lewandowsky, David Garcia

Political discourse that prioritizes evidence over intuition is measurably associated with better democratic outcomes and governance quality—a finding that quantifies the importance of epistemic rigor in public deliberation.

Researchers developed a method to measure whether political speech relies on evidence or intuition, then analyzed 15 million parliamentary speeches from 1946-2025 across seven countries. They found that speeches emphasizing evidence-based reasoning correlate with stronger democratic institutions and better governance, suggesting that how politicians talk about truth matters for democracy itself.

evaluationapplications

An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA

Apr 21, 2026

Saransh Sharma, Pritika Ramu, Aparna Garimella et al.

Open-ended QA isn't just about finding one answer—users need follow-up insights to refine their thinking. This work shows how to systematically generate those related insights from document collections to support iterative question-answering.

This paper introduces a new task where AI systems generate additional insights from documents to help users refine and improve answers to open-ended questions. The authors release SCOpE-QA, a dataset of 3,000 questions, and propose InsightGen, a method that clusters documents thematically and selects relevant context to generate diverse insights using language models.

evaluationapplicationsreasoning

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Apr 20, 2026

Andrew Zhang, Tong Ding, Sophia J. Wagner et al.

A single model can integrate all types of clinical data (images, text, lab results, medications, procedures) into patient embeddings that enable multiple downstream clinical tasks, suggesting that unified patient representations are feasible and useful at healthcare system scale.

Apollo is a foundation model trained on 30 years of hospital records from 7.2 million patients that learns unified representations of entire patient care journeys across 28 medical data types.

multimodalapplications

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Apr 20, 2026

Terry Leitch

Backend infrastructure (llama.cpp vs MLX) matters more than quantization level for local LLM performance, and long-context tasks expose memory limits that cloud models handle better—critical for practitioners choosing between cloud and local deployment.

This paper evaluates large language models on System Dynamics tasks, comparing cloud APIs (77–89% accuracy) against locally-hosted open-source models (up to 77% on causal diagram extraction).

evaluationefficiencyapplications

ConforNets: Latents-Based Conformational Control in OpenFold3

Apr 20, 2026

Minji Lee, Colin Kalicki, Minkyu Jeon et al.

By learning to transform AF3's internal representations, ConforNets can reliably generate multiple protein conformations and transfer conformational changes between proteins—solving a major limitation of structure prediction models that typically predict only one dominant state.

ConforNets is a method for controlling protein conformations in AlphaFold3 by applying learnable transformations to latent representations. Rather than perturbing inputs or using ad hoc tricks, it modulates the internal representations that AF3 uses to predict protein structures, enabling both discovery of alternate conformations and transfer of conformational changes across related proteins.

architecturetrainingapplications

Physics-Informed Neural Networks for Biological $2\mathrm{D}{+}t$ Reaction-Diffusion Systems

Apr 20, 2026

William Lavery, Jodie A. Cochrane, Christian Olesen et al.

You can now use neural networks to automatically discover the mathematical equations governing biological systems from experimental data, making it practical to reverse-engineer complex biological processes without manual equation design.

This paper extends physics-informed neural networks (PINNs) to discover reaction-diffusion equations from 2D spatial + time data. The method combines neural networks with known physics structure to learn unknown biological processes, demonstrated on lung cancer cell dynamics from microscopy images, producing interpretable mathematical equations.

reasoningapplications

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Apr 20, 2026

Xirui Li, Ming Li, Derry Xu et al.

Automating environment generation for agent evaluation enables large-scale benchmarking and continuous, on-demand testing—turning evaluation from a static, expensive process into a scalable, user-driven one that adapts to agent weaknesses.

ClawEnvKit automates the creation of training and evaluation environments for AI agents that use tools (claw-like agents). Instead of manually building environments, the system generates diverse, verified task scenarios from natural language descriptions.

agentsevaluationapplications
applicationsevaluationtraining

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Apr 17, 2026

Xiangbo Gao, Sicong Jiang, Bangya Liu et al.

To build better video editing systems, you need specialized evaluation tools—generic vision-language models don't understand editing quality the way humans do.

This paper introduces VEFX-Bench, a comprehensive dataset and evaluation system for video editing. It includes 5,049 human-annotated video editing examples across multiple categories, a specialized reward model (VEFX-Reward) that judges editing quality across three dimensions, and a 300-video benchmark for comparing editing systems.

evaluationmultimodalapplications

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Apr 17, 2026

Van-Truong Le

When deploying LLMs for legal tasks, don't rely on overall accuracy scores alone—detailed error analysis shows models make subtle but critical reasoning mistakes that surface-level metrics miss, especially with complex domain-specific language.

This paper evaluates four leading LLMs (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Grok-1) on Vietnamese legal text simplification using both quantitative benchmarks and detailed error analysis. The study reveals that models face a trade-off between readability and legal accuracy, with the main challenge being precise legal reasoning rather than summarization.

evaluationapplicationsreasoning

FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale

Apr 17, 2026

Aswathi Mundayatt, Jaya Sreevalsan-Nair

Combining multiple machine learning approaches with spatial awareness—rather than using one uniform model across an entire region—significantly improves predictions of natural hazard risks and reveals how different geographic areas are affected by different environmental factors.

This study develops a deep learning system to predict flood and landslide risks across large regions by combining multiple prediction approaches (Early Fusion, Late Fusion, and Mixture of Experts).

evaluationarchitectureapplications

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Apr 17, 2026

Deshan Sumanathilaka, Nicholas Micallef, Julian Hough et al.

Large language models with dynamic few-shot prompting can effectively judge word sense plausibility in stories, and combining multiple models improves results closer to human agreement patterns.

This paper tackles a new task where AI models predict how plausible a word meaning is within a story. The researchers test both small fine-tuned models and large models with few-shot prompting, finding that large models with dynamic examples best match human judgments of word sense plausibility in narratives.

evaluationreasoningapplications

Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

Apr 17, 2026

Reham Alharbi, Valentina Tamma, Terry R. Payne et al.

LLM choice matters for generating ontology requirements: different models have distinct strengths depending on the domain, so practitioners should test multiple models rather than assuming one works universally.

This paper evaluates how different AI language models generate Competency Questions—natural language requirements for ontology systems. The researchers tested open and closed models across multiple domains, measuring readability, relevance, and structural complexity to understand what kinds of questions each model produces best.

evaluationapplications

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Apr 16, 2026

Yan Li, Zezi Zeng, Yifan Yang et al.

Generating webpages with AI requires coordinating multiple content types (text, images, video) at both global and local levels—treating layout and content generation as interconnected problems rather than separate tasks.

MM-WebAgent is a hierarchical AI system that generates complete webpages by coordinating the creation of layouts, text, images, and videos together. Unlike simpler approaches that generate each element separately, it uses planning and self-reflection to ensure all parts work together visually and stylistically.

agentsmultimodalapplications

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

Apr 16, 2026

Moin Aminnaseri, Farima Fatahi Bayat, Nikita Bhutani et al.

Modern data systems need to treat LLMs, web search, and user context as first-class data sources alongside traditional databases, with intelligent agents orchestrating queries across all of them.

Blue's Data Intelligence Layer (DIL) is a system that lets users ask natural language questions across multiple data sources, websites, and knowledge bases—not just a single database.

agentsdataapplications

AI-Assisted Requirements Engineering: An Empirical Evaluation Relative to Expert Judgment

Apr 16, 2026

Oz Levy, Ilya Dikman, Natan Levy et al.

AI can reliably handle routine requirement quality checks (syntax, structure, clarity), but systems engineers must stay in the loop for contextual judgment and complex trade-off decisions that define good requirements.

This study evaluates whether AI tools can help systems engineers assess requirement quality by comparing AI assessments against expert judgment using established INCOSE criteria.

evaluationapplicationssafety

Low-Cost System for Automatic Recognition of Driving Pattern in Assessing Interurban Mobility using Geo-Information

Apr 16, 2026

Oscar Romero, Aika Silveira Miura, Lorena Parra et al.

Adding location and time data to driving sensor inputs significantly improves driving style classification accuracy by 13%, showing that contextual geo-information is crucial for understanding driver behavior beyond raw acceleration/speed metrics.

This paper presents a low-cost system that uses two physical sensors and a neural network to automatically recognize and classify driving styles (normal, aggressive, etc.) in real-world vehicles.

applicationsevaluation

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Apr 16, 2026

Raunak Agarwal, Markus Wenzel, Simon Baur et al.

For healthcare AI, smaller fine-tuned models often outperform large reasoning models at both accuracy and confidence estimation—and a model's stated confidence doesn't reliably indicate whether it's actually uncertain.

MADE is a continuously updated benchmark for classifying medical device adverse events into multiple labels while measuring prediction confidence. It addresses real-world healthcare challenges like imbalanced labels and data contamination, testing 20+ language models with different uncertainty quantification methods to show which approaches work best for high-stakes medical decisions.

evaluationsafetyapplications

Benchmarking Classical Coverage Path Planning Heuristics on Irregular Hexagonal Grids for Maritime Coverage Scenarios

Apr 16, 2026

Carlos S. Sepúlveda, Gonzalo A. Ruz

For maritime coverage planning on irregular grids, implementation details matter more than algorithm family—a Warnsdorff variant with specific tie-breaking rules outperforms other classical methods, but no classical approach reliably produces zero-revisit tours.

This paper benchmarks 17 classical path-planning algorithms on 10,000 irregular hexagonal grid problems inspired by maritime scenarios like search and rescue.

reasoningevaluationapplications

Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

Apr 16, 2026

Ziyang Chen, Renbing Chen, Daowei Li et al.

Combining reasoning-based and learning-based simulation through a shared policy layer reduces errors by ~45%, showing that hybrid approaches work better than either method alone for predicting real-world user behavior.

This paper presents a system for simulating how groups of users behave on a food delivery platform (Meituan) to test merchant strategies without real experiments. It combines two approaches—one that reasons through decisions logically and another that learns statistical patterns—using shared decision policies as a bridge between them.

agentsreasoningapplications

Agent-Aided Design for Dynamic CAD Models

Apr 16, 2026

Mitch Adler, Matthew Russo, Michael Cafarella

LLMs can design mechanical assemblies with moving parts when given the right tools (constraint solvers) and feedback mechanisms, opening the door to AI-assisted industrial design workflows.

AADvark is an AI agent system that designs complex 3D CAD models with moving parts—like pistons and scissors—by writing code, visualizing results, and iteratively refining based on feedback. It solves a key limitation of previous systems by using constraint solvers and specialized visual feedback to handle assemblies with multiple moving components.

agentsapplicationsreasoning

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Apr 15, 2026

Itay Itzhak, Eliya Habba, Gabriel Stanovsky et al.

Vibe-testing—informal, personalized evaluation—is how real users actually judge LLMs, and formalizing it with personalized prompts and user-aware criteria can better predict practical usefulness than standard benchmarks.

Users often evaluate LLMs informally by testing them on tasks relevant to their own work—a practice called 'vibe-testing.' This paper studies how vibe-testing actually works by analyzing user surveys and real-world model comparisons, then formalizes it as a two-step process: personalizing what to test and how to judge results.

evaluationapplications

ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation

Apr 15, 2026

Xiaofan Zhou, Kyumin Lee

Combining sequential and graph representations through multi-view contrastive learning and attention fusion significantly improves sequential recommendation accuracy, showing that different data perspectives can be effectively integrated for better predictions.

This paper proposes MVCrec, a recommendation system that learns from user interaction histories by combining two complementary views: sequential ID-based patterns and graph-based relational structures. Using contrastive learning across both views and a multi-view attention mechanism to fuse them, the approach achieves significant improvements on benchmark datasets without requiring external data.

applicationsarchitecturetraining

A Comparative Study of Dynamic Programming and Reinforcement Learning in Finite Horizon Dynamic Pricing

Apr 15, 2026

Lev Razumovskiy, Nikolay Karenin

Dynamic Programming and RL have different strengths in pricing: DP optimizes based on estimated demand patterns but struggles with computational complexity, while RL learns from trial-and-error but may be less stable—the best choice depends on your problem's complexity and constraints.

This paper compares two approaches to dynamic pricing: Fitted Dynamic Programming (which estimates demand from data) and Reinforcement Learning.

trainingapplications

Toward Autonomous Long-Horizon Engineering for ML Research

Apr 14, 2026

Guoxin Chen, Jie Chen, Lei Chen et al.

Long-horizon AI research requires treating the problem as systems coordination over persistent state rather than pure reasoning—agents perform better when they can reference and build upon saved artifacts than when relying on conversation history alone.

AiScientist is a system that enables AI agents to autonomously conduct multi-day ML research projects by combining hierarchical task orchestration with a file-based workspace that preserves state across stages.

agentsreasoningapplications

PAL: Personal Adaptive Learner

Apr 14, 2026

Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja et al.

Real-time adaptive learning systems can analyze multimodal lecture content and adjust difficulty dynamically, offering personalized feedback and summaries that respond to individual student understanding as lessons unfold.

PAL is an AI platform that transforms lecture videos into interactive learning experiences by analyzing video content in real time and dynamically adjusting question difficulty based on student responses. It generates personalized summaries tailored to each learner's interests, moving beyond static quiz-based systems to provide truly adaptive, responsive education.

applicationsmultimodalevaluation

Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

Apr 14, 2026

Yinghao Qin, Mosab Bazargani, Edmund K. Burke et al.

Bilevel optimization can efficiently solve complex routing problems by separating interdependent decisions (routing vs. charging) into hierarchical levels, using a surrogate objective to accelerate convergence without requiring parameter tuning.

This paper solves the Electric Capacitated Vehicle Routing Problem—deciding routes and charging stops for electric delivery vehicles—using a bilevel optimization approach. The method separates routing and charging decisions into different optimization levels, using a simplified objective to guide the search efficiently.

applications

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Apr 14, 2026

Han Bao, Penghao Zhang, Yue Huang et al.

Current LLMs struggle with policy understanding—they're better at applying knowledge to real problems than recalling facts or reasoning about concepts—and specialized models with domain-aligned experts can help close this gap.

This paper introduces PolicyBench, a 21K-case benchmark for evaluating how well large language models understand public policy across the US and China. It also proposes PolicyMoE, a specialized model using mixture-of-experts to improve policy comprehension at three levels: memorizing facts, understanding concepts, and applying knowledge to real scenarios.

evaluationapplicationsreasoning