Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.
Carlos Heredia, Daniel Roncel
Neural demand models can be designed to respect economic constraints (integrability), producing more reliable price-elasticity estimates that are both mathematically consistent and practically useful for retail pricing.
This paper introduces ICDN, a neural network model that learns demand patterns for multiple products based on prices. Unlike traditional approaches, it directly models how demand changes with price (elasticity) in a mathematically consistent way, making the learned relationships more economically realistic and stable.
Huanchi Wang, Zihang Huang, Yifang Tian et al.
You can build practical, label-efficient log anomaly detectors by using LLMs once offline to structure the problem, then training lightweight domain-specific models that run continuously without expensive LLM calls.
FAME is a system for detecting anomalies in individual log messages rather than groups, using a mixture-of-experts approach that leverages an LLM offline to organize log templates into failure domains. It requires minimal labeled data (as few as 100 examples) and runs efficiently on-premise, achieving 98% accuracy on real production logs while reducing annotation effort by 76x.
James Petullo, Nianwen Xue
Allocating more computational effort to harder SQL generation tasks—by exploring more candidate solutions—significantly improves accuracy without needing larger models.
CA-SQL improves LLM performance on complex SQL generation tasks by estimating question difficulty and dynamically adjusting how many candidate queries to explore. It uses evolutionary search principles and a custom voting method to find better SQL solutions, achieving state-of-the-art results on the BIRD benchmark's hardest problems.
Yi Yu, Parker Martin, Zhenyu Bu et al.
Distilled LLMs can extract medical data from unstructured reports with high accuracy and built-in confidence estimates, enabling clinicians to prioritize which extractions need human review.
CMR-EXTR converts free-text cardiac MRI reports into structured data with confidence scores for each extracted field. Using a lightweight distilled language model, it achieves 99.65% accuracy while running entirely offline, making it practical for clinical use without requiring constant API access.
Ziyang Huang, Yi Cao, Ali K. Shargh et al.
AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.
This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.
Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan
Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.
This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.
Inês Oliveira e Silva, Sérgio Jesus, Iker Perez et al.
Quantitative metrics for evaluating AI explanations (like sparsity and faithfulness) don't predict whether explanations actually help humans make better decisions in high-stakes settings—you need human-centered evaluation, not just mathematical benchmarks.
This paper evaluates eight different Shapley value methods—a popular AI explanation technique—by testing them with real financial analysts on fraud detection and risk assessment tasks.
Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil et al.
LLMs outperform traditional word-error metrics for evaluating speech recognition by understanding semantic meaning rather than just counting mistakes, opening the door to more human-aligned ASR evaluation.
This paper shows that large language models can evaluate speech recognition quality much better than traditional metrics like Word Error Rate. Instead of just counting wrong words, LLMs can understand meaning and classify errors in ways that match how humans judge speech quality—achieving 92-94% agreement with human raters.
Thomas Bayer, Alexander Lohr, Sarah Weiß et al.
LLMs can dynamically query Knowledge Graphs to generate contextual, domain-aware explanations of ML model predictions—making AI decisions more transparent and trustworthy in specialized industries like manufacturing.
This paper combines Knowledge Graphs and Large Language Models to explain machine learning predictions in manufacturing. The system stores domain knowledge and ML results in a structured graph, then uses an LLM to convert relevant information into clear, user-friendly explanations. Testing shows the approach works well for both standard and complex questions in real manufacturing settings.
Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir et al.
LLMs show promise for drug discovery, but RL-based post-training on domain-specific tasks is critical: a smaller model trained this way outperformed much larger untrained models, suggesting a practical path forward for real-world drug design applications.
This paper creates a benchmark of chemistry tasks to test how well large language models can help design new drugs. The researchers test three model families on tasks like predicting molecular properties and designing molecules, then show that reinforcement learning training can significantly boost performance—even making smaller models competitive with frontier models.