ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Spot an error in our data? Let us know.

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

861 papers9 this month12 topics
AllEfficiency 37Reasoning 36Training 35Evaluation 29Architecture 23Agents 23Multimodal 17Applications 15Alignment 9Safety 8scaling 8Data 3

May 18 – May 24(2)

Evaluating Commercial AI Chatbots as News Intermediaries

May 21, 2026

Mirac Suzgun, Emily Shen, Federico Bianchi et al.

AI chatbots excel at retrieving and synthesizing recent news but have three critical weaknesses: they systematically underperform on non-English content, fail primarily due to retrieval errors rather than reasoning mistakes, and are easily fooled by questions containing subtle false information.

This study evaluates six major AI chatbots (Gemini, Grok, Claude, GPT models) on their ability to answer factual news questions across six languages and regions.

evaluationmultimodaldata

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

May 18, 2026

Minrui Xu, Zilin Wang, Mengyi DENG et al.

Automated environment synthesis and trajectory generation can reduce the data requirements for tool-use agent training by 5x while improving downstream performance, making agentic RL more practical and scalable.

EnvFactory automates the creation of tool-use training environments and realistic multi-turn interaction trajectories for teaching language models to use tools effectively. It generates diverse, natural training data from verified executable environments, enabling more efficient agent training with fewer resources than existing approaches.

May 4 – May 10(5)

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

May 7, 2026

Yuhang Lai, Jiazhan Feng, Yee Whye Teh et al.

Using an independent verifier to validate problem correctness prevents reward hacking in AI-generated math problems, enabling better training data creation without human experts.

This paper tackles the problem of generating valid and challenging math problems for training AI models. Instead of relying on humans or simple self-play (which often produces invalid problems), the authors introduce VHG, a system with three players: a problem setter, a solver, and an independent verifier.

trainingreasoningdata

GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation

May 7, 2026

Ziyu Zhai, Siyou Li, Juexi Shao et al.

This dataset bridges AI and materials science by providing standardized benchmarks for predicting ceramic properties and generating glaze visuals—showing that multimodal AI can accelerate traditionally trial-and-error design processes.

GlazyBench is the first large-scale dataset for AI-assisted ceramic glaze design, containing 23,148 real glaze formulations. It enables two tasks: predicting glaze properties (color, transparency) from raw materials, and generating visual images of glazes.

Apr 27 – May 3(11)

Generating Statistical Charts with Validation-Driven LLM Workflows

May 1, 2026

Pavlin G. Poličar, Andraž Pevcin, Blaž Zupan

Treating chart generation as a multi-step inspectable process with rendered-output validation catches visualization failures that code-only checks miss, and the resulting dataset reveals specific weaknesses in how multimodal LLMs understand charts.

This paper presents a structured workflow for generating statistical charts from data using LLMs, with built-in validation to catch visualization errors before they reach users. The workflow produces 1,500 diverse charts paired with 30,000+ question-answer pairs, revealing that while LLMs excel at reading chart syntax, they struggle with value extraction and reasoning tasks.

evaluationapplicationsdata

Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media

May 1, 2026

Scott Friedman, Ruta Wheelock, Sonja Schmer-Galunder et al.

Most sentiment analysis tools miss nuance—they can't detect that a single message contains both praise for one group and criticism for another. This work enables fine-grained tracking of who is being helped, harmed, supported, or opposed in online discourse.

This paper introduces a new method to detect mixed positive and negative sentiments directed at different targets within the same message. Instead of labeling text as simply positive or negative, the approach identifies specific targets (like people or groups) and scores them across three dimensions: advocacy vs. opposition, aid vs. harm, and support vs. victimization.

Apr 20 – Apr 26(16)

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Apr 24, 2026

Hillary Mutisya, John Mugane

Cross-lingual transfer and unsupervised clustering are complementary for morphology discovery in low-resource languages—transfer finds cognates while clustering spots language-specific innovations that transfer misses.

This paper develops a method to automatically discover morphological patterns in Giriama, a low-resource Bantu language with minimal labeled data. By combining knowledge transfer from Swahili with unsupervised clustering, the system identifies noun classes and uncovers two previously unknown morphological patterns, achieving 86.7% accuracy on lemmatization across word classes.

datatraining

Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis

Apr 24, 2026

Victoria Ribeiro Rodrigues, Paul W. Davenport, Nicholas J. Napoli

Breaking respiratory airflow into time-localized parametric components reveals sub-breath dynamics that standard metrics miss, enabling better detection of breathing changes under cognitive stress.

This paper presents a new method to analyze breathing patterns by breaking down airflow signals into simple, interpretable components (half-sine, Gaussian, and beta shapes) rather than treating breath as a single unit. The approach captures fine details within each breath—like timing and coordination—and improves detection of cognitive fatigue by 30% compared to traditional breathing metrics.

Apr 13 – Apr 19(4)

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Apr 17, 2026

Hitesh Mehta, Arjit Saxena, Garima Chhikara et al.

Politeness measurably affects LLM output quality, but there's no universal effect—different languages and models respond differently to tone, so developers should test politeness strategies for their specific use case and audience.

This study tests how politeness in user prompts affects AI model responses across 5 models, 3 languages, and 22,500 prompt-response pairs. Results show politeness improves response quality by ~11% on average, but the effect varies significantly by language and model—English models prefer politeness, Hindi models prefer deference, Spanish models prefer assertiveness.

evaluationdata

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Apr 16, 2026

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara et al.

VLMs fail at emotion recognition due to two fixable problems: long-tailed training data that biases them toward common emotions, and inability to capture the fleeting temporal changes in facial expressions that are critical for understanding emotions.

Vision-language models struggle with emotion recognition because they inherit dataset biases that collapse rare emotions into common categories, and they can't effectively process the temporal dynamics of facial expressions. This paper identifies these vulnerabilities and proposes using natural language summaries of intermediate frames to preserve emotional context within memory constraints.

Apr 6 – Apr 12(16)

ANTIC: Adaptive Neural Temporal In-situ Compressor

Apr 10, 2026

Sandeep S. Cranganore, Andrei Bodnar, Gianluca Galleti et al.

Neural compression combined with smart temporal sampling can reduce physics simulation storage by orders of magnitude, making exabyte-scale HPC data manageable without sacrificing scientific accuracy.

ANTIC is a compression system that reduces storage needs for massive physics simulations by intelligently selecting which time snapshots to save and compressing spatial data using neural networks. It works during simulation rather than after, enabling petabyte-scale datasets to be stored efficiently while preserving physics accuracy.

efficiencyarchitecturedata

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Apr 10, 2026

Guanyu Zhou, Yida Yin, Wenhao Chai et al.

Synthetic data targeted at specific visual skills can significantly improve VLM performance on perception tasks, suggesting that natural images alone don't provide enough supervision for low-level visual understanding.

VisionFoundry is a system that generates synthetic training data for vision-language models to improve their visual perception skills like spatial understanding and 3D reasoning.

Mar 30 – Apr 5(7)

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

Apr 3, 2026

Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela et al.

When distilling from multiple teachers for summarization, simpler logit-level knowledge distillation is more reliable than complex approaches, and teacher agreement should guide when to trust teacher vs. ground truth supervision.

This paper improves knowledge distillation for low-resource abstractive summarization by using multiple teacher models intelligently. It introduces methods that route learning between teacher guidance and ground truth based on teacher agreement, and constrains how student models relate to different teachers.

trainingefficiencydata

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Apr 2, 2026

Chongjie Ye, Cheng Cao, Chuanyu Pan et al.

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

Mar 23 – Mar 29(2)

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Mar 26, 2026

Yuxing Lu, Xukai Zhao, Wei Wu et al.

You can improve RAG systems by preprocessing your corpus once to add distilled, compact versions of relevant documents—this works with any retrieval method and shows consistent gains without changing your pipeline.

This paper proposes WriteBack-RAG, a method that improves retrieval-augmented generation (RAG) systems by treating the knowledge base as trainable. Using labeled examples, the system identifies relevant documents, distills them into compact knowledge units, and adds these to the corpus.

datatraining

CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Detection

Mar 24, 2026

Abdul Rahman

Security AI models fail when deployed to new environments because telemetry data is fragmented. CSTS solves this by providing a unified, entity-focused data structure that maintains consistent identity and relationships across different systems.

This paper introduces CSTS, a standardized way to represent security data that helps AI systems detect cyber threats across different computer networks. Instead of treating security events as isolated incidents, CSTS organizes them around entities (like users or devices) and their relationships, making AI models more reliable when deployed in new environments.

Mar 16 – Mar 22(8)

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Mar 20, 2026

Jiazheng Xing, Fei Du, Hangjie Yuan et al.

To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.

LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.

multimodalarchitecturedata

MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Mar 19, 2026

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori et al.

Synthetic data from diffusion models may not be as privacy-safe as assumed—membership inference attacks can still reveal whether specific records were in the training data, even with synthetic tabular outputs.

This challenge evaluates how well synthetic tabular data generated by diffusion models protects privacy against membership inference attacks. Researchers tested whether synthetic data truly hides information about individuals in the original dataset, developing new attack methods to measure privacy risks across different types of tabular data structures.

Mar 9 – Mar 15(5)

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Mar 13, 2026

Xin Chen, Junchao Wu, Shu Yang et al.

You can train better LLMs on less data by selecting instruction examples that activate the same neurons as your target task—this beats using all data or relying on external models to score examples.

This paper introduces NAIT, a method for selecting the most useful instruction-tuning data for large language models by analyzing which neurons activate when processing different types of tasks. Instead of using all available training data, NAIT identifies a small subset (10% of data) that produces better results by matching neuron activation patterns to target capabilities.

trainingdataefficiency

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Mar 13, 2026

Haonan Huang

AI agents performing scientific research need memory and reflection, not just execution capability. Knowledge consolidation between runs dramatically improves efficiency and accuracy in computational science workflows.

QMatSuite is a platform that helps AI agents learn from computational materials science experiments by storing findings, retrieving past knowledge, and reflecting on results.

Feb 23 – Mar 1(8)

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Feb 27, 2026

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

Use knowledge graph topology to guide web crawling toward undiscovered entities, making supplier discovery more complete with less computational cost.

This paper tackles the problem of finding all small and medium-sized businesses in specialized industries (like semiconductor equipment makers) by combining web crawling, knowledge graphs, and smart coverage estimation.

dataapplications

Histopathology Image Normalization via Latent Manifold Compaction

Feb 27, 2026

Xiaolong Zhang, Jianwei Zhang, Selim Sevim et al.

Unsupervised learning can remove batch effects from medical images, letting models generalize across hospitals without retraining.

Medical image analysis struggles when microscope slides are stained or scanned differently across hospitals—models trained on one site fail at another. This paper introduces a technique that learns to remove these visual differences automatically, making AI models work reliably across different clinical sites without needing labeled examples.

data
agentstrainingdata
multimodalapplicationsdata

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

May 5, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

High-quality training data matters more than pipeline complexity: careful data curation with SFT alone can beat industrial-scale approaches combining pre-training, continual pre-training, and RL for building capable search agents.

OpenSeeker-v2 shows that simple supervised fine-tuning on carefully designed training data can match or beat complex industrial pipelines for building search agents.

trainingagentsdata

Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics

May 4, 2026

Bogdan Oancea

Unsupervised learning can detect multivariate anomalies in regional data that traditional single-variable checks miss, helping statistical agencies distinguish between data quality issues and genuine structural divergence.

This paper uses five unsupervised machine learning techniques to detect regions in Europe with unusual combinations of economic and social indicators, rather than just extreme individual values.

evaluationdata

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

May 4, 2026

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park et al.

Vision-language models perform surprisingly poorly on domain-specific action recognition even in simplified settings, but fine-tuning on domain-specific video data significantly closes the gap.

VideoNet is a new benchmark and dataset for testing how well AI models recognize specific actions in videos across 37 different domains. The researchers found that current vision-language models struggle with domain-specific action recognition—even simple binary choices—and created a 500k video question-answer dataset to improve performance through fine-tuning.

evaluationdatamultimodal
evaluationdata

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Apr 30, 2026

Tao Ge, Baolin Peng, Hao Cheng et al.

Synthetic computer environments with long-horizon simulations can generate realistic training data for productivity agents at scale, enabling them to learn from diverse workplace scenarios without human annotation.

Researchers created a system to generate realistic computer environments at scale—complete with folder structures and documents—then simulated AI agents working on month-long productivity tasks within them.

agentsdatatraining

Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

Apr 30, 2026

Yujun Wu, Dongxu Zhang, Xinchen Li et al.

Structured knowledge of method evolution, not just citations, is essential infrastructure for AI agents doing research. This graph enables machines to understand how innovations emerge and build upon each other, unlocking automated idea evaluation and generation.

Intern-Atlas is a structured database of how AI research methods evolve and build on each other, extracted from over 1 million papers. Unlike traditional citation networks, it explicitly maps methodological relationships—showing which techniques led to which innovations and why—making it queryable for AI research agents and enabling automated discovery of new research directions.

dataagentsapplications

FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems

Apr 30, 2026

Binghao Huang, Yunzhu Li

Practical tactile sensing for robotics is now accessible to researchers and developers without expensive custom hardware—FlexiTac provides a plug-and-play solution that integrates with standard robot learning pipelines.

FlexiTac is an affordable, open-source tactile sensor system for robot hands and grippers that combines flexible sensor pads with simple electronics to provide real-time touch feedback. It works with existing robot platforms and supports modern AI training methods like learning from combined vision and touch data.

applicationsmultimodaldata

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

Apr 30, 2026

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan et al.

LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.

TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.

evaluationreasoningdata

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Apr 30, 2026

Ansar Aynetdinov, Patrick Haller, Alan Akbik

For non-English language models, aggressively filtering data for quality and repeating it multiple times beats training once on larger, diverse datasets—a practical insight for resource-constrained language model development.

This paper challenges the assumption that diverse data is always better for language model training. For German, the researchers found that repeatedly training on a smaller, high-quality filtered dataset outperforms training once on a larger, less-filtered dataset—even after 7 epochs of repetition.

trainingdataefficiency

What Kind of Language is Easy to Language-Model Under Curriculum Learning?

Apr 29, 2026

Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe

Curriculum learning substantially changes language models' learning biases, suggesting that training order matters as much as model architecture when predicting which language structures are 'easy' to learn.

This paper investigates how curriculum learning—training language models on simpler sentences first rather than random order—affects which linguistic patterns models naturally learn.

trainingdata

Personalized Worked Example Generation from Student Code Submissions using Pattern-based Knowledge Components

Apr 27, 2026

Griffin Pitts, Muntasir Hoq, Peter Brusilovsky et al.

By extracting knowledge components from student code patterns, you can steer generative models to create personalized learning content that directly targets the logical errors students are making, rather than relying on generic pre-written examples.

This paper presents a system that automatically generates personalized worked examples for programming students based on their actual code submissions. Instead of using fixed example libraries, the system analyzes patterns in student errors using code structure analysis and uses these patterns to guide an AI model to create relevant examples that address each student's specific misconceptions.

applicationstrainingdata

Learning to Think from Multiple Thinkers

Apr 27, 2026

Nirmit Joshi, Roey Magen, Nathan Srebro et al.

Learning from diverse reasoning traces is harder than learning from a single thinker, but you can overcome this by actively collecting reasoning data from many thinkers (logarithmic in target accuracy) combined with passive final-answer supervision.

This paper studies how AI models can learn from multiple people or programs solving the same problem in different ways (e.g., different math solutions or code implementations).

trainingreasoningdata

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Apr 27, 2026

Amal Akli, Mike Papadakis, Maxime Cordy et al.

Task description quality matters more than model size for reliable code generation—a small, fine-tuned classifier can detect problematic descriptions better than much larger models, and under-specification is the most critical defect type to watch for.

This paper introduces SpecValidator, a lightweight classifier that detects defects in task descriptions given to code-generating AI models. The tool identifies three types of problems—vague language, missing details, and formatting issues—and shows it's much better at catching these issues than larger models like GPT-4 mini or Claude.

evaluationapplicationsdata
evaluationdata

CRAFT: Clustered Regression for Adaptive Filtering of Training data

Apr 24, 2026

Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

You can select optimal training data 40x faster than competing methods by matching source distributions through clustering and target distributions through regression, without sacrificing quality.

CRAFT is a fast method for selecting high-quality training data subsets from massive datasets. It uses clustering and statistical matching to pick training examples whose target outputs align with your validation set, enabling efficient fine-tuning of translation models on millions of examples in under a minute.

datatrainingefficiency

Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

Apr 23, 2026

Flávio Soriano, Victoria F. Mello, Pedro B. Rigueira et al.

Political discourse analysis using NLP can reveal patterns invisible in voting records alone—like how stylistic shifts, topic priorities, and speaker alignments reflect deeper political structures beyond formal party lines.

This paper analyzes 450,000+ speeches from Brazil's Chamber of Deputies using computational methods to understand how politicians speak, what they discuss, and who aligns with whom rhetorically.

dataevaluation

Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Apr 23, 2026

Sherly Alfonso-Sánchez, Cristián Bravo, Kristina G. Stankova

Geographic representation matters more than model complexity for insurance risk prediction—simple coordinate + environmental feature combinations often outperform complex image-based approaches in zone-level claim frequency models.

This study shows how to improve motor insurance claim prediction by adding geographic data to standard actuarial models, even when location information is limited. Researchers tested environmental features from maps and satellite imagery on insurance claims data, finding that combining coordinates with environmental data works best, while image embeddings help when map data isn't available.

dataevaluationapplications

EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

Apr 23, 2026

Praval Sharma, Ashok Samal, Leen-Kiat Soh et al.

This dataset enables building event extraction systems that work across diverse real-world documents and geographical contexts, moving beyond closed-domain limitations that plagued previous approaches.

EVENT5Ws is a large, manually annotated dataset for extracting key event information (who, what, when, where, why) from documents in open-domain settings.

dataevaluationapplications

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Apr 23, 2026

Yuto Nishida, Naoki Shikoda, Yosuke Kishinami et al.

LLMs don't memorize facts in a surface-invariant way; their ability to answer factual questions depends heavily on which name or spelling variant you use for an entity, suggesting memorization is tied to specific linguistic forms encountered during training.

This paper investigates how large language models memorize facts by testing whether they can answer questions about the same entity using different names and spellings.

evaluationdata

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

Apr 23, 2026

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger et al.

Synthetic data with perfect annotations can accelerate progress on multiple aerial imagery tasks simultaneously—depth, domain shift, and resolution—without the cost and difficulty of collecting real-world ground truth.

SyMTRS is a large synthetic dataset for aerial imagery that provides pixel-perfect depth maps, day/night image pairs, and multi-scale variants for training AI models on three tasks: depth estimation, domain adaptation, and super-resolution.

dataevaluationmultimodal

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Apr 23, 2026

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue et al.

You can build practical event detection systems using logical rules and constraint satisfaction that work efficiently on real timestamped data while handling conflicting inferences—demonstrated on medical records.

This paper presents a logic-based system for detecting high-level events from timestamped data, like inferring disease episodes from patient medical records. The system uses logical rules to identify events, handles conflicts between inferred events, and can run efficiently on real data while staying aligned with expert knowledge.

reasoningdataapplications

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

Apr 23, 2026

Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer et al.

Narrative similarity can be operationalized as a practical classification task, and LLM ensembles currently outperform other approaches, but there's significant room for improvement in how systems represent and compare story meanings.

This paper introduces a shared task on narrative similarity that asks systems to determine which of two stories is more similar to a reference story. The team collected over 1,000 annotated story triples and evaluated 71 submissions from 46 teams, finding that LLM ensembles performed best for classification while fine-tuned embedding models competed well with simpler approaches.

evaluationdata

Misinformation Span Detection in Videos via Audio Transcripts

Apr 23, 2026

Breno Matos, Rennan C. Lima, Savvas Zannettou et al.

Misinformation detection is more useful when you know *where* in a video the false claim occurs, not just *whether* it exists—this work enables fine-grained detection at the segment level rather than video level.

This paper tackles video misinformation by identifying exactly where false claims appear within videos. Instead of just labeling entire videos as true or false, researchers transcribed video audio and annotated which specific segments contain misinformation, creating two datasets with 500+ videos. They trained language models to pinpoint these problematic spans, achieving 68% F1 score.

safetydataevaluation

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Apr 23, 2026

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar et al.

Current audio AI models fail dramatically on genuine audio understanding tasks—they likely exploit dataset biases and metadata rather than actually listening to and reasoning about sound.

AUDITA is a new benchmark dataset with real-world audio and human-authored trivia questions designed to test whether AI models can truly understand audio content rather than relying on shortcuts. Humans answer correctly 32% of the time, but state-of-the-art models score below 9%, revealing a significant gap in audio reasoning capabilities.

evaluationmultimodaldata

PrismaDV: Automated Task-Aware Data Unit Test Generation

Apr 23, 2026

Hao Chen, Arnab Phani, Sebastian Schelter

Automated data testing works better when it understands the specific task consuming the data—PrismaDV shows you can generate more effective data unit tests by analyzing both dataset profiles and downstream application code together.

PrismaDV is an AI system that automatically generates data unit tests by analyzing both the dataset and the code that uses it. Unlike existing tools that test data in isolation, PrismaDV understands what the downstream application actually needs from the data, then creates tests that catch errors likely to break the application.

dataevaluation

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Apr 22, 2026

Sina Gholami, Abdulmoneam Ali, Tania Haghighi et al.

By analyzing the spectral structure of feature representations, you can identify noisy labels in federated learning and use clean clients to help relabel corrupted data—without needing to share raw data or redesign loss functions.

FedSIR tackles a major challenge in federated learning: when training data across distributed devices contains mislabeled examples. The method identifies which devices have clean vs. noisy labels by analyzing the mathematical structure of their learned features, then uses clean devices to help noisy devices fix their labels.

trainingdataefficiency

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

Apr 22, 2026

Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

This is a benchmarking dataset for time series classification on real-world infrastructure monitoring—useful for developers building automated systems to track construction phases and operational patterns in offshore wind farms globally.

Researchers created a global dataset of radar time series tracking offshore wind farm construction and operation from 2016-2025 using satellite data. The dataset includes over 14 million radar measurements at wind infrastructure locations, baseline labels from an automated classifier, and expert-annotated examples to enable development of monitoring systems.

dataevaluation

FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

Apr 21, 2026

Abdulmoneam Ali, Ahmed Arafa

Instead of clustering users during training (vulnerable to noisy labels), group them upfront using feature covariance structure, then fix label errors by checking if examples align with learned feature subspaces.

FB-NLL tackles noisy labels in federated learning by clustering users based on feature geometry rather than training dynamics, then correcting mislabeled data using feature alignment. This one-shot approach avoids the communication overhead of iterative methods while handling low-quality data that typically corrupts personalized federated learning.

trainingdataefficiency
multimodalevaluationdata

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

Apr 16, 2026

Moin Aminnaseri, Farima Fatahi Bayat, Nikita Bhutani et al.

Modern data systems need to treat LLMs, web search, and user context as first-class data sources alongside traditional databases, with intelligent agents orchestrating queries across all of them.

Blue's Data Intelligence Layer (DIL) is a system that lets users ask natural language questions across multiple data sources, websites, and knowledge bases—not just a single database.

agentsdataapplications

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

Apr 15, 2026

Ziming Wang

Adding LiDAR to wrist-mounted robot interfaces makes data collection more robust in real-world conditions, letting robots learn complex tasks like deformable object manipulation that were previously impossible with vision alone.

UMI-3D improves robot data collection by adding LiDAR to the Universal Manipulation Interface, replacing unreliable monocular vision with 3D spatial sensing. This enables robots to learn manipulation tasks in cluttered, dynamic environments where the original vision-only system failed, while keeping the system portable and affordable.

multimodalagentsdata
training
multimodal
data

Many Ways to Be Fake: Benchmarking Fake News Detection Under Strategy-Driven AI Generation

Apr 10, 2026

Xinyu Wang, Sai Koneru, Wenbo Zhang et al.

Fake news detectors are vulnerable to strategically crafted mixed-truth content where falsehoods are woven into accurate narratives, not just fully fabricated stories—a realistic threat that current benchmarks don't adequately test.

This paper introduces MANYFAKE, a benchmark of 6,798 synthetic fake news articles created through AI-driven strategies to test how well fake news detectors handle realistic threats. Unlike simple fabricated stories, the benchmark focuses on mixed-truth cases where false claims are embedded in otherwise credible narratives—a pattern that emerges from human-AI collaboration.

evaluationsafetydata

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Apr 9, 2026

Yunsong Zhou, Hangxu Liu, Xuekun Jiang et al.

Physics-grounded simulation can replace expensive real-world data collection for deformable object manipulation—synthetic data from calibrated digital twins trains policies that work in the real world without additional real-world training.

SIM1 creates physics-accurate digital twins of deformable objects from real demonstrations, then generates synthetic training data through simulation to train robotic manipulation policies. The system achieves real-world performance comparable to policies trained on 15x more real data, solving the data scarcity problem for cloth and soft object manipulation.

dataefficiencyapplications

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Apr 9, 2026

Jiayuan Ye, Vitaly Feldman, Kunal Talwar

Removing redundant or low-frequency facts from training data helps models memorize factual knowledge better, letting smaller models achieve the same fact accuracy as much larger ones.

This paper shows that LLMs struggle to memorize facts when training data contains too many facts or has skewed frequency distributions. The researchers propose a data pruning method that selects which facts to include in training, enabling smaller models to memorize significantly more facts—a GPT2-Small model trained with pruned data matched a 10X larger model trained on full data.

trainingdataefficiency

Formalizing building-up constructions of self-dual codes through isotropic lines in Lean

Apr 9, 2026

Jae-Hyun Baek, Jon-Lark Kim

Researchers can now use formally verified methods to construct and optimize error-correcting codes, with the Lean formalization ensuring mathematical correctness of the underlying algebraic structures.

This paper formalizes mathematical constructions for building self-dual error-correcting codes using Lean 4. It proves that two different construction methods are equivalent and extends them to work with different field sizes, enabling the discovery of new optimal codes with verified correctness.

evaluationdata

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Apr 9, 2026

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh et al.

For RL-based reasoning training, which tasks you select for training matters more than how many tasks you use—task-specific selection outperforms averaging strategies, and this insight can guide practical data curation for extending RL to general reasoning domains.

SUPERNOVA is a data curation framework that helps language models learn general reasoning skills (like causal inference and temporal understanding) through reinforcement learning.

trainingreasoningdata

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Apr 9, 2026

Lilian Wanzare, Cynthia Amol, zekiel Maina et al.

This dataset solves a major gap: most speech AI is trained on English and a few European languages, leaving African languages behind. AfriVoices-KE provides the foundation needed to build fair, inclusive speech technology for Kenya.

AfriVoices-KE is a 3,000-hour multilingual speech dataset covering five Kenyan languages with recordings from nearly 5,000 native speakers. It combines scripted and spontaneous speech to enable building speech technology (like voice assistants and transcription tools) for underrepresented African languages.

dataapplications

Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

Apr 9, 2026

Samay U. Shetty, Tharindu Cyril Weerasooriya, Deepak Pandita et al.

Modeling annotator demographics explicitly—not just their labels—is crucial for NLP systems handling subjective tasks. DiADEM shows that race and age consistently predict disagreement patterns better than treating all annotators as interchangeable.

When people label subjective content like offensive speech, they disagree—and that disagreement matters. This paper introduces DiADEM, a neural model that learns which demographic factors (race, age, etc.) drive annotator disagreement, rather than flattening diverse perspectives into a single label. DiADEM outperforms LLMs and standard models at predicting who will disagree and why.

evaluationdataalignment

Synthetic Data for any Differentiable Target

Apr 9, 2026

Tristan Thrush, Sung Min Park, Herman Brunborg et al.

You can precisely control what a language model learns by automatically generating synthetic training data optimized for your exact objectives, without modifying the model architecture or training process itself.

Researchers developed Dataset Policy Gradient (DPG), a technique that uses reinforcement learning to automatically generate synthetic training data optimized for any measurable goal.

trainingdatareasoning

ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification

Apr 9, 2026

Paul Quinlan, Qingguo Li, Xiaodan Zhu

ADAPT solves a critical problem in time-series AI: you can now pre-train on many diverse datasets together instead of just one, making it possible to build generalist foundation models that work across different time-series domains.

This paper introduces ADAPT, a new pre-training method that lets AI models learn from many different time-series datasets simultaneously, even when those datasets have different sizes and structures. By aligning the physical properties of diverse time-series data, the approach enables training a single foundation model on 162 datasets at once—something previous methods couldn't do well.

datascaling

RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

Apr 8, 2026

Wenjing Margaret Mao, Jefferson Ng, Luyang Hu et al.

Hybrid sensor fusion (IMUs + egocentric vision) enables robust, portable human motion capture in uncontrolled environments—critical for scaling robot learning with real-world human demonstrations.

RoSHI is a wearable system that combines IMU sensors with AR glasses to capture full-body human motion in real-world settings. By fusing inertial measurements with egocentric camera data, it creates accurate 3D pose estimates that work even when body parts are hidden or moving fast, making it practical for collecting robot training data from human activities.

dataapplicationsmultimodal

Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

Apr 8, 2026

Huaiyuan Qin, Muli Yang, Gabriel James Goenawan et al.

When pruning training data with label noise, tracking how loss changes over time is more reliable than using instantaneous loss values for identifying bad samples.

AlignPrune is a plug-and-play module that improves dynamic data pruning when training data contains mislabeled examples. Instead of relying on individual sample loss values (which can be misleading with noisy labels), it tracks how a sample's loss changes over training time to better identify which samples to keep or discard, achieving up to 6.3% accuracy improvements.

trainingdataefficiency

OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Apr 8, 2026

Jianhui Liu, Haoze Sun, Wenbo Li et al.

A principled data engine with 3 million spatial samples significantly improves model performance on spatial reasoning tasks—showing 19% average improvement—and provides the infrastructure needed to build spatially-aware AI systems.

OpenSpatial is an open-source system for generating high-quality spatial data at scale. It uses 3D bounding boxes to create a dataset of 3 million samples across five spatial reasoning tasks, enabling AI models to better understand 3D scenes, measurements, relationships, and camera perspectives.

dataevaluation

Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes

Apr 7, 2026

Brady Koenig, Sushovan Majhi, Atish Mitra et al.

Topological features (Euler Characteristic Surfaces) can automatically characterize complex physical phenomena and outperform traditional mechanistic models, even without labeled training data—showing that unsupervised topology-based methods can discover and correct scientific models.

Researchers developed the first mathematical definition of churn flow—a chaotic two-phase flow regime in vertical pipes—using topological analysis. They created an unsupervised learning system that combines topology-derived features with machine learning to automatically classify flow regimes and correct existing prediction models, achieving 95.6% accuracy without labeled training data.

dataevaluationreasoning

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Apr 7, 2026

Yanis Labrak, David Grünert, Séverin Baroudi et al.

Long-form audio reasoning needs better training data and evaluation benchmarks; synthetic generation with realistic audio characteristics can provide both, and traditional cascaded pipelines (speech-to-text then summarization) still beat end-to-end models on this task.

Researchers created a synthetic dataset of 8,800 doctor-patient conversations (1.3k hours of audio) to train and evaluate AI systems on long-form audio understanding. The pipeline generates realistic dialogues, synthesizes multi-speaker audio with background noise, and produces medical summaries (SOAP notes) as reference outputs—all using open-source models.

dataevaluationapplications
multimodalarchitecturedata

CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

Apr 2, 2026

Youssef Saidi, Haroun Elleuch, Fethi Bougares

End-to-end speech-to-entity models substantially outperform cascaded ASR+NER pipelines for Arabic, and multilingual pretraining transfers better than Arabic-specific pretraining for this low-resource task.

This paper introduces CV-18 NER, the first dataset for extracting named entities directly from Arabic speech. The researchers created 21 entity types by annotating the Arabic Common Voice corpus, then compared end-to-end speech models (Whisper, AraBEST-RQ) against traditional pipelines that first transcribe speech then extract entities.

data

Adam's Law: Textual Frequency Law on Large Language Models

Apr 2, 2026

Hongyuan Adam Lu, Z. L., Victor Wei et al.

LLMs perform better when trained on and prompted with more frequently-occurring textual patterns, similar to how humans read faster with common words—this simple principle can boost performance across multiple tasks.

This paper studies how word frequency in text affects large language model performance. The authors propose three techniques: using more frequent phrasings in prompts, generating training data with common expressions, and training models on increasingly frequent text. Tests on math, translation, reasoning, and tool-use tasks show these frequency-based approaches improve results.

trainingdata

Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions

Apr 2, 2026

Atilla Kaan Alkan, Felix Grezes, Jennifer Lynn Bartlett et al.

When building coreference systems for software mentions, choose between lexical and contextual methods based on your upstream noise type and corpus size: embeddings handle boundary noise better and scale linearly, while string matching degrades more gracefully under substitution errors.

This paper compares two approaches for identifying when software names refer to the same project across documents: a simple string-matching method and an embedding-based approach. Testing on noisy data shows they fail in different ways—embeddings handle boundary errors better, while string matching handles substitution errors better—and embeddings scale more efficiently to large datasets.

evaluationdata

AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

Apr 2, 2026

Atilla Kaan Alkan, Felix Grezes, Sergi Blanco-Cuaresma et al.

When classifying scientific text with thousands of rare concepts, vocabulary-constrained LLMs perform competitively with specialized models, suggesting you don't always need heavy domain adaptation—but frequency-stratified evaluation is critical to spot performance gaps hidden by aggregate metr...

AstroConcepts is a dataset of 21,702 astrophysics paper abstracts labeled with 2,367 specialized astronomy concepts, designed to study extreme class imbalance in scientific text classification.

dataevaluationapplications

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Apr 1, 2026

Nandan Thakur, Zijian Chen, Xueguang Ma et al.

You can build high-quality training data for search agents using synthetic generation and verification without expensive human annotation or API costs, enabling smaller models to compete with larger ones.

ORBIT is a dataset of 20,000 reasoning-heavy questions with verifiable answers, created cheaply without paid APIs. The authors built a four-stage pipeline (seed creation, question generation, self-verification, external verification) to generate training data for search agents—AI systems that combine language models with web search.

datatrainingagents
safetydataevaluation
safetyevaluationdata

A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Mar 19, 2026

Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz et al.

Automated health literacy detection from clinical notes is now possible with HEALIX, a curated dataset that could help clinicians identify patients needing extra support without adding screening burden.

Researchers created HEALIX, the first public dataset of 589 clinical notes annotated for patient health literacy levels (low, normal, high). Health literacy—a patient's ability to understand medical information—affects treatment outcomes, but current screening tools are impractical.

dataapplicationsevaluation

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Mar 18, 2026

Amine Lbath

Automated vulnerability injection with proof-of-concept exploits can scale up realistic training datasets for repository-level security detection, moving beyond function-level benchmarks to test how AI handles real-world code complexity.

This research creates an automated system to generate large-scale datasets for training AI models to detect software vulnerabilities in real code repositories.

datasafetyagents

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Mar 18, 2026

Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

Machine translation systems have systematic gender bias—they default to masculine forms when translating from English to gendered languages. This paper provides annotation guidelines and a benchmark dataset to measure and fix this problem.

This paper introduces ConGA, a framework for annotating gender in machine translation to address how systems handle gender when translating from gender-neutral languages (like English) to gendered ones (like Italian).

dataevaluationalignment

ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

Mar 17, 2026

Kaixuan Wang, Tianxing Chen, Jiawei Liu et al.

Having diverse, high-quality 3D assets at scale dramatically improves robot learning in simulation—this dataset removes a major bottleneck for scaling robotic manipulation training.

ManiTwin is an automated pipeline that converts single images into simulation-ready 3D digital objects for robot training. The team created ManiTwin-100K, a dataset of 100,000 annotated 3D assets with physical properties and manipulation instructions, enabling large-scale generation of robot training data in simulation.

dataapplicationstraining

Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Mar 17, 2026

Sahil Sen, Elias Lumer, Anmol Gulati et al.

Structuring long conversation histories as timestamped events with intelligent retrieval guidance lets AI agents accurately answer complex questions about what happened weeks or months ago—critical for building chatbots that remember user preferences and history over extended periods.

Chronos is a memory system for AI chatbots that tracks conversations over months by breaking down dialogue into timestamped events and organizing them in structured calendars. When answering questions about past conversations, it uses dynamic prompts to guide retrieval across time ranges and handle complex multi-step reasoning, achieving 95.6% accuracy on long-term memory tasks.

agentsreasoningdata

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Mar 16, 2026

Yuwen Du, Rui Ye, Shuo Tang et al.

You can now build frontier-level search agents without proprietary data—OpenSeeker proves that smart data synthesis (not scale) is the bottleneck, and releases everything needed to replicate it.

OpenSeeker is a fully open-source search agent that achieves state-of-the-art performance by synthesizing high-quality training data through two techniques: generating complex multi-hop reasoning tasks by reverse-engineering web graphs, and denoising agent trajectories using summarization.

agentsdatareasoning
agentsreasoningdata

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Mar 12, 2026

Ziyu Chen, Yilun Zhao, Chengye Wang et al.

Training multimodal models on scientific documents requires balancing synthetic data quality with real-world document complexity—this dataset achieves that by synthesizing faithful QA pairs then re-embedding them into full papers.

This paper introduces SciMDR, a dataset of 300K question-answer pairs across 20K scientific papers designed to train AI models on understanding complex scientific documents with both text and images. The dataset uses a two-stage process: first generating focused QA pairs with reasoning chains, then embedding them into full documents to maintain realistic complexity.

multimodaldataevaluation

STAMP: Selective Task-Aware Mechanism for Text Privacy

Mar 12, 2026

Fengwei Tian, Payel Bhattacharjee, Heidi Hanson et al.

By combining task-aware importance scoring with privacy sensitivity detection, STAMP achieves better privacy-utility trade-offs than uniform noise approaches—meaning you can protect sensitive data without sacrificing model performance.

STAMP is a privacy framework that protects sensitive information in text while keeping it useful for AI tasks. It smartly decides which parts of text need more protection (like names and dates) versus which parts are less sensitive, then applies targeted noise to embeddings using a novel 'polar mechanism' that preserves semantic meaning better than traditional approaches.

safetydataefficiency

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Mar 12, 2026

Jiayin Lei, Ming Ma, Yunxi Duan et al.

When training on synthetic code data, filtering by reverse semantic coherence (can the answer predict the question?) is more effective at removing noise than forward metrics, letting you use 75% less data without losing model quality.

This paper introduces QAQ, a method for filtering noisy synthetic code training data by measuring bidirectional semantic coherence—checking not just if a model can generate answers from questions, but also if answers can predict back to questions. By selecting only 25% of data with the highest quality scores, the approach matches full-dataset performance while cutting computational costs.

datatraining
applications
training

A Dataset is Worth 1 MB

Feb 26, 2026

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

You can teach models new tasks by transmitting just labels instead of data, if clients have a generic reference dataset pre-loaded.

Instead of sending large datasets over the network, this paper proposes sending only class labels for images from a reference dataset that clients already have locally. A smart filtering mechanism picks which images are most relevant to the new task, reducing communication to under 1 MB while maintaining accuracy.

efficiencydatatraining

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Feb 26, 2026

Amita Kamath, Jack Hessel, Khyathi Chandu et al.

Bigger models and more data won't automatically teach reasoning skills if your training data has systematic blind spots—you need intentional data...

Vision-language models struggle with reasoning tasks like counting and spatial understanding not because they're too small, but because their training data is biased toward how people naturally talk about images—omitting obvious details.

dataevaluationreasoning

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Feb 26, 2026

Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty et al.

You can create smaller datasets that preserve large dataset knowledge using pre-trained diffusion models with geometric guidance—no retraining ne...

This paper introduces ManifoldGD, a method to create smaller, representative datasets from large ones using diffusion models without any training. Instead of simple guidance, it uses geometric manifold structures to ensure generated synthetic data captures both broad concepts and fine details, resulting in better quality distilled datasets with fewer images.

dataefficiency

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Feb 26, 2026

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga et al.

A smaller, specialized AI model can generate better training data than a giant pre-trained one, unlocking real improvements in production systems.

Google used fine-tuned AI models to generate millions of relevance labels for app search results, solving a shortage of human-labeled training data. By combining these AI-generated labels with user behavior signals, they improved their App Store ranking system—especially for unpopular searches where user clicks are rare.

trainingapplicationsdata

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Feb 26, 2026

Pengxiang Li, Dilxat Muhtar, Lu Yin et al.

Training data structure, not model architecture, is why parallel language models revert to sequential generation—fix the training data to unlock ...

Diffusion language models promise faster parallel text generation, but they often end up generating tokens one-at-a-time like traditional models. This paper shows the problem is how models are trained—sequential training data pushes them toward sequential generation.

trainingefficiencydata

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

Feb 26, 2026

Junhu Fu, Shuyu Liang, Wutong Li et al.

Synthetic colonoscopy videos can now be generated with enough quality and control to help with doctor training and disease diagnosis in data-scarce...

ColoDiff generates realistic colonoscopy videos using AI to help doctors train and diagnose intestinal diseases when real patient data is limited. It uses a technique called diffusion to create videos with smooth motion and precise control over medical details like disease type and imaging quality.

multimodalapplicationsdata