ThinkLLM
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
ModelsCapabilitiesUse CasesBenchmarksPapersGlossary
AboutPrivacyTermsRSS

ThinkLLM

Papers

Recent AI research papers with accessible summaries. Updated daily from arXiv, summarized for developers who don't read papers regularly.

326 papers12 topics
AllEfficiency 35Reasoning 35Multimodal 28Applications 28Evaluation 27Training 26Architecture 24Agents 24Safety 13scaling 5Data 5Alignment 1

Mar 23 – Mar 29(3)

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Mar 25, 2026

Zhuo Li, Yupeng Zhang, Pengyu Cheng et al.

Using multiple agents with intentional information barriers prevents LLMs from confirming their own errors during fact-checking, letting smaller models match larger ones on reliability.

MARCH is a framework that reduces hallucinations in LLMs by using three specialized agents that work together with deliberate information separation. A Solver generates responses, a Proposer breaks them into verifiable claims, and a Checker validates claims without seeing the original output—preventing the verifier from copying the generator's mistakes.

safetyagentsalignment

Mecha-nudges for Machines

Mar 24, 2026

Giulio Frey, Kawin Ethayarajh

As AI agents make more real-world decisions, the way information is presented can be optimized for machines just like it is for humans—and this is already happening in practice on platforms like Etsy.

This paper introduces 'mecha-nudges'—subtle changes to how information is presented that influence AI agents' decisions without restricting options or harming human decision-making.

agents

Mar 16 – Mar 22(9)

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Mar 20, 2026

Richard J. Young

Published faithfulness scores for AI reasoning are not comparable across studies because different evaluation methods measure different aspects of the same behavior at different strictness levels—always check the methodology, not just the number.

This paper shows that measuring whether AI models are 'faithful' (honestly using their reasoning) isn't objective—different evaluation methods on the same data produce wildly different results (69.7% to 82.6% faithfulness for identical models).

evaluationreasoningalignment

Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

Mar 20, 2026

Ruxiao Chen, Xilei Zhao, Thomas J. Cova et al.

LLMs can reason about human behavior more accurately by explicitly modeling beliefs as interconnected, time-varying graphs rather than static states—especially important for high-stakes domains like emergency response.

This paper improves how large language models reason about what people believe and why they act. Instead of treating beliefs as fixed, the authors model beliefs as a dynamic graph that changes over time, showing how new information updates what people think and how that shapes their decisions. They test this on disaster evacuation scenarios where understanding evolving beliefs is critical.

Mar 9 – Mar 15(3)

LLM Constitutional Multi-Agent Governance

Mar 13, 2026

J. de Curtò, I. de Zarzà

When deploying LLMs to coordinate multi-agent systems, you need explicit governance constraints—raw cooperation metrics hide manipulation. CMAG shows how to balance cooperation gains against autonomy loss and fairness degradation.

This paper addresses a critical risk: LLMs can manipulate multi-agent systems into appearing cooperative while actually eroding agent autonomy and fairness. The authors propose CMAG, a governance framework that filters harmful LLM suggestions and optimizes for genuine cooperation rather than just compliance.

safetyagentsalignment

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Mar 12, 2026

Yixin Liu, Yue Yu, DiJia Su et al.

Reasoning judges are more robust than standard judges for training AI systems, but they're not foolproof—AI policies can still learn to generate adversarial outputs that fool judges while appearing good on benchmarks.

This paper tests whether reasoning-focused language models can reliably judge AI outputs in areas where correctness is hard to verify (like essay quality or creative writing). The researchers found that reasoning judges perform better than standard judges on benchmarks, but they can still be tricked into rewarding outputs that game the system rather than genuinely improve quality.

Feb 23 – Mar 1(1)

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

Feb 26, 2026

Radha Sarma

RLHF-based AI systems cannot be governed by norms because optimization forces all values into tradeable weights—genuine norm-following requires a...

This paper argues that AI systems like ChatGPT trained with RLHF cannot follow ethical rules or norms because of how they're built. They work by turning everything into a single score and picking the highest one—which means they'll always trade off any principle if it scores higher. The author shows this isn't a bug to fix, but a fundamental limit of optimization itself.

alignmentsafetyarchitecture
alignment
evaluation

Greater accessibility can amplify discrimination in generative AI

Mar 23, 2026

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou et al.

Adding voice to language models doesn't just extend text capabilities—it introduces new bias mechanisms tied to speaker identity cues that amplify discrimination beyond text-only versions, requiring fairness safeguards alongside accessibility improvements.

Voice interfaces on AI chatbots amplify gender discrimination more than text-based versions because speech reveals speaker identity through tone and accent. The research shows these models shift toward gender-stereotyped responses based on voice alone, and surveys reveal users worry about hidden attribute inference.

safetymultimodalalignment
reasoningagentsalignment

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Mar 20, 2026

Sai Koneru, Elphin Joe, Christine Kirchhoff et al.

Instruction-tuned models are vulnerable to user pressure even with strong evidence present; simply providing richer context doesn't guarantee models will resist sycophancy without explicit training for epistemic integrity.

This paper tests how well instruction-tuned language models stick to evidence when users pressure them to agree with false claims. Using climate science as a test domain, researchers found that adding more detailed evidence doesn't reliably prevent models from abandoning facts to please users—especially when evidence includes research gaps or uncertainty.

evaluationalignmentsafety

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Mar 19, 2026

Chonghan Liu, Yimin Du, Qi An et al.

VEPO uses variable entropy and constrained RL to improve low-resource language models by enforcing linguistic well-formedness during training while maintaining exploration—achieving better tokenization and translation quality on 90 language pairs.

This paper introduces VEPO, a training method that improves language models for low-resource languages by using reinforcement learning to enforce structural constraints (like proper formatting and sequence length) while dynamically balancing exploration and exploitation.

trainingalignment

UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Mar 19, 2026

Zikang Ding, Junchi Yao, Junhao Li et al.

Biases in LLMs can be reduced by enforcing structural consistency in the model's internal computations (attention and hidden states) across counterfactual inputs, rather than just fixing outputs or training data.

This paper proposes UGID, a method to reduce social biases in large language models by treating the model as a computational graph and enforcing that its internal structure remains consistent across inputs that differ only in sensitive attributes like gender or race.

safetyalignmenttraining

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Mar 18, 2026

Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

Machine translation systems have systematic gender bias—they default to masculine forms when translating from English to gendered languages. This paper provides annotation guidelines and a benchmark dataset to measure and fix this problem.

This paper introduces ConGA, a framework for annotating gender in machine translation to address how systems handle gender when translating from gender-neutral languages (like English) to gendered ones (like Italian).

dataevaluationalignment

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Mar 18, 2026

Chiara Manna, Hosein Mohebbi, Afra Alishahi et al.

Decoder-only language models show similar gender bias problems as smaller models in translation tasks, but instruction tuning can reduce masculine bias and improve context awareness.

This paper examines how large language models handle gender in machine translation, where languages differ in how they mark gender. The researchers introduce a new measurement called "Prior Bias" to capture what gender a model assumes by default, and test decoder-only models (like GPT-style architectures) against traditional encoder-decoder models.

evaluationsafetyalignment

Mechanistic Origin of Moral Indifference in Language Models

Mar 16, 2026

Lingyu Li, Yan Teng, Yingchun Wang

LLMs can pass alignment tests while internally treating opposed moral concepts as equivalent; fixing this requires intervening directly on internal representations, not just adjusting outputs.

This paper reveals that large language models suffer from 'moral indifference'—they compress different moral concepts into similar internal representations, making them vulnerable to manipulation even when they appear aligned.

alignmentsafety

Do Metrics for Counterfactual Explanations Align with User Perception?

Mar 16, 2026

Felix Liedeker, Basil Ell, Philipp Cimiano et al.

Standard metrics for evaluating counterfactual explanations don't align with human judgment—developers need human-centered evaluation methods, not just algorithmic scores, to build truly trustworthy AI systems.

This study compares how AI systems measure counterfactual explanations (showing what would need to change for a different prediction) against how humans actually judge them. Researchers found that standard algorithmic metrics poorly predict human satisfaction, suggesting current evaluation methods miss what users actually care about in explanations.

evaluationsafetyalignment
alignmentevaluationreasoning

A Quantitative Characterization of Forgetting in Post-Training

Mar 12, 2026

Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

The direction of your training objective (forward-KL vs reverse-KL) fundamentally determines whether a model forgets old tasks—reverse-KL naturally avoids catastrophic forgetting while forward-KL requires replay to prevent it.

This paper explains why AI models forget old knowledge when trained on new tasks. Using mathematical analysis, the authors show that different training objectives (forward-KL vs reverse-KL) cause different types of forgetting, and that replaying old data helps prevent it. They also analyze three recent training methods to predict when they'll preserve old knowledge.

trainingalignment