From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov|April 15, 2026arXiv

Key Takeaway

Vibe-testing—informal, personalized evaluation—is how real users actually judge LLMs, and formalizing it with personalized prompts and user-aware criteria can better predict practical usefulness than standard benchmarks.

Summary

Users often evaluate LLMs informally by testing them on tasks relevant to their own work—a practice called 'vibe-testing.' This paper studies how vibe-testing actually works by analyzing user surveys and real-world model comparisons, then formalizes it as a two-step process: personalizing what to test and how to judge results.

evaluation applications

Key Terms

benchmark personalization subjective-evaluation