Vibe-testing—informal, personalized evaluation—is how real users actually judge LLMs, and formalizing it with personalized prompts and user-aware criteria can better predict practical usefulness than standard benchmarks.
Users often evaluate LLMs informally by testing them on tasks relevant to their own work—a practice called 'vibe-testing.' This paper studies how vibe-testing actually works by analyzing user surveys and real-world model comparisons, then formalizes it as a two-step process: personalizing what to test and how to judge results.