LLM dialogue systems for compliance assessment suffer from low accuracy against expert ground truth, and user satisfaction decreases with longer responses—designers should prioritize concise, proactive interactions over verbose explanations.
This paper evaluates how well LLM dialogue assistants (like GitHub Copilot) help developers assess non-functional requirements like HIPAA compliance. The researchers had 49 programmers use Copilot to evaluate 148 compliance requirements against real code, measuring both accuracy against expert standards and user satisfaction across multi-turn conversations.