Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff et al.|April 16, 2026arXiv

Key Takeaway

For complex reasoning tasks like humor, supervising the intermediate thinking process with structured traces outperforms scaling alone—models need to learn *why* something is funny, not just predict captions.

Summary

This paper teaches AI models to understand humor like professional cartoonists by breaking down the reasoning process into three steps: spotting visual mismatches, reinterpreting them creatively, and judging which interpretations are funniest.

reasoning multimodal evaluation

Key Terms

incongruity-resolution reasoning-trace multimodal-humor-understanding preference-alignment