Narrative similarity can be operationalized as a practical classification task, and LLM ensembles currently outperform other approaches, but there's significant room for improvement in how systems represent and compare story meanings.
This paper introduces a shared task on narrative similarity that asks systems to determine which of two stories is more similar to a reference story. The team collected over 1,000 annotated story triples and evaluated 71 submissions from 46 teams, finding that LLM ensembles performed best for classification while fine-tuned embedding models competed well with simpler approaches.