Diagnosis-driven video summarization for medical imaging requires organizing sparse diagnostic events into coherent clinical contexts rather than treating frames independently—DiCE shows this contextual reasoning approach outperforms standard methods on ultra-long endoscopy videos.
This paper tackles video-level analysis of capsule endoscopy (CE) videos by introducing a new task: extracting key diagnostic frames and making accurate diagnoses from ultra-long videos containing thousands of normal frames mixed with rare abnormal findings.