A compact multimodal model that punches above its weight by distilling reasoning patterns from a larger model into a smaller, more deployable package. It handles both text and images, making it versatile for vision-language tasks without requiring heavy infrastructure. The open-weight nature and GGUF support mean it runs locally, though the distillation process means it may occasionally show gaps where the teacher model's deeper reasoning didn't fully transfer.