A mid-sized multimodal model that handles both text and image inputs, quantized to FP8 dynamic precision for more efficient deployment without fully sacrificing capability. The FP8 format means it runs leaner on hardware compared to full-precision variants, though quantization can introduce subtle quality trade-offs. It sits in a practical middle ground — capable enough for real tasks, small enough to run in constrained environments.