A multimodal open-weight model that handles both text and image inputs, quantized to 4-bit precision for efficient local deployment via MLX. The 4-bit quantization makes it accessible on consumer hardware with reduced memory footprint, though this comes with some quality trade-offs compared to full-precision variants. It processes visual and textual information together, making it capable of tasks that require understanding images alongside written context.