A mid-sized multimodal model that handles both text and image inputs, quantized to 4-bit precision for efficient local deployment via Apple's MLX framework. The 4-bit quantization reduces memory footprint significantly, making it practical to run on consumer hardware, though with some trade-off in raw precision compared to full-weight variants.