A mid-sized multimodal model that handles both text and image inputs, converted to MLX 8-bit format for efficient local inference on Apple Silicon hardware. The quantization keeps memory footprint manageable while preserving most of the base model's capabilities. Expect solid general reasoning and vision understanding with the trade-off of some precision loss from quantization.