A compact multimodal model that handles both text and image inputs, making it capable of visual reasoning tasks alongside standard language work. As a 4-bit quantized MLX variant, it trades some precision for significantly reduced memory footprint, running efficiently on Apple Silicon hardware. The QAT (quantization-aware training) process helps preserve quality despite the aggressive compression.