A mid-sized multimodal model that handles both text and image inputs, packaged in a 6-bit quantized MLX format optimized for Apple Silicon hardware. It sits in a practical middle ground — compact enough to run locally on Mac hardware while retaining image understanding capabilities. The quantization means some precision is traded off compared to full-weight versions, which can occasionally show in nuanced reasoning tasks.