A mid-sized multimodal model that handles both text and image inputs, running as a quantized 6-bit MLX format optimized for Apple Silicon hardware. The compression keeps it practical for local deployment while accepting visual inputs alongside text. As a community-packaged variant, it trades some precision for accessibility and on-device performance.