A multimodal open-weight model that handles both text and image inputs, quantized to 8-bit for efficient local inference via MLX. The quantization makes it more accessible on consumer hardware while accepting some precision trade-offs compared to full-precision variants. It processes visual and textual information together, making it capable of tasks that require understanding both modalities.