A mid-sized multimodal model that handles both text and image inputs, quantized to 4-bit precision for efficient local deployment via MLX. The compression makes it more accessible on consumer hardware, though some capability may be traded off against the full-precision version. It operates as a thinking-capable model in the Qwen3.6 family, balancing reasoning depth with practical resource constraints.