A mid-sized open-weight base model from Qwen that handles both text and images, making it a versatile starting point for fine-tuning or experimentation. As a base model, it hasn't been instruction-tuned, so it completes text rather than following conversational prompts — developers need to apply their own alignment layer. Its multimodal intake gives it an edge for vision-language tasks without requiring a separate model.