A compact 4B multimodal model that punches at its weight class for vision-language tasks. It handles both text and image inputs, making it versatile for visual question answering and image description work. As an open-weight release under Apache 2.0, it's freely adaptable, though its 4B parameter count means it trades raw reasoning depth for speed and accessibility.