A compact vision-language model that handles both text and image inputs with a focus on instruction-following. At 8 billion parameters, it operates within tight resource constraints while still processing visual content — a practical trade-off between capability and deployability. Expect solid performance on grounded visual tasks, though it may show limitations on complex multi-step reasoning compared to larger counterparts.