A compact, fast multimodal model that handles both text and image inputs with a notably large context window of 262K tokens — unusual for a model of its size. The NVFP4 quantization keeps it lean and efficient for deployment. Its open-weight nature means you can inspect and run it locally, though detailed capability benchmarks from the publisher are sparse.