A quantized vision-language model that trades a small amount of numerical precision for significantly reduced memory footprint, using MXFP4 format. It handles both text and image inputs, making it capable of multimodal reasoning within a more hardware-accessible package. The compression means it can run on less powerful hardware than the full-precision version, though with potential minor quality trade-offs.