A mid-sized multimodal model that handles both text and image inputs, quantized in nvfp4 format for efficient local deployment via MLX. It operates with a mixture-of-experts architecture, activating around 3 billion parameters out of 35 billion total, keeping inference lean without sacrificing much capability. The trade-off is that nvfp4 quantization may introduce subtle quality degradation compared to full-precision variants.