A mid-sized multimodal reasoner that processes both text and images, quantized to 4-bit precision for efficient local deployment via MLX. The compression keeps memory footprint lean while preserving much of the original model's capability, though fine details may soften compared to full-precision runs. It handles visual and textual inputs in a single pass, making it practical for on-device workflows where resource constraints matter.