A compact multimodal model that handles both text and image inputs, quantized to FP8 precision for efficient deployment. The reduced precision keeps memory footprint manageable while preserving much of the original model's capability. It reflects a practical trade-off: slightly lower fidelity in exchange for faster inference and lower hardware requirements.