A large mixture-of-experts model that activates only 2 billion parameters per forward pass despite its 24B total size, making it notably efficient at inference time. It handles text tasks with the resource footprint of a much smaller model, though the sparse activation pattern means it trades some raw capability for speed and memory savings. A practical choice when compute constraints matter more than squeezing out maximum performance.