Nemotron Cascade 2 30B A3B

Name: Nemotron Cascade 2 30B A3B
Author: NVIDIA

by NVIDIA

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released March 2026262K context≈ 196,608 words30B params

Nemotron Cascade 2 30B A3B is a hybrid dense-MoE model that activates only 3 billion parameters per forward pass despite having 30 billion total, making it unusually efficient at inference time. It handles extremely long contexts — up to 262K tokens — which suits tasks involving large documents or extended conversations. The trade-off is that sparse activation can sometimes mean less consistent depth than a fully dense model of similar size.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Exceptional

Instruction Following

Nemotron Cascade 2 30B A3B

by NVIDIA

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released March 2026262K context≈ 196,608 words30B params

Nemotron Cascade 2 30B A3B is a hybrid dense-MoE model that activates only 3 billion parameters per forward pass despite having 30 billion total, making it unusually efficient at inference time. It handles extremely long contexts — up to 262K tokens — which suits tasks involving large documents or extended conversations. The trade-off is that sparse activation can sometimes mean less consistent depth than a fully dense model of similar size.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Exceptional

Instruction Following

Glossary

Dense ModelA neural network where all parameters are active for every input, in contrast to sparse architectures like mixture-of-experts that selectively activate different parts.Forward PassA single computation cycle where input data flows through the model's layers to produce an output prediction.InferenceThe process of running a trained model to generate predictions or outputs from new inputs.Inference TimeThe amount of time it takes for a model to process input and generate output after it has been trained.ParametersThe learned numerical values in a model — more parameters generally means more capacity but higher compute cost.Sparse ActivationA technique where only a subset of a model's parameters are used for each input, reducing computational cost while maintaining performance.TokensThe basic units of text that a language model processes, typically representing words or word fragments.

Capabilities

Capabilities

Use Case Fit

Glossary