GLM 5 NVFP4

Name: GLM 5 NVFP4
Author: NVIDIA

by NVIDIAGLM

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released February 2026203K context≈ 152,064 words

GLM 5 NVFP4 is a quantized variant optimized for NVIDIA hardware, trading a small amount of precision for significantly faster inference and lower memory footprint. It handles long-context text tasks across its ~200K token window while running efficiently on consumer and datacenter GPUs. The FP4 quantization makes it practical for deployment scenarios where raw throughput matters more than maximum accuracy.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Exceptional

Multilingual

GLM 5 NVFP4

by NVIDIAGLM

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released February 2026203K context≈ 152,064 words

GLM 5 NVFP4 is a quantized variant optimized for NVIDIA hardware, trading a small amount of precision for significantly faster inference and lower memory footprint. It handles long-context text tasks across its ~200K token window while running efficiently on consumer and datacenter GPUs. The FP4 quantization makes it practical for deployment scenarios where raw throughput matters more than maximum accuracy.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Exceptional

Multilingual

Glossary

FP4 QuantizationA compression technique that represents model weights using only 4-bit floating-point numbers instead of larger formats, reducing memory usage and speeding up inference.InferenceThe process of running a trained model to generate predictions or outputs from new inputs.Long-ContextThe ability of a model to process and understand very long sequences of text while maintaining coherence across distant parts of the input.Memory FootprintThe amount of RAM or storage space a model requires to run, which is critical for deployment on resource-constrained devices.PrecisionThe level of numerical detail a model uses to represent its internal values; higher precision means more accurate calculations but requires more memory.QuantizationReducing a model's numerical precision (e.g., from 16-bit to 4-bit) to shrink memory usage and speed up inference.QuantizedA technique that reduces a model's size and memory usage by storing weights with lower precision (fewer bits), trading some accuracy for efficiency.ThroughputThe number of tokens a model can generate per second, measuring its processing speed.TokenA small unit of text (a word, subword, or punctuation mark) that a language model breaks input into for processing.

Capabilities

Capabilities

Use Case Fit

Glossary