GLM 5.1 NVFP4

Name: GLM 5.1 NVFP4
Author: NVIDIA

by NVIDIAGLM

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released May 2026203K context≈ 152,064 words

GLM 5.1 NVFP4 is a quantized text-in, text-out model optimized by NVIDIA using FP4 precision, which reduces memory footprint and can improve inference speed on compatible hardware. The trade-off is that FP4 quantization may introduce minor accuracy degradation compared to full-precision counterparts. It carries a large context window of roughly 200K tokens, making it capable of handling long documents in a single pass.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Strong

Multilingual

GLM 5.1 NVFP4

by NVIDIAGLM

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released May 2026203K context≈ 152,064 words

GLM 5.1 NVFP4 is a quantized text-in, text-out model optimized by NVIDIA using FP4 precision, which reduces memory footprint and can improve inference speed on compatible hardware. The trade-off is that FP4 quantization may introduce minor accuracy degradation compared to full-precision counterparts. It carries a large context window of roughly 200K tokens, making it capable of handling long documents in a single pass.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Strong

Multilingual

Glossary

Context WindowThe maximum number of tokens a model can process in a single conversation or prompt.FP4 PrecisionA ultra-low precision format using 4-bit floating-point numbers to represent model weights, enabling extreme compression.FP4 QuantizationA compression technique that represents model weights using only 4-bit floating-point numbers instead of larger formats, reducing memory usage and speeding up inference.Full-PrecisionA model using standard 32-bit floating-point numbers to represent weights, providing maximum accuracy but requiring more memory.InferenceThe process of running a trained model to generate predictions or outputs from new inputs.Inference SpeedHow quickly a model can generate predictions or outputs after being given an input, measured in time per token or tokens per second.Memory FootprintThe amount of RAM or storage space a model requires to run, which is critical for deployment on resource-constrained devices.PrecisionThe level of numerical detail a model uses to represent its internal values; higher precision means more accurate calculations but requires more memory.QuantizationReducing a model's numerical precision (e.g., from 16-bit to 4-bit) to shrink memory usage and speed up inference.QuantizedA technique that reduces a model's size and memory usage by storing weights with lower precision (fewer bits), trading some accuracy for efficiency.Text-In, Text-OutA model that accepts text as input and produces text as output, without support for images, audio, or other data types.TokensThe basic units of text that a language model processes, typically representing words or word fragments.

Capabilities

Capabilities

Use Case Fit

Glossary