gemma 4 12B it MLX 8bit

gemma

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released June 2026context N/A12B params

A compact, open-weight model that runs locally via MLX with 8-bit quantization, making it well-suited for on-device inference on Apple Silicon hardware. It handles text-in, text-out tasks and reflects Google's Gemma 4 architecture at the 12B parameter scale. The quantization keeps memory footprint manageable while accepting some trade-off in raw precision.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Strong

Instruction Following

Strong

Factual Knowledge

Use Case Fit

Fit scores are AI-generated based on model capabilities, intended use, and technical specifications. Learn more

gemma 4 12B it MLX 8bit

gemma

Open WeightModel weights are publicly available — can be downloaded and self-hosted

Released June 2026context N/A12B params

A compact, open-weight model that runs locally via MLX with 8-bit quantization, making it well-suited for on-device inference on Apple Silicon hardware. It handles text-in, text-out tasks and reflects Google's Gemma 4 architecture at the 12B parameter scale. The quantization keeps memory footprint manageable while accepting some trade-off in raw precision.

Capabilities

Capability scores are AI-generated based on model documentation, benchmarks, and technical specifications. Learn more

Long Context

Strong

Instruction Following

Strong

Factual Knowledge

Use Case Fit

Fit scores are AI-generated based on model capabilities, intended use, and technical specifications. Learn more

Glossary

8-bit QuantizationA specific quantization method that represents model weights using 8 bits instead of the standard 32 bits, significantly reducing memory requirements.Apple SiliconApple's custom-designed processors (like M1, M2, M3) optimized for running machine learning models on Mac computers.ArchitectureThe underlying structural design of a neural network that defines how data flows through layers and components.InferenceThe process of running a trained model to generate predictions or outputs from new inputs.MLXA machine learning framework optimized for running models efficiently on Apple Silicon chips.Memory FootprintThe amount of RAM or storage space a model requires to run, which is critical for deployment on resource-constrained devices.On-DeviceA model designed to run directly on a user's device (phone, laptop, etc.) rather than requiring a remote server.On-Device InferenceRunning a model directly on a user's device (phone, laptop, etc.) rather than sending data to a remote server, which improves privacy and reduces latency.Open-Weight ModelA model whose trained weights are publicly released, allowing anyone to download and run it locally.Parameter ScaleThe total number of trainable weights in a model, often expressed in billions (B); larger models generally have more capacity but require more computing power.PrecisionThe level of numerical detail a model uses to represent its internal values; higher precision means more accurate calculations but requires more memory.QuantizationReducing a model's numerical precision (e.g., from 16-bit to 4-bit) to shrink memory usage and speed up inference.Text-In, Text-OutA model that accepts text as input and produces text as output, without support for images, audio, or other data types.