Backend infrastructure (llama.cpp vs MLX) matters more than quantization level for local LLM performance, and long-context tasks expose memory limits that cloud models handle better—critical for practitioners choosing between cloud and local deployment.
This paper evaluates large language models on System Dynamics tasks, comparing cloud APIs (77–89% accuracy) against locally-hosted open-source models (up to 77% on causal diagram extraction).