LLMs trained primarily on Python code don't generalize well to other languages—Multi-LCB exposes this critical gap and provides a rigorous way to measure cross-language coding ability.
Multi-LCB extends LiveCodeBench, a popular code-generation benchmark, from Python-only to 12 programming languages. The benchmark transforms competitive programming problems into equivalent tasks across languages while maintaining contamination controls, revealing that LLMs show Python overfitting and significant performance gaps across different languages.