From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca|June 1, 2026arXiv

Key Takeaway

Compression works better when you target specific submodules (Attention vs FeedForward) individually rather than removing entire layers, because redundancy in LLMs isn't evenly distributed across the model's depth.

Summary

SubFit compresses large language models by removing redundant components at a finer granularity than existing methods. Instead of deleting entire layers, it selectively removes Attention and FeedForward submodules from anywhere in the model and replaces them with lightweight shortcuts, achieving better performance-efficiency trade-offs than layer-level compression approaches.

efficiency training

Key Terms

model-compression residual-network post-training calibration kv-cache