Compression works better when you target specific submodules (Attention vs FeedForward) individually rather than removing entire layers, because redundancy in LLMs isn't evenly distributed across the model's depth.
SubFit compresses large language models by removing redundant components at a finer granularity than existing methods. Instead of deleting entire layers, it selectively removes Attention and FeedForward submodules from anywhere in the model and replaces them with lightweight shortcuts, achieving better performance-efficiency trade-offs than layer-level compression approaches.