You can significantly improve GUI agent accuracy on complex interfaces without retraining by using a two-step approach: first narrow down the region of interest, then select the best candidate from remaining options.
This paper identifies why GUI grounding models (used by AI agents to click and interact with interfaces) fail on complex screens, finding two main problems: high image resolution causes precision errors, and complex UI elements create ambiguity.