Device-addressed speech detection works much better when you consider the conversation context and history rather than analyzing each utterance in isolation—and this sequential approach can run efficiently on edge devices.
This paper tackles the problem of detecting whether spoken audio is addressed to a device (like a smart speaker) before sending it for transcription. Rather than treating each utterance independently, the authors model it as a sequential decision problem that considers conversation history.