IIIT Hyderabad develops stethoscope to convert whispers for speech-impaired
Hyderabad: Researchers at the International Institute of Information Technology Hyderabad (IIITH) have developed an innovative wireless stethoscope that can convert non-audible whispers, often perceived as behind-the-ear vibrations, into intelligible speech. This technology aims to enhance social interactions for individuals with voice disorders, even in a 'zero-shot' setting.
Stephen Hawking famously said the importance of having a voice, noting in his memoir, “One’s voice is very important. If you have a slurred voice, people are likely to treat you as mentally deficient.” While Hawking's speech synthesizer converted selected letters from a computer screen into speech, researchers at IIITH have explored a silent speech interface (SSI) that translates non-audible speech into vocalised output.
The team, led by Neil Shah, a TCS researcher and PhD student at the Centre for Visual Information Technology (CVIT) at IIITH, along with Neha Sahipjohn and Vishal Tambrahalli, under the guidance of Dr. Ramanathan Subramanian and Prof. Vineet Gandhi, has published its findings in a paper titled "StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin." The paper was presented at the prestigious UBIcomp/ISWC conference in Melbourne, Australia, from 5-9 October 2024. UBIcomp/ISWC is an interdisciplinary conference that invites papers published by the proceedings of ACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies.
Traditional SSI Techniques
Neil Shah explained that SSI enables communication without producing audible sound. The most common technique is lip reading, along with other methods such as Ultrasound Tongue Imaging and Electromagnetic Articulography. However, many of these techniques are invasive and do not function in real time.
Research Methodology
The innovation aimed to improve communication for people with voice disorders. The team used a regular stethoscope placed on the skin behind the ear to capture vibrations known as Non-Audible Murmurs (NAM). "We gathered a dataset of NAM vibrations, called the Stethotext corpus, collected in various noisy conditions," explained Prof. Vineet Gandhi. The dataset was created by asking individuals to read aloud while murmuring, which allowed the team to train their model to convert these vibrations into speech.
Unique Features of the Research
IIITH's research stands out from previous approaches due to its use of a simple stethoscope design. The NAM vibrations are sent via Bluetooth to a mobile phone, which produces clear speech output. Unlike previous methods that required paired whisper-speech data, the IIITH approach shows that NAM vibrations can be converted to speech without needing data from the speaker during training. The conversion process is very fast; it takes less than 0.3 seconds to translate a 10-second NAM vibration. This method remains effective even when the user is moving, such as when walking.
Moreover, users have the option to select specific attributes for the output speech, such as ethnicity and gender. "We can create person-specific models, requiring only four hours of murmuring data to develop a specialized model for any individual," Prof. Gandhi explained.
Advancements in Machine Learning for Accessibility
Professor Gandhi's team initially concentrated on text-to-speech (TTS) models. Their system, called Parrot TTS, converts lip movements into clear speech. The researchers' approach is unique in that it mirrors natural language acquisition. They first build a speech-to-speech system and then map sound representation to text.
Significant Breakthrough
The wireless stethoscope has demonstrated potential in high-noise environments, such as concerts, where normal speech is often unintelligible. It can also enable discreet communication in settings like security operations.
“This research is transformative because previous studies assumed clean speech corresponding to the recorded vibrations. Our approach does not require prior clean speech from individuals who are disabled or speech-impaired,” highlighted Prof. Gandhi. While the team has yet to conduct experiments with medical patients, they are eager to collaborate with hospitals to gather data. “It’s exciting to think we can give a voice to someone who has lost their own,” he added.