New wearable uses light and AI to convert quiet throat movements into audible speech

Applications of AI


Speeches usually appear simple. The air moves, the vocal cords vibrate, and sound is produced. But the act of speaking leaves another trace that never reaches the ears. The small muscles in your throat become tense and shift. The skin stretches in very small units, so it’s easy to overlook it. The researchers found that these movements may contain enough information to reconstruct spoken language. This is true even if there is no sound at all.

That’s the idea behind a new wearable system developed by researchers at POSTECH (Pohang University of Science and Technology). A team led by Professor Sung-min Park and Dr. Seung-kook Hong has developed a neck-mounted device that uses light to read the subtle movements of the throat. It then uses artificial intelligence to decode those patterns and convert them back into speech with your own synthesized voice.

This concept targets stubborn problems. Clear communication can quickly break down in noisy places. In factories, construction sites, battlefields, and even some clinical settings, spoken words can become unreliable. Traditional silent voice interfaces attempt to solve this problem by measuring signals from the brain and muscles. This is often done through systems such as EEG and EMG. However, these approaches can be cumbersome, limited, and difficult to use outside the laboratory.

This new version takes a different route. Rather than listening to sounds, it monitors the body’s mechanical audio signatures.

A comprehensive overview of the proposed wearable SSI system. It consists of a reliable multi-axis strain sensor with real-time adaptive audio decoding and reconstruction capabilities. (Credit: Cyborg and Bionic Systems)

Camera, small marker, neck choker

The device uses what the team calls a multi-axis strain mapping sensor. It’s built around a soft silicone layer patterned with tiny black markers and combined with a small camera, a compact microscope lens, and an LED light. The system is worn as part of a choker-like neck brace and tracks how markers move. This happens when the skin and muscles around your throat move during conversation.

This is important because speech is not a unidirectional movement. The muscles in your throat expand, contract, and twist in different directions. Many early wearable strain sensors primarily captured one axis of motion. This limited the amount of detail that could be collected. In contrast, this sensor maps both the size and direction of local strain. As a result, you will have a more complete picture of what your throat is doing.

The research team reported that the sensor has a high gauge factor of 3,625, low hysteresis of less than 0.65%, and high linearity of more than 0.99 over the operating range. A small strain of 0.02 percent was also detected. This is a level of sensitivity sufficient to detect subtle biomechanical changes that occur during conversation.

Just as importantly, it was maintained. In testing, variation between sensor samples was minimal, with an average absolute percentage error of 2.8 percent. The device remained stable after 1,000, 5,000, and 10,000 load cycles.

Teach AI to read silent speech

Collecting throat movements is only half the challenge. The other half understands the meaning.

Design, mechanism, and mechanical properties of CVOS sensors. (A) Design and configuration of the CVOS sensor. (B) Multiaxial strain map detection mechanism and corresponding micromarker movement patterns in multiaxial strain directions. (C) Multiaxial strain map during diastole and systole of throat muscle movement. (Credit: Cyborg and Bionic Systems)

The researchers built an AI pipeline that combines convolutional neural networks and transformer models. CNNs handle fine-grained local features within the strain map, while transformers help track broader patterns over time. This hybrid setting was designed to better capture the changing and time-dependent nature of silent speech.

The team also had to deal with the practical problem of wearing a wearable device changing slightly each time it is put back on. The feeling of tightness will change. The placement will be shifted. Skin contact varies. These differences can cause the signal to change even when the same word is spoken.

To address this, the system measures what researchers call an initial residual stress map. Simply put, we record the baseline deformation that exists when the device is attached, before intentional speech begins. This allows the AI ​​to adjust for differences in attachment and avoid treating those changes as audio itself.

The system was trained on 5,186 samples from six participants (all healthy adults between the ages of 23 and 32) covering 26 words of the NATO phonetic alphabet. Rather than trying to decipher free speech, the researchers focused on controlled vocabulary. This vocabulary is already designed to help you communicate clearly in noisy environments, including Alpha, Bravo, and Charlie.

That choice was intentional. Similar-sounding letters can be confused on the radio or in a noisy workplace. That’s why the NATO alphabet exists in the first place. As a result, a silent speech system built around these words may become useful faster than a system that attempts to reconstruct a complete native language.

Accuracy in noise, but not in all situations

The classifier reached 85.8% accuracy across 26 NATO words. After compressing the model with knowledge distillation, the system file size decreased from 12.4 MB to 3.6 MB. Processing speed increased from 0.018 seconds to 0.003 seconds, but accuracy remained at 82%. The researchers also reported a signal-to-noise ratio as high as 33.75 decibels. This was significantly higher than the 10.17 decibels quoted for typical commercial EMG systems.

Experimental environment for evaluating word recognition performance in the presence of environmental noise. (Credit: Cyborg and Bionic Systems)

In testing with new users, the fine-tuning method, called LoRA, reached 80% accuracy with 20 samples per class. This compares to 76% for traditional fine-tuning.

The device remained effective even at high volume settings. When tested under 90 decibels of white noise, which is roughly comparable to the roar of a construction site, recognition performance matched results in a normal 60 decibel environment. In another demonstration, a user fitted the system and fired a gas blowback rifle in semi-auto and full-auto modes. The interface still identified the intended words and sent them for real-time speech reconstruction.

Not all tests were so clean. Performance decreased if the choker was worn too loosely or if the user spoke very loudly. This is likely because larger and faster muscle movements taxed the limits of current hardware. Motion was another weakness. If you talk while walking or move your head, decoding accuracy will decrease. The steepest decline occurred during up-and-down movements of the head.

That limit is important. A system designed for real life cannot expect people to stay still.

Voices of people who cannot use

Researchers are considering several potential applications. One is clinical. People who have lost their voice after vocal cord disease or laryngeal surgery may also experience throat movements related to their speech intentions. These signals can be converted into audible audio.

The other one is professional. People who work in noisy environments may need a communication tool that doesn’t rely on microphones. Quiet communication is also possible in places where loud talking is prohibited, such as libraries and conference rooms.

Professor Park said, “I hope that this technology will hasten the day when speech-impaired patients regain their voice.” He added: “This is a remarkable technology because it has a wide range of potential applications, including assisting laryngectomy patients, communicating in noisy industrial environments, and even supporting silent speech.”

Experimental environment in which rifles are fired to evaluate word recognition performance in the presence of irregular noise and direct mechanical vibrations. (Credit: Cyborg and Bionic Systems)

The system is not there yet. This study used a limited number of participants and a limited vocabulary. The sampling rate was 50 hertz, which the researchers said falls within the biological frequency range of muscle signals. Future upgrades should improve speed and robustness.

The researchers also said that the platform will require larger datasets, more users, a broader vocabulary, and better handling of motion artifacts before it becomes a full-fledged silent voice communication tool.

Practical implications of the research

This research aims at a communication system that does not rely on audible voice, handheld radios, or gel-based sensors.

Improvements in this technology could provide people with language disorders with a more natural way to communicate. It could also help workers exchange information in places where noise makes normal conversation unreliable.

Its biggest promise is simple. The goal is to keep the intended audio alive even when the audio is not audible.






Source link