Automated speech-recognition technology has become more common with the popularity of virtual assistants such as Siri, but many of these systems only perform well with the most widely spoken languages being only around 7,000 in the world.
Since these systems largely do not exist for less common languages, the millions of people who speak them are cut off from the many technologies that rely on speech, from smart home devices to assistive technologies and translation services.
Recent advances have enabled machine learning models that can learn the world’s unusual languages, which lack the large amounts of written speech needed to train algorithms. However, these solutions are often too complex and expensive to be widely implemented.
Researchers at MIT and elsewhere have now tackled this problem by developing a simple technique that reduces the complexity of an advanced speech-learning model, enabling it to run more efficiently and achieve higher performance.
His technique involves removing unnecessary parts of a common, but complex, speech recognition model and then making minor adjustments so that it recognizes a specific language. Since only small changes are needed after the size reduction of the large model, it is much less expensive and time-consuming to teach this model an unusual language.
The work could help level the playing field and bring automated speech-recognition systems to many areas of the world where they have not yet been deployed. The systems are important in some academic environments, where they can assist blind or low vision students, and are being used to improve efficiency in health care settings through medical transcription and in the legal field through court reporting. Is. Automated speech recognition can also help users learn new languages and improve their pronunciation skills. This technique can also be used for transliteration and documentation of rare languages that are in danger of extinction.
“This is an important problem to solve because we have amazing technology in natural language processing and speech recognition, but research in this direction will help us scale the technology to many more unexplored languages in the world,” Cheng-i Jeff Lai, PhD student at MIT’s Computer Science and Artificial Intelligence Laboratory (CSEL) and first author of the paper.
Lai collaborated with fellow MIT PhD students Alexander H. co-wrote the paper with Liu, Yi-Lun Liao, Samir Khurana and Yung-Sung Chuang; His advisor and senior author is James Glass, senior research scientist and head of the Spoken Language Systems Group at CSAIL; MIT-IBM Watson AI Lab research scientists Yang Zhang, Shiyu Chang, and Caizhi Qian; and David Cox, IBM director of the MIT-IBM Watson AI Lab. The research will be presented in December at the Conference on Neural Information Processing Systems.
learn speech from audio
Researchers studied a powerful neural network that has been purported to learn basic speech from raw audio, called Wave2vec 2.0.
A neural network is a series of algorithms that can learn to recognize patterns in data; Designed loosely from the human brain, neural networks are organized into layers of interconnected nodes that process data input.
Wave2vec 2.0 is a self-supervised learning model, so it learns to recognize spoken language after being fed a large amount of unlabeled speech. The training process requires only a few minutes of written speech. This opens the door to the speech recognition of unusual languages that lack a large amount of written speech, such as Wolof, which is spoken by 5 million people in West Africa.
However, neural networks have about 300 million individual connections, so training it on a specific language requires an enormous amount of computing power.
The researchers set out to improve its efficiency by sorting this network. Just as a gardener cuts off unnecessary branches, neural network pruning involves removing connections that are not necessary for a specific task, in this case, learning a language. Lai and his colleagues wanted to see how the pruning process would affect the speech recognition performance of this model.
After pruning the entire neural network to form a smaller subnetwork, they trained the subnetwork with a small amount of labeled Spanish speech and then with French speech, a process known as finetuning.
“We would expect these two models to be very different because they are geared to different languages. But what is surprising is that if we pruned these models, they would end up with highly similar sorting patterns. For French and Spanish, they have a 97 percent overlap,” says Lai.
They conducted experiments using 10 languages, ranging from Romance languages like Italian and Spanish to completely different letter languages like Russian and Mandarin. The results were similar – there was a huge overlap across all refined models.
a simple solution
Drawing on that unique discovery, he developed a simple technique to improve efficiency and boost the performance of neural networks, called PARP (Prune, Adjust, and Re-Prune).
In the first step, pre-trained speech recognition neural networks such as Wave2Wake 2.0 are truncated by removing unnecessary connections. Then in the second step, the resulting subnetwork is adjusted for a specific language, and then disconnected again. During this second phase, the removed connections are allowed to grow back if they are important for that particular language.
Because the connection is allowed to grow back during the second phase, the model only needs to be finetuned once, not multiple iterations, which greatly reduces the amount of computing power required.
technology test
Researchers put PARP to the test against other common pruning techniques and found that it outperformed them all for speech recognition. This was especially effective when there was only a very small amount of written speech to train.
They also showed that PARP can form a small subnetwork that can be fine-tuned for 10 languages simultaneously, eliminating the need to have separate subnetworks for each language, allowing these models to be trained. The cost and time required for this may also be less.
Going forward, the researchers want to apply PARP to text-to-speech models and see how their technology can improve the efficiency of other deep learning networks.
“There is an increasing need to put large deep learning models on growing devices. Having more efficient models allows these models to be squeezed onto more primitive systems such as cell phones. Speech technology is very important for cell phones, for example, But having a smaller model doesn’t mean it’s computing faster. We need additional technology to compute faster, so there’s still a long way to go,” says Zhang.
Self-supervised learning (SSL) is changing the field of speech processing, so making SSL models smaller without underperforming is an important research direction, says Hung-Yi Lee, associate professor in the Department of Electrical Engineering and Computer Science. Information Engineering at National Taiwan University, who was not involved in this research.
“PARP trims the SSL model, and at the same time, surprisingly improves recognition accuracy. In addition, the paper shows that the SSL model has a subnet, which is suitable for ASR tasks of many languages. This finding Will encourage research on language/task agnostic network pruning. In other words, the SSL model can be compacted while maintaining its performance on different tasks and languages,” he says.
This work is partially funded by the MIT-IBM Watson AI Lab and the 5k Language Learning Project.