Microsoft’s research team has created technology that can recognise words in a conversation as well as professional human transcribers.
In a conversational speech recognition task called Switchboard, the company’s research team reached a record 5.1% accuracy, a new industry milestone, substantially surpassing the accuracy achieved last year.
In a new blog post, Xuedong Huang, technical fellow at Microsoft, explained the achievement:
“Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems. The task involves transcribing conversations between strangers discussing topics such as sports and politics.
“We reduced our error rate by about 12% compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models. We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modelling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.
“Moreover, we strengthened the recogniser’s language model by using the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.
“Our team also has benefited greatly from using the most scalable deep learning software available, Microsoft Cognitive Toolkit 2.1 (CNTK), for exploring model architectures and optimising the hyper-parameters of our models. Additionally, Microsoft’s investment in cloud compute infrastructure, specifically Azure GPUs, helped to improve the effectiveness and speed by which we could train our models and test new ideas.”
Reaching human parity with an accuracy on par with humans has been a research goal of Microsoft’s for the last 25 years. “Microsoft’s willingness to invest in long-term research is now paying dividends for our customers in products and services such as Cortana, Presentation Translator, and Microsoft Cognitive Services,” Huang said. “It’s deeply gratifying to our research teams to see our work used by millions of people each day.”