On Monday, Microsoft announced the achievement of parity with human speech performance on its Switchboard speech recognition system, according to a company announcement. Microsoft used deep learning software and cloud compute infrastructure to improve the system model.
Switchboard is a bank of recorded telephone conversations that the company uses to evaluate speech recognition programs. Last year, the company announced it reaching parity with humans' 5.9% word-error rate. The bar recently reached 5.1% after researchers used human translators.
Microsoft adapted its system model to learn from the conversation history in order to adapt to topic and context and predict what was likely to occur next in the conversation. Improvements in acoustic modeling also aided the effort.
Voice recognition and digital assistance technology have made large strides in achieving parity with human speech. Microsoft’s new accuracy rate is an important step in creating user-technology interactions that simulate human to human interactions.
Microsoft noted that its speech recognition systems still have some work to do, especially in recognizing different accents, language styles and languages and accurately recognizing words in loud environments. The company hopes that its computer system can move beyond just transcribing and learn to understand the meaning and intent behind speech.
Improving accuracy and familiarity for the user is important as voice-based technology is integrated into mobile technology and home systems. As the technology better emulates human speech, fewer user errors will take place.
Voice-activated systems have faced many stumbling blocks trying to understand emotion and empathy. But in July, IBM introduced a new service for chatbots that can detect and analyze communication tones. Such services will help voice-based technology interact with users in a more nuanced way.
Tech companies are also working to make their digital assistance systems feel familiar auditorily. This fall, Apple will debut an enhanced Siri with a more natural, 'expressive' voice that sounds less robotic.