It became obvious around 2000 that mobile phones were the future and that voice interfaces would be a great match for these small devices – but the data networks, devices and operating systems weren’t well enough developed. Some tried to rush things through without thinking enough about the overall user experience, and that led to some negative perceptions.
But the prize was still huge, because the most natural way to communicate is through language. People have evolved to talk. Little key pads, tiny buttons and small displays don’t make things easy, whereas voice isn’t constrained by the size of the device. A “mix of modality” is also very important – one that gives users a choice to suit their personal preference and situation. We can speak faster than we can type, and read faster than we can listen. So the ideal (unless you’re in a car) could be to talk into the device and then receive feedback via the display.
Just think about the lifestyle, work and personal productivity benefits of being able to press a single button, talk to your phone using normal language, and get it to do everything from sending emails and text messages to web searches, voice dialling, writing yourself a note, updating your Facebook site or launching other apps. All this is possible now using the latest voice recognition technology. It’s helping people to access information faster, save time and get more done. It really has come a long way. But getting to this point wasn’t easy. A completely different approach was needed to overcome some hefty challenges.
Unconstrained voice user interface
As devices, operating systems and networks get better, they’ll no longer be constrained by memory or network speed, and the only limiting factor for apps will be the user interface. This was the critical insight in developing the new generation of voice recognition technology, with the need for an unconstrained and far more capable user interface which could work across any application. To do this, we had to get rid of application-specific grammar constraints and in turn remove the need for the scripted interactions. Put simply, users should be able to say or type anything they want into a voice-enabled text box. But making this a reality took a number of technological step changes.
The first of these was using hierarchical language models (HLMs), which are many millions of words based on well defined statistical models to predict what users are likely to say and how words are grouped together, scaled to enable web search, directory assistance, navigation and other tasks where a very large number of words are needed.
Then there’s automatic adaptation, where a system learns new words, pronunciations, and the speech patterns of individuals and groups. When Wagamama restaurants first opened in the US the name was brand new, but people were doing successful voice-enabled web searches on it within days. Equally, a first time user with a particular accent benefits from other users who’ve spoken into the system using the same accent.
Server-side processing is another key. A small amount of software on the mobile device handles audio capture and the user interface, and communicates over the mobile data network to a set of servers which run the bulk of the system processing. This enables the use of the large amounts of CPU and memory resources needed for unconstrained speech recognition, adaptation, the learning of new words and the updating of the language models for the benefit of all users.
Speech recognition remains a challenging area, and the main one is variability. If everyone spoke the same way using the same language, speech patterns and accent, things would be easy. The reality is that people speak all sorts of ways, choose different words, and use mobiles in quiet offices or on noisy roads – so there’s a huge variability in acoustic signals and sound patterns.
Statistical modelling is the answer; millions of parameters that can cope with this variability and a system that can learn as it goes along. It works for adding new languages too, which are being developed all the time. A ‘freed up’ user interface opens up more and more apps that people can use, just by talking.
It certainly seems like a lot of people are really starting to see voice as the killer app that can really move things forward. So talk to your phone – it works!
As the inventor of the mobile phone “voice user interface,” Vlingo delivers a voice interface and technology that allows users to instantly access services and content on their device. www.vlingo.com
Has voice recognition finally come of age?
Most Popular Mobile Business Stories
- Best-selling ringtones in 2005
- The MVNO Challenge
- Mobile Distribution
- Nazir of Advanced Comms gets £20k fine and starves workers
- Keith Curran – Yes Telecom
- Mobile First, Security Second
- Moco has given every member of staff a new LG 8180 video pho...
- 3 offers free video downloads from April
- Nokia restructures for the converging marketplace
- Mystery Caller – June 2013
- The Forgotten “BYOD” Stakeholder – the Service Provider!...
- 5G – What is it good for?
- Mystery Caller – October 2016
- Fone Logistics Goes on the Road
- Palm aims for Europe
- LG’s JoY at New Smartphone
- New Unique Mobility Solution
- Motofone Clearvision
- Sony Ericsson K800i
- Dubai’s Carousel Co-operation