Google wants to hear you say ‘yes’…and ‘no’, and maybe also ‘on’ and ‘off’ too.
The firm is gathering speech samples from people across the globe, as part of a push to get simple voice recognition everywhere, paving the way for voice commands to be added to appliances and gadgets throughout our homes.
“Voice interfaces are really interesting to us for a whole bunch of reasons,” says Google staff engineer Pete Warden, speaking at an Arm Research Summit in Cambridge.
“From an engineering point of view it’s great to avoid having buttons and switches and displays, and all of these other things that suck up power and take up space on the device.
“It really means you can shrink things down, so you can build devices that are almost invisible. It allows you to create applications that wouldn’t make any sense otherwise.”
There are many barriers to creating voice interfaces, he says, not least that it’s difficult for most people to get their hands on the data needed to train a machine-learning model to understand voice commands as well as virtual assistants offered by Amazon, Apple and Google.
“Voice recognition is really hard. Traditionally it’s been something that’s been confined to very large companies with very big teams and that have very specialized stacks of language models,” says Warden.
“You don’t get people in their notional garages just putting this stuff together because it is very difficult to get started.”
While major tech firms offer cloud-based, speech-recognition services, Warden says there are a range of reasons—such as reliability, latency, privacy and power consumption—why developers might not want to use these with an always-listening device.
Warden and the AIY [Artificial Intelligence Yourself] team at Google decided to create their own open speech commands dataset (1.4GB download), composed of 65,000 one-second long utterances of 30 short English words. The Creative Commons-licensed dataset already has thousands of different voices, but Google is hoping to expand the range of speakers and vocabulary further, to encompass many different accents and dialects, and is inviting anyone to record spoken commands via this site.
The data won’t allow a computer to be trained to understand every word a person says, but rather will help train machines to manage the easier task of recognizing a small group of common commands—the likes of ‘Yes’ and ‘No’, ‘On’ and ‘Off’ and a selection of directions.
“Think about a lot of the applications that we care about. For many of them, you’re only going to say about 10 different words to any particular device. That is a very different problem than recognizing arbitrary open-world speech,” he says.
Squashing neural networks
Having the training data is only part of the battle when it comes to ushering in a new wave of voice-controlled devices, however. Another major challenge is developing a speech-recognition model lightweight enough to run on the cheap hardware used in low-end mobile phones and single-board computers like the Raspberry Pi, and the very simple embedded processors used in home appliances and wearables.
Warden is hoping that by providing the necessary data and tools to the developer community, they will be able to devise suitably efficient speech-recognition models. To that end Google has created a tutorial for training a speech-recognition system using this open data, without the need for any sort of data preparation. The resulting model can be used with a Google-built demo app, which can run be on the Raspberry Pi or an Android phone and recognize 10 different command words.
The machine-learning systems that have powered recent AI breakthroughs in areas like speech and image recognition are underpinned by large neural networks. These brain-inspired networks are interconnected layers of algorithms that feed data into each other, and which can be trained to carry out specific tasks by modifying the importance of input data as it passes between the layers.
To build a system able to understand a series of simple voice commands Warden trained a Convolutional Neural Network (CNN)—similar to the type used in small image-recognition networks—to recognize images of the audio wave form of each single-word utterance. Google found it was able to shrink down these CNNs, and further experiments were able to train a compressed deep neural network to be capable of recognizing these command words with 85% accuracy, within a system operating at 0.75 million FLOPs and that required 750,000 parameters to train. By reducing the number of parameters further, Warden says the model would run on the very cheap, very low power Arm-based Cortex M processors — embedded chips already used in many different home appliances and wearables.
“I’m pretty convinced that people will be able to do much better than me on these models,” he says.
“We know from other experiences that we can get down to 10 or 20,000 parameters and still do useful things with audio. At which point embedded M-class chips start to get really interesting.”
The other technological shift necessary to usher in Warden’s vision of ubiquitous voice-controlled machines will be the availability of low-power chips better suited to processing neural networks.
“I want a 50 cent chip that can do simple voice recognition and that will run for a year on a coin battery,” says Warden.
“I really think this is doable, even with the current technology that we have now I don’t think we’re that far off being able to do this.
“Once you have something that you can almost use as a disposable electronic component, in all sorts of consumer and toy and industrial devices, that’s really going to be game-changing.”
Some pointers as to how this level of efficiency might be achieved were on show at the Arm Research Summit. Belgium’s University of Leuven revealed changes they had made to general-purpose chip design that could enable embedded, low-power processors to carry out always-on computer recognition for a limited number of voice commands or images.
To be capable of always-on recognition, the university’s Professor Marian Verhelst says these processors need to achieve efficiencies of between one and 10 teraops [trillion operations per second] per watt when processing Convolutional Neural Networks.
A chip designed by the university, dubbed DVAFS [Dynamic Voltage Accuracy Frequency Scaling], achieves close to 10 teraops by carrying out a greater range of operations in parallel and, where possible, cutting the number of computations and reducing the precision of the calculations. One key example of how this speedup is achieved is that the chip’s 16-bit multiplier-accumulator can split its operations—for example, carrying out two 8-bit or four 4-bit multiplier operations in parallel.
“I think that’s fascinating work that was worth paying attention to. We’re open to all ideas in this space,” said Matt Mattina, senior director of machine learning research at chip designer Arm.
“I’ve been doing CPU design and architecture for 20 years now, and the amount of excitement and activity around machine learning is unlike anything I’ve seen before.”
For his part, Google’s Warden says he’s hoping it won’t be too long before “we’ll be able to get people doing all sorts of crazy applications that we wouldn’t have thought of using cheap, low-power devices”.
Once hardware design reaches the point where the 50 cent chip that runs on a coin battery becomes a reality, Warden anticipates a swathe of new devices that listen to the world.
“You can add a simple voice interface to anything in your home. You can do audio recognition to do safety cut offs, spot mechanical problems in all sorts of industrial machinery. I love the idea of detecting locusts and crickets by having a whole bunch of audio sensors scattered through fields.
“This is a whole way we understand the world, and I really want our devices to be able to do the same.”