Blu Manta is constantly experimenting new, cutting-edge approaches to solve challenging problems. Single-microphone noise reduction for spoken conversations (e.g phone conversations) albeit an old topic, remains a difficult one. In this problem, it is not possible to perform any kind of spatial filtering (i.e. beamforming) because the noisy speech signal is recorded from a single microphone. In these conditions, the noise component of the signal must be estimated directly from the noisy speech mixture. Because of the high variety of noise characteristics, this modeling task is complex, in particular for non-stationary noise. Therefore it suits nicely a data-driven approach such as neural networks.
Our attempt to solve this problem is based on recurrent neural networks built of gated recurrent units (GRUs). This neural network is used to estimate and subtract the noise component in the frequency domain. To prevent channel-mismatch issues, a heavy data augmentation process has been put in place at the training phase. While allowing sampling rates as high as 48kHz, both the memory size and the overall complexity of the system remain reduced such that they can fit typical platforms based on ARM Cortex M4. Our prototype is low-latency and hence is well suited for realtime applications.
At Blu Manta we have in mind that products must be practical and adaptable from the user point of view. An illustration of this way of thinking is our voice command prototype. The main challenge for this usecase is to allow the user to enrol a new, personalized voice command on a mobile device: 1/ without data network connection, and 2/ without waiting time (instantaneously). We thus designed our system to be retrainable on the edge (i.e. on the device itself) and in a short amount of time to recognize this new and unknown command. We approached this problem with a recurrent neural network based on long short-term memory units (LSTMs) that builds a compact representation of the voice command to recognize. The user is asked to record a voice command of their choice, and the enrolment phase is performed live, in less than two seconds. The representation of this new command—often called an embedding—is used as a reference to detect an occurrence of that specific command at runtime.
We also put a significant effort in building solutions that are scalable and applicable across different domains. For instance, we derived a gesture recognition engine out of our speech command system described above. Both share the same neural architecture with different I/O data. In this case, the system analyses depth images to recognize a specific hand gesture. Here again, a live enrolment mechanism allows the user to personalize the system to recognize a new and unknown gesture. Our system outperformed state-of-the-art deep-learning solutions on a public database, while dividing the computational complexity by a factor 10 with a model size of a few tens of kilobytes only.