Analysis of the traditional automatic speech recognition (ASR) system for Google applications

Currently, Google's various voice search applications still use the traditional automatic speech recognition (ASR) system, which includes an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are each other. Independently trained, and requires researchers to manually debug on different data sets. For example, when an acoustic model acquires some acoustic features, it will refer to phonemes in the context, and sometimes even unrelated phonemes to generate a series of subword unit predictions. The pronunciation model then predicts the phoneme mapping sequence in a hand-designed dictionary, and finally the language model assigns words based on the sequence probability.

This independent training of each model is actually a sub-optimal choice compared to all components of the joint training, which makes the whole process more complicated. In the past few years, end-to-end system development has become more and more popular. Their idea is to combine these independent components into a single system to learn together, but one fact that cannot be ignored is that although the end-to-end model is in the paper There is some hope, but no one really knows that they are better than the traditional ones.

To verify this, Google recently recommended a new paper published by the Google Brain Team: State-of-the-art Speech Recognition With Sequence-to-Sequence Models, which introduces a new, performance-oriented tradition. The end-to-end speech recognition model of the practice. The paper shows that compared to the most advanced speech recognition tools, Google's new model has a word error rate (WER) of only 5.6%, which is 16% higher than the former's 6.7%. Furthermore, the end-to-end model used to output the initial word hypothesis is one-eighth of the volume of traditional tools without any predictive scoring because it does not contain separate language models and pronunciation models.

The new model's system is built on the Listen-Attend-Spell (LAS) end-to-end architecture, which consists of three parts. The encoder of the Listen component is similar to the standard acoustic model, with the time-frequency speech signal x as input. And use a set of neural network layers to map the input to a high level of representation henc. Attend receives the output of the former encoder and uses henc to learn the alignment between the input x and the predicted subword unit {yn, ... y0}. Each of the subword units is usually a glyph or a word form. In combination, the Attend component transmits the output to the Spell component (decoder), which is similar to the language model and produces a probability distribution of a set of predicted words.

Unlike traditional independent training, all components of LAS are jointly trained in a single end-to-end neural network, which means it is simpler and more convenient. In addition, since LAS is a thorough neural network, it does not require external manual components, such as finite state transfer machines, lexicon or TN models. Finally, LAS does not require a separate system-generated decision tree or time alignment to do Bootstrap, as in traditional models, which can be trained directly given text transcription and corresponding audio material.

In the paper, the Google brain team also introduced their introduction of various novel structures in the LAS to adjust the neural network, including improving the attention vector passed to the decoder, and training the network with longer subword units (such as wordpiece ). They also use a number of optimized training methods, including training with the lowest word rate. These innovations are the reason why the end-to-end model has a 16% improvement over traditional performance.

Another exciting aspect of this research is the multi-dialect and multi-language system, which may open up some potential applications. Because it is an optimized single neural network, the simplicity of the model makes it attractive. In LAS, researchers can integrate all dialect and language data for training without having to set up AM, PM, and LM separately for each category. According to the paper, after testing, Google's model performed well in 7 English dialects and 9 Indian languages, and surpassed the separate training model of the control group.

Although the results of this data are exciting, this is not a really mature job at the moment, because it can't handle speech in real time, and this is a major premise that it is used for voice search. In addition, there is still a big gap between the data generated by these models and the actual data. They only learn 22,000 audio text conversations, and the corpus data accumulation is far less than the traditional method. When faced with some rare vocabulary, such as some artificially designed terms, proper nouns, the end-to-end model can not be written correctly. Therefore, in order to make them more practical and applicable, scientists in Google's brain will still face many problems in the future.

T Copper Tube Terminals

T Copper Tube Terminals,Non-Insulated Pin-Shaped Naked Terminal,Copper Cable Lugs Terminals,Insulated Fork Cable Spade Terminal

Taixing Longyi Terminals Co.,Ltd. , https://www.lycopperlugs.com