Voice signal processing technology for microphone array

With the increasing proximity of artificial intelligence and people's lives, the development of voice technology has also received much attention. Traditional near-field speech can no longer meet people's needs, and people hope to control voice devices in a farther distance and more complex environment. Therefore, array technology has become the core of far-field speech technology.
The significance of array microphones for artificial intelligence:
Spatial selectivity: The effective location of the sound source can be obtained by spatial positioning technology such as electric sweep array. The intelligent device obtains accurate sound source position information, makes our voice more intelligent, and obtains high quality voice signal quality through algorithm.
The microphone array automatically detects the location of the sound source, tracks the speaker, and gains the advantages of multiple sources and tracking of the source. The smart device will enhance the position of your position no matter where you go.
The array microphone adds spatial processing, and the multi-signal space-time-frequency three-dimensional processing compensates for the shortcomings of single signal in noise suppression, echo suppression, reverberation suppression, sound source localization, and speech separation, allowing our smart devices to be in complex environments. Both can get high quality voice signals and provide a better intelligent voice experience.
Technical difficulties in microphone array technology:
The traditional array signal processing technology is directly applied to the microphone array processing system, which is often not effective. The reason is that the microphone array processing has different processing characteristics:
The establishment of the array model is mainly used to process speech signals. The range of sound collection is limited, and it is mostly used in near-field models, so that conventional array processing methods such as radar, sonar and other plane wave far-field models are no longer applicable. In the near-field model, more needs to be made. Accurate spherical waves need to consider different amplitude attenuation caused by different propagation paths.
Wideband signal processing Generally, the array signal processing is mostly narrowband, that is, different array elements receive delay and phase difference mainly in the carrier frequency, while the speech signal is not modulated and has no carrier, and the high and low frequencies are relatively large, and different array elements The phase delay is highly dependent on the characteristics of the source itself—the frequency is closely related, making traditional array signal processing methods no longer fully applicable.
Non-stationary signal processing In traditional array processing, most of them are stationary signals, and the processed signals of the microphone array are mostly non-stationary signals or short-term stationary signals. Therefore, the microphone array generally performs short-term frequency domain processing on the signals, and each frequency domain corresponds to A phase difference divides the wideband signal into multiple subbands in the frequency domain, each subband is narrowband processed and then combined into a wideband spectrum.
Reverberation sound propagation is greatly affected by space. Due to spatial reflection and diffraction, in addition to the direct signal, the signal received by the microphone has multipath signals superimposed, so that the signal is interfered, which is reverberation. In an indoor environment, it is diffracted by room boundaries or obstacles, and reflection causes sound to continue, greatly affecting the intelligibility of speech.
Sound source localization sound source localization technology is widely used in the field of artificial intelligence. The microphone array is used to form a space Cartesian coordinate system. According to different linear arrays, planar arrays and spatial arrays, the position of the sound source in space is determined. The smart device can first make further voice enhancements to the location of the sound source. When the smart device acquires your location information, it can be combined with other sensors for further intelligent experience. For example, the robot will hear your call and go to your side, the video device. Will focus on locking the speaker and so on. Before understanding the sound source localization technique, we need to understand the near-field model and the far-field model.
The near-field model and the far-field model usually have a distance of 1~3m from the microphone array. The array is in the near-field model. The microphone array accepts spherical waves instead of plane waves. The sound waves will attenuate during the propagation process, while the attenuation factor and propagation The distance is proportional to the distance, so the amplitude of the sound waves from the sound source to the array element is also different. In the far field model, the distance difference between the sound source and the array element is relatively small and can be ignored. Generally, we define 2L2/λ as the near-field threshold, L is the array aperture, and λ is the acoustic wavelength, so the array receiving signal not only has phase delay but also amplitude attenuation.
Sound source localization technology Sound source localization methods include beamforming, super-resolution spectral estimation and TDOA, which respectively transform the relationship between sound source and array into spatial beam, spatial spectrum and arrival time difference, and locate by corresponding information.
The electric sweep array scans the space through the beam formed by the array, and judges the direction according to the different suppression of different angles. The output of the array is controlled by controlling the weighting coefficients of the respective array elements for scanning. When the system scans to the maximum output signal power, the corresponding beam direction is considered to be the DOA direction of the sound source, so that the sound source can be positioned. The way in which the array is electrically scanned has certain limitations and is only applicable to a single source. If multiple sources are in the same main beam of the array pattern, they cannot be distinguished. This positioning accuracy is related to the width of the array—the beamwidth is inversely proportional to the array aperture at the specified frequency, so a large aperture microphone array is difficult to implement on many hardware.
Super-resolution spectrum estimation, such as MUSIC, ESPRIT, etc., eigen-decomposes its covariance matrix (correlation matrix) to construct a spatial spectrum. The spectrum corresponding to the direction is the direction of the sound source. Suitable for multiple sound sources, and the resolution of the sound source is independent of the array size, breaking through the physical limitations, thus becoming a super-resolution spectrum scheme. This type of method can be extended to wideband processing, but is very sensitive to errors, such as microphone single error, channel error, suitable for far-field models, and large matrix operations.
TDOA
TDOA is to estimate the delay difference of the sound source to different microphones, calculate the distance difference by delay, and then use the distance difference and the spatial geometric position of the microphone array to determine the position of the sound source. Divided into two steps: TDOA estimation and TDOA positioning:
1. TDOA estimation commonly used generalized cross-correlation GCC, Generalized Cross Correlation and LMS adaptive filtering generalized cross-correlation generalized cross-correlation based on TDOA-based sound source localization method, mainly using GCC for delay estimation. The GCC calculation method is simple, the delay is small, and the tracking ability is good. It is suitable for real-time applications. It performs well under moderate noisy intensity and low reverberation noise, and the positioning accuracy will decrease under noisy unsteady noise environment.
LMS adaptive filtering gives the TDOA estimate in a convergent state, does not require a priori information of noise and signal, but is sensitive to reverberation. The method uses two microphone signals as the target signal and the input signal, uses the input signal to approach the target signal, and obtains the TDOA by adjusting the filter coefficients.
2. TDOA locates the TDOA estimate for sound source localization. The three microphone arrays can determine the spatial sound source position, and increasing the microphone will increase the data accuracy. The methods of localization include MLE maximum likelihood estimation, minimum variance, spherical difference and linear intersection. TDOA is relatively widely used, with high positioning accuracy, minimal computation, good real-time performance, and can be used for real-time tracking. In most of the current intelligent positioning products, TDOA technology is used as the positioning technology.
Beamforming:
Beamforming can be divided into conventional beamforming CBF, Conventional Beam Forming and Adaptive Beam Forming ABF, Adaptive Beam Forming. CBF is the simplest non-adaptive beamforming. The output of each microphone is weighted and summed to obtain a beam. In CBF, the weight of each channel is fixed. The function is to suppress the sidelobe level of the array pattern to filter. Interference and noise in addition to the sidelobe area. Based on CBF, ABF performs spatial domain adaptive filtering on interference and noise. In ABF, different filters are used to obtain different algorithms, that is, the amplitude weighting values ​​of different channels are adjusted and optimized according to some optimal criteria. Such as LMS, LS, maximum SNR, LCMV (linearly constrained Minimum Variance). Obtained by the LCMV criterion is the MVDR beamformer (Minimum Variance Distortionless Response). The LCMV criterion is to minimize the output power of the array while ensuring that the main lobe gain of the pattern remains unchanged, indicating that the interference plus noise power of the array output is the smallest, which can also be understood as the maximum SINR criterion, so that the maximum possible reception is possible. Signal and suppress noise and interference.
The CBF-traditional beamforming delay summing beamforming method is used for speech enhancement, delaying the received signal of the microphone, compensating for the time difference between the sound source and each microphone, so that the output signals of the respective channels are in phase in a certain direction, so that The incident signal in this direction is maximized so that there is a direction of maximum output power within the main beam. Spatial filtering is formed such that the array has directional selectivity.
CBF + Adaptive Filter Enhanced beamforming combined with Weiner filtering to improve speech enhancement. The noisy speech is filtered by Weiner to obtain a pure speech signal based on the LMS criterion. The filter coefficients can be continuously updated and iteratively, and the unsteady noise can be removed more effectively than the conventional CBF.
ABF-adaptive beamforming GSLC is a method based on ANC active noise cancellation. The noisy signal passes through both the main channel and the auxiliary channel, while the blocking matrix of the auxiliary channel filters out the speech signal to obtain a reference containing only multi-channel noise. The signal and each channel obtain an optimal signal estimate based on the noise signal to obtain a pure speech signal estimate.
Future Developments of Array Technology Microphone array technology has many advantages over single-microphone systems and has become an important part of speech enhancement and speech signal processing. Voice enhancement and sound source localization have become an indispensable part of array technology. Sound source positioning and voice enhancement are required in video conferencing, intelligent robots, hearing aids, smart home appliances, communications, smart toys, and automotive. Various signal processing technologies and array signal processing technologies have been integrated into the speech processing system of the microphone array, and gradually improved and further widely applied. In complex noise environments, reverberant environments, and acoustic environments, powerful hardware processing capabilities make it possible for complex algorithms to process speech enhancements in real time. In the future, the close combination of voice and image will become a new breakthrough in the field of artificial intelligence. In the cusp of artificial intelligence, who can make speech recognition, speech understanding, array signal processing, far-field speech, image recognition, face recognition, Iris recognition, the technique of voiceprint recognition is ingeniously and organically combined, and the combination of the essence of technology and the purpose of people-oriented, we will wait and see.

500W PTC Fan Heater

500W Ptc Room Heater,Ptc Fan Heater,Ptc Tower Heater,Ptc Electric Fan Heater

Foshan Shunde Josintech Electrical Appliance Technology Co.,Ltd , https://www.josintech.com